dplyr

+ here and skimr

Author

Steen Flammild Harsted

Published

January 10, 2024

1 Getting started

Load the `tidyverse` and the `here` package

Code

library(tidyverse)
library(here)

1.1 `here()`

Try out the `here()` function

What happens if you write here() ?
What happens if you write here("raw_data") ?
Compare your output with your neighbors.
Compare your output with someone who has a different operating system (Windows, Mac, Linux) than you. (Hint: look for \ /)
Discuss if and how the here()function could ever be of any use to anybody

Load the `soldiers` dataset

Use the function read_csv2()
The file argument should be here("raw_data", "soldiers.csv")

Code

read_csv2(here("raw_data", "soldiers.csv"))

Assign the `soldiers` dataset to an object called `soldiers`

Great, now read the data again and assign it (remember <-) to an object called soldiers

Code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

1.2 `skim()`

There are many ways to explore a dataframe. In this course we will take a shortcut and use the skim() function from the skimr package. The skim() function provides an exellent overview of a dataframe.

Load the skimr package and use the skim() function to get an impression of the soldiers dataset.

Discuss with your neighbor:

Nr of rows?
Nr of columns?
Missing values?
Types of variables?
How many unique values does the different character variables have?
Any fake data? (hint: Yes, for educational purposes I have added some fake data)

Code

library(skimr)
skim(soldiers)

2 `dplyr`

The main functions of dplyr that you will practice today are:

select()
filter()
summarise()
group_by()
arrange()
mutate()
count()

2.1 `select()`

select the columns `subjectid`, `sex`, `age`

Code

soldiers %>% 
  select(subjectid, sex, age)

select the columns 1, 3, 5:7

Code

soldiers %>% 
  select(1,3,5:7)

remove the columns 3:5

Code

soldiers %>% 
  select(-(3:5))

select all columns that contains the word “circumference”

HINT

Use one of the tidyselect helper functions.

Code

soldiers %>% 
  select(contains("circumference"))

remove all columns containing the letter “c”

HINT

Use one of the tidyselect helper functions.
Use a minus sign.

Code

soldiers %>% 
  select(-contains("c"))

select all columns that contains a “c” OR “x” OR “y” OR “z”

HINT

In R(and many other programming languages) the | sign is used as a logical operator for OR.

Code

soldiers %>% 
  select(contains("c") | contains("x") | contains("y") | contains("z"))

select all columns that contains a “c” OR “x” OR “y” OR “z”

This time use the tidyselect helper function called matches()
matches() allows you to use logical operators inside the function. E.g., matches("this|that")

Code

soldiers %>% 
  select(matches("c|x|y|z"))

Challenge: Why not always use `matches()`?

Use the preloaded iris data-set. (just write iris)
Try to use matches() to select all columns containing a “.”

Code

iris %>% 
  select(matches(".")) %>% 
  head() # This line is just to prevent a very long output.

Why did we select the Species column?
What happens if we use contains() instead

Code

iris %>% 
  select(contains(".")) %>% 
  head() # This line is just to prevent a very long output.

What is the difference in the output? Why is it different?

ANSWER

contains(): Contains a literal string.
matches(): Matches a regular expression.
In regular expressions . is a wildcard.

The wildcard . matches any character. For example, a.b matches any string that contains an “a”, and then any character and then “b”; and a.*b matches any string that contains an “a”, and then the character “b” at some later point.
https://en.wikipedia.org/wiki/Regular_expression

2.2 `filter()`

Restart R. (Click Session -> Restart R)
Load the tidyverse and the here package

Code

library(tidyverse)
library(here)

Load the soldiersdata and assign it to an object called soldiers

Code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

Keep all rows where `sex` is Female:

Hint

???? == "Female"

Code

soldiers %>% filter(sex == "Female")

Keep all rows where `weightkg` is missing (NA value)

Hint

use the is.na() function

Code

soldiers %>% 
  filter(is.na(weightkg))

Keep all rows where `WritingPreference` is “Left hand” AND `sex` is “Female”

Code

soldiers %>% 
  filter(WritingPreference == "Left hand" ,  sex == "Female")  # you can use & instead of a ,

Keep all rows where `WritingPreference` is “Left hand” OR `sex` is “Female”

Code

soldiers %>% 
  filter(WritingPreference == "Left hand" |  sex == "Female")

What is going wrong in this code?

soldiers %>% 
  select(1:5) %>% 
  filter(WritingPreference == "Left hand" |  sex == "Female")

The error message is:

Error in `filter()`:
ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`.
Caused by error:
! object 'WritingPreference' not found
Run `rlang::last_error()` to see where the error occurred.

ANSWER

The variable WritingPreference was not selected in the first line.

Keep all rows where `age` is above 30 and the `weightkg` is below 600

Code

soldiers %>% 
  filter(age > 30, weightkg < 600)

Keep all rows where `Ethnicity` is either “Mexican” OR “Filipino” OR “Caribbean Islander”

Hint

you need to use %in% and c()

Code

soldiers %>% 
  filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander"))

Difference between == and %in%

tldr - Use %in% instead of == when you want to filter for multiple values.

Read on if you want to understand why. (You don’t have to)

The code filter(Ethnicity == c("Mexican", "Filipino")) is likely not doing what you expect. The ‘==’ operator does an element-wise comparison, which means it compares the first element of ‘Ethnicity’ to the first element of the vector (“Mexican”), the second element of ‘Ethnicity’ to the second element of the vector (“Filipino”). The short vector is then recycled so now the third element of ‘Ethnicity’ is compared to the first element of the vector (“Mexican”), the fourth element of ‘Ethnicity’ to the second element of the vector (“Filipino”), and so on.

Inspect the differences in how may rows these lines of code produce

soldiers %>% 
  filter(Ethnicity %in% c("Mexican", "Filipino"))

soldiers %>% 
  filter(Ethnicity == c("Mexican", "Filipino"))

Run this code chunk line by line. Inspect the differences.

# Create a data frame
df <- data.frame(
  Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
  Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
  stringsAsFactors = FALSE
)

# Investigate the data frame
df

# Filter using %in%
df %>% filter(Ethnicity %in% c("Mexican", "Filipino"))

# Filter using ==
df %>% filter(Ethnicity == c("Mexican", "Filipino"))

2.3 `summarise()`

Restart R. (Click Session -> Restart R)
Load the tidyverse and the here package

Code

library(tidyverse)
library(here)

Load the soldiersdata and assign it to an object called soldiers

Code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

Calculate the mean and standard deviation of `footlength`

Code

soldiers %>% summarise(
  footlength_avg = mean(footlength),
  footlength_sd = sd(footlength))

Calculate the median and interquartile range of `earlength`

HINT

use the IQR() function

Code

soldiers %>% 
  summarise(
    earlength_median = median(earlength),
    earlength_iqr = IQR(earlength))

Count the number of rows where `WritingPreference` is equal to “Right hand”

Code

soldiers %>%  
  summarise(
    n_righthanded = sum(WritingPreference == "Right hand"))

How old is the oldest soldier?

HINT if you can’t work out why get an NA value

Many Base R functions, including mean(), does not ignore NA values by default. This means that if the vector contains an NA value the result will be NA. Is this a good or bad thing?
You can set the argument na.rm = TRUE, to ignore missing values.

Code

soldiers %>% 
  summarise(
    age_max = max(age, na.rm = TRUE))

Calculate the mean weight of the Females

HINT if you can’t work out why get an NA value

Code

soldiers %>% 
  filter(sex == "Female") %>% 
  summarise(
    weight_avg = mean(weightkg, na.rm = TRUE))

Calculate the range in weight (max-min) within Males

Code

soldiers %>% 
  filter(sex == "Male") %>% 
  summarise(
    weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE))

2.4 `group_by()` and `arrange()`

Restart R. (Click Session -> Restart R)
Load the tidyverse and the here package

Code

library(tidyverse)
library(here)

Load the soldiersdata and assign it to an object called soldiers

Code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s

Code

soldiers %>% 
  group_by(Installation) %>% 
  summarise(weight_avg = mean(weightkg, na.rm = TRUE),
            weight_sd = sd(weightkg, na.rm = TRUE),
            age_avg = mean(age, na.rm = TRUE),
            age_sd = sd(age, na.rm = TRUE))

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es

Code

soldiers %>% 
  group_by(Installation, sex) %>% 
  summarise(weight_avg = mean(weightkg, na.rm = TRUE),
            weight_sd = sd(weightkg, na.rm = TRUE),
            age_avg = mean(age, na.rm = TRUE),
            age_sd = sd(age, na.rm = TRUE))

Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`

Hint

To count missings, use the functions sum() and is.na()

Code

soldiers %>% 
  group_by(Installation) %>% 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)))

As before, but now also add the number of observations (rows) within each `Installation`

Hint

Use n()

Code

soldiers %>% 
  group_by(Installation) %>% 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)),
            n = n())

As before, but now arrange the output after number of soldiers at each `Installation` in descending order.

Code

soldiers %>% 
  group_by(Installation) %>% 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)),
            n = n()) %>% 
  arrange(desc(n))

2.5 `mutate()`

Restart R. (Click Session -> Restart R)
Load the tidyverse and the here package

Code

library(tidyverse)
library(here)

Load the soldiersdata and assign it to an object called soldiers

Code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

Add a column called `heightcm` with the height of the soldiers in cm

Update the soldiers dataset with the new variable
place the new variable after Heightin

Code

soldiers <- soldiers %>% 
  mutate(
    heightcm = Heightin * 2.54,
    .after = Heightin)

Update the `weightkg` column to kg instead of kg*10

Update the soldiers dataset with the new weightkg column

Code

soldiers <- soldiers %>% 
  mutate(
    weightkg = weightkg/10
    )

Add a column called `BMI` with the Body mass index (BMI) of the soldiers

BMI

Update the soldiers dataset to with the new variable
place the new variable after weightkg

Code

soldiers <- soldiers %>% 
  mutate(BMI = weightkg/(heightcm/100)^2,
         .after = weightkg)

Add a column called `obese` that contains the value TRUE if BMI is > 30

Code

soldiers %>% 
  mutate(
    obese = if_else(BMI > 30, TRUE, FALSE),
    .before = 1 # This line code just places the variable at the front
  )

# OR

soldiers %>% 
  mutate(
    obese = BMI > 30,
    .before = 1 # This line code just places the variable at the front
  )

Inspect the below table from Wikipedia

Category	BMI (kg/m²)
Underweight (Severe thinness)	< 16.0
Underweight (Moderate thinness)	16.0 – 16.9
Underweight (Mild thinness)	17.0 – 18.4
Normal range	18.5 – 24.9
Overweight (Pre-obese)	25.0 – 29.9
Obese (Class I)	30.0 – 34.9
Obese (Class II)	35.0 – 39.9
Obese (Class III)	>= 40.0

Create the variable `category` that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

Update the soldiers dataset with the new variable
place the new variable after BMI

Hint 1

Use case_when()

Hint 2

soldiers %>% 
  mutate(
    category = ????
    )

Hint 3

soldiers %>% 
  mutate(
    category = case_when(
      #TEST HERE ~ OUTPUT, 
      #TEST HERE ~ OUTPUT,
      #TEST HERE ~ OUTPUT,
      #.default = OUTPUT
    )
    )

Code

soldiers <- soldiers %>% 
  mutate(
    category = case_when(
      BMI < 18.5 ~ "Underweight",
      BMI < 25   ~ "Normal range",
      BMI < 30   ~ "Overweight",
      BMI >= 30  ~ "Obese",
      .default = NA),
    .after = BMI
    )

2.6 `count()`

For simple counting count() is faster than summarise(n = n()) or mutate(n = n())

What is this code equivalent to?

diamonds %>% 
  count()

ANSWER

count() works like summarise(n = n())

What is this code equivalent to?

diamonds %>% 
  count(cut)

Answer

# The code is equivalent to:
diamonds %>% group_by(cut) %>% summarise(n = n())

# OR

diamonds %>% summarise(n = n(), .by = cut)

What is this code equivalent to?

diamonds %>% 
  count(cut, color)

Answer

# The code is equivalent to:
diamonds %>% group_by(cut, color) %>% summarise(n = n())

# OR

diamonds %>% summarise(n = n(), .by = c(cut, color))


# However, notice that the first solution returns a grouped tibble

What is this code equivalent to?

diamonds %>% 
  add_count()

ANSWER

add_count() works like mutate(n = n())

Count the number of diamonds within each type of `cut` and calculate the percentage of each cut

Code

diamonds %>% 
  count(cut) %>% 
  mutate(percent = n/sum(n)*100)

Inspect the output of this code. What is going wrong?

diamonds %>% 
  group_by(cut) %>% 
  count() %>% 
  mutate(percent = n/sum(n)*100)

# A tibble: 5 × 3
# Groups:   cut [5]
  cut           n percent
  <ord>     <int>   <dbl>
1 Fair       1610     100
2 Good       4906     100
3 Very Good 12082     100
4 Premium   13791     100
5 Ideal     21551     100

--- title: "dplyr" subtitle: "+ here and skimr" author: "Steen Flammild Harsted" date: today output: distill::distill_article: self_contained: false format: html: toc: true toc-depth: 2 number-sections: true number-depth: 2 code-fold: true code-summary: "Show the code" code-tools: true execute: message: false warning: false --- # Getting started ## Load the `tidyverse` and the `here` package {.unnumbered} ```{r} library(tidyverse) library(here) ``` \ ## `here()` ### Try out the `here()` function * What happens if you write `here()` ? * What happens if you write `here("raw_data")` ? * Compare your output with your neighbors. * Compare your output with someone who has a different operating system (Windows, Mac, Linux) than you. (Hint: look for \\ /) * Discuss if and how the `here()`function could ever be of any use to anybody \ ### Load the `soldiers` dataset Use the function `read_csv2()` \ The `file` argument should be `here("raw_data", "soldiers.csv")` ```{r read_csv} #| eval: false read_csv2(here("raw_data", "soldiers.csv")) ``` \ ### Assign the `soldiers` dataset to an object called `soldiers` Great, now read the data again and assign it (remember `<-`) to an object called `soldiers` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ## `skim()` There are many ways to explore a dataframe. In this course we will take a shortcut and use the `skim()` function from the `skimr` package. The `skim()` function provides an exellent overview of a dataframe. Load the `skimr` package and use the `skim()` function to get an impression of the `soldiers` dataset. Discuss with your neighbor: * Nr of rows? * Nr of columns? * Missing values? * Types of variables? * How many unique values does the different character variables have? * Any fake data? *(hint: Yes, for educational purposes I have added some fake data)* ```{r} #| eval: false library(skimr) skim(soldiers) ``` # `dplyr` The main functions of dplyr that you will practice today are: * `select()` * `filter()` * `summarise()` * `group_by()` * `arrange()` * `mutate()` * `count()` ## `select()` ### select the columns `subjectid`, `sex`, `age` ```{r} #| eval: false soldiers %>% select(subjectid, sex, age) ``` \ ### select the columns 1, 3, 5:7 ```{r} #| eval: false soldiers %>% select(1,3,5:7) ``` \ ### remove the columns 3:5 ```{r} #| eval: false soldiers %>% select(-(3:5)) ``` \ ### select all columns that contains the word "circumference" ::: {.callout-tip collapse=true} #### HINT Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions. ::: ```{r soldiers_6} #| eval: false soldiers %>% select(contains("circumference")) ``` ### remove all columns containing the letter "c" ::: {.callout-tip collapse=true} #### HINT Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions. Use a minus sign. ::: ```{r soldiers_7} #| eval: false soldiers %>% select(-contains("c")) ``` ### select all columns that contains a "c" OR "x" OR "y" OR "z" ::: {.callout-tip collapse=true} #### HINT In R(and many other programming languages) the \| sign is used as a logical operator for OR. ::: ```{r} #| eval: false soldiers %>% select(contains("c") | contains("x") | contains("y") | contains("z")) ``` ### select all columns that contains a "c" OR "x" OR "y" OR "z" This time use the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper function called `matches()`\ `matches()` allows you to use logical operators inside the function. E.g., `matches("this|that")` ```{r soldiers_8} #| eval: false soldiers %>% select(matches("c|x|y|z")) ``` ### Challenge: Why not always use `matches()`? Use the preloaded `iris` data-set. (just write `iris`)\ Try to use `matches()` to select all columns containing a "." ```{r} #| eval: false iris %>% select(matches(".")) %>% head() # This line is just to prevent a very long output. ``` Why did we select the `Species` column?\ What happens if we use `contains()` instead ```{r} #| eval: false iris %>% select(contains(".")) %>% head() # This line is just to prevent a very long output. ``` What is the difference in the output? Why is it different? <details><summary>ANSWER</summary> `contains()`: Contains a literal string.\ `matches()`: Matches a [regular expression](https://en.wikipedia.org/wiki/Regular_expression).\ In regular expressions `.` is a wildcard. > The wildcard . matches any character. For example, a.b matches any string that contains an "a", and then any character and then "b"; and a.\*b matches any string that contains an "a", and then the character "b" at some later point.\ > [*https://en.wikipedia.org/wiki/Regular_expression*](https://en.wikipedia.org/wiki/Regular_expression){.uri} </details> ## `filter()` Restart R. (Click Session -> Restart R)\ Load the `tidyverse` and the `here` package ```{r} library(tidyverse) library(here) ``` \ Load the `soldiers `data and assign it to an object called `soldiers` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ### Keep all rows where `sex` is Female: ::: {.callout-tip collapse=true} #### Hint `???? == "Female"` ::: ```{r} #| eval: false soldiers %>% filter(sex == "Female") ``` ### Keep all rows where `weightkg` is missing (NA value) ::: {.callout-tip collapse=true} #### Hint use the `is.na()` function ::: ```{r} #| eval: false soldiers %>% filter(is.na(weightkg)) ``` ### Keep all rows where `WritingPreference` is "Left hand" AND `sex` is "Female" ```{r} #| eval: false soldiers %>% filter(WritingPreference == "Left hand" , sex == "Female") # you can use & instead of a , ``` ### Keep all rows where `WritingPreference` is "Left hand" OR `sex` is "Female" ```{r} #| eval: false soldiers %>% filter(WritingPreference == "Left hand" | sex == "Female") ``` ### What is going wrong in this code? ```{r} #| eval: false #| code-fold: false soldiers %>% select(1:5) %>% filter(WritingPreference == "Left hand" | sex == "Female") ``` The error message is: ```{r} #| eval: false #| code-fold: false Error in `filter()`: ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`. Caused by error: ! object 'WritingPreference' not found Run `rlang::last_error()` to see where the error occurred. ``` ```{r} #| eval: false #| code-summary: "ANSWER" The variable WritingPreference was not selected in the first line. ``` ### Keep all rows where `age` is above 30 and the `weightkg` is below 600 ```{r} #| eval: false soldiers %>% filter(age > 30, weightkg < 600) ``` \ ### Keep all rows where `Ethnicity` is either "Mexican" OR "Filipino" OR "Caribbean Islander" ::: {.callout-tip collapse=true} #### Hint you need to use `%in%` and `c()` ::: ```{r} #| eval: false soldiers %>% filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander")) ``` ::: {.callout-caution collapse=true} #### Difference between `==` and `%in%` tldr - Use `%in%` instead of `==` when you want to filter for multiple values. Read on if you want to understand why. (You don't have to) The code `filter(Ethnicity == c("Mexican", "Filipino"))` is likely not doing what you expect. The '==' operator does an element-wise comparison, which means it compares the first element of 'Ethnicity' to the first element of the vector ("Mexican"), the second element of 'Ethnicity' to the second element of the vector ("Filipino"). The short vector is then recycled so now the third element of 'Ethnicity' is compared to the first element of the vector ("Mexican"), the fourth element of 'Ethnicity' to the second element of the vector ("Filipino"), and so on. Inspect the differences in how may rows these lines of code produce ```{r} #| code-fold: false #| eval: false soldiers %>% filter(Ethnicity %in% c("Mexican", "Filipino")) soldiers %>% filter(Ethnicity == c("Mexican", "Filipino")) ``` Run this code chunk line by line. Inspect the differences. ```{r} #| code-fold: false #| eval: false # Create a data frame df <- data.frame( Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"), Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"), stringsAsFactors = FALSE ) # Investigate the data frame df # Filter using %in% df %>% filter(Ethnicity %in% c("Mexican", "Filipino")) # Filter using == df %>% filter(Ethnicity == c("Mexican", "Filipino")) ``` ::: ## `summarise()` Restart R. (Click Session -> Restart R)\ Load the `tidyverse` and the `here` package ```{r} library(tidyverse) library(here) ``` \ Load the `soldiers `data and assign it to an object called `soldiers` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ### Calculate the mean and standard deviation of `footlength` ```{r} #| eval: false soldiers %>% summarise( footlength_avg = mean(footlength), footlength_sd = sd(footlength)) ``` ### Calculate the median and interquartile range of `earlength` <details><summary>HINT</summary> use the `IQR()` function </details> ```{r} #| eval: false soldiers %>% summarise( earlength_median = median(earlength), earlength_iqr = IQR(earlength)) ``` ### Count the number of rows where `WritingPreference` is equal to "Right hand" ```{r} #| eval: false soldiers %>% summarise( n_righthanded = sum(WritingPreference == "Right hand")) ``` \ ### How old is the oldest soldier? <details> <summary> HINT if you can't work out why get an `NA` value </summary> Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\ You can set the argument `na.rm = TRUE`, to ignore missing values. </details> ```{r} #| eval: false soldiers %>% summarise( age_max = max(age, na.rm = TRUE)) ``` \ ### Calculate the mean weight of the Females <details> <summary> HINT if you can't work out why get an `NA` value </summary> Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\ You can set the argument `na.rm = TRUE`, to ignore missing values. </details> ```{r} #| eval: false soldiers %>% filter(sex == "Female") %>% summarise( weight_avg = mean(weightkg, na.rm = TRUE)) ``` ### Calculate the range in weight (max-min) within Males ```{r} #| eval: false soldiers %>% filter(sex == "Male") %>% summarise( weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE)) ``` ## `group_by()` and `arrange()` Restart R. (Click Session -> Restart R)\ Load the `tidyverse` and the `here` package ```{r} library(tidyverse) library(here) ``` \ Load the `soldiers `data and assign it to an object called `soldiers` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s ```{r} #| eval: false soldiers %>% group_by(Installation) %>% summarise(weight_avg = mean(weightkg, na.rm = TRUE), weight_sd = sd(weightkg, na.rm = TRUE), age_avg = mean(age, na.rm = TRUE), age_sd = sd(age, na.rm = TRUE)) ``` ### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es ```{r} #| eval: false soldiers %>% group_by(Installation, sex) %>% summarise(weight_avg = mean(weightkg, na.rm = TRUE), weight_sd = sd(weightkg, na.rm = TRUE), age_avg = mean(age, na.rm = TRUE), age_sd = sd(age, na.rm = TRUE)) ``` ### Calcualate the average height for each `Installation` and count the number of missing values within each `Installation` ::: {.callout-tip collapse=true} #### Hint To count missings, use the functions `sum()` and `is.na()` ::: ```{r} #| eval: false soldiers %>% group_by(Installation) %>% summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin))) ``` ### As before, but now also add the number of observations (rows) within each `Installation` ::: {.callout-tip collapse=true} #### Hint Use `n()` ::: ```{r} #| eval: false soldiers %>% group_by(Installation) %>% summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin)), n = n()) ``` ### As before, but now arrange the output after number of soldiers at each `Installation` in descending order. ```{r} #| eval: false soldiers %>% group_by(Installation) %>% summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin)), n = n()) %>% arrange(desc(n)) ``` ## `mutate()` Restart R. (Click Session -> Restart R)\ Load the `tidyverse` and the `here` package ```{r} library(tidyverse) library(here) ``` \ Load the `soldiers `data and assign it to an object called `soldiers` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ### Add a column called `heightcm` with the height of the soldiers in cm * Update the `soldiers` dataset with the new variable * place the new variable after `Heightin` ```{r} #| output: false soldiers <- soldiers %>% mutate( heightcm = Heightin * 2.54, .after = Heightin) ``` \ ### Update the `weightkg` column to kg instead of kg*10 * Update the `soldiers` dataset with the new `weightkg` column ```{r} #| output: false soldiers <- soldiers %>% mutate( weightkg = weightkg/10 ) ``` \ ### Add a column called `BMI` with the Body mass index (BMI) of the soldiers [BMI](https://en.wikipedia.org/wiki/Body_mass_index) * Update the `soldiers` dataset to with the new variable * place the new variable after `weightkg` ```{r} #| output: false soldiers <- soldiers %>% mutate(BMI = weightkg/(heightcm/100)^2, .after = weightkg) ``` \ ### Add a column called `obese` that contains the value TRUE if BMI is > 30 ```{r} #| eval: false soldiers %>% mutate( obese = if_else(BMI > 30, TRUE, FALSE), .before = 1 # This line code just places the variable at the front ) # OR soldiers %>% mutate( obese = BMI > 30, .before = 1 # This line code just places the variable at the front ) ``` \ ### Inspect the below table from [Wikipedia](https://en.wikipedia.org/wiki/Body_mass_index) {#sec-BMI} ```{r} #| echo: false tibble::tribble( ~Category, ~`BMI (kg/m^2^)`, "Underweight (Severe thinness)", "< 16.0", "Underweight (Moderate thinness)", "16.0 – 16.9", "Underweight (Mild thinness)", "17.0 – 18.4", "Normal range", "18.5 – 24.9", "Overweight (Pre-obese)", "25.0 – 29.9", "Obese (Class I)", "30.0 – 34.9", "Obese (Class II)", "35.0 – 39.9", "Obese (Class III)", ">= 40.0", ) %>% knitr::kable() ``` \ ### Create the variable `category` that tells us whether the soldiers are "Underweight", "Normal range", "Overweight", or "Obese" * Update the `soldiers` dataset with the new variable * place the new variable after `BMI`\ ::: {.callout-tip collapse=true} #### Hint 1 Use `case_when()` ::: ::: {.callout-tip collapse=true} #### Hint 2 ```{r} #| eval: false #| code-fold: false soldiers %>% mutate( category = ???? ) ``` ::: ::: {.callout-tip collapse=true} #### Hint 3 ```{r} #| eval: false #| code-fold: false soldiers %>% mutate( category = case_when( #TEST HERE ~ OUTPUT, #TEST HERE ~ OUTPUT, #TEST HERE ~ OUTPUT, #.default = OUTPUT ) ) ``` ::: ```{r} #| output: false soldiers <- soldiers %>% mutate( category = case_when( BMI < 18.5 ~ "Underweight", BMI < 25 ~ "Normal range", BMI < 30 ~ "Overweight", BMI >= 30 ~ "Obese", .default = NA), .after = BMI ) ``` ## `count()` For simple counting `count()` is faster than `summarise(n = n())` or `mutate(n = n())` ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds %>% count() ``` <details><summary>ANSWER</summary> `count()` works like `summarise(n = n())` </details> ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds %>% count(cut) ``` ```{r} #| code-fold: true #| code-summary: "Answer" #| eval: false # The code is equivalent to: diamonds %>% group_by(cut) %>% summarise(n = n()) # OR diamonds %>% summarise(n = n(), .by = cut) ``` ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds %>% count(cut, color) ``` ```{r} #| code-fold: true #| code-summary: "Answer" #| eval: false # The code is equivalent to: diamonds %>% group_by(cut, color) %>% summarise(n = n()) # OR diamonds %>% summarise(n = n(), .by = c(cut, color)) # However, notice that the first solution returns a grouped tibble ``` ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds %>% add_count() ``` <details><summary>ANSWER</summary> `add_count()` works like `mutate(n = n())` </details> ### Count the number of diamonds within each type of `cut` and calculate the percentage of each cut ```{r} #| output: false diamonds %>% count(cut) %>% mutate(percent = n/sum(n)*100) ``` ### Inspect the output of this code. What is going wrong? ```{r} #| code-fold: false diamonds %>% group_by(cut) %>% count() %>% mutate(percent = n/sum(n)*100) ``` <script src="https://giscus.app/client.js" data-repo="sorenoneill/r4phd" data-repo-id="R_kgDOJ9etDQ" data-category="Announcements" data-category-id="DIC_kwDOJ9etDc4CYApF" data-mapping="pathname" data-strict="0" data-reactions-enabled="0" data-emit-metadata="0" data-input-position="bottom" data-theme="cobalt" data-lang="en" crossorigin="anonymous" async> </script>

1 Getting started

Load the tidyverse and the here package

1.1 here()

Try out the here() function

Load the soldiers dataset

Assign the soldiers dataset to an object called soldiers

1.2 skim()

2 dplyr

2.1 select()

select the columns subjectid, sex, age

select the columns 1, 3, 5:7

remove the columns 3:5

select all columns that contains the word “circumference”

remove all columns containing the letter “c”

select all columns that contains a “c” OR “x” OR “y” OR “z”

select all columns that contains a “c” OR “x” OR “y” OR “z”

Challenge: Why not always use matches()?

2.2 filter()

Keep all rows where sex is Female:

Keep all rows where weightkg is missing (NA value)

Keep all rows where WritingPreference is “Left hand” AND sex is “Female”

Keep all rows where WritingPreference is “Left hand” OR sex is “Female”

What is going wrong in this code?

Keep all rows where age is above 30 and the weightkg is below 600

Keep all rows where Ethnicity is either “Mexican” OR “Filipino” OR “Caribbean Islander”

2.3 summarise()

Calculate the mean and standard deviation of footlength

Calculate the median and interquartile range of earlength

Count the number of rows where WritingPreference is equal to “Right hand”

How old is the oldest soldier?

Calculate the mean weight of the Females

Calculate the range in weight (max-min) within Males

2.4 group_by() and arrange()

Calculate the mean and sd of weightkg and age for all Installations

Calculate the mean and sd of weightkg and age for all Installations for both sexes

Calcualate the average height for each Installation and count the number of missing values within each Installation

As before, but now also add the number of observations (rows) within each Installation

As before, but now arrange the output after number of soldiers at each Installation in descending order.

2.5 mutate()

Add a column called heightcm with the height of the soldiers in cm

Update the weightkg column to kg instead of kg*10

Add a column called BMI with the Body mass index (BMI) of the soldiers

Add a column called obese that contains the value TRUE if BMI is > 30

Inspect the below table from Wikipedia

Create the variable category that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

2.6 count()

What is this code equivalent to?

What is this code equivalent to?

What is this code equivalent to?

What is this code equivalent to?

Count the number of diamonds within each type of cut and calculate the percentage of each cut

Inspect the output of this code. What is going wrong?

Load the `tidyverse` and the `here` package

1.1 `here()`

Try out the `here()` function

Load the `soldiers` dataset

Assign the `soldiers` dataset to an object called `soldiers`

1.2 `skim()`

2 `dplyr`

2.1 `select()`

select the columns `subjectid`, `sex`, `age`

Challenge: Why not always use `matches()`?

2.2 `filter()`

Keep all rows where `sex` is Female:

Keep all rows where `weightkg` is missing (NA value)

Keep all rows where `WritingPreference` is “Left hand” AND `sex` is “Female”

Keep all rows where `WritingPreference` is “Left hand” OR `sex` is “Female”

Keep all rows where `age` is above 30 and the `weightkg` is below 600

Keep all rows where `Ethnicity` is either “Mexican” OR “Filipino” OR “Caribbean Islander”

2.3 `summarise()`

Calculate the mean and standard deviation of `footlength`

Calculate the median and interquartile range of `earlength`

Count the number of rows where `WritingPreference` is equal to “Right hand”

2.4 `group_by()` and `arrange()`

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es

Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`

As before, but now also add the number of observations (rows) within each `Installation`

As before, but now arrange the output after number of soldiers at each `Installation` in descending order.

2.5 `mutate()`

Add a column called `heightcm` with the height of the soldiers in cm

Update the `weightkg` column to kg instead of kg*10

Add a column called `BMI` with the Body mass index (BMI) of the soldiers

Add a column called `obese` that contains the value TRUE if BMI is > 30

Create the variable `category` that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

2.6 `count()`

Count the number of diamonds within each type of `cut` and calculate the percentage of each cut