Code
library(tidyverse)
library(here)
+ here and skimr
Steen Flammild Harsted
January 10, 2024
tidyverse
and the here
package
here()
here()
functionhere()
?here("raw_data")
?here()
function could ever be of any use to anybody
soldiers
datasetUse the function read_csv2()
The file
argument should be here("raw_data", "soldiers.csv")
soldiers
dataset to an object called soldiers
Great, now read the data again and assign it (remember <-
) to an object called soldiers
skim()
There are many ways to explore a dataframe. In this course we will take a shortcut and use the skim()
function from the skimr
package. The skim()
function provides an exellent overview of a dataframe.
Load the skimr
package and use the skim()
function to get an impression of the soldiers
dataset.
Discuss with your neighbor:
dplyr
The main functions of dplyr that you will practice today are:
select()
filter()
summarise()
group_by()
arrange()
mutate()
count()
select()
subjectid
, sex
, age
Use one of the tidyselect helper functions.
Use one of the tidyselect helper functions.
Use a minus sign.
In R(and many other programming languages) the | sign is used as a logical operator for OR.
This time use the tidyselect helper function called matches()
matches()
allows you to use logical operators inside the function. E.g., matches("this|that")
matches()
?Use the preloaded iris
data-set. (just write iris
)
Try to use matches()
to select all columns containing a “.”
Why did we select the Species
column?
What happens if we use contains()
instead
What is the difference in the output? Why is it different?
contains()
: Contains a literal string.
matches()
: Matches a regular expression.
In regular expressions .
is a wildcard.
The wildcard . matches any character. For example, a.b matches any string that contains an “a”, and then any character and then “b”; and a.*b matches any string that contains an “a”, and then the character “b” at some later point.
https://en.wikipedia.org/wiki/Regular_expression
filter()
Restart R. (Click Session -> Restart R)
Load the tidyverse
and the here
package
Load the soldiers
data and assign it to an object called soldiers
sex
is Female:???? == "Female"
weightkg
is missing (NA value)use the is.na()
function
WritingPreference
is “Left hand” AND sex
is “Female”WritingPreference
is “Left hand” OR sex
is “Female”The error message is:
age
is above 30 and the weightkg
is below 600
Ethnicity
is either “Mexican” OR “Filipino” OR “Caribbean Islander”you need to use %in%
and c()
==
and %in%
tldr - Use %in%
instead of ==
when you want to filter for multiple values.
Read on if you want to understand why. (You don’t have to)
The code filter(Ethnicity == c("Mexican", "Filipino"))
is likely not doing what you expect. The ‘==’ operator does an element-wise comparison, which means it compares the first element of ‘Ethnicity’ to the first element of the vector (“Mexican”), the second element of ‘Ethnicity’ to the second element of the vector (“Filipino”). The short vector is then recycled so now the third element of ‘Ethnicity’ is compared to the first element of the vector (“Mexican”), the fourth element of ‘Ethnicity’ to the second element of the vector (“Filipino”), and so on.
Inspect the differences in how may rows these lines of code produce
Run this code chunk line by line. Inspect the differences.
# Create a data frame
df <- data.frame(
Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
stringsAsFactors = FALSE
)
# Investigate the data frame
df
# Filter using %in%
df %>% filter(Ethnicity %in% c("Mexican", "Filipino"))
# Filter using ==
df %>% filter(Ethnicity == c("Mexican", "Filipino"))
summarise()
Restart R. (Click Session -> Restart R)
Load the tidyverse
and the here
package
Load the soldiers
data and assign it to an object called soldiers
footlength
earlength
IQR()
function
WritingPreference
is equal to “Right hand”
HINT if you can’t work out why get an NA
value
Many Base R functions, including mean()
, does not ignore NA values by default. This means that if the vector contains an NA
value the result will be NA
. Is this a good or bad thing?
You can set the argument na.rm = TRUE
, to ignore missing values.
HINT if you can’t work out why get an NA
value
Many Base R functions, including mean()
, does not ignore NA values by default. This means that if the vector contains an NA
value the result will be NA
. Is this a good or bad thing?
You can set the argument na.rm = TRUE
, to ignore missing values.
group_by()
and arrange()
Restart R. (Click Session -> Restart R)
Load the tidyverse
and the here
package
Load the soldiers
data and assign it to an object called soldiers
weightkg
and age
for all Installation
sweightkg
and age
for all Installation
s for both sex
esInstallation
and count the number of missing values within each Installation
To count missings, use the functions sum()
and is.na()
Installation
Use n()
Installation
in descending order.mutate()
Restart R. (Click Session -> Restart R)
Load the tidyverse
and the here
package
Load the soldiers
data and assign it to an object called soldiers
heightcm
with the height of the soldiers in cmsoldiers
dataset with the new variableHeightin
weightkg
column to kg instead of kg*10soldiers
dataset with the new weightkg
column
BMI
with the Body mass index (BMI) of the soldierssoldiers
dataset to with the new variableweightkg
obese
that contains the value TRUE if BMI is > 30
Category | BMI (kg/m2) |
---|---|
Underweight (Severe thinness) | < 16.0 |
Underweight (Moderate thinness) | 16.0 – 16.9 |
Underweight (Mild thinness) | 17.0 – 18.4 |
Normal range | 18.5 – 24.9 |
Overweight (Pre-obese) | 25.0 – 29.9 |
Obese (Class I) | 30.0 – 34.9 |
Obese (Class II) | 35.0 – 39.9 |
Obese (Class III) | >= 40.0 |
category
that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”soldiers
dataset with the new variableBMI
Use case_when()
count()
For simple counting count()
is faster than summarise(n = n())
or mutate(n = n())
count()
works like summarise(n = n())
add_count()
works like mutate(n = n())
cut
and calculate the percentage of each cut---
title: "dplyr"
subtitle: "+ here and skimr"
author: "Steen Flammild Harsted"
date: today
output:
distill::distill_article:
self_contained: false
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
<br><br>
# Getting started
## Load the `tidyverse` and the `here` package {.unnumbered}
```{r}
library(tidyverse)
library(here)
```
\
## `here()`
### Try out the `here()` function
* What happens if you write `here()` ?
* What happens if you write `here("raw_data")` ?
* Compare your output with your neighbors.
* Compare your output with someone who has a different operating system (Windows, Mac, Linux) than you. (Hint: look for \\ /)
* Discuss if and how the `here()`function could ever be of any use to anybody
\
### Load the `soldiers` dataset
Use the function `read_csv2()` \
The `file` argument should be `here("raw_data", "soldiers.csv")`
```{r read_csv}
#| eval: false
read_csv2(here("raw_data", "soldiers.csv"))
```
\
### Assign the `soldiers` dataset to an object called `soldiers`
Great, now read the data again and assign it (remember `<-`) to an object called `soldiers`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
## `skim()`
There are many ways to explore a dataframe. In this course we will take a shortcut and use the `skim()` function from the `skimr` package.
The `skim()` function provides an exellent overview of a dataframe.
Load the `skimr` package and use the `skim()` function to get an impression of the `soldiers` dataset.
Discuss with your neighbor:
* Nr of rows?
* Nr of columns?
* Missing values?
* Types of variables?
* How many unique values does the different character variables have?
* Any fake data? *(hint: Yes, for educational purposes I have added some fake data)*
```{r}
#| eval: false
library(skimr)
skim(soldiers)
```
<br><br><br><br><br><br><br><br>
# `dplyr`
The main functions of dplyr that you will practice today are:
* `select()`
* `filter()`
* `summarise()`
* `group_by()`
* `arrange()`
* `mutate()`
* `count()`
<br><br>
## `select()`
### select the columns `subjectid`, `sex`, `age`
```{r}
#| eval: false
soldiers %>%
select(subjectid, sex, age)
```
\
### select the columns 1, 3, 5:7
```{r}
#| eval: false
soldiers %>%
select(1,3,5:7)
```
\
### remove the columns 3:5
```{r}
#| eval: false
soldiers %>%
select(-(3:5))
```
\
### select all columns that contains the word "circumference"
::: {.callout-tip collapse=true}
#### HINT
Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions.
:::
```{r soldiers_6}
#| eval: false
soldiers %>%
select(contains("circumference"))
```
<br>
### remove all columns containing the letter "c"
::: {.callout-tip collapse=true}
#### HINT
Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions.
Use a minus sign.
:::
```{r soldiers_7}
#| eval: false
soldiers %>%
select(-contains("c"))
```
<br>
### select all columns that contains a "c" OR "x" OR "y" OR "z"
::: {.callout-tip collapse=true}
#### HINT
In R(and many other programming languages) the \| sign is used as a logical operator for OR.
:::
```{r}
#| eval: false
soldiers %>%
select(contains("c") | contains("x") | contains("y") | contains("z"))
```
<br>
### select all columns that contains a "c" OR "x" OR "y" OR "z"
This time use the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper function called `matches()`\
`matches()` allows you to use logical operators inside the function. E.g., `matches("this|that")`
```{r soldiers_8}
#| eval: false
soldiers %>%
select(matches("c|x|y|z"))
```
<br>
### Challenge: Why not always use `matches()`?
Use the preloaded `iris` data-set. (just write `iris`)\
Try to use `matches()` to select all columns containing a "."
```{r}
#| eval: false
iris %>%
select(matches(".")) %>%
head() # This line is just to prevent a very long output.
```
<br>
Why did we select the `Species` column?\
What happens if we use `contains()` instead
```{r}
#| eval: false
iris %>%
select(contains(".")) %>%
head() # This line is just to prevent a very long output.
```
<br>
What is the difference in the output? Why is it different?
<details><summary>ANSWER</summary>
`contains()`: Contains a literal string.\
`matches()`: Matches a [regular expression](https://en.wikipedia.org/wiki/Regular_expression).\
In regular expressions `.` is a wildcard.
> The wildcard . matches any character. For example, a.b matches any string that contains an "a", and then any character and then "b"; and a.\*b matches any string that contains an "a", and then the character "b" at some later point.\
> [*https://en.wikipedia.org/wiki/Regular_expression*](https://en.wikipedia.org/wiki/Regular_expression){.uri}
</details>
<br><br><br><br><br><br><br><br>
## `filter()`
Restart R. (Click Session -> Restart R)\
Load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
\
Load the `soldiers `data and assign it to an object called `soldiers`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
### Keep all rows where `sex` is Female:
::: {.callout-tip collapse=true}
#### Hint
`???? == "Female"`
:::
```{r}
#| eval: false
soldiers %>% filter(sex == "Female")
```
<br>
### Keep all rows where `weightkg` is missing (NA value)
::: {.callout-tip collapse=true}
#### Hint
use the `is.na()` function
:::
```{r}
#| eval: false
soldiers %>%
filter(is.na(weightkg))
```
<br>
### Keep all rows where `WritingPreference` is "Left hand" AND `sex` is "Female"
```{r}
#| eval: false
soldiers %>%
filter(WritingPreference == "Left hand" , sex == "Female") # you can use & instead of a ,
```
<br>
### Keep all rows where `WritingPreference` is "Left hand" OR `sex` is "Female"
```{r}
#| eval: false
soldiers %>%
filter(WritingPreference == "Left hand" | sex == "Female")
```
<br>
### What is going wrong in this code?
```{r}
#| eval: false
#| code-fold: false
soldiers %>%
select(1:5) %>%
filter(WritingPreference == "Left hand" | sex == "Female")
```
The error message is:
```{r}
#| eval: false
#| code-fold: false
Error in `filter()`:
ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`.
Caused by error:
! object 'WritingPreference' not found
Run `rlang::last_error()` to see where the error occurred.
```
```{r}
#| eval: false
#| code-summary: "ANSWER"
The variable WritingPreference was not selected in the first line.
```
<br>
### Keep all rows where `age` is above 30 and the `weightkg` is below 600
```{r}
#| eval: false
soldiers %>%
filter(age > 30, weightkg < 600)
```
\
### Keep all rows where `Ethnicity` is either "Mexican" OR "Filipino" OR "Caribbean Islander"
::: {.callout-tip collapse=true}
#### Hint
you need to use `%in%` and `c()`
:::
```{r}
#| eval: false
soldiers %>%
filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander"))
```
::: {.callout-caution collapse=true}
#### Difference between `==` and `%in%`
tldr - Use `%in%` instead of `==` when you want to filter for multiple values.
Read on if you want to understand why. (You don't have to)
The code `filter(Ethnicity == c("Mexican", "Filipino"))` is likely not doing what you expect. The '==' operator does an element-wise comparison, which means it compares the first element of 'Ethnicity' to the first element of the vector ("Mexican"), the second element of 'Ethnicity' to the second element of the vector ("Filipino"). The short vector is then recycled so now the third element of 'Ethnicity' is compared to the first element of the vector ("Mexican"), the fourth element of 'Ethnicity' to the second element of the vector ("Filipino"), and so on.
Inspect the differences in how may rows these lines of code produce
```{r}
#| code-fold: false
#| eval: false
soldiers %>%
filter(Ethnicity %in% c("Mexican", "Filipino"))
soldiers %>%
filter(Ethnicity == c("Mexican", "Filipino"))
```
Run this code chunk line by line. Inspect the differences.
```{r}
#| code-fold: false
#| eval: false
# Create a data frame
df <- data.frame(
Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
stringsAsFactors = FALSE
)
# Investigate the data frame
df
# Filter using %in%
df %>% filter(Ethnicity %in% c("Mexican", "Filipino"))
# Filter using ==
df %>% filter(Ethnicity == c("Mexican", "Filipino"))
```
:::
<br><br><br><br><br><br><br><br>
## `summarise()`
Restart R. (Click Session -> Restart R)\
Load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
\
Load the `soldiers `data and assign it to an object called `soldiers`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
### Calculate the mean and standard deviation of `footlength`
```{r}
#| eval: false
soldiers %>% summarise(
footlength_avg = mean(footlength),
footlength_sd = sd(footlength))
```
<br>
### Calculate the median and interquartile range of `earlength`
<details><summary>HINT</summary>
use the `IQR()` function
</details>
```{r}
#| eval: false
soldiers %>%
summarise(
earlength_median = median(earlength),
earlength_iqr = IQR(earlength))
```
<br>
### Count the number of rows where `WritingPreference` is equal to "Right hand"
```{r}
#| eval: false
soldiers %>%
summarise(
n_righthanded = sum(WritingPreference == "Right hand"))
```
\
### How old is the oldest soldier?
<details>
<summary>
HINT if you can't work out why get an `NA` value
</summary>
Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\
You can set the argument `na.rm = TRUE`, to ignore missing values.
</details>
```{r}
#| eval: false
soldiers %>%
summarise(
age_max = max(age, na.rm = TRUE))
```
\
### Calculate the mean weight of the Females
<details>
<summary>
HINT if you can't work out why get an `NA` value
</summary>
Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\
You can set the argument `na.rm = TRUE`, to ignore missing values.
</details>
```{r}
#| eval: false
soldiers %>%
filter(sex == "Female") %>%
summarise(
weight_avg = mean(weightkg, na.rm = TRUE))
```
<br>
### Calculate the range in weight (max-min) within Males
```{r}
#| eval: false
soldiers %>%
filter(sex == "Male") %>%
summarise(
weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE))
```
<br><br><br><br><br><br><br><br>
## `group_by()` and `arrange()`
Restart R. (Click Session -> Restart R)\
Load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
\
Load the `soldiers `data and assign it to an object called `soldiers`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s
```{r}
#| eval: false
soldiers %>%
group_by(Installation) %>%
summarise(weight_avg = mean(weightkg, na.rm = TRUE),
weight_sd = sd(weightkg, na.rm = TRUE),
age_avg = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE))
```
<br>
### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es
```{r}
#| eval: false
soldiers %>%
group_by(Installation, sex) %>%
summarise(weight_avg = mean(weightkg, na.rm = TRUE),
weight_sd = sd(weightkg, na.rm = TRUE),
age_avg = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE))
```
<br>
### Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`
::: {.callout-tip collapse=true}
#### Hint
To count missings, use the functions `sum()` and `is.na()`
:::
```{r}
#| eval: false
soldiers %>%
group_by(Installation) %>%
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)))
```
<br>
### As before, but now also add the number of observations (rows) within each `Installation`
::: {.callout-tip collapse=true}
#### Hint
Use `n()`
:::
```{r}
#| eval: false
soldiers %>%
group_by(Installation) %>%
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)),
n = n())
```
<br>
### As before, but now arrange the output after number of soldiers at each `Installation` in descending order.
```{r}
#| eval: false
soldiers %>%
group_by(Installation) %>%
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)),
n = n()) %>%
arrange(desc(n))
```
<br><br><br><br><br><br><br><br>
## `mutate()`
Restart R. (Click Session -> Restart R)\
Load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
\
Load the `soldiers `data and assign it to an object called `soldiers`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
### Add a column called `heightcm` with the height of the soldiers in cm
* Update the `soldiers` dataset with the new variable
* place the new variable after `Heightin`
```{r}
#| output: false
soldiers <- soldiers %>%
mutate(
heightcm = Heightin * 2.54,
.after = Heightin)
```
\
### Update the `weightkg` column to kg instead of kg*10
* Update the `soldiers` dataset with the new `weightkg` column
```{r}
#| output: false
soldiers <- soldiers %>%
mutate(
weightkg = weightkg/10
)
```
\
### Add a column called `BMI` with the Body mass index (BMI) of the soldiers
[BMI](https://en.wikipedia.org/wiki/Body_mass_index)
* Update the `soldiers` dataset to with the new variable
* place the new variable after `weightkg`
```{r}
#| output: false
soldiers <- soldiers %>%
mutate(BMI = weightkg/(heightcm/100)^2,
.after = weightkg)
```
\
### Add a column called `obese` that contains the value TRUE if BMI is > 30
```{r}
#| eval: false
soldiers %>%
mutate(
obese = if_else(BMI > 30, TRUE, FALSE),
.before = 1 # This line code just places the variable at the front
)
# OR
soldiers %>%
mutate(
obese = BMI > 30,
.before = 1 # This line code just places the variable at the front
)
```
\
### Inspect the below table from [Wikipedia](https://en.wikipedia.org/wiki/Body_mass_index) {#sec-BMI}
```{r}
#| echo: false
tibble::tribble( ~Category, ~`BMI (kg/m^2^)`,
"Underweight (Severe thinness)", "< 16.0",
"Underweight (Moderate thinness)", "16.0 – 16.9",
"Underweight (Mild thinness)", "17.0 – 18.4",
"Normal range", "18.5 – 24.9",
"Overweight (Pre-obese)", "25.0 – 29.9",
"Obese (Class I)", "30.0 – 34.9",
"Obese (Class II)", "35.0 – 39.9",
"Obese (Class III)", ">= 40.0",
) %>% knitr::kable()
```
\
### Create the variable `category` that tells us whether the soldiers are "Underweight", "Normal range", "Overweight", or "Obese"
* Update the `soldiers` dataset with the new variable
* place the new variable after `BMI`\
::: {.callout-tip collapse=true}
#### Hint 1
Use `case_when()`
:::
::: {.callout-tip collapse=true}
#### Hint 2
```{r}
#| eval: false
#| code-fold: false
soldiers %>%
mutate(
category = ????
)
```
:::
::: {.callout-tip collapse=true}
#### Hint 3
```{r}
#| eval: false
#| code-fold: false
soldiers %>%
mutate(
category = case_when(
#TEST HERE ~ OUTPUT,
#TEST HERE ~ OUTPUT,
#TEST HERE ~ OUTPUT,
#.default = OUTPUT
)
)
```
:::
```{r}
#| output: false
soldiers <- soldiers %>%
mutate(
category = case_when(
BMI < 18.5 ~ "Underweight",
BMI < 25 ~ "Normal range",
BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",
.default = NA),
.after = BMI
)
```
<br><br><br><br><br><br><br><br>
## `count()`
For simple counting `count()` is faster than `summarise(n = n())` or `mutate(n = n())`
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds %>%
count()
```
<details><summary>ANSWER</summary>
`count()` works like `summarise(n = n())`
</details>
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds %>%
count(cut)
```
```{r}
#| code-fold: true
#| code-summary: "Answer"
#| eval: false
# The code is equivalent to:
diamonds %>% group_by(cut) %>% summarise(n = n())
# OR
diamonds %>% summarise(n = n(), .by = cut)
```
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds %>%
count(cut, color)
```
```{r}
#| code-fold: true
#| code-summary: "Answer"
#| eval: false
# The code is equivalent to:
diamonds %>% group_by(cut, color) %>% summarise(n = n())
# OR
diamonds %>% summarise(n = n(), .by = c(cut, color))
# However, notice that the first solution returns a grouped tibble
```
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds %>%
add_count()
```
<details><summary>ANSWER</summary>
`add_count()` works like `mutate(n = n())`
</details>
<br><br>
### Count the number of diamonds within each type of `cut` and calculate the percentage of each cut
```{r}
#| output: false
diamonds %>%
count(cut) %>%
mutate(percent = n/sum(n)*100)
```
<br><br>
### Inspect the output of this code. What is going wrong?
```{r}
#| code-fold: false
diamonds %>%
group_by(cut) %>%
count() %>%
mutate(percent = n/sum(n)*100)
```
<script src="https://giscus.app/client.js"
data-repo="sorenoneill/r4phd"
data-repo-id="R_kgDOJ9etDQ"
data-category="Announcements"
data-category-id="DIC_kwDOJ9etDc4CYApF"
data-mapping="pathname"
data-strict="0"
data-reactions-enabled="0"
data-emit-metadata="0"
data-input-position="bottom"
data-theme="cobalt"
data-lang="en"
crossorigin="anonymous"
async>
</script>