Code
library(tidyverse)
library(here)
Steen Flammild Harsted
January 10, 2024
In this section we will explore a few other functions that you most likely are going to need.
pivot_wider()
and pivot_longer()
_join()
across()
tidyverse
and the here
packageIn the realm of data science, it’s common to encounter data in what we call a “wide” format. To visualize this, imagine that each measurement of the same attribute is spread across multiple columns instead of being listed down in rows. Consider a scenario where we record a person’s weight at different intervals; in the wide format, we might see individual columns dedicated to each time point, like weightkg_week0, weightkg_week10 and so forth. Essentially, the column labels often carry dual information: the attribute (in this case, ‘weight’) and the condition or time (like ‘week0’ or ‘week10’).
Often, data is inputted in this wide format for convenience, as it visually follows the chronological order of data collection or event occurrence. However, in the practice of data science, it’s generally more efficient to work with data in what we call a “long” format.
In long format, the data is structured such that each row is a unique observation or instance, and each column represents a unique variable or attribute. Continuing with the weight example, instead of having separate columns for each week’s weight (weightkg_week0, weightkg_week10), we would have separate rows for each weight measurement, with a column for the weight (weightkg) and another column for the time point (week).
The long format, also known as “tidy” data, provides a systematic and consistent structure, allowing us to work with data more effectively. Plus, many functions and packages in R, such as the popular ggplot2 for data visualization, are designed to work optimally with this long format.
But how do we move from wide format to long format? And vice versa? This is where the pivot_longer()
and pivot_wider()
functions from the tidyverse in R come into play.
The code creates a made-up data set. You dont need to understand the code.
set.seed(1)
weight_data <- tibble(
id = 1:100,
sex = sample(c("Male", "Female"), size = 100, replace = TRUE),
weightkg_week0 = rnorm(100, mean = 95, sd = 5)) %>%
# Mean wiegh more than women
mutate(
weightkg_week0 = if_else(
sex == "Male", weightkg_week0 * rnorm(100, mean = 1.15, sd = 0.3),
weightkg_week0
)
) %>%
# Weight change over weeks
mutate(
weightkg_week1 = rnorm(100, mean = weightkg_week0*0.96, sd = 0.5),
weightkg_week2 = rnorm(100, mean = weightkg_week1*0.97, sd = 0.5),
weightkg_week3 = rnorm(100, mean = weightkg_week2*0.98, sd = 0.5),
weightkg_week4 = rnorm(100, mean = weightkg_week3*0.99, sd = 0.5),
weightkg_week5 = rnorm(100, mean = weightkg_week4, sd = 0.5))
weight_data
This exercise is only here to show you how troublesome it can be to work with data in a wide format. You don’t need to write all the code.
pivot_longer
The pivot_longer()
function takes wide format data and makes it longer by increasing the number of rows and decreasing the number of columns. For example, it can take our weight data from multiple columns (weightkg_week0, weightkg_week10) and collapse them into two columns: one for weight and one for timepoint.
pivot_longer()
What does this line in the help file mean?
cols <tidy-select> Columns to pivot into longer format. ::: {.callout collapse=true} #### Answer This means that you use all the tidyselect functions that you also use with select() when you tell R what columns you to pivot into a longer format. :::
What does the names_to
and values_to
arguments do?
weight_data
called weight_data_long
We will fix the values in the week
column later. For now we are happy with the long data format.
weight_data_long_sum
pivot_wider()
The pivot_wider()
function takes long format data and makes it wider by decreasing the number of rows and increasing the number of columns. For instance, it can spread our long format weight data back into multiple columns for each week’s weight.
pivot_wider()
weight_data_long_sum
wider using the week
variable# A tibble: 2 × 7
# Groups: sex [2]
sex weightkg_week0 weightkg_week1 weightkg_week2 weightkg_week3
<chr> <dbl> <dbl> <dbl> <dbl>
1 Female 94.7 90.9 88.3 86.4
2 Male 113. 108. 105. 103.
# ℹ 2 more variables: weightkg_week4 <dbl>, weightkg_week5 <dbl>
weight_data_long_sum
wider using the sex
variable# A tibble: 6 × 3
week Female Male
<chr> <dbl> <dbl>
1 weightkg_week0 94.7 113.
2 weightkg_week1 90.9 108.
3 weightkg_week2 88.3 105.
4 weightkg_week3 86.4 103.
5 weightkg_week4 85.5 102.
6 weightkg_week5 85.4 102.
pivot_wider()
and pivot_longer()
Change it into three variables:
sex
(“female”, “male”)pregnant
(“yes”, “no”)count
, which is a non-negative integer representing the number of observations.table2
for this exercisetable2
is a dataset that is load with the tidyverse
. Try to type table2
in the console.
table2
displays the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population).
Change table2
into the output below:
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
_join()
The join functions in the dplyr package in R are used to combine two dataframes based on a shared key (or set of keys). There are several types of joins, each with their own use cases:
inner_join()
: This returns only the rows in which the key exists in both dataframes.
left_join()
: This returns all the rows from the first (left-hand) dataframe, and any rows from the second (right-hand) dataframe that match. If there’s no match, the columns for the right-hand dataframe will be filled with NA.
right_join()
: This is the reverse of left_join(). It returns all the rows from the second (right-hand) dataframe, and any rows from the first (left-hand) dataframe that match. If there’s no match, the columns for the left-hand dataframe will be filled with NA.
full_join()
: This returns all rows from both dataframes. If there’s no match, the columns for the other dataframe will be filled with NA.
semi_join()
: This returns all rows from the first (left-hand) dataframe where there are matching values in the second dataframe, but it doesn’t add any columns from the second dataframe.
anti_join()
: This returns all rows from the first (left-hand) dataframe where there are not matching values in the second dataframe. It also doesn’t add any columns from the second dataframe.
These functions are essential tools for merging and manipulating data in R, allowing you to effectively combine and cross-reference information from multiple dataframes.
Read the slides from Day 1 to see code examples.
Run this code to get example data
across()
Restart R Load the tidyverse
and the here
package
Load the soldiers data set and assign it to an object called soliders
The across()
function is part of the dplyr
package in R and is used for applying a function (or functions) across multiple columns in a dataframe.
across()
is particularly useful when you want to perform the same operation on multiple columns but don’t want to write out each operation individually. It works in combination with mutate() and summarise().
Very importantly the .cols argument allows you to use the tidyselect functions that you are now familiar with.
Here’s a basic example of its usage.
Let’s say we want to take the mean of weightkg
and earlength
in the soldiers
dataframe.
You can see that the mean of weightkg
is NA
. This is because is has missings values and mean()
defaults to na.rm = FALSE
. If we want to change this, we have to write a slightly more complicated code. In the following code you will find two new things:
~
You tell R - “I am going to show you a function”.x
You tell R - “Put the column that you have selected here”.x
can also be used outside of a function. Lets say we want to double all the values instead of taking the mean
soldiers
dataframe and change their values from cm to inches# A tibble: 6,208 × 24
subjectid fake_cpr sex age Ethnicity DODRace WritingPreference
<dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 1017 151168-2263 Male 41 <NA> 1 Right hand
2 4775 210475-1945 Man 35 <NA> 1 Left hand
3 2177 290568-1069 Male 42 <NA> 2 Left hand
4 5026 121283-0610 Female 26 <NA> 2 Right hand
5 1533 090189-0422 Woman 21 Mexican 3 Right hand
6 4567 020487-3860 Female 23 <NA> 1 Right hand
7 2347 041287-8106 Female 22 Caribbean Islan… 2 Right hand
8 270 130765-7114 Woman 45 <NA> 1 Right hand
9 4050 020166-4680 Female 44 <NA> 2 Left hand
10 5307 101188-3148 Female 21 <NA> 1 Right hand
# ℹ 6,198 more rows
# ℹ 17 more variables: Installation <chr>, Component <chr>, Branch <chr>,
# Heightin <dbl>, weightkg <dbl>, earlength <dbl>, footlength <dbl>,
# handlength <dbl>, handbreadth <dbl>, hipbreadth <dbl>, chestbreadth <dbl>,
# thighcircumference <dbl>, waistcircumference <dbl>,
# anklecircumference <dbl>, bicepscircumference <dbl>,
# calfcircumference <dbl>, span <dbl>
.cols = where(is.[column_type])
You can also target columns based on their type (logical, double, character, factor, numeric). Using the functions is.logical()
, is.double()
, etc.. However, you need to use them inside the function where()
In this example, we change all character columns to factors (You will learn about the column types on day 3). where(is.character)
selects the columns where the condition is true (i.e., the column is character), and as.factor
is the function that gets applied to these columns. As a result, all character columns get converted to factor.
soldiers
dataframe and add 1.5# A tibble: 6,208 × 24
subjectid fake_cpr sex age Ethnicity DODRace WritingPreference
<dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 1018. 151168-2263 Male 42.5 <NA> 2.5 Right hand
2 4776. 210475-1945 Man 36.5 <NA> 2.5 Left hand
3 2178. 290568-1069 Male 43.5 <NA> 3.5 Left hand
4 5028. 121283-0610 Female 27.5 <NA> 3.5 Right hand
5 1534. 090189-0422 Woman 22.5 Mexican 4.5 Right hand
6 4568. 020487-3860 Female 24.5 <NA> 2.5 Right hand
7 2348. 041287-8106 Female 23.5 Caribbean Islan… 3.5 Right hand
8 272. 130765-7114 Woman 46.5 <NA> 2.5 Right hand
9 4052. 020166-4680 Female 45.5 <NA> 3.5 Left hand
10 5308. 101188-3148 Female 22.5 <NA> 2.5 Right hand
# ℹ 6,198 more rows
# ℹ 17 more variables: Installation <chr>, Component <chr>, Branch <chr>,
# Heightin <dbl>, weightkg <dbl>, earlength <dbl>, footlength <dbl>,
# handlength <dbl>, handbreadth <dbl>, hipbreadth <dbl>, chestbreadth <dbl>,
# thighcircumference <dbl>, waistcircumference <dbl>,
# anklecircumference <dbl>, bicepscircumference <dbl>,
# calfcircumference <dbl>, span <dbl>
soldiers
dataframeStandardizing a variable is a statistical process that transforms the variable’s values to have a mean of 0 and a standard deviation of 1. This is often used often used within machinelearning.
The most common way of standardizing a values is to subtract the mean from each value and divide by the standard deviation.
The standard score (or z-score) of a value x is calculated as:
\[z = (x - μ) / σ\]
across()
across()
can do much more than this. For instance, it can invoke multiple functions simultaneously in a single call, like calculating the mean and standard deviation of chosen variables. Or change the names of the newly created columns. Read more here
---
title: "Other data wrangling functions"
author: "Steen Flammild Harsted"
date: today
output:
distill::distill_article:
self_contained: false
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
# Other important functions for data wrangling.
In this section we will explore a few other functions that you most likely are going to need.
- `pivot_wider()` and `pivot_longer()`
- `_join()`
- `across()`
<br><br><br><br><br><br><br><br>
## Getting started for now
### Restart R and load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
<br><br><br><br><br><br><br><br>
## Wide and Long data formats
In the realm of data science, it's common to encounter data in what we call a "wide" format. To visualize this, imagine that each measurement of the same attribute is spread across multiple columns instead of being listed down in rows. Consider a scenario where we record a person's weight at different intervals; in the wide format, we might see individual columns dedicated to each time point, like weightkg_week0, weightkg_week10 and so forth. Essentially, the column labels often carry dual information: the attribute (in this case, 'weight') and the condition or time (like 'week0' or 'week10').
Often, data is inputted in this wide format for convenience, as it visually follows the chronological order of data collection or event occurrence. However, in the practice of data science, it's generally more efficient to work with data in what we call a "long" format.
In long format, the data is structured such that each row is a unique observation or instance, and each column represents a unique variable or attribute. Continuing with the weight example, instead of having separate columns for each week's weight (weightkg_week0, weightkg_week10), we would have separate rows for each weight measurement, with a column for the weight (weightkg) and another column for the time point (week).
The long format, also known as "tidy" data, provides a systematic and consistent structure, allowing us to work with data more effectively. Plus, many functions and packages in R, such as the popular ggplot2 for data visualization, are designed to work optimally with this long format.
But how do we move from wide format to long format? And vice versa? This is where the `pivot_longer()` and `pivot_wider()` functions from the tidyverse in R come into play.
<br>
### Run the code below and explore the dataset that it creates
The code creates a made-up data set. You dont need to understand the code.
- Is the data frame in a wide or long format? How can you tell?
```{r}
#| code-fold: false
#| output: false
set.seed(1)
weight_data <- tibble(
id = 1:100,
sex = sample(c("Male", "Female"), size = 100, replace = TRUE),
weightkg_week0 = rnorm(100, mean = 95, sd = 5)) %>%
# Mean wiegh more than women
mutate(
weightkg_week0 = if_else(
sex == "Male", weightkg_week0 * rnorm(100, mean = 1.15, sd = 0.3),
weightkg_week0
)
) %>%
# Weight change over weeks
mutate(
weightkg_week1 = rnorm(100, mean = weightkg_week0*0.96, sd = 0.5),
weightkg_week2 = rnorm(100, mean = weightkg_week1*0.97, sd = 0.5),
weightkg_week3 = rnorm(100, mean = weightkg_week2*0.98, sd = 0.5),
weightkg_week4 = rnorm(100, mean = weightkg_week3*0.99, sd = 0.5),
weightkg_week5 = rnorm(100, mean = weightkg_week4, sd = 0.5))
weight_data
```
<br><br>
### What is the mean weight of men and women at the different time-points?
*This exercise is only here to show you how troublesome it can be to work with data in a wide format. You don't need to write all the code.*
```{r}
#| output: false
weight_data %>%
group_by(sex) %>%
summarise(week0 = mean(weightkg_week0),
week1 = mean(weightkg_week1),
# etc. etc.
)
```
<br><br>
## `pivot_longer`
The `pivot_longer()` function takes wide format data and makes it longer by increasing the number of rows and decreasing the number of columns. For example, it can take our weight data from multiple columns (weightkg_week0, weightkg_week10) and collapse them into two columns: one for weight and one for timepoint.
<br><br>
### Read the help file for `pivot_longer()`
```{r}
#| output: false
#| eval: false
?pivot_longer
```
- What does this line in the help file mean?
*cols \<tidy-select\> Columns to pivot into longer format.*
::: {.callout collapse=true}
#### Answer
This means that you use all the [tidyselect](https://tidyselect.r-lib.org/) functions that you also use with select() when you tell R what columns you to pivot into a longer format.
:::
- What does the `names_to` and `values_to` arguments do?
<br><br>
### Create a long version of `weight_data` called `weight_data_long`
```{r}
#| output: false
weight_data_long <- weight_data %>%
pivot_longer(cols = starts_with("weight"))
weight_data_long
```
<br><br>
### Try again, but this time use the names_to and values_to arguments to make names for the new columns
```{r}
#| output: false
weight_data_long <- weight_data %>%
pivot_longer(cols = starts_with("weight"),
names_to = "week",
values_to = "weightkg")
weight_data_long
```
We will fix the values in the `week` column later. For now we are happy with the long data format.
<br><br>
### What is the mean and standard deviation of weight of men and women at the different time-points?
* assign you results to an object called `weight_data_long_sum`
```{r}
#| output: false
weight_data_long_sum <- weight_data_long %>%
group_by(sex, week) %>%
summarise(mean = mean(weightkg))
weight_data_long_sum
```
<br><br><br><br>
## `pivot_wider()`
The `pivot_wider()` function takes long format data and makes it wider by decreasing the number of rows and increasing the number of columns. For instance, it can spread our long format weight data back into multiple columns for each week's weight.
<br><br>
### Read the help file for `pivot_wider()`
```{r}
#| output: false
#| eval: false
?pivot_wider
```
<br><br>
### Make `weight_data_long_sum` wider using the `week` variable
```{r}
weight_data_long_sum %>%
pivot_wider(names_from = "week",
values_from = mean)
```
<br><br>
### Make `weight_data_long_sum` wider using the `sex` variable
```{r}
weight_data_long_sum %>%
pivot_wider(names_from = "sex",
values_from = mean)
```
<br><br>
## More examples of `pivot_wider()` and `pivot_longer()`
<br><br>
### Tidy the simple tibble below.
Change it into three variables:
- `sex` ("female", "male")
- `pregnant` ("yes", "no")
- `count`, which is a non-negative integer representing the number of observations.
```{r}
#| code-fold: false
#| output: false
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
preg
```
```{r}
#| output: false
preg %>%
pivot_longer(c(male, female), names_to = "sex", values_to = "count")
```
<br><br>
### Use `table2` for this exercise
`table2` is a dataset that is load with the `tidyverse`. Try to type `table2` in the console.
`table2` displays the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population).
Change `table2` into the output below:
```{r}
#| eval: true
table2 %>%
pivot_wider(names_from = type,
values_from = count)
```
<br><br><br><br>
## `_join()`
The join functions in the dplyr package in R are used to combine two dataframes based on a shared key (or set of keys). There are several types of joins, each with their own use cases:
* `inner_join()`: This returns only the rows in which the key exists in both dataframes.
* `left_join()`: This returns all the rows from the first (left-hand) dataframe, and any rows from the second (right-hand) dataframe that match. If there's no match, the columns for the right-hand dataframe will be filled with NA.
* `right_join()`: This is the reverse of left_join(). It returns all the rows from the second (right-hand) dataframe, and any rows from the first (left-hand) dataframe that match. If there's no match, the columns for the left-hand dataframe will be filled with NA.
* `full_join()`: This returns all rows from both dataframes. If there's no match, the columns for the other dataframe will be filled with NA.
* `semi_join()`: This returns all rows from the first (left-hand) dataframe where there are matching values in the second dataframe, but it doesn't add any columns from the second dataframe.
* `anti_join()`: This returns all rows from the first (left-hand) dataframe where there are not matching values in the second dataframe. It also doesn't add any columns from the second dataframe.
These functions are essential tools for merging and manipulating data in R, allowing you to effectively combine and cross-reference information from multiple dataframes.
<br><br>
Read the slides from Day 1 to see code examples.
<br><br>
Run this code to get example data
```{r}
#| code-fold: false
set.seed(1)
A <- tibble(id = c(1:5, 7:8),
sex = sample(c("Male", "Female"), 7, replace = TRUE),
nr_of_orders = sample(5:10, 7, replace = TRUE))
B <- tibble(id = 1:6,
age = sample(25:75, 6),
total_order_value = sample(5:50, 6)*100)
```
<br><br>
#### What rows doesnt match between A and B
```{r}
#| eval: false
anti_join(A, B)
anti_join(B, A)
```
<br><br>
#### What will be the results of the below code
```{r}
#| eval: false
#| code-fold: false
semi_join(A, B)
```
<br><br>
#### Keep all rows in A and all columns from A and B
```{r}
#| eval: false
left_join(A, B)
```
<br><br>
#### Keep all rows and all columns from both A and B, and sort by id
```{r}
#| eval: false
full_join(A,B) %>%
arrange(id)
```
<br><br>
#### Calculate the average order value
```{r}
#| eval: false
full_join(A,B) %>%
mutate(
avg_order_value = total_order_value/nr_of_orders)
```
<br><br>
#### Calculate the mean of the average order value for males and females
```{r}
#| eval: false
full_join(A,B) %>%
mutate(
avg_order_value = total_order_value/nr_of_orders) %>%
group_by(sex) %>%
summarise(
mean_avg_order_value = mean(avg_order_value, na.rm = TRUE)
)
```
<br><br><br><br><br><br><br><br>
## `across()`
Restart R
Load the `tidyverse` and the `here` package
```{r}
library(tidyverse)
library(here)
```
Load the soldiers data set and assign it to an object called soliders
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
The `across()` function is part of the `dplyr` package in R and is used for applying a function (or functions) across multiple columns in a dataframe.
`across()` is particularly useful when you want to perform the same operation on multiple columns but don't want to write out each operation individually. It works in combination with mutate() and summarise().
Very importantly the .cols argument allows you to use the [tidyselect](https://tidyselect.r-lib.org/) functions that you are now familiar with.
Here's a basic example of its usage.
Let's say we want to take the mean of `weightkg` and `earlength` in the `soldiers` dataframe.
```{r}
#| code-fold: false
#| output: false
soldiers %>%
summarise(across(
.cols = c("weightkg", "earlength"), # What columns to target
.fns = mean) # What function to apply - Notice that mean is not written mean()
)
```
You can see that the mean of `weightkg` is `NA`. This is because is has missings values and `mean()` defaults to `na.rm = FALSE`. If we want to change this, we have to write a slightly more complicated code. In the following code you will find two new things:
* `~` You tell R - "I am going to show you a function"
* `.x` You tell R - "Put the column that you have selected here"
```{r}
#| code-fold: false
#| output: false
soldiers %>%
summarise(across(
.cols = c("weightkg", "earlength"), # What columns to target
.fns = ~ mean(.x, na.rm = TRUE)) # What function to apply
)
```
`.x` can also be used outside of a function. Lets say we want to double all the values instead of taking the mean
```{r}
#| code-fold: false
#| output: false
soldiers %>%
mutate(across( # Notice now we use mutate() instead of summarise()
.cols = c("weightkg", "earlength"), # What columns to target
.fns = ~ .x * 2
))
```
<br><br>
### Target all circumference columns in the `soldiers` dataframe and change their values from cm to inches
```{r}
soldiers %>%
mutate(across(
.cols = contains("circumference"),
.fns = ~.x *0.393701))
```
<br><br>
### `.cols = where(is.[column_type])`
You can also target columns based on their type (logical, double, character, factor, numeric). Using the functions `is.logical()`, `is.double()`, etc.. However, you need to use them inside the function `where()`
In this example, we change all character columns to factors (You will learn about the column types on day 3). `where(is.character)` selects the columns where the condition is true (i.e., the column is character), and `as.factor` is the function that gets applied to these columns. As a result, all character columns get converted to factor.
```{r}
#| code-fold: false
#| output: false
soldiers %>%
mutate(across(
.cols = where(is.character), # What columns to target - notice is.character is not written is.charater()
.fns = as.factor # What function to apply - notice as.factor is written without ()
))
```
<br><br>
### Target all numeric columns in the `soldiers` dataframe and add 1.5
```{r}
soldiers %>%
mutate(
across(
.cols = where(is.numeric),
.fns = ~.x+1.5
)
)
```
<br><br>
### Challenge... Standardise all numeric columns in the `soldiers` dataframe
Standardizing a variable is a statistical process that transforms the variable's values to have a mean of 0 and a standard deviation of 1. This is often used often used within machinelearning.
The most common way of standardizing a values is to subtract the mean from each value and divide by the standard deviation.
The standard score (or z-score) of a value x is calculated as:
$$z = (x - μ) / σ$$
::: {.callout-tip collapse=true}
#### Hint
If we would target just a single columns the code could be
```{r}
#| output: false
#| code-fold: false
soldiers %>%
mutate(
weightkg = (weightkg-mean(weightkg, na.rm = TRUE))/sd(weightkg, na.rm=TRUE)
)
```
:::
```{r}
#| output: false
#| code-fold: true
#| code-summary: "Show me the code"
soldiers %>%
mutate(
across(
.cols = where(is.numeric),
.fns = ~(.x - mean(.x, na.rm = TRUE))/sd(.x, na.rm = TRUE)
)
)
```
<br><br>
### Learn more about `across()`
`across()` can do much more than this. For instance, it can invoke multiple functions simultaneously in a single call, like calculating the mean and standard deviation of chosen variables. Or change the names of the newly created columns. Read more [here](https://dplyr.tidyverse.org/reference/across.html)