Other data wrangling functions

Author

Steen Flammild Harsted

Published

January 10, 2024

1 Other important functions for data wrangling.

In this section we will explore a few other functions that you most likely are going to need.

  • pivot_wider() and pivot_longer()
  • _join()
  • across()









1.1 Getting started for now

Restart R and load the tidyverse and the here package

Code
library(tidyverse)
library(here)









1.2 Wide and Long data formats

In the realm of data science, it’s common to encounter data in what we call a “wide” format. To visualize this, imagine that each measurement of the same attribute is spread across multiple columns instead of being listed down in rows. Consider a scenario where we record a person’s weight at different intervals; in the wide format, we might see individual columns dedicated to each time point, like weightkg_week0, weightkg_week10 and so forth. Essentially, the column labels often carry dual information: the attribute (in this case, ‘weight’) and the condition or time (like ‘week0’ or ‘week10’).

Often, data is inputted in this wide format for convenience, as it visually follows the chronological order of data collection or event occurrence. However, in the practice of data science, it’s generally more efficient to work with data in what we call a “long” format.

In long format, the data is structured such that each row is a unique observation or instance, and each column represents a unique variable or attribute. Continuing with the weight example, instead of having separate columns for each week’s weight (weightkg_week0, weightkg_week10), we would have separate rows for each weight measurement, with a column for the weight (weightkg) and another column for the time point (week).

The long format, also known as “tidy” data, provides a systematic and consistent structure, allowing us to work with data more effectively. Plus, many functions and packages in R, such as the popular ggplot2 for data visualization, are designed to work optimally with this long format.

But how do we move from wide format to long format? And vice versa? This is where the pivot_longer() and pivot_wider() functions from the tidyverse in R come into play.


Run the code below and explore the dataset that it creates

The code creates a made-up data set. You dont need to understand the code.

  • Is the data frame in a wide or long format? How can you tell?
set.seed(1)

weight_data <- tibble(
  id = 1:100,
  sex = sample(c("Male", "Female"), size = 100, replace = TRUE),
  weightkg_week0 = rnorm(100, mean = 95, sd = 5)) %>% 
  
  # Mean wiegh more than women
  mutate(
    weightkg_week0 = if_else(
      sex == "Male", weightkg_week0 * rnorm(100, mean = 1.15, sd = 0.3),
      weightkg_week0
    )
  ) %>% 
  
  # Weight change over weeks
  mutate(
    weightkg_week1 = rnorm(100, mean = weightkg_week0*0.96, sd = 0.5), 
    weightkg_week2 = rnorm(100, mean = weightkg_week1*0.97, sd = 0.5),
    weightkg_week3 = rnorm(100, mean = weightkg_week2*0.98, sd = 0.5), 
    weightkg_week4 = rnorm(100, mean = weightkg_week3*0.99, sd = 0.5),
    weightkg_week5 = rnorm(100, mean = weightkg_week4, sd = 0.5))
  

weight_data



What is the mean weight of men and women at the different time-points?

This exercise is only here to show you how troublesome it can be to work with data in a wide format. You don’t need to write all the code.

Code
weight_data %>% 
  group_by(sex) %>% 
  summarise(week0 = mean(weightkg_week0),
            week1 = mean(weightkg_week1),
            # etc. etc.
            )



1.3 pivot_longer

The pivot_longer() function takes wide format data and makes it longer by increasing the number of rows and decreasing the number of columns. For example, it can take our weight data from multiple columns (weightkg_week0, weightkg_week10) and collapse them into two columns: one for weight and one for timepoint.



Read the help file for pivot_longer()

Code
?pivot_longer
  • What does this line in the help file mean?

    cols <tidy-select> Columns to pivot into longer format. ::: {.callout collapse=true} #### Answer This means that you use all the tidyselect functions that you also use with select() when you tell R what columns you to pivot into a longer format. :::

  • What does the names_to and values_to arguments do?



Create a long version of weight_data called weight_data_long

Code
weight_data_long <- weight_data %>% 
  pivot_longer(cols = starts_with("weight"))

weight_data_long



Try again, but this time use the names_to and values_to arguments to make names for the new columns

Code
weight_data_long <- weight_data %>% 
  pivot_longer(cols = starts_with("weight"),
               names_to = "week",
               values_to = "weightkg")
weight_data_long

We will fix the values in the week column later. For now we are happy with the long data format.



What is the mean and standard deviation of weight of men and women at the different time-points?

  • assign you results to an object called weight_data_long_sum
Code
weight_data_long_sum <- weight_data_long %>% 
  group_by(sex, week) %>% 
  summarise(mean = mean(weightkg))

weight_data_long_sum





1.4 pivot_wider()

The pivot_wider() function takes long format data and makes it wider by decreasing the number of rows and increasing the number of columns. For instance, it can spread our long format weight data back into multiple columns for each week’s weight.



Read the help file for pivot_wider()

Code
?pivot_wider



Make weight_data_long_sum wider using the week variable

Code
weight_data_long_sum %>% 
  pivot_wider(names_from = "week",
              values_from = mean)
# A tibble: 2 × 7
# Groups:   sex [2]
  sex    weightkg_week0 weightkg_week1 weightkg_week2 weightkg_week3
  <chr>           <dbl>          <dbl>          <dbl>          <dbl>
1 Female           94.7           90.9           88.3           86.4
2 Male            113.           108.           105.           103. 
# ℹ 2 more variables: weightkg_week4 <dbl>, weightkg_week5 <dbl>



Make weight_data_long_sum wider using the sex variable

Code
weight_data_long_sum %>% 
  pivot_wider(names_from = "sex",
              values_from = mean)
# A tibble: 6 × 3
  week           Female  Male
  <chr>           <dbl> <dbl>
1 weightkg_week0   94.7  113.
2 weightkg_week1   90.9  108.
3 weightkg_week2   88.3  105.
4 weightkg_week3   86.4  103.
5 weightkg_week4   85.5  102.
6 weightkg_week5   85.4  102.



1.5 More examples of pivot_wider() and pivot_longer()



Tidy the simple tibble below.

Change it into three variables:

  • sex (“female”, “male”)
  • pregnant (“yes”, “no”)
  • count, which is a non-negative integer representing the number of observations.
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes", NA, 10,
  "no", 20, 12
)
preg
Code
preg %>%
  pivot_longer(c(male, female), names_to = "sex", values_to = "count")



Use table2 for this exercise

table2 is a dataset that is load with the tidyverse. Try to type table2 in the console.

table2 displays the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population).

Change table2 into the output below:

Code
table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583





1.6 _join()

The join functions in the dplyr package in R are used to combine two dataframes based on a shared key (or set of keys). There are several types of joins, each with their own use cases:

  • inner_join(): This returns only the rows in which the key exists in both dataframes.

  • left_join(): This returns all the rows from the first (left-hand) dataframe, and any rows from the second (right-hand) dataframe that match. If there’s no match, the columns for the right-hand dataframe will be filled with NA.

  • right_join(): This is the reverse of left_join(). It returns all the rows from the second (right-hand) dataframe, and any rows from the first (left-hand) dataframe that match. If there’s no match, the columns for the left-hand dataframe will be filled with NA.

  • full_join(): This returns all rows from both dataframes. If there’s no match, the columns for the other dataframe will be filled with NA.

  • semi_join(): This returns all rows from the first (left-hand) dataframe where there are matching values in the second dataframe, but it doesn’t add any columns from the second dataframe.

  • anti_join(): This returns all rows from the first (left-hand) dataframe where there are not matching values in the second dataframe. It also doesn’t add any columns from the second dataframe.

These functions are essential tools for merging and manipulating data in R, allowing you to effectively combine and cross-reference information from multiple dataframes.



Read the slides from Day 1 to see code examples.



Run this code to get example data

set.seed(1)
A <- tibble(id = c(1:5, 7:8),
            sex = sample(c("Male", "Female"), 7, replace = TRUE),
            nr_of_orders = sample(5:10, 7, replace = TRUE))
B <- tibble(id = 1:6,
            age = sample(25:75, 6),
            total_order_value = sample(5:50, 6)*100)



What rows doesnt match between A and B

Code
anti_join(A, B)
anti_join(B, A)



What will be the results of the below code

semi_join(A, B)



Keep all rows in A and all columns from A and B

Code
left_join(A, B)



Keep all rows and all columns from both A and B, and sort by id

Code
full_join(A,B) %>% 
  arrange(id)



Calculate the average order value

Code
full_join(A,B) %>% 
  mutate(
    avg_order_value = total_order_value/nr_of_orders) 



Calculate the mean of the average order value for males and females

Code
full_join(A,B) %>% 
  mutate(
    avg_order_value = total_order_value/nr_of_orders) %>% 
  group_by(sex) %>% 
  summarise(
    mean_avg_order_value = mean(avg_order_value, na.rm = TRUE)
  )









1.7 across()

Restart R Load the tidyverse and the here package

Code
library(tidyverse)
library(here)

Load the soldiers data set and assign it to an object called soliders

Code
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

The across() function is part of the dplyr package in R and is used for applying a function (or functions) across multiple columns in a dataframe.

across() is particularly useful when you want to perform the same operation on multiple columns but don’t want to write out each operation individually. It works in combination with mutate() and summarise().

Very importantly the .cols argument allows you to use the tidyselect functions that you are now familiar with.

Here’s a basic example of its usage.

Let’s say we want to take the mean of weightkg and earlength in the soldiers dataframe.

soldiers %>% 
  summarise(across(
    .cols = c("weightkg", "earlength"),  # What columns to target
    .fns = mean)                         # What function to apply - Notice that mean is not written mean()
    )

You can see that the mean of weightkg is NA. This is because is has missings values and mean() defaults to na.rm = FALSE. If we want to change this, we have to write a slightly more complicated code. In the following code you will find two new things:

  • ~ You tell R - “I am going to show you a function”
  • .x You tell R - “Put the column that you have selected here”
soldiers %>% 
  summarise(across(
    .cols = c("weightkg", "earlength"),  # What columns to target
    .fns = ~ mean(.x, na.rm = TRUE))     # What function to apply
    )

.x can also be used outside of a function. Lets say we want to double all the values instead of taking the mean

soldiers %>% 
  mutate(across(                         # Notice now we use mutate() instead of summarise()
    .cols = c("weightkg", "earlength"),  # What columns to target
    .fns = ~ .x * 2                           
    )) 



Target all circumference columns in the soldiers dataframe and change their values from cm to inches

Code
soldiers %>% 
  mutate(across(
    .cols = contains("circumference"), 
    .fns = ~.x *0.393701))
# A tibble: 6,208 × 24
   subjectid fake_cpr    sex      age Ethnicity        DODRace WritingPreference
       <dbl> <chr>       <chr>  <dbl> <chr>              <dbl> <chr>            
 1      1017 151168-2263 Male      41 <NA>                   1 Right hand       
 2      4775 210475-1945 Man       35 <NA>                   1 Left hand        
 3      2177 290568-1069 Male      42 <NA>                   2 Left hand        
 4      5026 121283-0610 Female    26 <NA>                   2 Right hand       
 5      1533 090189-0422 Woman     21 Mexican                3 Right hand       
 6      4567 020487-3860 Female    23 <NA>                   1 Right hand       
 7      2347 041287-8106 Female    22 Caribbean Islan…       2 Right hand       
 8       270 130765-7114 Woman     45 <NA>                   1 Right hand       
 9      4050 020166-4680 Female    44 <NA>                   2 Left hand        
10      5307 101188-3148 Female    21 <NA>                   1 Right hand       
# ℹ 6,198 more rows
# ℹ 17 more variables: Installation <chr>, Component <chr>, Branch <chr>,
#   Heightin <dbl>, weightkg <dbl>, earlength <dbl>, footlength <dbl>,
#   handlength <dbl>, handbreadth <dbl>, hipbreadth <dbl>, chestbreadth <dbl>,
#   thighcircumference <dbl>, waistcircumference <dbl>,
#   anklecircumference <dbl>, bicepscircumference <dbl>,
#   calfcircumference <dbl>, span <dbl>



.cols = where(is.[column_type])

You can also target columns based on their type (logical, double, character, factor, numeric). Using the functions is.logical(), is.double(), etc.. However, you need to use them inside the function where()

In this example, we change all character columns to factors (You will learn about the column types on day 3). where(is.character) selects the columns where the condition is true (i.e., the column is character), and as.factor is the function that gets applied to these columns. As a result, all character columns get converted to factor.

soldiers %>% 
  mutate(across(
    .cols = where(is.character), # What columns to target - notice is.character is not written is.charater()
    .fns = as.factor             # What function to apply - notice as.factor is written without ()                           
    )) 



Target all numeric columns in the soldiers dataframe and add 1.5

Code
soldiers %>% 
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~.x+1.5
    )
  )
# A tibble: 6,208 × 24
   subjectid fake_cpr    sex      age Ethnicity        DODRace WritingPreference
       <dbl> <chr>       <chr>  <dbl> <chr>              <dbl> <chr>            
 1     1018. 151168-2263 Male    42.5 <NA>                 2.5 Right hand       
 2     4776. 210475-1945 Man     36.5 <NA>                 2.5 Left hand        
 3     2178. 290568-1069 Male    43.5 <NA>                 3.5 Left hand        
 4     5028. 121283-0610 Female  27.5 <NA>                 3.5 Right hand       
 5     1534. 090189-0422 Woman   22.5 Mexican              4.5 Right hand       
 6     4568. 020487-3860 Female  24.5 <NA>                 2.5 Right hand       
 7     2348. 041287-8106 Female  23.5 Caribbean Islan…     3.5 Right hand       
 8      272. 130765-7114 Woman   46.5 <NA>                 2.5 Right hand       
 9     4052. 020166-4680 Female  45.5 <NA>                 3.5 Left hand        
10     5308. 101188-3148 Female  22.5 <NA>                 2.5 Right hand       
# ℹ 6,198 more rows
# ℹ 17 more variables: Installation <chr>, Component <chr>, Branch <chr>,
#   Heightin <dbl>, weightkg <dbl>, earlength <dbl>, footlength <dbl>,
#   handlength <dbl>, handbreadth <dbl>, hipbreadth <dbl>, chestbreadth <dbl>,
#   thighcircumference <dbl>, waistcircumference <dbl>,
#   anklecircumference <dbl>, bicepscircumference <dbl>,
#   calfcircumference <dbl>, span <dbl>



Challenge… Standardise all numeric columns in the soldiers dataframe

Standardizing a variable is a statistical process that transforms the variable’s values to have a mean of 0 and a standard deviation of 1. This is often used often used within machinelearning.

The most common way of standardizing a values is to subtract the mean from each value and divide by the standard deviation.

The standard score (or z-score) of a value x is calculated as:

\[z = (x - μ) / σ\]

If we would target just a single columns the code could be

soldiers %>% 
  mutate(
    weightkg = (weightkg-mean(weightkg, na.rm = TRUE))/sd(weightkg, na.rm=TRUE)
  )
Show me the code
soldiers %>% 
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~(.x - mean(.x, na.rm = TRUE))/sd(.x, na.rm = TRUE)
    )
  )



Learn more about across()

across() can do much more than this. For instance, it can invoke multiple functions simultaneously in a single call, like calculating the mean and standard deviation of chosen variables. Or change the names of the newly created columns. Read more here