factors

and the forcats package

Author

Steen Flammild Harsted

Published

January 10, 2024

1 forcats

 

1.1 Base exercises

Use the soldiers dataset for the following exercises.


race

race is now ordered alphabetically (try: soldiers %>% count(race)))

  • fix race so that it is ordered after frequency (fct_infreq())
  • Add the code to your 01_import.R file
  • Try to recreate one of the plots from day two where you used x = race. Do you see the difference?
Code
soldiers %>% 
  mutate(
    race = fct_infreq(race)
  ) %>% 
  
  ggplot(aes(x = race, fill = race))+
  geom_bar()+
  scale_x_discrete(labels = scales::label_wrap(8)) # one of many ways to fix long labels on the axis


Ethnicity

Background information: DODRace was collected by assigning fixed race values (1:7) to each soldier. Ethnicity was a black space where the soldiers themselves have filled out their race.

  • Explore Ethnicity using count() and view()
soldiers %>% 
  count(Ethnicity) # add  %>% view()

 

fct_lump()

OMG.. we probably need to merge some of the many Ethnicity groups.
Try with fct_lump()

  • Can you find a number of levels that’s reasonable?
Code
soldiers %>% 
  mutate(
    Ethnicity = fct_lump(Ethnicity, 5) # I dont think this is reasonable see below
  ) %>% 
  count(Ethnicity)


fct_collapse()

Hmmm… fct_lump() is probably not the best choice for the Ethnicity variable. It has too many groups, and many groups have similar sounding names. We need to fix this manually.

  • Here is a code example where we merge all the Apache.
  • Merge a few more groups together by building on this code.
  • Add the code to your 01_import.R file
soldiers %>% 
  mutate(
    Ethnicity = fct_collapse(Ethnicity,
      Apache = c("Apache Blackfoot",
                 "Apache Blackfoot Cherokee Crow",
                 "Apache Cherokee",
                 "Apache Kiowa Mexican",
                 "Apache Mexican" )
    )
  ) 


category

DODRace, race, and Ethnicity are all true factors in the sense that no values in any of these variables are more ´race´ than other values. Think about category do the values here imply more or less of the same thing?

  • Fix category by changing it into an ordered variable. Use the function factor() and set the argument ordered = TRUE
  • Add the code to your 01_import.R file
  • Try to recreate some of the plots we did on day 2 were we colored the points using category - Notice the difference?
Code
soldiers %>% 
  mutate(
    category = factor(category,
                      levels = (c("Underweight",
                                "Normal range",
                                "Overweight",
                                "Obese")),
                      ordered = TRUE)
  ) 

 

Go back to your home assignment and see if you can improve anything using your new factor and forcats skills

 

1.2 Challenges with forcats + ggplot + dplyr

In mpg

  • Collapse the model types a4 quattro and a4 into a4
  • Create a bar plot sorted by frequency
  • Work out a way to show what manufacturers the models belong to
Here is my suggestion

I have cheated and used some functions (str_to_title() and facet_grid()), and theme options that I haven´t showed you before.

Code
mpg %>% 
  group_by(manufacturer) %>% 
  mutate(
    model = model %>% str_to_title() %>%  fct_collapse(A4 = c("A4", "A4 Quattro")) %>% fct_infreq(),
    manufacturer = str_to_title(manufacturer)) %>% 
  ggplot(aes(y = model, fill = manufacturer)) +
  geom_bar()+
  geom_text(aes(x = 0.5, label = model), size =3,  hjust = 0, check_overlap = TRUE)+ # Display the model name at the position (x=0.5, y = model)
  scale_x_continuous(expand = c(0,0))+ # Remove the padding between the y-axis and the start of the bars
  facet_grid(rows = vars(manufacturer), 
             scales = "free_y", 
             space = "free_y",
             switch = "y")+
  ggthemes::theme_pander()+
  theme(axis.text.y=element_blank(), # Remove the names from y-axis (we used geom_tect instead)
        axis.ticks.y = element_blank(), # Remove y axis ticks (the small lines)
        strip.text.y.left = element_text(angle = 0, hjust = 1), # Change strip text orientation
        legend.position = "none" # remove fill legend
        ) +
  labs(
    title = "Count of car models in `ggplot2::mpg` data set",
    x = "Count of car models",
    y = NULL,
    caption = "Consider a different fill color scale. The current one seems to imply a gradient"
  )



Fix this starwars plot

  • Order the columns after frequency
  • Pimp it with a nice theme and consider adding some nice colors
  • Correct the y-axis label and give the plot a title
  • Consider lumping together some of the levels
Code
ggplot(starwars, aes(y = eye_color)) + 
  geom_bar() 
Here is my suggestion

I have cheated and used three functions (str_to_title(), after_stat(), and scale_fill_gradient()), that I haven´t showed you before.

Code
starwars %>%
  mutate(eye_color = fct_recode(str_to_title(eye_color))) %>%  # Change all factor levels to Title case
  mutate(eye_color = fct_lump(eye_color, 7) %>% fct_infreq()) %>% 
  ggplot(aes(y = eye_color,
             fill = after_stat(count))) +                      # Set the fill color to the count value
  geom_bar() +
  ggthemes::theme_foundation()+
  scale_fill_gradient(low = "grey", high = "black")+           # Create a new fill scale going from grey to black
  labs(
    x = "Count", 
    y = "Eye Color",
    title = "Eye color counts of Starwars characters",
    caption = "Consider... Is the grey gradient disturbing? e.g. ´brown´ has a black color ")+
  theme(plot.title.position = "plot")   # Place the title all the way to the left side



In table1 (another inbuilt dataset)

table1 displays the number of TB cases documented by WHO in Afghanistan, Brazil, and China between 1999 and 2000.

  • Create a new variable called case_pr_mill_pop (cases pr. million)
  • recode country labels so that (Afghanistan = Afg, Brazil = Bra, China = Chi)
  • reorder country factor after case_pr_mill_pop
  • arrange the table after country
Code
table1 %>%
  mutate(
    case_pr_mill_pop = cases/(population/1e6),
    country = fct_recode(country,
                         "Afg" = "Afghanistan",
                         "Bra" = "Brazil",
                         "Chi" = "China"),
    country = fct_reorder(country, case_pr_mill_pop)) %>% 
  arrange(country)