Base R - missing data

Author

Søren O’Neill

Published

January 10, 2024

1 Handling missing data

Take a look at this simple code:

# Our seven participants self-reported their age:
participants_age <- c(23,25,23,21,25,NA,20)

# What is the mean age?
mean(participants_age)
[1] NA

The NA values indicates a data point that is Not Available, i.e. missing. This is fundementally not the same as NULL or 0. So how would you calculate a mean value? Well, when the age of one participant is unavailable, so is the mean value and R, quite rightly, reports the mean as NA.

You might think, it would be obvious to calculate the mean value of the available values and ignore the one NA. In some cases, that makes sense and is a valid decision to make (or perhaps not). The important think is: you have to decide how to deal with missing values - R won’t do it for you. This is not a bug, it’s a (safety) feature.

Look at this amended code:

# Our seven participants self-reported their age:
participants_age <- c(23,25,23,21,25,NA,20)

# What is the mean age?
mean(participants_age, na.rm=TRUE)
[1] 22.83333

What is your best quess as to the meaning of na.rm=TRUE?