Inspect your data

RStudio
Inspection

Making sure your data is what you think it is

Author
Published

November 26, 2023


Many new R users (and quite a few experienced ones, too) have spent a lot of time scratching their heads, because they thought they knew how their data was structured, only to find it was not.

It is a very good idea, once data has been loaded into memory, to spend a little time inspecting it more carefully.

1 Data structure

1.1 Names

The function names() provides the names of the variables in an R data.frame or a list. For instance, on this page, we make us of a simple data frame called ‘df’:

names(df)
[1] "id"              "age"             "department"      "passport_stamps"

The names() tells us that ‘df’ contains four columns called ‘id’, ‘age’, etc. The names() function will not provides meaningful information about vectors, as they have no named constituents.

names() can alse be used to define the names of a data frame. What do you think the following code does?

names(df) <- c("Id", "Age", "Department", "Stamps")

1.2 Structure

R also provides the function str() which details the structure of a variable. An example output of the str() function can look like this:

str(df)
'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : chr  "Management" "Marketing" "Accounting" "Public relations" ...
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

This output is somewhat more comprehensive and tells us that the variable ‘df’ is a data frame consisting of 5 observations (rows) of 4 variables (columns).

Furthermore, we can see the data type and the first actual data points for each column – e.g. ‘id’ is of type ‘int’ (integer).

Perhaps, we expected the variable ‘department’ to be a factor rather than a type character. Let’s fix that:

df <- df |> mutate(department = as.factor(department))
str(df)
'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

Notice, that the ‘department’ is now of type ‘Factor’ with 4 levels.

Also notice the column ‘passport_stamps’ which is of type ‘list’.

2 Data content

2.1 Inspect the head of the data

The function head() simply lists the first 6 observations in a data frame or a vector. This is useful to get a first quick impression of the data at hand. The number of lines displayed can be specified in the function call, e.g. head(data, n=10).

2.2 Look for missing values

The function is.na() tests whether a given value is NA and returns TRUE or FALSE. If a data frame is passed to is.na() it will return a data frame of similar size with each cell being TRUE or FALSE. Look at this simple example

df2<-data.frame(c1=1:5, c2=sample(LETTERS[1:24],5,TRUE), c3=letters[1:5]) 
df2[3,3]<-NA
df2
  c1 c2   c3
1  1  T    a
2  2  R    b
3  3  C <NA>
4  4  M    d
5  5  B    e

We can now use the is.na() function to look for NA values:

is.na(df2)
        c1    c2    c3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE

If the data frame is large and difficult to get a good overview of, we could also check for the presence of any NA’s and their location:

is.na(df2) |> sum()
[1] 1

The code above works like this: the boolean values TRUE and FALSE will be regarded as integers 1 and 0 in any function that takes numerical input – thus the sum() function will add all the FALSE (0) and TRUE (1) values in the data frame and the result thus represents the number of observed NAs.

If you were interested in finding the row and column of NA values, you could do it like so:

which(is.na(df2) , arr.ind=TRUE)
     row col
[1,]   3   3

..or perhaps, if you want to check for NAs on a per-column basis:

df2 |> summarise(across(c1:c3, ~ sum(is.na(.x))))
  c1 c2 c3
1  0  0  1

2.3 Look at value ranges

Let us return to the data frame ‘df’ and look at the column ‘age’.

summary(df$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  23.00   29.75   37.50   35.25   43.00   43.00       1 

The function summary() gives us some summary statistics for the variable ‘age’, including the minimum and maximum values.

Similarly, look at the output of summary() for the ‘department’ variable, which was a factor:

summary(df$department)
      Accounting       Management        Marketing Public relations 
               2                1                1                1 

The rstatix package includes the function get_summarys_stats which will provide even more detailed summary data. Look at the ouput regarding the data frame df - notice, that it only includes the numerical variables/columns:

get_summary_stats(df) |> kable()
variable n min max median q1 q3 iqr mad mean sd se ci
id 5 1 5 3.0 2.00 4 2.00 1.483 3.00 1.581 0.707 1.963
age 4 23 43 37.5 29.75 43 13.25 8.154 35.25 9.674 4.837 15.393

3 Manually cleaning data

If you find, that you data is not structured correctly (e.g. a variable is cast as a character, but should be a factor), has unexpected NA values or has some other issues with data values: You should write R code to clean and restructure the data – do not edit the raw data.

That way, the data cleaning remains transparent and reversible.