Wrangle your data into tidy data

RStudio
Inspection

A few words about tidy data structure…

Author
Published

November 26, 2023


It is often said that 80% of data analysis is spent on the cleaning and preparing data.

… a quote from the tidyverse page on tidy data.

Please consider the differences between:

The raw data is the unadulterated version of the data as collected by what-ever-means: questionnaires, machine readings, etc. You should always a version of the raw data in its original form.

Cleaned data is the raw data after the minimal changes necessary to make data useful. For example, deletion of observations which are flawed due to apparatus malfunction, or data entry mistakes, and deletion of variables that were never collected. The process whereby the raw data is cleaned should be scripted (coded) to ensure, that it is repeatable and documented.

The cleaned data probably needs to be wrangled into a shape (and content) appropriate for analyses. For example, a wrangled data set may include only specific observations of specific variables relevant to a given analysis, in a format/shape suited for that analysis. This should also be scripted to ensure, that it is repeatable and documented.

The link to the tidyverse page on tidy data provides a lot of information about tidy data, but the central principle is, that with tidy data:

On the following three tabs, you can see three examples of the same data set, structured in different ways. Look at each of them ….

Table 1: A
id test_a test_b day_of_test_a day_of_test_b
1 794 739 MON TUE
2 842 658 THU MON
3 545 973 FRI MON
4 676 859 TUE MON
5 744 429 TUE FRI
6 752 323 FRI THU
7 387 426 WED WED
8 215 199 FRI TUE
9 458 802 TUE WED
10 403 857 TUE THU
Table 2: B
id test day measurement
1 a MON 794
1 b TUE 739
2 a THU 842
2 b MON 658
3 a FRI 545
3 b MON 973
4 a TUE 676
4 b MON 859
5 a TUE 744
5 b FRI 429
6 a FRI 752
6 b THU 323
7 a WED 387
7 b WED 426
8 a FRI 215
8 b TUE 199
9 a TUE 458
9 b WED 802
10 a TUE 403
10 b THU 857
Table 3: C
id test_a test_b
1 794,MON 739,TUE
2 842,THU 658,MON
3 545,FRI 973,MON
4 676,TUE 859,MON
5 744,TUE 429,FRI
6 752,FRI 323,THU
7 387,WED 426,WED
8 215,FRI 199,TUE
9 458,TUE 802,WED
10 403,TUE 857,THU

Consider the three different ways to structure the data in light of:

Which of the three structures/tables represent the most tidy data structure?

Ask yourself, what different information (i.e. data points) constitutes each observation … and how the relation between data points specifies such an observation?

It is obvious, that each of the numerical values (measurements) represent data, but so does ‘id’ and the ‘weekday’, as well as the test ‘a’ versus ‘b’.

It seems from the data, that each id was tested on two occasions (‘a’ and ‘b’) which fell on different weekdays.

In other words, ‘id’, ‘test’, ‘weekday’ and ‘measurement’ all represent information (data points) which together constitutes an observation, but they are related in a non-trivial manner:

For instance, the data id=1, test=a, weekday=MON and measurement=794 are related as a single observation.

The most tidy data structure is thus Table 2 above: Each row represents an observation and each column represents one of the variables that constitutes each observation. Note however, that there is no one-single variable that is unique per observation – instead, it is the combination of variables that constitutes a unique identifier (in this case, ‘id’ and ‘test’ in combination). This is not a problem!

Table 1 may seem more intuitive, and probably easier to set up as a data entry interface, e.g. a spreadsheet. At first impression, it also has the benefit that each line includes a unique identifier (id). In reality however, this data structure stores some information (e.g. whether the test was ‘a’ and ‘b’) as column names, rather than as actual data in cells. Table 3 is even more problematic, not only does it store data in the column names, it also stores multiple data points in each cell, and data of different types (numeric vs text) at that.