Tidy data

A few words about tidy data structure…

Author

Søren O’Neill

Published

November 26, 2023

It is often said that 80% of data analysis is spent on the cleaning and preparing data.

… a quote from the tidyverse page on tidy data.

Please consider the differences between:

Raw data
Cleaned data
Wrangled data

The raw data is the unadulterated version of the data as collected by what-ever-means you collect your data: questionnaires, machine sensor readings, etc. Within GDPR reules, you should always keep a version of the raw data in its original form.

Cleaned data is the raw data after you have made the minimal changes necessary to make data useful. For example, deletion of observations which are flawed due to apparatus malfunction, or data entry mistakes, deletion of variables that were never collected, and conversion of data types if necessary. The process whereby the raw data is cleaned should be scripted (coded) to ensure, that it is repeatable and documented.

The cleaned data probably needs to be wrangled into a shape (and content) appropriate for specific analyses. For example, a wrangled data set may include only specific observations of specific variables relevant to a given analysis, in a format/shape suited for that analysis. This should also be scripted to ensure, that it is repeatable and documented.

The above link to the tidyverse page on tidy data provides a lot of information about tidy data, but the central principle is, that with tidy data:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

…this sort of assumes, that data is stored in a rectangular data frame (or tibble).

On the following three tabs, you can see three examples of the same data set, structured in different ways. Look at each of them ….

Table 1: A
id	test_a	test_b	day_of_test_a	day_of_test_b
1	741	528	FRI	MON
2	491	367	FRI	THU
3	490	980	MON	TUE
4	341	234	TUE	THU
5	544	107	TUE	WED
6	956	213	THU	WED
7	201	156	MON	FRI
8	793	945	WED	FRI
9	756	144	FRI	WED
10	164	745	MON	THU

Table 2: B
id	test	day	measurement
1	a	FRI	741
1	b	MON	528
2	a	FRI	491
2	b	THU	367
3	a	MON	490
3	b	TUE	980
4	a	TUE	341
4	b	THU	234
5	a	TUE	544
5	b	WED	107
6	a	THU	956
6	b	WED	213
7	a	MON	201
7	b	FRI	156
8	a	WED	793
8	b	FRI	945
9	a	FRI	756
9	b	WED	144
10	a	MON	164
10	b	THU	745

Table 3: C
id	test_a	test_b
1	741,FRI	528,MON
2	491,FRI	367,THU
3	490,MON	980,TUE
4	341,TUE	234,THU
5	544,TUE	107,WED
6	956,THU	213,WED
7	201,MON	156,FRI
8	793,WED	945,FRI
9	756,FRI	144,WED
10	164,MON	745,THU

Consider the three different ways to structure the data in light of:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

Which of the three structures/tables represent the most tidy data structure?

Click to get a hint..

Ask yourself, what different units of information (i.e. data points) constitutes each observation … and how the relation between data points specifies such an observation?

Click to get some answers..

It is obvious, that each of the numerical values (measurements) represent data, but so does ‘id’ and the ‘weekday’, as well as the test ‘a’ versus ‘b’.

It seems from the data, that each id was tested on two occasions (‘a’ and ‘b’) which fell on different weekdays.

In other words, ‘id’, ‘test’, ‘weekday’ and ‘measurement’ all represent units of information (data points) which together constitutes an observation, but they are related in a non-trivial manner:

For instance, the data id=1, test=a, weekday=FRI and measurement=741 are related as a single observation. Similarly, the data id=1, test=b, weekday=MON and measurement=528 are related as another unique observation.

The most tidy data structure is thus Table 2 above: Each row represents an observation and each column represents one of the variables that constitutes each observation. Note however, that there is no one-single variable that is unique per observation – instead, it is the combination of variables that constitutes a unique identifier (in this case, ‘id’ and ‘test’ in combination). This is not necessarily a problem.

Table 1 may seem more intuitive, and probably easier to set up as a data entry interface, e.g. a spreadsheet. At first impression, it also has the benefit that each line includes a unique identifier (id). In reality however, this data structure stores some information (e.g. whether the test was ‘a’ and ‘b’) as column names, rather than as actual data in cells. Table 3 is even more problematic, not only does it store data in the column names, it also stores multiple data points in each cell, and data of different types (numeric vs text) at that.