subject_id | subject_name | measurement | test_day | test | type |
---|---|---|---|---|---|
1 | John | 778,431 | MON,WED | a,b | Blood,Saliva |
2 | Billy | 518,37 | WED,TUE | a,b | Blood,Saliva |
3 | Joan | 784,133 | FRI,FRI | a,b | Blood,Saliva |
4 | Clare | 962,557 | FRI,TUE | a,b | Blood,Saliva |
5 | Peter | 151,305 | MON,FRI | a,b | Blood,Saliva |
6 | Maggy | 441,772 | MON,TUE | a,b | Blood,Saliva |
7 | Jack | 31,533 | THU,MON | a,b | Blood,Saliva |
8 | Suzy | 976,244 | MON,FRI | a,b | Blood,Saliva |
9 | Billy | 658,514 | WED,THU | a,b | Blood,Saliva |
10 | Mary | 221,842 | WED,WED | a,b | Blood,Saliva |
Normalization of data
Levels of normalization of data structure…
Observing a tidy data format is sufficient for most purposes.
A more comprehensive way to approach data structure is called data normalization: “Normalization is a process that database designers use to eliminate data redundancy, improve data integrity, and enhance the overall efficiency of the database.” 1
The normalization of data has nothing to do with the normal (Gaussian) distribution.
Data normalization it is not a case of either/or – data can be normalized to different levels, depending on your needs. In this text, we will deal only with levels 1 and 2.
1 First Normal Form – 1NF
The First Normal Form (of data normalization) is roughly similar to the Tidy data format and requires that:
- Each column represents one variable
- Each variable contains atomic values – i.e. the smallest (indivisible) unit of information
- Variables do not contain arrays of atomic values
- The data set does not contain repeating groups of similar variables
Look at the different examples of structures of the same data on the tabs below in light of the requirements for First Normal Form data listed above:
More than one value is stored per cell. In other words, each cell consists of an array (or a series of) values, separated by a comma. This does not conform to 1NF.
subject_id | subject_name | test_a | test_b |
---|---|---|---|
1 | John | 778 (MON : Blood) | 431 (WED : Saliva) |
2 | Billy | 518 (WED : Blood) | 37 (TUE : Saliva) |
3 | Joan | 784 (FRI : Blood) | 133 (FRI : Saliva) |
4 | Clare | 962 (FRI : Blood) | 557 (TUE : Saliva) |
5 | Peter | 151 (MON : Blood) | 305 (FRI : Saliva) |
6 | Maggy | 441 (MON : Blood) | 772 (TUE : Saliva) |
7 | Jack | 31 (THU : Blood) | 533 (MON : Saliva) |
8 | Suzy | 976 (MON : Blood) | 244 (FRI : Saliva) |
9 | Billy | 658 (WED : Blood) | 514 (THU : Saliva) |
10 | Mary | 221 (WED : Blood) | 842 (WED : Saliva) |
The data is not atomic, i.e. indivisible. The cells of columns ‘test_a’ and ‘test_b’ contain three different values (measurement, weekday, and type). This does not conform to 1NF.
subject_id | subject_name | test_a | test_b | day_of_test_a | day_of_test_b | test_a_type | test_b_type |
---|---|---|---|---|---|---|---|
1 | John | 778 | 431 | MON | WED | Blood | Saliva |
2 | Billy | 518 | 37 | WED | TUE | Blood | Saliva |
3 | Joan | 784 | 133 | FRI | FRI | Blood | Saliva |
4 | Clare | 962 | 557 | FRI | TUE | Blood | Saliva |
5 | Peter | 151 | 305 | MON | FRI | Blood | Saliva |
6 | Maggy | 441 | 772 | MON | TUE | Blood | Saliva |
7 | Jack | 31 | 533 | THU | MON | Blood | Saliva |
8 | Suzy | 976 | 244 | MON | FRI | Blood | Saliva |
9 | Billy | 658 | 514 | WED | THU | Blood | Saliva |
10 | Mary | 221 | 842 | WED | WED | Blood | Saliva |
The data is atomic, and there is only one data point per cell. However, the data set contains repeating groups of similar variables (‘a’ and ‘b’). This does not conform to 1NF.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | MON | 778 |
1 | John | b | Saliva | WED | 431 |
2 | Billy | a | Blood | WED | 518 |
2 | Billy | b | Saliva | TUE | 37 |
3 | Joan | a | Blood | FRI | 784 |
3 | Joan | b | Saliva | FRI | 133 |
4 | Clare | a | Blood | FRI | 962 |
4 | Clare | b | Saliva | TUE | 557 |
5 | Peter | a | Blood | MON | 151 |
5 | Peter | b | Saliva | FRI | 305 |
6 | Maggy | a | Blood | MON | 441 |
6 | Maggy | b | Saliva | TUE | 772 |
7 | Jack | a | Blood | THU | 31 |
7 | Jack | b | Saliva | MON | 533 |
8 | Suzy | a | Blood | MON | 976 |
8 | Suzy | b | Saliva | FRI | 244 |
9 | Billy | a | Blood | WED | 658 |
9 | Billy | b | Saliva | THU | 514 |
10 | Mary | a | Blood | WED | 221 |
10 | Mary | b | Saliva | WED | 842 |
This data conforms to the First Normal Form (1NF) data structure.
2 Second Normal Form – 2NF
The Second Normal Form requires that data conforms to the 1NF requirements and additionally:
- Data contains a single-column primary key.
- That all non-key variables ‘depend’ on the entire primary key.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | MON | 778 |
1 | John | b | Saliva | WED | 431 |
2 | Billy | a | Blood | WED | 518 |
2 | Billy | b | Saliva | TUE | 37 |
3 | Joan | a | Blood | FRI | 784 |
3 | Joan | b | Saliva | FRI | 133 |
4 | Clare | a | Blood | FRI | 962 |
4 | Clare | b | Saliva | TUE | 557 |
5 | Peter | a | Blood | MON | 151 |
5 | Peter | b | Saliva | FRI | 305 |
6 | Maggy | a | Blood | MON | 441 |
6 | Maggy | b | Saliva | TUE | 772 |
7 | Jack | a | Blood | THU | 31 |
7 | Jack | b | Saliva | MON | 533 |
8 | Suzy | a | Blood | MON | 976 |
8 | Suzy | b | Saliva | FRI | 244 |
9 | Billy | a | Blood | WED | 658 |
9 | Billy | b | Saliva | THU | 514 |
10 | Mary | a | Blood | WED | 221 |
10 | Mary | b | Saliva | WED | 842 |
The data does not contain a single-column primary key. Each observation is unique and identifiable by the (compound) primary key subject_id
+ test_id
.
key | subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|---|
A | 1 | John | a | Blood | MON | 778 |
B | 1 | John | b | Saliva | WED | 431 |
C | 2 | Billy | a | Blood | WED | 518 |
D | 2 | Billy | b | Saliva | TUE | 37 |
E | 3 | Joan | a | Blood | FRI | 784 |
F | 3 | Joan | b | Saliva | FRI | 133 |
G | 4 | Clare | a | Blood | FRI | 962 |
H | 4 | Clare | b | Saliva | TUE | 557 |
I | 5 | Peter | a | Blood | MON | 151 |
J | 5 | Peter | b | Saliva | FRI | 305 |
K | 6 | Maggy | a | Blood | MON | 441 |
L | 6 | Maggy | b | Saliva | TUE | 772 |
M | 7 | Jack | a | Blood | THU | 31 |
N | 7 | Jack | b | Saliva | MON | 533 |
O | 8 | Suzy | a | Blood | MON | 976 |
P | 8 | Suzy | b | Saliva | FRI | 244 |
Q | 9 | Billy | a | Blood | WED | 658 |
R | 9 | Billy | b | Saliva | THU | 514 |
S | 10 | Mary | a | Blood | WED | 221 |
T | 10 | Mary | b | Saliva | WED | 842 |
The data does contains a single-column primary key (key
), but all non-key variables do not ‘depend’ on the entire primary key. Specifically, the variable type
is contingent exclusively on test_id
. In this example, there is a very simple \(1:1\) relationship between test_id
and type
(test ‘a’ is always ‘Blood’, and ‘b’ always ‘Saliva’), but that need not be the case.
Data frame #1
key | subject_id | subject_name | test_id | day | measurement |
---|---|---|---|---|---|
A | 1 | John | a | MON | 778 |
B | 1 | John | b | WED | 431 |
C | 2 | Billy | a | WED | 518 |
D | 2 | Billy | b | TUE | 37 |
E | 3 | Joan | a | FRI | 784 |
F | 3 | Joan | b | FRI | 133 |
G | 4 | Clare | a | FRI | 962 |
H | 4 | Clare | b | TUE | 557 |
I | 5 | Peter | a | MON | 151 |
J | 5 | Peter | b | FRI | 305 |
K | 6 | Maggy | a | MON | 441 |
L | 6 | Maggy | b | TUE | 772 |
M | 7 | Jack | a | THU | 31 |
N | 7 | Jack | b | MON | 533 |
O | 8 | Suzy | a | MON | 976 |
P | 8 | Suzy | b | FRI | 244 |
Q | 9 | Billy | a | WED | 658 |
R | 9 | Billy | b | THU | 514 |
S | 10 | Mary | a | WED | 221 |
T | 10 | Mary | b | WED | 842 |
Data frame #2
test_id | type |
---|---|
a | Blood |
b | Saliva |
This data structure conforms to Second Normal Form (2NF) data structure. The data has been split into two separate tables, with no redundancy of information. The variable test_id
in the first, larger data set, corresponds to test_id
in the second, smaller data set. This structure makes data much easier to survey, when the amount of data grows large. Conversely, if data from datasets are needed for a given analysis, the data has to wrangled first (specifically merge with a join
command).
3 Further Normal Forms
For most statistical analyses 2NF will suffice. If your data set is very large and complex, requiring a relational database system, look online for further details about data normalization - e.g. this link