Normalization of data

Levels of normalization of data structure…

Author
Published

November 26, 2023


Observing a tidy data format is sufficient for most purposes.

A more comprehensive way to approach data structure is called data normalization: “Normalization is a process that database designers use to eliminate data redundancy, improve data integrity, and enhance the overall efficiency of the database.” 1

The normalization of data has nothing to do with the normal (Gaussian) distribution.

Data normalization it is not a case of either/or – data can be normalized to different levels, depending on your needs. In this text, we will deal only with levels 1 and 2.

1 First Normal Form – 1NF

The First Normal Form (of data normalization) is roughly similar to the Tidy data format and requires that:

  • Each column represents one variable
  • Each variable contains atomic values – i.e. the smallest (indivisible) unit of information
  • Variables do not contain arrays of atomic values
  • The data set does not contain repeating groups of similar variables

Look at the different examples of structures of the same data on the tabs below in light of the requirements for First Normal Form data listed above:

Table 1: A
subject_id subject_name measurement test_day test type
1 John 778,431 MON,WED a,b Blood,Saliva
2 Billy 518,37 WED,TUE a,b Blood,Saliva
3 Joan 784,133 FRI,FRI a,b Blood,Saliva
4 Clare 962,557 FRI,TUE a,b Blood,Saliva
5 Peter 151,305 MON,FRI a,b Blood,Saliva
6 Maggy 441,772 MON,TUE a,b Blood,Saliva
7 Jack 31,533 THU,MON a,b Blood,Saliva
8 Suzy 976,244 MON,FRI a,b Blood,Saliva
9 Billy 658,514 WED,THU a,b Blood,Saliva
10 Mary 221,842 WED,WED a,b Blood,Saliva

More than one value is stored per cell. In other words, each cell consists of an array (or a series of) values, separated by a comma. This does not conform to 1NF.

Table 2: B
subject_id subject_name test_a test_b
1 John 778 (MON : Blood) 431 (WED : Saliva)
2 Billy 518 (WED : Blood) 37 (TUE : Saliva)
3 Joan 784 (FRI : Blood) 133 (FRI : Saliva)
4 Clare 962 (FRI : Blood) 557 (TUE : Saliva)
5 Peter 151 (MON : Blood) 305 (FRI : Saliva)
6 Maggy 441 (MON : Blood) 772 (TUE : Saliva)
7 Jack 31 (THU : Blood) 533 (MON : Saliva)
8 Suzy 976 (MON : Blood) 244 (FRI : Saliva)
9 Billy 658 (WED : Blood) 514 (THU : Saliva)
10 Mary 221 (WED : Blood) 842 (WED : Saliva)

The data is not atomic, i.e. indivisible. The cells of columns ‘test_a’ and ‘test_b’ contain three different values (measurement, weekday, and type). This does not conform to 1NF.

Table 3: C
subject_id subject_name test_a test_b day_of_test_a day_of_test_b test_a_type test_b_type
1 John 778 431 MON WED Blood Saliva
2 Billy 518 37 WED TUE Blood Saliva
3 Joan 784 133 FRI FRI Blood Saliva
4 Clare 962 557 FRI TUE Blood Saliva
5 Peter 151 305 MON FRI Blood Saliva
6 Maggy 441 772 MON TUE Blood Saliva
7 Jack 31 533 THU MON Blood Saliva
8 Suzy 976 244 MON FRI Blood Saliva
9 Billy 658 514 WED THU Blood Saliva
10 Mary 221 842 WED WED Blood Saliva

The data is atomic, and there is only one data point per cell. However, the data set contains repeating groups of similar variables (‘a’ and ‘b’). This does not conform to 1NF.

Table 4: D
subject_id subject_name test_id type day measurement
1 John a Blood MON 778
1 John b Saliva WED 431
2 Billy a Blood WED 518
2 Billy b Saliva TUE 37
3 Joan a Blood FRI 784
3 Joan b Saliva FRI 133
4 Clare a Blood FRI 962
4 Clare b Saliva TUE 557
5 Peter a Blood MON 151
5 Peter b Saliva FRI 305
6 Maggy a Blood MON 441
6 Maggy b Saliva TUE 772
7 Jack a Blood THU 31
7 Jack b Saliva MON 533
8 Suzy a Blood MON 976
8 Suzy b Saliva FRI 244
9 Billy a Blood WED 658
9 Billy b Saliva THU 514
10 Mary a Blood WED 221
10 Mary b Saliva WED 842

This data conforms to the First Normal Form (1NF) data structure.

2 Second Normal Form – 2NF

The Second Normal Form requires that data conforms to the 1NF requirements and additionally:

  • Data contains a single-column primary key.
  • That all non-key variables ‘depend’ on the entire primary key.
Table 5: A
subject_id subject_name test_id type day measurement
1 John a Blood MON 778
1 John b Saliva WED 431
2 Billy a Blood WED 518
2 Billy b Saliva TUE 37
3 Joan a Blood FRI 784
3 Joan b Saliva FRI 133
4 Clare a Blood FRI 962
4 Clare b Saliva TUE 557
5 Peter a Blood MON 151
5 Peter b Saliva FRI 305
6 Maggy a Blood MON 441
6 Maggy b Saliva TUE 772
7 Jack a Blood THU 31
7 Jack b Saliva MON 533
8 Suzy a Blood MON 976
8 Suzy b Saliva FRI 244
9 Billy a Blood WED 658
9 Billy b Saliva THU 514
10 Mary a Blood WED 221
10 Mary b Saliva WED 842

The data does not contain a single-column primary key. Each observation is unique and identifiable by the (compound) primary key subject_id + test_id.

Table 6: B
key subject_id subject_name test_id type day measurement
A 1 John a Blood MON 778
B 1 John b Saliva WED 431
C 2 Billy a Blood WED 518
D 2 Billy b Saliva TUE 37
E 3 Joan a Blood FRI 784
F 3 Joan b Saliva FRI 133
G 4 Clare a Blood FRI 962
H 4 Clare b Saliva TUE 557
I 5 Peter a Blood MON 151
J 5 Peter b Saliva FRI 305
K 6 Maggy a Blood MON 441
L 6 Maggy b Saliva TUE 772
M 7 Jack a Blood THU 31
N 7 Jack b Saliva MON 533
O 8 Suzy a Blood MON 976
P 8 Suzy b Saliva FRI 244
Q 9 Billy a Blood WED 658
R 9 Billy b Saliva THU 514
S 10 Mary a Blood WED 221
T 10 Mary b Saliva WED 842

The data does contains a single-column primary key (key), but all non-key variables do not ‘depend’ on the entire primary key. Specifically, the variable type is contingent exclusively on test_id. In this example, there is a very simple \(1:1\) relationship between test_id and type (test ‘a’ is always ‘Blood’, and ‘b’ always ‘Saliva’), but that need not be the case.

Data frame #1

Table 7: C1
key subject_id subject_name test_id day measurement
A 1 John a MON 778
B 1 John b WED 431
C 2 Billy a WED 518
D 2 Billy b TUE 37
E 3 Joan a FRI 784
F 3 Joan b FRI 133
G 4 Clare a FRI 962
H 4 Clare b TUE 557
I 5 Peter a MON 151
J 5 Peter b FRI 305
K 6 Maggy a MON 441
L 6 Maggy b TUE 772
M 7 Jack a THU 31
N 7 Jack b MON 533
O 8 Suzy a MON 976
P 8 Suzy b FRI 244
Q 9 Billy a WED 658
R 9 Billy b THU 514
S 10 Mary a WED 221
T 10 Mary b WED 842

Data frame #2

Table 8: C2
test_id type
a Blood
b Saliva

This data structure conforms to Second Normal Form (2NF) data structure. The data has been split into two separate tables, with no redundancy of information. The variable test_id in the first, larger data set, corresponds to test_id in the second, smaller data set. This structure makes data much easier to survey, when the amount of data grows large. Conversely, if data from datasets are needed for a given analysis, the data has to wrangled first (specifically merge with a join command).

3 Further Normal Forms

For most statistical analyses 2NF will suffice. If your data set is very large and complex, requiring a relational database system, look online for further details about data normalization - e.g. this link