Normalization of data

Levels of normalization of data structure…

Author

Published

November 26, 2023

Observing a tidy data format is sufficient for most purposes.

A more comprehensive way to approach data structure is called data normalization: “Normalization is a process that database designers use to eliminate data redundancy, improve data integrity, and enhance the overall efficiency of the database.” ¹

The normalization of data has nothing to do with the normal (Gaussian) distribution.

Data normalization it is not a case of either/or – data can be normalized to different levels, depending on your needs. In this text, we will deal only with levels 1 and 2.

1 First Normal Form – 1NF

The First Normal Form (of data normalization) is roughly similar to the Tidy data format and requires that:

Each column represents one variable
Each variable contains atomic values – i.e. the smallest (indivisible) unit of information
Variables do not contain arrays of atomic values
The data set does not contain repeating groups of similar variables

Look at the different examples of structures of the same data on the tabs below in light of the requirements for First Normal Form data listed above:

Table 1: A
subject_id	subject_name	measurement	test_day	test	type
1	John	778,431	MON,WED	a,b	Blood,Saliva
2	Billy	518,37	WED,TUE	a,b	Blood,Saliva
3	Joan	784,133	FRI,FRI	a,b	Blood,Saliva
4	Clare	962,557	FRI,TUE	a,b	Blood,Saliva
5	Peter	151,305	MON,FRI	a,b	Blood,Saliva
6	Maggy	441,772	MON,TUE	a,b	Blood,Saliva
7	Jack	31,533	THU,MON	a,b	Blood,Saliva
8	Suzy	976,244	MON,FRI	a,b	Blood,Saliva
9	Billy	658,514	WED,THU	a,b	Blood,Saliva
10	Mary	221,842	WED,WED	a,b	Blood,Saliva

More than one value is stored per cell. In other words, each cell consists of an array (or a series of) values, separated by a comma. This does not conform to 1NF.

Table 2: B
subject_id	subject_name	test_a	test_b
1	John	778 (MON : Blood)	431 (WED : Saliva)
2	Billy	518 (WED : Blood)	37 (TUE : Saliva)
3	Joan	784 (FRI : Blood)	133 (FRI : Saliva)
4	Clare	962 (FRI : Blood)	557 (TUE : Saliva)
5	Peter	151 (MON : Blood)	305 (FRI : Saliva)
6	Maggy	441 (MON : Blood)	772 (TUE : Saliva)
7	Jack	31 (THU : Blood)	533 (MON : Saliva)
8	Suzy	976 (MON : Blood)	244 (FRI : Saliva)
9	Billy	658 (WED : Blood)	514 (THU : Saliva)
10	Mary	221 (WED : Blood)	842 (WED : Saliva)

The data is not atomic, i.e. indivisible. The cells of columns ‘test_a’ and ‘test_b’ contain three different values (measurement, weekday, and type). This does not conform to 1NF.

Table 3: C
subject_id	subject_name	test_a	test_b	day_of_test_a	day_of_test_b	test_a_type	test_b_type
1	John	778	431	MON	WED	Blood	Saliva
2	Billy	518	37	WED	TUE	Blood	Saliva
3	Joan	784	133	FRI	FRI	Blood	Saliva
4	Clare	962	557	FRI	TUE	Blood	Saliva
5	Peter	151	305	MON	FRI	Blood	Saliva
6	Maggy	441	772	MON	TUE	Blood	Saliva
7	Jack	31	533	THU	MON	Blood	Saliva
8	Suzy	976	244	MON	FRI	Blood	Saliva
9	Billy	658	514	WED	THU	Blood	Saliva
10	Mary	221	842	WED	WED	Blood	Saliva

The data is atomic, and there is only one data point per cell. However, the data set contains repeating groups of similar variables (‘a’ and ‘b’). This does not conform to 1NF.

Table 4: D
subject_id	subject_name	test_id	type	day	measurement
1	John	a	Blood	MON	778
1	John	b	Saliva	WED	431
2	Billy	a	Blood	WED	518
2	Billy	b	Saliva	TUE	37
3	Joan	a	Blood	FRI	784
3	Joan	b	Saliva	FRI	133
4	Clare	a	Blood	FRI	962
4	Clare	b	Saliva	TUE	557
5	Peter	a	Blood	MON	151
5	Peter	b	Saliva	FRI	305
6	Maggy	a	Blood	MON	441
6	Maggy	b	Saliva	TUE	772
7	Jack	a	Blood	THU	31
7	Jack	b	Saliva	MON	533
8	Suzy	a	Blood	MON	976
8	Suzy	b	Saliva	FRI	244
9	Billy	a	Blood	WED	658
9	Billy	b	Saliva	THU	514
10	Mary	a	Blood	WED	221
10	Mary	b	Saliva	WED	842

This data conforms to the First Normal Form (1NF) data structure.

2 Second Normal Form – 2NF

The Second Normal Form requires that data conforms to the 1NF requirements and additionally:

Data contains a single-column primary key.
That all non-key variables ‘depend’ on the entire primary key.

Table 5: A
subject_id	subject_name	test_id	type	day	measurement
1	John	a	Blood	MON	778
1	John	b	Saliva	WED	431
2	Billy	a	Blood	WED	518
2	Billy	b	Saliva	TUE	37
3	Joan	a	Blood	FRI	784
3	Joan	b	Saliva	FRI	133
4	Clare	a	Blood	FRI	962
4	Clare	b	Saliva	TUE	557
5	Peter	a	Blood	MON	151
5	Peter	b	Saliva	FRI	305
6	Maggy	a	Blood	MON	441
6	Maggy	b	Saliva	TUE	772
7	Jack	a	Blood	THU	31
7	Jack	b	Saliva	MON	533
8	Suzy	a	Blood	MON	976
8	Suzy	b	Saliva	FRI	244
9	Billy	a	Blood	WED	658
9	Billy	b	Saliva	THU	514
10	Mary	a	Blood	WED	221
10	Mary	b	Saliva	WED	842

The data does not contain a single-column primary key. Each observation is unique and identifiable by the (compound) primary key subject_id + test_id.

Table 6: B
key	subject_id	subject_name	test_id	type	day	measurement
A	1	John	a	Blood	MON	778
B	1	John	b	Saliva	WED	431
C	2	Billy	a	Blood	WED	518
D	2	Billy	b	Saliva	TUE	37
E	3	Joan	a	Blood	FRI	784
F	3	Joan	b	Saliva	FRI	133
G	4	Clare	a	Blood	FRI	962
H	4	Clare	b	Saliva	TUE	557
I	5	Peter	a	Blood	MON	151
J	5	Peter	b	Saliva	FRI	305
K	6	Maggy	a	Blood	MON	441
L	6	Maggy	b	Saliva	TUE	772
M	7	Jack	a	Blood	THU	31
N	7	Jack	b	Saliva	MON	533
O	8	Suzy	a	Blood	MON	976
P	8	Suzy	b	Saliva	FRI	244
Q	9	Billy	a	Blood	WED	658
R	9	Billy	b	Saliva	THU	514
S	10	Mary	a	Blood	WED	221
T	10	Mary	b	Saliva	WED	842

The data does contains a single-column primary key (key), but all non-key variables do not ‘depend’ on the entire primary key. Specifically, the variable type is contingent exclusively on test_id. In this example, there is a very simple \(1:1\) relationship between test_id and type (test ‘a’ is always ‘Blood’, and ‘b’ always ‘Saliva’), but that need not be the case.

Data frame #1

Table 7: C1
key	subject_id	subject_name	test_id	day	measurement
A	1	John	a	MON	778
B	1	John	b	WED	431
C	2	Billy	a	WED	518
D	2	Billy	b	TUE	37
E	3	Joan	a	FRI	784
F	3	Joan	b	FRI	133
G	4	Clare	a	FRI	962
H	4	Clare	b	TUE	557
I	5	Peter	a	MON	151
J	5	Peter	b	FRI	305
K	6	Maggy	a	MON	441
L	6	Maggy	b	TUE	772
M	7	Jack	a	THU	31
N	7	Jack	b	MON	533
O	8	Suzy	a	MON	976
P	8	Suzy	b	FRI	244
Q	9	Billy	a	WED	658
R	9	Billy	b	THU	514
S	10	Mary	a	WED	221
T	10	Mary	b	WED	842

Data frame #2

Table 8: C2
test_id	type
a	Blood
b	Saliva

This data structure conforms to Second Normal Form (2NF) data structure. The data has been split into two separate tables, with no redundancy of information. The variable test_id in the first, larger data set, corresponds to test_id in the second, smaller data set. This structure makes data much easier to survey, when the amount of data grows large. Conversely, if data from datasets are needed for a given analysis, the data has to wrangled first (specifically merge with a join command).

3 Further Normal Forms

For most statistical analyses 2NF will suffice. If your data set is very large and complex, requiring a relational database system, look online for further details about data normalization - e.g. this link

Footnotes

https://www.linkedin.com/pulse/understanding-normalization-databases-crucial-step-data-organization-dwnwc ↩︎