Normalizing Tables in SQL
So far in all SQL examples we were dealing with a single table. The truth is that in real life when dealing with databases you’ll have to work with many tables, which are interrelated. The true power of the Relational Database Management Systems is the fact that they are Relational.
The relationships in a RDBMS ensure that there is no redundant data. For example: An online store, offers computers for sale and the easiest way to track the sales will be to keep them in a database. We can have a table called Product, which will hold information about each computer - model name, price and the manufacturer. We also need to keep some details about the manufacturer like their website and their support email. If you store the manufacturer details in the Product table, you will have the manufacturer contact info repeated for each computer model the manufacturer produces:
The relationships in a RDBMS ensure that there is no redundant data. For example: An online store, offers computers for sale and the easiest way to track the sales will be to keep them in a database. We can have a table called Product, which will hold information about each computer - model name, price and the manufacturer. We also need to keep some details about the manufacturer like their website and their support email. If you store the manufacturer details in the Product table, you will have the manufacturer contact info repeated for each computer model the manufacturer produces:
model | Price | Manufacturer | ManufacturerWebsite | ManufacturerEmail |
Inspiron B120 | $499 | Dell | http://www.dell.com | support@dell.com |
Inspiron B130 | $599 | Dell | http://www.dell.com | support@dell.com |
Inspiron E1705 | $949 | Dell | http://www.dell.com | support@dell.com |
Satellite A100 | $549 | Toshiba | http://www.toshiba.com | support@toshiba.com |
Satellite P100 | $934 | Toshiba | http://www.toshiba.com | support@toshiba.com |
To get rid of the redundant manufacturer data in the Product table, we can create a new table called Manufacturer, which will have only one entry (row) for each manufacturer and we can link (relate) this table to the Product table. To create this relation we need to add additional column in the Product table that references the entries in the Manufacturer table.
A relationship between 2 tables is established when the data in one of the columns in the first table matches the data in a column in the second table. To explain this further we have to understand SQL relational concepts – Primary Key and Foreign Key. Primary Key is a column or a combination of columns that uniquely identifies each row in a table. Foreign Key is a column or a combination of columns whose values match a Primary Key in a different table. In the most common scenario the relationship between 2 tables matches the Primary Key in one of the tables with a Foreign Key in the second table. Consider the new Product and Manufacturer tables below:
A relationship between 2 tables is established when the data in one of the columns in the first table matches the data in a column in the second table. To explain this further we have to understand SQL relational concepts – Primary Key and Foreign Key. Primary Key is a column or a combination of columns that uniquely identifies each row in a table. Foreign Key is a column or a combination of columns whose values match a Primary Key in a different table. In the most common scenario the relationship between 2 tables matches the Primary Key in one of the tables with a Foreign Key in the second table. Consider the new Product and Manufacturer tables below:
Manufacturer
ManufacturerID | Manufacturer | ManufacturerWebsite | ManufacturerEmail |
1 | Dell | http://www.dell.com | support@dell.com |
2 | Toshiba | http://www.toshiba.com | support@toshiba.com |
Product
model | Price | ManufacturerID |
Inspiron B120 | $499 | 1 |
Inspiron B130 | $599 | 1 |
Inspiron E1705 | $949 | 1 |
Satellite A100 | $549 | 2 |
Satellite P100 | $934 | 2 |
The first table is Manufacturer which has 2 entries for Dell and Toshiba respectively. Each of these entries has a ManufacturerID value, which is unique integer number. Because the ManufacturerID column is unique for the Manufacturer table we can use it as a Primary Key in this table. The Product table retains the Model and the Price columns, but has a new column called ManufacturerID, which matches the values of the ManufacturerID column in the Manufacturer table. All values in the ManufacturerID column in the Product table have to match one of the values in the Manufacturer table Primary Key (for example you can’t have ManufacturerID with value of 3 in the Product table, simply because there is no manufacturer with this ManufacturerID defined in the Manufacturer table).
Notice that we used the same name for the Primary Key in the first table as for the Foreign Key in the second. This was done on purpose to show the relationship between the 2 tables based on these columns. Of course you can call the 2 columns with different names, but if somebody sees your database for a first time it won’t be immediately clear that these 2 tables are related.
But how do we ensure that the Product table doesn’t have invalid entries like the last entry below:
Notice that we used the same name for the Primary Key in the first table as for the Foreign Key in the second. This was done on purpose to show the relationship between the 2 tables based on these columns. Of course you can call the 2 columns with different names, but if somebody sees your database for a first time it won’t be immediately clear that these 2 tables are related.
But how do we ensure that the Product table doesn’t have invalid entries like the last entry below:
model | Price | ManufacturerID |
Inspiron B120 | $499 | 1 |
Inspiron B130 | $599 | 1 |
Inspiron E1705 | $949 | 1 |
Satellite A100 | $549 | 2 |
Satellite P100 | $934 | 2 |
ThinkPad Z60t | $849 | 3 |
We do not have a manufacturer with ManufacturerID of 3 in our Manufacturer table, hence this entry in the Product table is invalid. The answer is that you have to enforce referential integrity between the 2 tables. Different RDBMS have different ways to enforce referential integrity, and I will not go into more details as this is not important to understand the concept of relationship between tables.
There are 3 types of relations between tables – One-To-Many, Many-To-Many and One-To-One. The relation we created above is One-To-Many and is the most common of the 3 types. In One-To-Many relation a row in one of the tables can have many matching rows in the second table, but a row the second table can match only one row in the first table.
In our example, each manufacturer (a row in the Manufacturer table) produces several different computer models (several rows in the Product table), but each particular product (a row in the Product table) has only one manufacturer (a row in the Manufacturer table).
The second type is the Many-To-Many relation. In this relation many rows from the first table can match many rows in the second and the other way around. To define this type of relation you need a third table whose primary key is composed of the 2 foreign keys from the other 2 table. To clarify this relation lets review the following example. We have a Article table (ArticleID is primary key) and Category (CategoryID is primary key) table.
Every article published in the Article table can belong to multiple categories. To accommodate that, we create a new table called ArticleCategory, which has only 2 columns – ArticleID and CategoryID (these 2 columns form the primary key for this table). This new table called sometimes junction table defines the Many-To-Many relationship between the 2 main tables. One article can belong to multiple categories, and every category may contain more than one article.
In the One-To-One relation each row in the first table may match only one row in the second and the other way around. This relationship is very uncommon simply because if you have this type of relation you may as well keep all the info in one single table.
By dividing the data into 2 tables we successfully removed the redundant manufacturer details from the initial Product table adding an integer column referencing the new Manufacturer table instead.
The process of removing redundant data by creating relations between tables is known as Normalization. Normalization process uses formal methods to design the database in interrelated tables.
There are 3 types of relations between tables – One-To-Many, Many-To-Many and One-To-One. The relation we created above is One-To-Many and is the most common of the 3 types. In One-To-Many relation a row in one of the tables can have many matching rows in the second table, but a row the second table can match only one row in the first table.
In our example, each manufacturer (a row in the Manufacturer table) produces several different computer models (several rows in the Product table), but each particular product (a row in the Product table) has only one manufacturer (a row in the Manufacturer table).
The second type is the Many-To-Many relation. In this relation many rows from the first table can match many rows in the second and the other way around. To define this type of relation you need a third table whose primary key is composed of the 2 foreign keys from the other 2 table. To clarify this relation lets review the following example. We have a Article table (ArticleID is primary key) and Category (CategoryID is primary key) table.
Every article published in the Article table can belong to multiple categories. To accommodate that, we create a new table called ArticleCategory, which has only 2 columns – ArticleID and CategoryID (these 2 columns form the primary key for this table). This new table called sometimes junction table defines the Many-To-Many relationship between the 2 main tables. One article can belong to multiple categories, and every category may contain more than one article.
In the One-To-One relation each row in the first table may match only one row in the second and the other way around. This relationship is very uncommon simply because if you have this type of relation you may as well keep all the info in one single table.
By dividing the data into 2 tables we successfully removed the redundant manufacturer details from the initial Product table adding an integer column referencing the new Manufacturer table instead.
The process of removing redundant data by creating relations between tables is known as Normalization. Normalization process uses formal methods to design the database in interrelated tables.
NORMALIZATION EXAMPLE 1: EMPLOYEE PROJECT DETAIL
Given below is the data in an un-normalized table. Normalize it to 1NF, 2NF and 3NF.
DETAILS TABLE
Proj_no | Proj_name | Emp_no | Emp_name | Rate_category | Hourly_rate |
101 | AirLine Reservation | 1001 1002 1003 | John Smith Amit | A B C | 60 50 40 |
102 | Advertising Agency | 1001 1004 | Rahul Preeti | A B | 60 50 |
Solution:
Table in 1NF:
DETAILS TABLE
Proj_no | Proj_name | Emp_no | Emp_name | Rate_category | Hourly_rate |
101 | AirLine Reservation | 1001 | John | A | 60 |
101 | AirLine Reservation | 1002 | Smith | B | 50 |
101 | AirLine Reservation | 1003 | Amit | C | 40 |
102 | Advertising Agency | 1001 | Rahul | A | 60 |
102 | Advertising Agency | 1004 | Preeti | B | 50 |
The key is (Proj_no,Emp_no)
Functional Dependencies:
Proj_no -> Proj_name
Emp_no -> Emp_name, Rate_category, Hourly_rate
Transitive Dependencies :
Rate_category -> Hourly_rate
Tables in 2NF:
Emp_Proj Table
Proj_no | Emp_no |
101 | 1001 |
101 | 1002 |
101 | 1003 |
102 | 1001 |
102 | 1004 |
Employee Table
Emp_no | Emp_name | Rate_category | Hourly_rate |
1001 | John | A | 60 |
1002 | Smith | B | 50 |
1003 | Rahul | A | 40 |
1004 | Preeti | B | 50 |
Project Table
Proj_no | Proj_name |
1001 | Airline Reservation |
1002 | Advertising Agency |
Tables in 3NF:
Emp_Proj Table
Proj_no | Emp_no |
101 | 1001 |
101 | 1002 |
101 | 1003 |
102 | 1001 |
102 | 1004 |
Project Table
Proj_no | Proj_name |
1001 | Airline Reservation |
1002 | Advertising Agency |
Employee Table
Emp_no | Emp_name | Rate_category |
1001 | John | A |
1002 | Smith | B |
1003 | Rahul | A |
1004 | Preeti | B |
Rate Table
Rate_category | Hourly_rate |
A | 60 |
B | 50 |
C | 40 |
NORMALIZATION EXAMPLE 2 :DEPARTMENT COMPLAINTS
Important Assumption
Assume that the Dept. # and Cust. # are all needed to uniquely identify the date and nature of the complaint about the department.
Step 1: Identify Dependencies
Draw lines showing the dependencies between each attribute
Unnormalized Table
Dept# | Dept Name | Loc | Mgr Name | Mgr id Name | Tel Extn. | Cust# | Cust Name | Date of comp | Nature of comp |
11232 Soap Cincinnati Mary S11 7711 P10451 Robert 1/12/1998 PS Division Samuel Drumtree P10480 Steven 1/14/1998 DAParks |
Step 2:Convert UNF to 1NF
When moving from an Unnormalized table to 1NF, take out repeating groups. One approach is to fill in empty cells. A second is to split out the repeating elements into a new table.
In this case, I removed the fields that were repeating.
1NF Table(s)
Dept # | Cust # | Cust Name | Date of Complaint | Nature of Complaint | Dept Name | Location | Mgr | Mgr ID No. |
Step 3:Convert 1NF to 2NF
To move a table from 1NF to 2NF, remove partial dependencies.
In this case, the Customer Name depends only on Cust#, which is a partial dependency. Therefore remove the Customer Name and make a copy of Cust #
2NF Tables
Dept# | Cust# | Date Of Complaint | Nature Of Complaint |
Dept# | Dept Name | Location | Mgr Name | Mgr ID No. | Tel Ext. |
Cust# | Cust Name |
Step 4:Convert 2NF to 3NF
To move a table from 2NF to 3NF, remove transitive dependencies.
In this case, the Mgr Name is dependent on the Mgr ID No., which is a transitive dependency. Therefore, remove the Mgr Name and make a copy of the the Mgr ID No.
3NF Table
Dept# | Cust# | Date Of Complaint | Nature Of Complaint |
Dept# | Dept Name | Location | Mgr Name | Mgr ID No. | Tel Ext. |
Cust# | Cust Name |
Mgr Id No | Mgr Name |