Amateur Techies: August 2011

Normalizing Tables in SQL

So far in all SQL examples we were dealing with a single table. The truth is that in real life when dealing with databases you’ll have to work with many tables, which are interrelated. The true power of the Relational Database Management Systems is the fact that they are Relational.

The relationships in a RDBMS ensure that there is no redundant data. For example: An online store, offers computers for sale and the easiest way to track the sales will be to keep them in a database. We can have a table called Product, which will hold information about each computer - model name, price and the manufacturer. We also need to keep some details about the manufacturer like their website and their support email. If you store the manufacturer details in the Product table, you will have the manufacturer contact info repeated for each computer model the manufacturer produces:

model	Price	Manufacturer	ManufacturerWebsite	ManufacturerEmail
Inspiron B120	$499	Dell	http://www.dell.com	support@dell.com
Inspiron B130	$599	Dell	http://www.dell.com	support@dell.com
Inspiron E1705	$949	Dell	http://www.dell.com	support@dell.com
Satellite A100	$549	Toshiba	http://www.toshiba.com	support@toshiba.com
Satellite P100	$934	Toshiba	http://www.toshiba.com	support@toshiba.com

To get rid of the redundant manufacturer data in the Product table, we can create a new table called Manufacturer, which will have only one entry (row) for each manufacturer and we can link (relate) this table to the Product table. To create this relation we need to add additional column in the Product table that references the entries in the Manufacturer table.

A relationship between 2 tables is established when the data in one of the columns in the first table matches the data in a column in the second table. To explain this further we have to understand SQL relational concepts – Primary Key and Foreign Key. Primary Key is a column or a combination of columns that uniquely identifies each row in a table. Foreign Key is a column or a combination of columns whose values match a Primary Key in a different table. In the most common scenario the relationship between 2 tables matches the Primary Key in one of the tables with a Foreign Key in the second table. Consider the new Product and Manufacturer tables below:

Manufacturer

ManufacturerID	Manufacturer	ManufacturerWebsite	ManufacturerEmail
1	Dell	http://www.dell.com	support@dell.com
2	Toshiba	http://www.toshiba.com	support@toshiba.com

Product

model	Price	ManufacturerID
Inspiron B120	$499	1
Inspiron B130	$599	1
Inspiron E1705	$949	1
Satellite A100	$549	2
Satellite P100	$934	2

The first table is Manufacturer which has 2 entries for Dell and Toshiba respectively. Each of these entries has a ManufacturerID value, which is unique integer number. Because the ManufacturerID column is unique for the Manufacturer table we can use it as a Primary Key in this table. The Product table retains the Model and the Price columns, but has a new column called ManufacturerID, which matches the values of the ManufacturerID column in the Manufacturer table. All values in the ManufacturerID column in the Product table have to match one of the values in the Manufacturer table Primary Key (for example you can’t have ManufacturerID with value of 3 in the Product table, simply because there is no manufacturer with this ManufacturerID defined in the Manufacturer table).

Notice that we used the same name for the Primary Key in the first table as for the Foreign Key in the second. This was done on purpose to show the relationship between the 2 tables based on these columns. Of course you can call the 2 columns with different names, but if somebody sees your database for a first time it won’t be immediately clear that these 2 tables are related.
But how do we ensure that the Product table doesn’t have invalid entries like the last entry below:

model	Price	ManufacturerID
Inspiron B120	$499	1
Inspiron B130	$599	1
Inspiron E1705	$949	1
Satellite A100	$549	2
Satellite P100	$934	2
ThinkPad Z60t	$849	3

We do not have a manufacturer with ManufacturerID of 3 in our Manufacturer table, hence this entry in the Product table is invalid. The answer is that you have to enforce referential integrity between the 2 tables. Different RDBMS have different ways to enforce referential integrity, and I will not go into more details as this is not important to understand the concept of relationship between tables.

There are 3 types of relations between tables – One-To-Many, Many-To-Many and One-To-One. The relation we created above is One-To-Many and is the most common of the 3 types. In One-To-Many relation a row in one of the tables can have many matching rows in the second table, but a row the second table can match only one row in the first table.

In our example, each manufacturer (a row in the Manufacturer table) produces several different computer models (several rows in the Product table), but each particular product (a row in the Product table) has only one manufacturer (a row in the Manufacturer table).

The second type is the Many-To-Many relation. In this relation many rows from the first table can match many rows in the second and the other way around. To define this type of relation you need a third table whose primary key is composed of the 2 foreign keys from the other 2 table. To clarify this relation lets review the following example. We have a Article table (ArticleID is primary key) and Category (CategoryID is primary key) table.
Every article published in the Article table can belong to multiple categories. To accommodate that, we create a new table called ArticleCategory, which has only 2 columns – ArticleID and CategoryID (these 2 columns form the primary key for this table). This new table called sometimes junction table defines the Many-To-Many relationship between the 2 main tables. One article can belong to multiple categories, and every category may contain more than one article.

In the One-To-One relation each row in the first table may match only one row in the second and the other way around. This relationship is very uncommon simply because if you have this type of relation you may as well keep all the info in one single table.

By dividing the data into 2 tables we successfully removed the redundant manufacturer details from the initial Product table adding an integer column referencing the new Manufacturer table instead.

The process of removing redundant data by creating relations between tables is known as Normalization. Normalization process uses formal methods to design the database in interrelated tables.

NORMALIZATION EXAMPLE 1: EMPLOYEE PROJECT DETAIL

Given below is the data in an un-normalized table. Normalize it to 1NF, 2NF and 3NF.

DETAILS TABLE

Proj_no	Proj_name	Emp_no	Emp_name	Rate_category	Hourly_rate
101	AirLine Reservation	1001 1002 1003	John Smith Amit	A B C	60 50 40
102	Advertising Agency	1001 1004	Rahul Preeti	A B	60 50

Solution:

Table in 1NF:

DETAILS TABLE

Proj_no	Proj_name	Emp_no	Emp_name	Rate_category	Hourly_rate
101	AirLine Reservation	1001	John	A	60
101	AirLine Reservation	1002	Smith	B	50
101	AirLine Reservation	1003	Amit	C	40
102	Advertising Agency	1001	Rahul	A	60
102	Advertising Agency	1004	Preeti	B	50

The key is (Proj_no,Emp_no)

Functional Dependencies:

Proj_no -> Proj_name

Emp_no -> Emp_name, Rate_category, Hourly_rate

Transitive Dependencies :

Rate_category -> Hourly_rate

Tables in 2NF:

Emp_Proj Table

Proj_no	Emp_no
101	1001
101	1002
101	1003
102	1001
102	1004

Employee Table

Emp_no	Emp_name	Rate_category	Hourly_rate
1001	John	A	60
1002	Smith	B	50
1003	Rahul	A	40
1004	Preeti	B	50

Project Table

Proj_no	Proj_name
1001	Airline Reservation
1002	Advertising Agency

Tables in 3NF:

Emp_Proj Table

Proj_no	Emp_no
101	1001
101	1002
101	1003
102	1001
102	1004

Project Table

Proj_no	Proj_name
1001	Airline Reservation
1002	Advertising Agency

Employee Table

Emp_no	Emp_name	Rate_category
1001	John	A
1002	Smith	B
1003	Rahul	A
1004	Preeti	B

Rate Table

Rate_category	Hourly_rate
A	60
B	50
C	40

NORMALIZATION EXAMPLE 2 :DEPARTMENT COMPLAINTS

Important Assumption

Assume that the Dept. # and Cust. # are all needed to uniquely identify the date and nature of the complaint about the department.

Step 1: Identify Dependencies

Draw lines showing the dependencies between each attribute

Unnormalized Table

Dept#

Dept Name

Loc

Mgr Name

Mgr id Name

Tel Extn.

Cust#

Cust Name

Date of comp

Nature of comp

11232 Soap Cincinnati Mary S11 7711 P10451 Robert 1/12/1998 PS

Division Samuel Drumtree

P10480 Steven 1/14/1998 DA

Parks

Step 2:Convert UNF to 1NF

When moving from an Unnormalized table to 1NF, take out repeating groups. One approach is to fill in empty cells. A second is to split out the repeating elements into a new table.

In this case, I removed the fields that were repeating.

1NF Table(s)

Dept #

Cust #

Cust Name

Date of Complaint

Nature of Complaint

Dept Name

Location

Mgr

Mgr ID No.

Step 3:Convert 1NF to 2NF

To move a table from 1NF to 2NF, remove partial dependencies.

In this case, the Customer Name depends only on Cust#, which is a partial dependency. Therefore remove the Customer Name and make a copy of Cust #

2NF Tables

Dept#

Cust#

Date Of Complaint

Nature Of Complaint

Dept#

Dept Name

Location

Mgr Name

Mgr ID No.

Tel Ext.

Cust#

Cust Name

Step 4:Convert 2NF to 3NF

To move a table from 2NF to 3NF, remove transitive dependencies.

In this case, the Mgr Name is dependent on the Mgr ID No., which is a transitive dependency. Therefore, remove the Mgr Name and make a copy of the the Mgr ID No.

3NF Table

Dept#

Cust#

Date Of Complaint

Nature Of Complaint

Dept#

Dept Name

Location

Mgr Name

Mgr ID No.

Tel Ext.

Cust#

Cust Name

Mgr Id No

Mgr Name

Amateur Techies

Monday, 8 August 2011

Basic normalization concepts & Statistica(tool for data mining)

P10480 Steven 1/14/1998 DA

Parks