Powered By Blogger

Monday, 8 August 2011

Basic normalization concepts & Statistica(tool for data mining)





 
Normalizing Tables in SQL
So far in all SQL examples we were dealing with a single table. The truth is that in real life when dealing with databases you’ll have to work with many tables, which are interrelated. The true power of the Relational Database Management Systems is the fact that they are Relational.

The relationships in a RDBMS ensure that there is
no redundant data. For example: An online store, offers computers for sale and the easiest way to track the sales will be to keep them in a database. We can have a table called Product, which will hold information about each computer - model name, price and the manufacturer. We also need to keep some details about the manufacturer like their website and their support email. If you store the manufacturer details in the Product table, you will have the manufacturer contact info repeated for each computer model the manufacturer produces:
 
model Price Manufacturer ManufacturerWebsite ManufacturerEmail
Inspiron B120 $499 Dell http://www.dell.com support@dell.com
Inspiron B130 $599 Dell http://www.dell.com support@dell.com
Inspiron E1705 $949 Dell http://www.dell.com support@dell.com
Satellite A100 $549 Toshiba http://www.toshiba.com support@toshiba.com
Satellite P100 $934 Toshiba http://www.toshiba.com support@toshiba.com

 
To get rid of the redundant manufacturer data in the Product table, we can create a new table called Manufacturer, which will have only one entry (row) for each manufacturer and we can link (relate) this table to the Product table. To create this relation we need to add additional column in the Product table that references the entries in the Manufacturer table.

A relationship between 2 tables is established when the data in one of the columns in the first table matches the data in a column in the second table. To explain this further we have to understand SQL relational concepts –
Primary Key and Foreign Key. Primary Key is a column or a combination of columns that uniquely identifies each row in a table. Foreign Key is a column or a combination of columns whose values match a Primary Key in a different table. In the most common scenario the relationship between 2 tables matches the Primary Key in one of the tables with a Foreign Key in the second table. Consider the new Product and Manufacturer tables below:

 
Manufacturer 
ManufacturerID Manufacturer ManufacturerWebsite ManufacturerEmail
1 Dell http://www.dell.com support@dell.com
2 Toshiba http://www.toshiba.com support@toshiba.com

Product 
model Price ManufacturerID
Inspiron B120 $499 1
Inspiron B130 $599 1
Inspiron E1705 $949 1
Satellite A100 $549 2
Satellite P100 $934 2




The first table is Manufacturer which has 2 entries for Dell and Toshiba respectively. Each of these entries has a ManufacturerID value, which is unique integer number. Because the ManufacturerID column is unique for the Manufacturer table we can use it as a Primary Key in this table. The Product table retains the Model and the Price columns, but has a new column called ManufacturerID, which matches the values of the ManufacturerID column in the Manufacturer table. All values in the ManufacturerID column in the Product table have to match one of the values in the Manufacturer table Primary Key (for example you can’t have ManufacturerID with value of 3 in the Product table, simply because there is no manufacturer with this ManufacturerID defined in the Manufacturer table).

Notice that we used the same name for the Primary Key in the first table as for the Foreign Key in the second. This was done on purpose to show the relationship between the 2 tables based on these columns. Of course you can call the 2 columns with different names, but if somebody sees your database for a first time it won’t be immediately clear that these 2 tables are related.
But how do we ensure that the Product table doesn’t have invalid entries like the last entry below:
 
model Price ManufacturerID
Inspiron B120 $499 1
Inspiron B130 $599 1
Inspiron E1705 $949 1
Satellite A100 $549 2
Satellite P100 $934 2
ThinkPad Z60t $849 3


We do not have a manufacturer with ManufacturerID of 3 in our Manufacturer table, hence this entry in the Product table is invalid. The answer is that you have to enforce referential integrity between the 2 tables. Different RDBMS have different ways to enforce referential integrity, and I will not go into more details as this is not important to understand the concept of relationship between tables.

There are 3 types of relations between tables –
One-To-Many, Many-To-Many and One-To-One. The relation we created above is One-To-Many and is the most common of the 3 types. In One-To-Many relation a row in one of the tables can have many matching rows in the second table, but a row the second table can match only one row in the first table.

In our example, each manufacturer (a row in the Manufacturer table) produces several different computer models (several rows in the Product table), but each particular product (a row in the Product table) has only one manufacturer (a row in the Manufacturer table).

The second type is the Many-To-Many relation. In this relation many rows from the first table can match many rows in the second and the other way around. To define this type of relation you need a third table whose primary key is composed of the 2 foreign keys from the other 2 table. To clarify this relation lets review the following example. We have a Article table (ArticleID is primary key) and Category (CategoryID is primary key) table.
Every article published in the Article table can belong to multiple categories. To accommodate that, we create a new table called ArticleCategory, which has only 2 columns – ArticleID and CategoryID (these 2 columns form the primary key for this table). This new table called sometimes junction table defines the Many-To-Many relationship between the 2 main tables. One article can belong to multiple categories, and every category may contain more than one article.

In the One-To-One relation each row in the first table may match only one row in the second and the other way around. This relationship is very uncommon simply because if you have this type of relation you may as well keep all the info in one single table.

By dividing the data into 2 tables we successfully removed the redundant manufacturer details from the initial Product table adding an integer column referencing the new Manufacturer table instead.

The process of removing redundant data by creating relations between tables is known as Normalization. Normalization process uses formal methods to design the database in interrelated tables.

NORMALIZATION EXAMPLE 1: EMPLOYEE PROJECT DETAIL

Given below is the data in an un-normalized table. Normalize it to 1NF, 2NF and 3NF.
DETAILS TABLE





Proj_no
Proj_name
Emp_no
Emp_name
Rate_category
Hourly_rate
101
AirLine Reservation
1001

1002

1003
John

Smith

Amit
A

B

C
60

50

40
102
Advertising Agency
1001

1004
Rahul

Preeti
A

B
60

50


Solution:
Table in 1NF:
DETAILS TABLE






Proj_no
Proj_name
Emp_no
Emp_name
Rate_category
Hourly_rate
101
AirLine Reservation
1001
John
A
60
101
AirLine Reservation
1002
Smith

B
50
101
AirLine Reservation
1003
Amit
C
40
102
Advertising Agency
1001

Rahul

A

60

102
Advertising Agency
1004
Preeti
B
50


The key is (Proj_no,Emp_no)




Functional Dependencies:

Proj_no -> Proj_name
Emp_no -> Emp_name, Rate_category, Hourly_rate

Transitive Dependencies :

Rate_category -> Hourly_rate

Tables in 2NF:

Emp_Proj Table





Proj_no
Emp_no
101
1001
101
1002
101
1003
102
1001
102
1004


Employee Table





Emp_no
Emp_name
Rate_category
Hourly_rate
1001
John
A
60
1002
Smith
B
50
1003
Rahul
A
40
1004
Preeti
B
50


Project Table





Proj_no
Proj_name
1001
Airline Reservation
1002
Advertising Agency





Tables in 3NF:
Emp_Proj Table





Proj_no
Emp_no
101
1001
101
1002
101
1003
102
1001
102
1004


Project Table





Proj_no
Proj_name
1001
Airline Reservation
1002
Advertising Agency


Employee Table





Emp_no
Emp_name
Rate_category
1001
John
A
1002
Smith
B
1003
Rahul
A
1004
Preeti
B


Rate Table





Rate_category
Hourly_rate
A
60
B
50
C
40





NORMALIZATION EXAMPLE 2 :DEPARTMENT COMPLAINTS

Important Assumption
Assume that the Dept. # and Cust. # are all needed to uniquely identify the date and nature of the complaint about the department.

Step 1: Identify Dependencies
Draw lines showing the dependencies between each attribute

Unnormalized Table





Dept#
Dept Name
Loc
Mgr Name
Mgr id Name
Tel Extn.
Cust#
Cust Name
Date of comp
Nature of comp
11232 Soap Cincinnati Mary S11 7711 P10451 Robert 1/12/1998 PS
Division Samuel Drumtree
P10480 Steven 1/14/1998 DA
Parks


Step 2:Convert UNF to 1NF
When moving from an Unnormalized table to 1NF, take out repeating groups. One approach is to fill in empty cells. A second is to split out the repeating elements into a new table.

In this case, I removed the fields that were repeating.

1NF Table(s)






Dept # Cust # Cust Name Date of Complaint Nature of Complaint Dept Name Location
Mgr
Mgr ID No.



Step 3:Convert 1NF to 2NF
To move a table from 1NF to 2NF, remove partial dependencies.
In this case, the Customer Name depends only on Cust#, which is a partial dependency. Therefore remove the Customer Name and make a copy of Cust #

2NF Tables

Dept#
Cust#
Date Of Complaint
Nature Of Complaint

Dept#
Dept Name
Location
Mgr Name
Mgr ID No.
Tel Ext.

Cust#
Cust Name

Step 4:Convert 2NF to 3NF
To move a table from 2NF to 3NF, remove transitive dependencies.
In this case, the Mgr Name is dependent on the Mgr ID No., which is a transitive dependency. Therefore, remove the Mgr Name and make a copy of the the Mgr ID No.

3NF Table

Dept#
Cust#
Date Of Complaint
Nature Of Complaint

Dept#
Dept Name
Location
Mgr Name
Mgr ID No.
Tel Ext.

Cust#
Cust Name

Mgr Id No
Mgr Name