Denormalization

[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ]

Denormalization

Chris Date wrote in a paper for InfoMIS 4 No. 22^nd Quarter 1984 titled “Some Relational Myths Exploded”. Date’s myth number 25 is “Third normal form is a panacea”. In this section of the paper Date states, “In a sense, the 3NF discipline optimizes for update at the expense of retrieval: It eliminates certain “update anomalies”, but it can lead to the need for lots of joins on retrieval.” (Emphasis is in the original).

There are many other issues that are important to a system design besides the logical data model. At times these issues can be at odds with one and other. When a developer encounters one of these circumstances a decision has to be made as to which issue will win out.

Sometimes normalization will win and sometimes it won’t. When a developer decides to back off from a fully normalized design, the process has been named denormalization by Date.

Denormalization: Controlled redundancy.

Chris Date, “A Practical Approach to Database Design”, IBM Technical Report No. TR 03.220, December 1984.

What date is talking about here is that a fully normalized design is optimized for the elimination of update problems. However, the very design decisions that optimize for updates can be detrimental for retrieval.

Denormalization should not be undertaken until a fully normalized design has been achieved. Denormalization is not an excuse for not knowing or applying the process of normalization. It is, rather, an acknowledgement that the database design has to work well for both updates and retrievals.

The figure below shows a design that is 3^rd NF in the top and denormalized at the bottom.

In order to justify denormalization we need to have a business reason for the alteration form 3^rd NF. This example design is for a mail order company that has 120,000,000 customers to whom they must send catalogs. The top design is in 3^rd normal form because the attributes for State, City, and Country are codependent on the PostalCode field. To fix that we created the PostalCode entity and related it back to the Customer records on the PostalCode foreign key in the Customer table.

However, during system testing it was discovered that the time required to produce 120,000,000 mailing labels through a two-table join was much longer than if the labels could be produced from a single table. The table was then denormalized by reintroducing the City, State, and Country attributes to the Customer table.

After doing this it would be heavily documented including the normal form that is violated, the reason for the violation, and the consequences of the violation. Any violation of a fully normalized design carries with it potential problems in the area of updates.

My rule of thumb for breaking any programming “rules” is that If:

you know what the rules is
you know why the rule exists
you know what the consequences of breaking the rule are
you know why you want to break the rule

Then go ahead and break the rule. Just be sure that you fully document what you have done and why you have done it so programmers coming after you don’t have a difficult time trying to figure what you did and why you did it.

Data Partitioning

Data Partitioning: Dividing the contents of a single entity into two or more tables that are related to each other.

Every database manager has some type of restriction or another on the size and amount of data that can be stored in it. For many of the “big iron” database servers this limitation is practically insignificant, in that the limitation is so high that very few systems will ever approach the limit.

Visual FoxPro is not one of those database managers. VFP has a limit on file size, 2 gigabytes, table size 1 billion records, and record size 255 fields and 65,500 characters in a record. These limits, though pretty high, can and often will be encountered is business systems.

Combine the limitations issue with the fact that often, with very large data sets, it is advisable to divide things up for faster retrieval and you have a good argument for data partitioning.

The next sections will discuss the two approaches to data partitioning and the reasons for selecting one over the other.

Horizontal Partitioning

The two types of partitioning are horizontal and vertical. These titles may be somewhat meaningless until you see the contents of an entity in a row and column table orientation. The example below should help in understanding the use of horizontal and vertical partitioning.

CustID Name City Credit Limit

101 Joe’s bar and grill Washington, D.C. 1,000

102 Mike’s barber shop New York 1,000

105 Sally’s dress shop Philadelphia 2,500

107 You plug ‘em we plant ‘em funeral services New York 1,500

110 Princely Cigars San Francisco 2,750

123 Eat at Joe’s Pittsburgh 1,000

124 Dewey, Skruewem, and Howe Attorneys at law Washington, D.C. 100

CustID	Name	City	Credit Limit
101	Joe’s bar and grill	Washington, D.C.	1,000
102	Mike’s barber shop	New York	1,000
105	Sally’s dress shop	Philadelphia	2,500
107	You plug ‘em we plant ‘em funeral services	New York	1,500
110	Princely Cigars	San Francisco	2,750
123	Eat at Joe’s	Pittsburgh	1,000
124	Dewey, Skruewem, and Howe Attorneys at law	Washington, D.C.	100

In the table the data is oriented in horizontal columns and vertical rows. If we divide the fields into two or more tables we are dividing the entity horizontally. If we make our division on the records, then it is vertical partitioning.

Each method of partitioning has its benefits and its drawbacks. We will use the example below for our discussion.

This example is not large enough to require any data partitioning but it will be useful in explaining the concepts of partitioning anyway. The limitation of space is the reason for using this size example.

The first partitioning method we are discussing is horizontal. In horizontal partitioning we will divide the table into two or more tables each having a subset of the attributes from the original table. The design below shows one partitioning that is possible.

We have created two tables that are related to each other on a one-to-one basis. The combination of the data in both of these tables is representative of the actual entity. Either table alone has only partial data for the entity.

Notice that the partitioning was done along some logical lines, that is one table is holding demographic data while the other is holding financial data. It is a good idea to take this approach, the division of the data follows some logical concept, when partitioning horizontally. Doing the partitioning this way allows for the possibility that one of the tables alone can provide all of the needed data. For example, when printing mailing labels for customers, the Customer1 table can be used alone.

Managing data in horizontal partitioned entities is as simple as using the JOIN clause of the SELECT command.

SELECT Customer1.*, Customer2.* ;

FROM Customer1 JOIN Customer2 ;

ON Customer1.CustID = Customer2.CustID ;

WHERE Customer1.State = “NY” …

Vertical Partitioning

Vertical, in contrast to horizontal, partitioning divides the table based on records rather then fields. Vertical partitioning is a reasonable approach when the system using the partitioned data has a limited requirement to combine the records between the partitions.

The design below shows the customer entity vertically partitioned.

Note that the two tables both have the full set of attributes, however the records found in the two tables are not the same. In the design above the partitioning is done on the spelling of the customer’s name (which is not a good idea). A more appropriate partitioning might be on AccountRep or on State.

The combination of these partitioned tables into the original full entity is not as easy as a JOIN. The combination requires the use of a UNION. The sample code below would retrieve the records for all customers in NY.

SELECT Customer1.* ;

FROM Customer1 ;

WHERE State = “NY” ;

UNION ALL ;

SELECT Customer2.* ;

FROM Customer2 ;

WHERE State = “NY” …

Want to read more about relational database design?

The following two books are among the best I have seen for detailed, complete, and understandable discussions of relational database design concepts.

Handbook of Relational Database Design, Candace C. Fleming and Barbara von Halle,
Addison-Wesley ISBN: 0-201-11434-8

Relational Database Selected Writings, C. J. Date, Addison-Wesley

ISBN: 0-201-14196-5

[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ]