database

ebscohost2.pdf

Home >Computer Science homework help >database

88 cHAPTER 5 The Relational Data Model

that make up your schema. However, relational operations on tables produce additional tables as their result. Such tables, which exist only in main memory, are known as virtual tables. Virtual tables may not be legal relations—in particular, they may have no primary key—but because virtual tables are never stored in the database, this presents no problem in terms of the overall design of the database.

The use of virtual tables benefits a DBMS in several ways. First, it allows the DBMS to keep intermediate query tables in main memory rather than storing them on disk, enhancing query performance. Second, it allows tables that violate the rules of the relational data model to exist in main memory without affecting the integrity of the database. Third, it helps avoid fragmentation of database files and disk surfaces by avoiding repeated write, read, and delete operations on temporary tables.

Note: SQL, the language used to manage most relational databases, also supports “temporary base tables.” Although called base tables, temporary tables are actually virtual tables in the sense that they exist only in main memory for a short time and are never stored in the physical database.

a Notation for relations You will see instances of relations throughout this book used as exam- ples. However, we do not usually include data in a relation when documenting that relation. One common way to express a relation is as follows:

relation_name (primary_key, non_primary_key_column …)

For example, the Customer relation that you saw in Figure 5-1 would be written as:

customers (customer_numb, first_name last_name, phone)

The preceding expression is a true relation, an expression of the struc- ture of a relation. It correctly does not contain any data. (As mentioned earlier, when data are included, you have an instance of a relation.)

priMarY KEYs As you have just read, a unique primary key makes it possible to uniquely identify every row in a table. Why is this so important? The

Co py ri gh t © 2 00 9. M or ga n Ka uf ma nn . Al l ri gh ts r es er ve d. M ay n ot b e re pr od uc ed i n an y fo rm w it ho ut p er mi ss io n fr om t he p ub li sh er , ex ce pt f ai r us es p er mi tt ed u nd er U .S . or

ap pl ic ab le c op yr ig ht l aw .

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 10/17/2017 10:20 PM via EXCELSIOR COLLEGE AN: 319641 ; Harrington, Jan L..; Relational Database Design and Implementation : Clearly Explained Account: s4946256

89Primary Keys

issue is the same as with entity identifiers: You want to be able to retrieve every single piece of data you put into a database.

As far as a relational database is concerned, you should need only three pieces of information to retrieve any specific bit of data: the name of the table, the name of the column, and the primary key of the row. If primary keys are unique for every row, then we can be sure that we are retrieving exactly the row we want. If they are not unique, then we are retrieving only some row with the primary key value, which may not be the row that contains the data for which we are searching.

Along with being unique, a primary key must not contain the value null. Null is a special database value meaning “unknown.” It is not the same as a zero or a blank. If you have one row with a null primary key, then you are actually all right. However, the minute you introduce a second one, you have lost the property of uniqueness. We therefore forbid the presence of nulls in any primary key columns. This con- straint, known as entity integrity, will be enforced by a DBMS whenever data are entered or modified.

Selecting a good primary key can be a challenge. As you may remem- ber from Chapter 4, some entities have natural primary keys, such as purchase order numbers. These are arbitrary, meaningless, unique identifiers that a company attaches to the orders it sends to vendors and are therefore ideal primary keys.

primary Keys to identify people What about a primary key to identify people? The first thing that pops into your mind might be a Social Security number (or, for those outside the United States, a national identification number). Every person in the United States is supposed to have a Social Security number. Parents apply for them in the hospital where a baby is born, right? And because they are assigned by the U.S. government, they must be unique, right? Unfortunately, the answer to both questions is “no.”

The Social Security Administration has been known to give everyone in an entire town the same Social Security number; over time, the numbers may be reused. However, these are minor problems com- pared to a Social Security number being null.

Consider what happens at a college that uses Social Security numbers as student numbers when international students enroll. Upon entry

ap pl ic ab le c op yr ig ht l aw .

90 cHAPTER 5 The Relational Data Model

into the country, the international students do not have Social Security numbers. Because primary keys cannot be null, the international stu- dents cannot sign up for classes or even enroll in the college until they have a Social Security number.

The college’s solution is to give them “fake” numbers in the format 999-99-XXXX, where XXXX is some number currently not in use. Then, when the student receives a “real” Social Security number from the government, the college supposedly replaces the fake value with the real one. Sometimes, however, the process does not work. A graduate student ended up with his first semester’s grades being stored under the fake Social Security number but the rest of his grades under his real number. (Rather than changing the original data, someone created an entire new transcript for the student.) When the time came to audit his transcript to see if he had satisfied all of his graduation require- ments, he was told that he was missing an entire semester’s worth of courses.

This example leads us to two important desirable qualities of primary keys:

n A primary key should be some value that is highly unlikely ever to be null.

n A primary key value should never change.

In addition, there is significant concern about security problems that can arise from the use of Social Security numbers as identifiers in a database. The danger of identity theft has made it risky to share a national identifier. Many U.S. state governments, for example, have mandated that publicly supported organizations use something other than the Social Security number as a customer/client/student ID to help protect individual privacy.

Although Social Security numbers initially look like good natural identifiers, you will be much better off in the long run using arbitrary numbers for people—such as student numbers or account numbers— rather than relying on Social Security numbers.

avoiding Meaningful identifiers It can be very tempting to code meaning into a primary key. For example, assume that Antique Opticals wants to assign codes to its distributors rather than giving them arbitrary distributor numbers. Someone might create codes such as TLC for The Laser Club and JS

ap pl ic ab le c op yr ig ht l aw .

91Primary Keys

for Jones Services. At first, this might seem like a good idea. The codes are short, and by looking at them you can figure out which distributor they reference.

But what happens if one of the companies changes its name? Perhaps Jones Services is renamed Jones Distribution House. Do you change the primary key value in the distributor table? Do you change the code so that it reads JDH? If the distributor table were all that we cared about, that would be the easy solution.

However, consider that the table that describes merchandise items contains the code for the distributor so that Antique Opticals can know which distributor provides the item. (You’ll read a great deal more about this concept in the next major section of this chapter.) If you change the distributor code value in the distributor table, you must change the value of the code for every merchandise item that comes from that distributor. Without the change, Antique Opticals will not be able to match the code to a distributor and get information about the distributor. It will appear that the item comes from a non- existent distributor!

Note: This is precisely the same problem you read about in Chapter 3 concerning Antique Opticals’ identifiers for its customers.

Meaningful primary keys tend to change and therefore introduce the potential for major data inconsistencies between tables. Resist the temptation to use them at all costs. Here, then, is yet another property of a good primary key:

n A primary key should avoid using meaningful data: Use arbitrary identifiers or concatenations of arbitrary identifiers wherever possible.

It is not always possible to use completely meaningless primary keys. You may find, for example, that you need to include dates or times in primary keys to distinguish among events. The suggestion that you should not use meaningful primary keys is therefore not a hard and fast rule but a guideline you should follow whenever you can.

Concatenated primary Keys Some tables have no single column in which the values never duplicate. As an example, look at the sample order items table in

ap pl ic ab le c op yr ig ht l aw .

92 cHAPTER 5 The Relational Data Model

Figure 5-2. Because there is more than one item on an order and because the same item can appear on more than one order, order numbers are repeated. Therefore, neither column by itself can serve as the table’s primary key. The combination of an order number and an item number, however, is unique. We can therefore concatenate the two columns that form the table’s primary key.

It is true that you could also concatenate all three columns in the table and still ensure a unique primary key. However, the quantity column is not necessary to ensure uniqueness and therefore should not be used. We now have some additional properties of a good primary key:

n A concatenated primary key should be made up of the smallest number of columns necessary to ensure the uniqueness of the primary key.

n Whenever possible, the columns used in a concatenated primary key should be meaningless identifiers.

all-Key relations It is possible to have a table in which every column is part of the primary key. As an example, consider a library book catalog. Each book title owned by a library has a natural unique primary key: the ISBN (Inter- national Standard Book Number). Each ISBN is assigned to one or more subject headings in the library’s catalog, and each subject heading is also assigned to one or more books. We therefore have a many-to-many relationship between books and subject headings.

A relation to represent this relationship might be:

subject_catalog (isbn, subject_heading)

All we need to do is pair a subject heading with a book identifier. No additional data are needed. Therefore, all columns in the table become part of the primary key.

Order Number Item Number Quantity 10991 0022 1 10991 0209 2 10991 1001 1 10992 0022 1 10992 0486 1 10993 0209 1 10993 1001 2 10994 0621 1

n FIguRE 5-2 A sample order items table.

ap pl ic ab le c op yr ig ht l aw .

93Representing Data Relationships

There is absolutely no problem with having all-key relations in a database. In fact, they occur whenever a database design contains a composite entity that has no relationship data. They are not necessar- ily errors, and you can use them wherever needed.

rEprEsENTiNg DaTa rELaTiONsHips In the preceding section we alluded to the use of identifiers in more than one relation. This is the one way in which relational databases represent relationships between entities. To make this concept clearer, take a look at the three tables in Figure 5-3.

Each table in the illustration is directly analogous to the entity by the same name in the Antique Opticals ER diagram. The Orders table (the order entity) is identified by an order number, an arbitrary unique

Items Item Number Title

Distributor Number Price

1001 Gone with the Wind 002 39.95 1002 Star Wars: Special Edition 002 59.95 1003 Die Hard 004 29.95 1004 Bambi 006 29.95

Orders Order Number

Customer Number

Order Date

11100 0012 12/18/09 11101 0186 12/18/09 11102 0056 12/18/09

Order Items Order Number Item Number Quantity Shipped? 11100 1001 1 Y 11100 1002 1 Y 11101 1002 2 Y 11102 1002 1 N 11102 1003 1 N 11102 1001 1 N

n FIguRE 5-3 Three relations from the Antique Opticals database.

ap pl ic ab le c op yr ig ht l aw .

94 cHAPTER 5 The Relational Data Model

primary key assigned by Antique Opticals. The Items table (the item entity) is identified by an item number, which could be another arbi- trary unique identifier or a UPC.

The third table—Order Items (the order items entity)—tells the company which items are part of which order. As you saw earlier in this chapter, this table requires a concatenated primary key because multiple items can appear on multiple orders. The selection of the primary key, however, has more significance than simply uniquely identifying each row. It also represents a relationship between the order items, the orders on which they appear, and the items being ordered.

The item number column in the order items relation is the same as the primary key of the item table. This indicates a one-to-many rela- tionship between the two tables. By the same token, there is a one-to- many relationship between the orders and order items tables because the order number column in the order items table is the same as the primary key of the orders table.

When a table contains a column (or concatenation of columns) that is the same as the primary key of a table, the column is called a foreign key. The matching of foreign key values to primary key values repre- sents data relationships in a relational database. As far as the user of a relational database is concerned, there are no structures that show relationships other than the matching column’s values.

Note: This is why the idea that relational databases have “relationships between files” is so absurd. The relationships in a relational database are between logical constructs—tables—and nothing else. Such structures make absolutely no assumptions about physical storage.

Foreign keys may be part of a concatenated primary key, or they may not be part of their table’s primary key at all. Consider, for example, a pair of simple Antique Opticals customers and orders relations:

customers (customer_numb, first_name, last_name, phone)

orders (order_numb, customer_numb, order_date)

The customer number column in the orders table is a foreign key that matches the primary key of the customers table. It represents the one-

ap pl ic ab le c op yr ig ht l aw .

95Representing Data Relationships

to-many relationship between customers and the orders they place. However, the customer number is not part of the primary key of its table but is a non-key attribute that is nonetheless a foreign key.

Technically, foreign keys need not have values unless they are part of a concatenated primary key; they can be null. However, in this particu- lar database, Antique Opticals would be in serious trouble if customer numbers were null: There would be no way to know which customer placed an order!

A relational DBMS uses the relationships indicated by matching data between primary and foreign keys. For example, assume that an Antique Opticals employee wanted to see what titles had been ordered on order number 11102. First, the DBMS identifies the rows in the order items table that contain an order number of 11102. Then, it takes the items number from those rows and matches them to the item numbers in the items table. In the rows where there are matches, the DBMS retrieves the associated data.

referential integrity The procedure described in the preceding paragraph works very well— unless for some reason there is no order number in the orders table to match a row in the order items table. This is a very undesirable situ- ation because you can’t ship the ordered items if you don’t know which customer placed the order.

This relational data model therefore enforces a constraint called refer- ential integrity, which states that every non-null foreign key value must match an existing primary key value. Of all the constraints on a relational database, this is probably the most important because it ensures the consistency of the cross-references among tables.

Referential integrity constraints are stored in the database and enforced automatically by the DBMS. As with all other constraints, each time a user enters or modifies data, the DBMS checks the constraints and verifies that they are met. If the constraints are violated, the data modification will not be allowed.

foreign Keys and primary Keys in the same Table Foreign keys do not necessarily need to reference a primary key in a different table; they need only reference a primary key. As an example, consider the following employee relation:

ap pl ic ab le c op yr ig ht l aw .

96 cHAPTER 5 The Relational Data Model

employee (employee_ID, first_name, last_name, department, manager_ID)

A manager is also an employee. Therefore, the manager ID, although named differently from the employee ID, is actually a foreign key that references the primary key of its own table. The DBMS will therefore always ensure that whenever a user enters a manager ID, that manager already exists in the table as an employee.

viEWs The people responsible for developing a database schema and those who write application programs for use by technologically unsophis- ticated users typically have knowledge of and access to the entire schema, including direct access to the database’s base tables. However, it is usually undesirable to have end users working directly with base tables, primarily for security reasons.

The relational data model therefore includes a way to provide end users with their own window into the database, one that hides the details of the overall database design and prohibits direct access to the base tables.

The view Mechanism A view is not stored with data. Instead, it is stored under a name in the database itself along with a database query that will retrieve its data. A view can therefore contain data from more than one table, selected rows, and selected columns.

Note: Although a view can be constructed in just about any way that you can query a relational database, many views can be used for data display. As you will learn in Chapter 10, only views that meet a strict set of rules can be used to modify data.

The real beauty of storing views in this way, however, is that whenever the user includes the name of the view in a data manipulation lan- guage statement, the DBMS executes the query associated with the view name and recreates the view’s table. This means that the data in a view will always be current.

A view table remains in main memory only for the duration of the data manipulation statement in which it was used. As soon as the user Co

py ri gh t © 2 00 9. M or ga n Ka uf ma nn . Al l ri gh ts r es er ve d. M ay n ot b e re pr od uc ed i n an y fo rm w it ho ut p er mi ss io n fr om t he p ub li sh er , ex ce pt f ai r us es p er mi tt ed u nd er U .S . or

ap pl ic ab le c op yr ig ht l aw .

97The Data Dictionary

issues another query, the view table is removed from main memory to be replaced by the result of the most recent query. A view table is therefore a virtual table.

Note: Some end user DBMSs give the user the ability to save the contents of a view as a base table. This is a particularly undesirable feature, as there are no provisions for automatically updating the data in the saved table whenever the tables on which it was based change. The view table therefore will quickly become out of date and inaccurate.

Why use views? There are three good reasons to include views in the design of a database:

n As mentioned earlier, views provide a significant security mecha- nism by restricting users from viewing portions of a schema to which they should not have access.

n Views can simplify the design of a database for technologically unsophisticated users.

n Because views are stored as named queries, they can be used to store frequently used, complex queries. The queries can then be executed by using the name of the view in a simple query.

Like other structural elements in a relational database, views can be created and destroyed at any time. However, because views do not contain stored data but only specification of a query that will generate a virtual table, adding or removing view definitions has no impact on base tables or the data they contain. Removing a view will create prob- lems only when that view is used in an application program and the program is not modified to work with a different view or base table.

THE DaTa DiCTiONarY The structure of a relational database is stored in the database’s data dictionary or catalog. The data dictionary is made up of a set of relations that are identical in properties to the relations used to hold data. They can be queried using the same tools used to query data-handling rela- tions. No user can modify the data dictionary tables directly. However, data manipulation language commands that create, modify, and destroy database structural elements work by modifying rows in data dictionary tables.Co

ap pl ic ab le c op yr ig ht l aw .

98 cHAPTER 5 The Relational Data Model

You will typically find the following types of information in a data dictionary:

n Definitions of the columns that make up each table n Integrity constraints placed on relations n Security information (which user has the right to perform

which operation on which table) n Definitions of other database structure elements, such as views

(discussed further in Chapter 8) and user-defined domains

When a user attempts to access data in any way, a relational DBMS first goes to the data dictionary to determine whether the database elements the user has requested are actually part of the schema. In addition, the DBMS verifies that the user has the access right to what- ever he or she is requesting.

When a user attempts to modify data, the DBMS also goes to the data dictionary to look for integrity constraints that may have been placed on the relation. If the data meet the constraints, then the modification is permitted. Otherwise the DBMS returns an error message and does not make the change.

Because all access to a relational database is through the data diction- ary, relational DBMSs are said to be data dictionary driven. The data in the data dictionary are known as metadata: data about data.

sample Data Dictionary Tables The precise tables that make up a data dictionary depend somewhat on the DBMS. In this section you will see one example of a typical way in which a DBMS might organize its data dictionary.

The linchpin of the data dictionary is actually a table that documents all the data dictionary tables (often named syscatalog, the first few rows of which can be found in Figure 5-4). From the names of the data dictionary tables, you can probably guess that there are tables to store data about base tables, their columns, their indexes, and their foreign keys.

The syscolumn table describes the columns in each table (including the data dictionary tables). In Figure 5-5, for example, you can see a portion of a syscolumn table that describes the Antique Opticals mer- chandise item table.

Keep in mind that these data dictionary tables have the same structure and must adhere to the same rules as base tables. They must have

ap pl ic ab le c op yr ig ht l aw .

99A Bit of History

non-null unique primary keys, and they must enforce referential integ- rity among themselves.

a biT Of HisTOrY When Codd published his paper describing the relational data model in 1970, software developers were bringing databases based on older data models to market. The software was becoming relatively mature and was being widely installed. Although many theorists recognized the benefits of the relational data model, it was some time before relational systems actually appeared.

IBM had a working prototype of its System R by 1976. This product, however, was never released. Instead, the first relational DBMS to feature SQL—an IBM development—was Oracle, released by the company of the same name in 1977. IBM didn’t actually market a relational DBMS until 1981, when it released SQL/DS.

Oracle debuted on minicomputers running UNIX. SQL/DS ran under VM (often specifically using CMS on top of VM) on IBM mainframes.

creator tname dbspace tabletype ncols Primary_key

SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS

SYSTABLE SYSCOLUMN SYSINDEX SYSIXCOL SYSFOREIGNKEY SYSKCOL SYSFILE SYSDOMAIN SYSUSERPERM SYTSTABLEPERM SYSCOLPERM

SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM

TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE

12 14 8 5 8 4 3 4 10 11 6

Y Y Y Y Y Y Y Y Y Y Y

n FIguRE 5-4 A portion of a syscatalog table.

creator

DBA DBA DBA DBA DBA

cname

item_numb title distributor_numb release_date retail_price

tname

items items items items items

coltype

integer varchar integer date numeric

nulls

N Y Y Y Y

length

4 60 4 6 8

Inprimarykey

Y N N N N

Colno

1 2 3 4 5

n FIguRE 5-5 Selected rows from a syscolumn table.

ap pl ic ab le c op yr ig ht l aw .

100 cHAPTER 5 The Relational Data Model

There was also a crop of early products that were designed specifically for PCs, the first of which was dBase II, from a company named Ashton-Tate. Released in 1981, the product ran on IBM PCs and Apple II+s.

Note: It is seriously questionable whether dBase was ever truly a “relational” DBMS. However, most consumers saw it as such, and it is therefore considered the first relational product for PCs.

Oracle was joined by a large number of competing products in the UNIX market, including Informix and Ingres. Oracle has been the biggest winner in this group because it now runs on virtually every OS/hardware platform combination imaginable and has scaled well (down to PCs and up to mainframes). Prior to the widespread deploy- ment of mySQL as a database server for Web sites, it was safe to say that there were more copies of Oracle running on computers than any other DBMS.

The PC market for relational DBMSs has been flooded with products. As often happens with software, the best has not necessarily become the most successful. In 1983, Microrim released its R:BASE product, the first truly relational product for a PC. With its support for standard SQL, a powerful integrity rule facility, and a capable programming language, R:BASE was a robust product. It succumbed, however, to the market penetration of dBASE. The same can be said for Paradox (origi- nally a Borland product and later purchased by Corel) and FoxPro (a dBase-compatible product originally developed by Fox Software).

dBase faded from prominence after being purchased by Borland in 1991. FoxPro, dBase’s major competitor, was purchased by Microsoft in 1992. It, too, has faded from the small computer DBMS market. Instead, the primary end user desktop DBMS for Windows today is Access, first released by Microsoft in 1993.

Note: You may be wondering why no newer products are mentioned in the preceding discussion. That is because, for the most part, there haven’t been any major new relational DBMSs released in some time. A DBMS is a complex product for a software developer to create. Added to that, once an organization has invested time and money in purchasing, installing, and writing applications for a specific product, it is unlikely to want to change as long as the current product can be made to work. The barriers to entry into the DBMS software market are very high for a new product.

ap pl ic ab le c op yr ig ht l aw .

101For Further Reading

fOr furTHEr rEaDiNg If you want to follow the history of Codd’s specifications for relational databases, consult the following:

Codd, E. F. “A Relational Model of Data for Large Shared Databanks.” Communications of the ACM 13(6), 377–387, 1970.

Codd, E. F. “Extending the Relational Model to Capture More Meaning.” Transactions on Database Systems 4(4), 397–434, 1979.

Codd, E. F. “Relational Database: A Practical Foundation for Productivity.” Communications of the ACM. 25(2), 109–117, 1982.

Codd, E. F. “Is Your DBMS Really Relational?” Computerworld, October 14: ID/1-1D/9, 1983.

Codd, E. F. The Relational Data Model, Version 2. Addison-Wesley, 1990.

There are also literally hundreds of books that discuss the details of specific relational DBMSs. After you finish reading this book, you may want to consult one or more books that deal with your specific product to help you learn to develop applications using that product’s tools.

Another title of interest is:

Lightstone, Sam S., Toby J. Teorey, and Tom Nadeau. Physical Database Design: The Database Professional’s Guide to Exploiting Views, Storage and More. Morgan Kaufmann, 2007.

ap pl ic ab le c op yr ig ht l aw .

This page intentionally left blank

ap pl ic ab le c op yr ig ht l aw .

Given any pool of entities and attributes, you can group them into relations in many ways. In this chapter, you will be introduced to the process of normalization, through which you create relations that avoid most of the problems that arise from bad relational design.

There are at least two ways to approach normalization. The first is to work from an ER diagram. If the diagram is drawn correctly, then there are some simple rules you can use to translate it into relations that will avoid most relational design problems. The drawback to this approach is that it can be difficult to determine whether your design is correct. The second approach is to use the theoretical concepts behind good design to create your relations. This is a bit more difficult than working from an ER diagram, but it often results in a better design.

In practice, you may find it useful to use a combination of both approaches. First, create an ER diagram and use it to design your rela- tions. Then, check those relations against the theoretical rules for good design and make any changes necessary to meet the rules.

TraNsLaTiNg aN Er DiagraM iNTO rELaTiONs An ER diagram in which all many-to-many relationships have been transformed into one-to-many relationships through the introduction of composite entities can be translated directly into a set of relations. To do so:

n Create one table for each entity.

n For each entity that is only at the “one” end of one or more relation- ships and not at the “many” end of any relationship, create a single- column primary key, using an arbitrary unique identifier if no natural primary key is available.

103

6Chapter Normalization

ap pl ic ab le c op yr ig ht l aw .

104 cHAPTER 6 Normalization

n For each entity that is at the “many” end of one or more relation- ships, include the primary key of each parent entity (those at the “one” end of the relationships) in the table as foreign keys.

n If an entity at the “many” end of one or more relationships has a natural primary key (for example, an order number or an invoice number), use that single column as the primary key. Otherwise, concatenate the primary key of its parent with any other column or columns needed for uniqueness to form the table’s primary key.

Following these guidelines, we end up with the following tables for the Antique Opticals database:

Customer (customer_numb, customer_first_name, customer_last_name, customer_street, customer_city, customer_state, customer_zip, customer_phone)

Distributor (distributor_numb, distributor_name, distributor_street, distributor_city, distributor_state, distributor_zip, distributor_phone, distributor_contact_person, contact_person_ext)

Item (item_numb, item_type, title, distributor_numb, retail_price, release_date, genre, quant_in_stock)

Order (order_numb, customer_numb, order_date, credit_card_numb, credit_card_exp_date, order_complete?, pickup_or_ship?)

Order item (order_numb, item_numb, quantity, discount_percent, selling_price, line_cost, shipped?, shipping_date)

Purchase (purchase_date, customer_numb, items_received?, customer_paid?)

Purchase item (purchase_date, customer_numb, item_numb, condition, price_paid)

Actor (actor_numb, actor_name) Performance (actor_numb, item_numb, role) Producer (producer_name, studio) Production (producer_name,item_numb)

Note: You will see these relations reworked a bit throughout the remainder of the first part of this book to help illustrate various aspects of database design. However, the preceding is the design that results from a direct translation of the ER diagram.

ap pl ic ab le c op yr ig ht l aw .

105Normal Forms

NOrMaL fOrMs The theoretical rules that the design of a relation meet are known as normal forms. Each normal form represents an increasingly stringent set of rules. Theoretically, the higher the normal form, the better the design of the relation. As you can see in Figure 6-1, there are six nested normal forms, indicating that if a relation is in one of the higher, inner normal forms, it is also in all of the normal forms below it.

In most cases, if you can place your relations in third normal form (3NF), then you will have avoided most of the problems common to bad relational designs. The three higher normal forms—Boyce-Codd, fourth normal form (4NF), and fifth normal form (5NF)—handle special situations that arise only occasionally. However, the situations that these normal forms handle are conceptually easy to understand and can be used in practice if the need arises.

Fifth Normal Form

Fo ur

th N ormal Form

Bo yce

-Codd Normal Form

Thi rd Normal Form

Sec ond Normal Form

First Normal Form

n FIguRE 6-1 Nested normal forms.

ap pl ic ab le c op yr ig ht l aw .

106 cHAPTER 6 Normalization

In recent years, sixth normal form has been added to relational data- base design theory. It is not precisely a more rigorous normal form than fifth normal form, although it uses the same principles to trans- form relations from one form to another. You will be introduced to it briefly at the end of this chapter.

Note: In addition to the normal forms in Figure 6-1 and sixth normal form, another normal form—domain/key normal form—which is of purely theoretical importance, has not been used as a practical design objective.

firsT NOrMaL fOrM A table is in first normal form (1NF) if it meets the following criteria:

1. The data are stored in a two-dimensional table. 2. There are no repeating groups.

The key to understanding 1NF, therefore, is understanding the nature of a repeating group of data.

understanding repeating groups A repeating group is an attribute that has more than one value in each row of a table. For example, assume that you were working with an employee relation and needed to store the names and birth dates of the employees’ children. Because each employee can have more than one child, the names of the children and their birth dates each form a repeating group.

Note: A repeating group is directly analogous to a multivalued attribute in an ER diagram.

There is actually a very good reason why repeating groups are not permitted. To see what might happen if they were used, take a look at Figure 6-2, an instance of an employee table containing repeating groups. Notice that there are multiple values in a single row in both the children’s names and the children’s birth dates columns. This presents two major problems:

ap pl ic ab le c op yr ig ht l aw .

107First Normal Form

n There is no way to know exactly which birth date belongs to which child. It is tempting to say that we can associate the birth dates with the children by their positions in the list, but there is nothing to ensure that the relative positions will always be maintained.

n Searching the table is very difficult. If, for example, we want to know which employees have children born before 2005, the DBMS will need to perform data manipulation to extract the individual dates themselves. Given that there is no way to know how many birth dates there are in the column for any specific row, the processing overhead for searching becomes even greater.

The solution to these problems, of course, is to get rid of the repeating groups altogether.

Emp# First Last Children’s Names Children’s Birthdates

1001 Jane Doe Mary, Sam 1/9/02, 5/15/04

1002 John Doe Lisa, David 1/9/00, 5/15/01

1003 Jane Smith John, Pat, Lee, Mary 10/5/04, 10/12/00, 6/6/2006, 8/21/04

1004 John Smith Michael 7/4/06

1005 Jane Jones Edward, Martha 10/21/05, 10/15/99

n FIguRE 6-2 A table with repeating groups.

Note: The table in Figure 6-2 is not a legal relation because it contains those repeating groups. Therefore, we should not call it a relation.

Handling repeating groups There are two ways to get rid of repeating groups to bring a table into conformance with the rules for first normal form: a right way and a wrong way. We will look first at the wrong way so you will know what not to do.

In Figure 6-3 you can see a relation that handles repeating groups by creating multiple columns for the multiple values. This particular example includes three pairs of columns for a child’s name and birth date. The relation in Figure 6-3 does meet the criteria for first normal form. The repeating groups are gone, and there is no problem identify- ing which birth date belongs to which child. However, the design has introduced several problems of its own, as follows:

ap pl ic ab le c op yr ig ht l aw .

108 cHAPTER 6 Normalization

n The relation is limited to three children for any given employee. This means that there is no room to store Jane Smith’s fourth child. Should you put another row for Jane Smith into the table? If so, then the primary key of this relation can no longer be just the employee number. The primary key must include one child’s name as well.

n The relation wastes space for people who have less than three chil- dren. Given that disk space is one of the least expensive elements of a database system, this is probably the least of the problems with this relation.

n Searching for a specific child becomes very clumsy. To answer the question “Does anyone have a child named Lee?,” the DBMS must construct a query that includes a search of all three child name columns because there is no way to know in which column the name might be found.

The right way to handle repeating groups is to create another table (another entity) to handle multiple instances of the repeating group. In the example we have been using, we would create a second table for the children, producing something like Figure 6-4.

Neither of the two new tables contains any repeating groups, and this form of the design avoids all the problems of the preceding solution:

n There is no limit to the number of children that can be stored for a given employee. To add another child, you simply add another row to the table.

n There is no wasted space. The children table uses space only for data that are present.

n Searching for a specific child is much easier because children’s names are found in only one column.

Emp# First Last Child Name 1

Child Bdate 1

Child Name 2

Child Bdate 2

Child Name 3

Child Bdate 3

1001 Jane Doe Mary 1/1/02 Sam 5/15/04

1002 John Doe Lisa 1/1/00 David 5/15/01

1003 Jane Smith John 10/5/04 Pat 10/12/00 Lee 6/6/06

1004 John Smith Michael 7/4/06

1005 Joe Jones Edward 10/21/05 Martha 10/15/99

n FIguRE 6-3 A relation handling repeating groups in the wrong way

ap pl ic ab le c op yr ig ht l aw .

109First Normal Form

problems with first Normal form Although first normal form relations have no repeating groups, they usually have many other problems. To examine the most typical, we will look at the table underlying the data entry form in Chapter 3. (This table comes from Antique Opticals’ original data management system rather than the new and improved design you saw earlier in this chapter.) Expressed in the notation for relations that we have been using, the relation is:

orders (customer_numb, first_name, last_name, street, city, state, zip, phone, order_numb, order_date, item_numb, title, price, has_shipped?)

The first thing we need to do is determine the primary key for this table. The customer number alone will not be sufficient because the customer number repeats for every item ordered by the customer. The item number will also not suffice because it is repeated for every order on which it appears. We cannot use the order number because it is repeated for every item on the order. The only solution is a concate- nated key, in this example, the combination of the order number and the item number.

Employee

Emp# First Last 1001 Jane Doe 1002 John Doe 1003 Jane Smith 1004 John Smith 1005 Joe Jones

Children 1001 Mary 1/1/02 1001 Sam 5/15/04 1002 Lisa 1/1/00 1002 David 5/15/01 1003 John 10/5/04 1003 Pat 10/12/00 1003 Lee 6/6/06 1003 Mary 8/21/04 1004 Michael 7/4/06 1005 Edward 10/21/05 1005 Martha 1015/99

n FIguRE 6-4 The correct way to handle a repeating group.

ap pl ic ab le c op yr ig ht l aw .

110 cHAPTER 6 Normalization

Given that the primary key is made up of the order number and the item number, there are two important things we cannot do with this relation:

1. We cannot add data about a customer until the customer places at least one order because without an order and an item on that order, we do not have a complete primary key.

2. We cannot add data about a merchandise item we are carrying without that item being ordered. There must be an order number to complete the primary key.

The preceding are insertion anomalies, situations that arise when you are prevented from inserting data into a relation because a complete primary key is not available. (Remember that no part of a primary key can be null.)

Note: To be strictly correct, there is a third insertion anomaly in the orders relation. You cannot insert an order until you know one item on the order. In a practical sense, however, no one would enter an order without there being an item ordered.

Insertion anomalies are common in first normal form relations that are not also in any of the higher normal forms. In practical terms, they occur because there are data about more than one entity in the rela- tion. The anomaly forces you to insert data about an unrelated entity (for example, a merchandise item) when you want to insert data about another entity (such as a customer).

First normal form relations can also cause problems when data are deleted. Consider, for example, what happens if a customer cancels the order of a single item:

n In cases where the deleted item was the only item on the order, you lose all data about the order.

n In cases where the order was the only order on which the item appeared, you lose data about the item.

n In cases where the deleted item was the item ordered by a customer, you lose all data about the customer.

These deletion anomalies occur because part of the primary key of a row becomes null when the merchandise item data are deleted, forcing you to remove the entire row. The result of a deletion anomaly is the loss of data that you would like to keep. In practical terms, you are

ap pl ic ab le c op yr ig ht l aw .

111Second Normal Form

forced to remove data about an unrelated entity when you delete data about another entity in the same table.

Note: Moral to the story: More than one entity in a table is a bad thing.

There is a final type of anomaly in the orders relation that is not related to the primary key: a modification, or update, anomaly. The order rela- tion has a great deal of unnecessary duplicated data—in particular, information about customers. When a customer moves, then the cus- tomer’s data must be changed in every row, for every item on every order ever placed by the customer. If every row is not changed, then data that should be the same are no longer the same. The potential for these inconsistent data is the modification anomaly.

sECOND NOrMaL fOrM The solution to anomalies in a first normal form relation is to break down the relation so there is one relation for each entity in the 1NF relation. The orders relation, for example, will break down into four relations (customers, items, orders, and line items). Such relations are in at least second normal form (2NF).

In theoretical terms, second formal form relations are defined as follows:

1. The relation is in first normal form. 2. All non-key attributes are functionally dependent on the

entire primary key.

The new term in the preceding is functionally dependent, a special rela- tionship between attributes.

understanding functional Dependencies A functional dependency is a one-way relationship between two attri- butes such that at any given time, for each unique value of attribute A, only one value of attribute B is associated with it through the rela- tion. For example, assume that A is the customer number from the orders relation. Each customer number is associated with one cus- tomer first name, one last name, one street address, one city, one state, one zip code, and one phone number. Although the values for those attributes may change, at any moment, there is only one.

ap pl ic ab le c op yr ig ht l aw .

112 cHAPTER 6 Normalization

We can therefore say that first name, last name, street, city, state, zip, and phone are functionally dependent on the customer number. This relationship is often written:

Customer_numb -> first_name, last_name, street, city, state, zip, phone

and read “customer number determines first name, last name, street, city, state, zip, and phone.” In this relationship, customer number is known as the determinant (an attribute that determines the value of other attributes).

Notice that the functional dependency does not necessarily hold in the reverse direction. For example, any given first or last name may be associated with more than one customer number. (It would be unusual to have a customer table of any size without some duplication of names.)

The functional dependencies in the orders table are:

Customer_numb −> first_name, last_name, street, city, state, zip, phone

Item_numb −> title, price Order_numb −> customer_numb, order_date Item_numb + order_numb −> has_shipped?

Notice that there is one determinant for each entity in the relation and the determinant is what we have chosen as the entity identifier. Notice also that when an entity has a concatenated identifier, the determinant is also concatenated. In this example, whether an item has shipped depends on the combination of the item and the order.

using functional Dependencies to reach 2Nf If you have correctly identified the functional dependencies among the attributes in a database environment, then you can use them to create second normal form relations. Each determinant becomes the primary key of a relation. All the attributes that are functionally depen- dent on it become non-key attributes in the relation.

The original orders relation should be broken into the following four relations:

Customer (customer_numb, first_name, last_name, street, city, state, zip, phone)

Item (item_numb, title, price)

ap pl ic ab le c op yr ig ht l aw .

113Second Normal Form

Order (order_numb, customer_numb, order_date) Order items (order_numb, item_numb, has_shipped?)

Each of these should in turn correspond to a single entity in your ER diagram.

Note: When it comes to deciding what is driving database design— functional dependencies or entities—it is really a “chicken and egg” situation. What is most important is that there is consistency between the ER diagram and the functional dependencies you identify in your relations. It makes no difference whether you design by looking for functional dependencies or for entities. In most cases, database design is an iterative process in which you create an initial design, check it, modify it, and check it again. You can look at either functional dependencies and/or entities at any stage in the process, checking one against the other for consistency.

The relations we have created from the original orders relation have eliminated the anomalies present in the original:

n It is now possible to insert data about a customer before the customer places an order.

n It is now possible to insert data about an order before we know an item on the order.

n It is now possible to store data about merchandise items before they are ordered.

n Line items can be deleted from an order without affecting data describing that item, the order itself, or the merchandise item.

n Data describing the customer are stored only once, and therefore any change to those data needs to be made only once. A modification anomaly cannot occur.

problems with 2Nf relations Although second normal form eliminates problems from many rela- tions, you will occasionally run into relations that are in second normal form yet still exhibit anomalies. Assume, for example, that each new DVD title that Antique Opticals carries comes from one distributor and that each distributor has only one warehouse that has only one phone number. The following relation is therefore in second normal form:

ap pl ic ab le c op yr ig ht l aw .

114 cHAPTER 6 Normalization

Item (item_numb, title, distrib_numb, warehouse_phone_number)

For each item number, there is only one value for the item’s title, distributor, and warehouse phone number. However, there is one insertion anomaly: You cannot insert data about a distributor until you have an item from that distributor. There is one deletion anomaly: If you delete the only item from a distributor, you lose data about the distributor. There is also a modification anomaly: The distributor’s warehouse phone number is duplicated for every item the company gets from that distributor. The relation is in second normal form but not third.

THirD NOrMaL fOrM Third normal form is designed to handle situations like the one you just read about in the preceding section. In terms of entities, the item relation does contain two entities: the merchandise item and the dis- tributor. That alone should convince you that the relation needs to be broken down into two smaller relations, both of which are now in third normal form:

Item (item_numb, distrib_numb) Distributor (distrib_numb, warehouse_phone_number)

The theoretical definition of third normal form says:

1. The relation is in second normal form. 2. There are no transitive dependencies.

The functional dependencies found in the original relation are an example of a transitive dependency.

Transitive Dependencies A transitive dependency exists when you have the following functional dependency pattern:

This is precisely the case with the original items relation. The only reason that the warehouse phone number is functionally dependent on the item number is because the distributor is functionally depen- dent on the item number and the phone number is functionally dependent on the distributor. The functional dependencies are really:

ap pl ic ab le c op yr ig ht l aw .

115Third Normal Form

Item_numb −> distrib_numb Distrib_numb −> warehouse_phone_number

There are two determinants in the original items relation, each of which should be the primary key of its own relation. However, it is not merely the presence of the second determinant that creates the transitive dependency. What really matters is that the second determi- nant is not a candidate key for the relation.

Consider for example, this relation:

Item (item_numb, UPC, distrib_numb, price)

The item number is an arbitrary number that Antique Opticals assigns to each merchandise item. The UPC is an industry-wide code that is unique to each item as well. The functional dependencies in this relation are:

Item_numb −> UPC, distrib_numb, price UPC −> item_numb, distrib_numb, price

Is there a transitive dependency here? No, because the second deter- minant is a candidate key. (Antique Opticals could have just as easily used the UPC as the primary key.) There are no insertion, deletion, or modification anomalies in this relation; it describes only one entity: the merchandise item.

A transitive dependency therefore exists only when the determinant that is not the primary key is not a candidate key for the relation. In the items table we have been using, for example, the distributor is a determinant but not a candidate key for the table. (There can be more than one item coming from a single distributor.)

When you have a transitive dependency in a 2NF relation, you should break the relation into two smaller relations, each of which has one of the determinants in the transitive dependency as its primary key. The attributes determined by the determinant become non-key attri- butes in each relation. This removes the transitive dependency—and its associated anomalies—and places the relation in third normal form.

Note: Transitive dependencies take their name from the transitive property in mathematics, which states that if a > b and b > c, then a > c.

ap pl ic ab le c op yr ig ht l aw .

116 cHAPTER 6 Normalization

bOYCE-CODD NOrMaL fOrM For most relations, third normal form is a good design objective. Rela- tions in that state are free of most anomalies. However, occasionally you run into relations that exhibit special characteristics where anom- alies still occur. Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form (5NF) were created to handle such special situations.

Note: A second normal form relation that has no transitive dependencies is, of course, automatically in third normal form.

Note: If your relations are in third normal form and do not exhibit the special characteristics that BCNF, 4NF, and 5NF were designed to handle, then they are automatically in 5NF.

The easiest way to understand BCNF is to start with an example. Assume that Antique Opticals decides to add a relation to its database to handle employee work scheduling. Each employee works one or two 4-hour shifts a day at the store. During each shift, an employee is assigned to one station (a place in the store, such as the front desk or the stockroom). Only one employee works a station during the given shift.

A relation to handle the schedule might be designed as follows:

Schedule (employee_ID, date, shift, station, worked_shift?)

Given the rules for the scheduling (one person per station per shift), there are two possible primary keys for this relation:

employee_ID + date + shift or date + shift + station

The functional dependencies in the relation are:

employee_ID + date + shift −> station, worked_shift? date + shift + stations −> employee_ID,

worked_shift?

Keep in mind that this holds true only because there is only one person working each station during each shift.

ap pl ic ab le c op yr ig ht l aw .

117Fourth Normal Form

This schedule relation exhibits overlapping candidate keys. (Both can- didate keys have date and shift in common.) Boyce-Codd normal form was designed to deal with relations that exhibit this characteris- tic. To be in Boyce-Codd normal form, a relation must meet the fol- lowing rules:

1. The relation must be in third normal form. 2. All determinants must be candidate keys.

BCNF is considered to be a more general way of looking at 3NF because it includes those relations with the overlapping candidate keys. The sample schedule relation we have been considering does meet the criteria for BCNF because the two determinants are indeed candidate keys.

fOurTH NOrMaL fOrM Like BCNF, fourth normal form was designed to handle relations that exhibit a special characteristic that does not arise too often. In this case, the special characteristic is something known as a multivalued dependency.

As an example, consider the following relation:

movie info (title, star, producer)

A given movie can have more than one star; it can also have more than one producer. The same star can appear in more than one movie; a producer can also work on more than one movie (for example, see the instance in Figure 6-5). The relation must therefore include all columns in its key.

ProducerStarTitle

Great Film Great Film Great Film Great Film Boring Movie Boring Movie

Lovely Lady Handsome Man Lovely Lady Handsome Man Lovely Lady Precocious Child

Money Bags Money Bags Helen Pursestrings Helen Pursestrings Helen Pursestrings Helen Pursestrings

n FIguRE 6-5 A relation with a multivalued dependency.

Note: There is very little difference between the two candidate keys as far as the choice of a primary key is concerned. In cases like this, you can choose either one.

ap pl ic ab le c op yr ig ht l aw .

118 cHAPTER 6 Normalization

Because there are no non-key attributes, this relation is in BCNF. Nonetheless, the relation exhibits anomalies:

n You cannot insert the stars of a movie without knowing at least one producer.

n You cannot insert the producer of a movie without knowing at least one star.

n If you delete the only producer from a movie, you lose information about the stars.

n If you delete the only star from a movie, you lose information about its producers.

n Each producer’s name is duplicated for every star in the movie. By the same token, each star’s name is duplicated for each producer of the movie. These unnecessary duplicated data form the basis of a modification anomaly.

There are at least two unrelated entities in this relation: one that handles the relationship between a movie and its stars and another that handles the relationship between a movie and its producers. In a practical sense, that is the cause of the anomalies. (Arguably, there are also movie, star, and producer entities involved.)

However, in a theoretical sense, the anomalies are caused by the pres- ence of a multivalued dependency in the same relation, which must be eliminated to get to fourth normal form. The rules for fourth normal form are:

1. The relation is in Boyce-Codd normal form. 2. There are no multivalued dependencies.

Multivalued Dependencies A multivalued dependency exists when for each value of attribute A, there exists a finite set of values of both attribute B and attribute C that are associated with it. Attributes B and C, however, are indepen- dent of each other. In the example that we have been using, there is just such a dependency. First, for each movie title, there is a group of actors (the stars) who are associated with the movie. For each title, there is also a group of producers who are associated with it. However, the actors and the producers are independent of one another.

ap pl ic ab le c op yr ig ht l aw .

119Fifth Normal Form

The multivalued dependency can be written:

and read “title multidetermines star and title multidetermines producer.”

Note: At this point, do not let semantics get in the way of database theory. Yes, it is true that producers fund the movies in which the actors are starring, but in terms of database relationships, there is no direct connection between the two.

Note: To be strictly accurate, a functional dependency is a special case of a multivalued dependency, where what is being determined is one value rather than a group of values.

To eliminate the multivalued dependency and bring this relation into fourth normal form, you split the relation, placing each part of the dependency in its own relation:

movie_stars (title, star) movie_producers (title, producer)

With this design, you can independently insert and remove stars and producers without affecting the other. Star and producer names also appear only once for each movie with which they are involved.

fifTH NOrMaL fOrM Fifth normal form—also known as projection-join normal form—is designed to handle a general case of a multivalued dependency, known as a join dependency. Before we can consider 5NF, we must therefore look at the relational algebra operations project and join.

Note: Relational algebra is a set of operations used to manipulate and extract data from relations. Each operation performs a single manipulation of one or two tables. To complete a query, a DBMS uses a sequence of relational algebra operations; relational algebra is therefore procedural. It is not used directly by people using a database but instead is a tool used by the DBMS.

ap pl ic ab le c op yr ig ht l aw .

120 cHAPTER 6 Normalization

projections and Joins When you split relations during the normalization process, you are actually creating a relational algebra projection. Join combines tables on matching attributes and is used extensively in queries to match data based on primary and foreign keys.

Projection The project operation creates a subset of any relation by extracting specified columns. It makes no provision for choosing rows: You get all of them. The theoretical project operation removes duplicate rows so that the result is a legal relation.

As an example, consider the following relation that you saw earlier in this chapter:

Item (item_numb, UPC, distrib_numb, price)

We can make a number of projections, all of which are legal relations:

(item_numb, UPC) (item_numb, distrib_numb) (item_numb, price) (UPC, distrib_numb) (UPC, price) (distrib_numb, price) (item_numb, UPC, distrib_numb) (item_numb, UPC, price) (UPC, distrib_numb, price)

Equi-Join In its most common form, a join forms new rows when data in the two source tables match. Because we are looking for rows with equal values, this type of join is known as an equi-join (or a natural equi-join). As an example, consider the two tables in Figure 6-6. Notice that the customer number column is the primary key of the Customers table and that the same column is a foreign key in the Orders table. The customer number column in orders therefore serves to relate orders to the customers to which they belong.

Assume that you wanted to see the names of the customers who placed each order. To do so, you must join the two tables, creating combined rows wherever there is a matching customer number. In database terminology, we are joining the two tables “over” the customer number. The result table can be found in Figure 6-7.Co

ap pl ic ab le c op yr ig ht l aw .

121Fifth Normal Form

An equi-join can begin with either source table. (The result should be the same regardless of the direction in which the join is performed.) The join compares each row in one source table with the row in the second. For each row in the first source table that matches data in the second source table in the column or columns over which the join is being performed a new row is placed in the result table.

Customers customer_numb first_name last_name

001 Jane Doe 002 John Doe 003 Jane Smith 004 John Smith 005 Jane Jones 006 John Jones

Orders order_numb customer_numb order_date order_total

1001 002 10/10/09 250.85 1002 002 2/21/10 125.89 1003 003 11/15/09 1567.99 1004 004 11/22/09 180.92 1005 004 12/15/09 565.00 1006 006 11/22/09 25.00 1007 006 10/8/09 85.00 1008 006 12/29/09 109.12

n FIguRE 6-6 Two tables with a primary key-foreign key relationship.

Result Table customer_numb first_name last_name order_numb order_date order_total

002 John Doe 1001 10/10/09 250.65

002 John Doe 1002 2/21/10 125.89

003 Jane Smith 1003 11/15/09 1597.99

004 John Smith 1004 11/22/09 180.92

004 John Smith 1005 12/15/09 565.00

006 John Jones 1006 11/22/09 25.00

006 John Jones 1007 10/8/09 85.00

006 John Jones 1008 12/29/09 109.12

n FIguRE 6-7 The joined table.

ap pl ic ab le c op yr ig ht l aw .

122 cHAPTER 6 Normalization

Assuming that we are using the customers table as the first source table, producing the result table in Figure 6-7 might therefore proceed conceptually as follows:

1. Search orders for rows with a customer number of 001. Because there are now matching rows in orders, do not place a row in the result table.

2. Search orders for rows with a customer number of 002. There are two matching rows in orders. Create two new rows in the result table, placing the same customer information at the end of each row in orders.

3. Search orders for rows with a customer number of 003. There is one matching row in orders. Place one new row in the result table.

4. Search orders for rows with a customer number of 004. There are two matching rows in orders. Place two rows in the result table.

5. Search orders for rows with a customer number of 005. There are no matching rows in orders. Therefore, do not place a row in the result table.

6. Search orders for rows with a customer number of 006. There are three matching rows in orders. Place three rows in the result table.

Notice that if a customer number does not appear in both tables, then no row is placed in the result table. This behavior categorizes this type of join as an inner join.

understanding 5Nf Now that you know how the project and join operations work, we can take a look at fifth normal form. As an example, consider the following relation:

Selections (customer_numb, series, item_numb)

This relation represents various series of discs, such as Spider-man or Rambo. Customers place orders for a series; when a customer orders a series, he or she must take all items in that series. Determining fifth normal form becomes relevant only when this type of rule is in place. If customers could request selected titles from a series, then the rela- tion would be fine. Because it would be all-key, it would automatically fall through the normal form rules to 5NF.

ap pl ic ab le c op yr ig ht l aw .

123Fifth Normal Form

To make the problems with this table under the preceding rule clearer, consider the instance of the relation in Figure 6-8. Because this table is all-key, it is automatically in fourth normal form. However, there is a great deal of unnecessary duplicated data in this relation. For example, the item numbers are repeated for every customer that orders a given series. The series name is also repeated for every item in the series and for every customer ordering that series. This relation is therefore prone to modification anomalies.

There is also a more subtle issue: Under the rules of this relation, if customer 2180 orders the first Harry Potter movie and indicates that he or she would like more movies in the series, then the only way to put that choice in the table is to add rows for all five Harry Potter movies. You may be forced to add rows that you don’t want to add and introduce data that aren’t accurate.

Customer Number Series

Item Number

1005 Star Wars 2090 1005 Star Wars 2091 1005 Star Wars 2092 1005 Star Wars 4689 1005 Star Wars 4690 1005 Star Wars 4691 1010 Harry Potter 3200 1010 Harry Potter 3201 1010 Harry Potter 3202 1010 Harry Potter 3203 1010 Harry Potter 3204 2180 Star Wars 2090 2180 Star Wars 2091 2180 Star Wars 2092 2180 Star Wars 4689 2180 Star Wars 4690 2180 Star Wars 4691

n FIguRE 6-8 A relation in 4NF but not 5NF.

Note: There is no official term for the preceding anomaly. It is precisely the opposite of the insertion anomalies described earlier in this chapter, although it does involve a problem with inserting data.

ap pl ic ab le c op yr ig ht l aw .

124 cHAPTER 6 Normalization

By the same token, if a customer doesn’t want one item in a series, then you must remove from the table all the rows for that customer for that series. If the customer still wants the remaining items in the series, then you have a deletion anomaly.

As you might guess, you can solve the problem by breaking the table into two smaller tables, eliminating the unnecessary duplicated data and the insertion and deletion anomalies:

series_subscription (customer_numb, series) series_content (series, item_numb)

The official definition for 5NF is:

1. The relation is in fourth normal form. 2. All join dependencies are implied by the candidate keys.

A join dependency occurs when a table can be put together correctly by joining two or more tables, all of which contain only attributes from the original table. The original selections relation does have a join dependency because it can be created by joining the series subscription and series content relations. The join is valid only because of the rule that requires a customer to order all of the items in a series.

A join dependency is implied by candidate keys when all possible projections from the original relation that form a join dependency are candidate keys for the original relation. For example, the following projections can be made from the selections relation:

A: (customer_numb, series) B: (customer_numb, item_numb) C: (series, item_numb)

We can regenerate the selections relation by combining any two of the preceding relations. Therefore, the join dependencies are A + B, A + C, B + C, and A + B + C. Like other relational algebra operations, the join theoretically removes duplicate rows, so although the raw result of the join contains extra rows, they will be removed from the result, producing the original table.

Note: One of the problems with 5NF is that as the number of columns in a table increases, the number of possible projections increases exponentially. It can therefore be very difficult to determine 5NF for a large relation.

ap pl ic ab le c op yr ig ht l aw .

125Sixth Normal Form

However, each of the projections is not a candidate key for the selec- tions relation. All three columns from the original relation are required for a candidate key. Therefore, the relation is not in 5NF. When we break down the selections relation into series_selections and series_content, we eliminate the join dependencies, ensuring that the relations are in 5NF.

sixTH NOrMaL fOrM Normalization theory has been very stable for nearly 40 years. However, in the late 1990s, C.J. Date, one of the foremost experts in database theory, proposed sixth normal form, particularly to handle situations in which there is temporal data. However, this is not techni- cally a project-join normal form like the others we discussed earlier in this chapter.

Consider the following relation:

customers (ID, valid_interval, street, city, state, zip, phone)

The intent of this relation is to maintain a history of a customer’s location and when they were valid (starting date to ending date). Depending on the circumstances, there may be a great deal of dupli- cated data in this relation (for example, if only the phone number changed) or very little (for example, if there is a move to a new state with a new phone number). Nonetheless, there is only one functional dependency in the relation:

There are no transitive dependencies, no overlapping candidate keys, no multivalued dependencies, and all join dependencies are implied by the candidate key(s). The relation is therefore in fifth normal form.

Sixth normal form was created to handle the situation where temporal data vary independently to avoid unnecessary duplication. The result is tables that cannot be decomposed any further; in most cases, the tables include the primary key and a single non-key attribute. The sixth normal form tables for the sample customers relation would be as follows:

street_addresses (ID, valid_interval, street) cities (ID, valid_interval, city) states (ID, valid_interval, state)

ap pl ic ab le c op yr ig ht l aw .

126 cHAPTER 6 Normalization

zip_codes (ID, valid_interval, zip) phone_numbers (ID, valid_interval, phone)

The resulting tables eliminate the possibility of redundant data but introduce some time-consuming joins to find a customer’s current address or to assemble a history for a customer.

Going to sixth normal form may also introduce the need for a circular inclusion constraint. There is little point in including a street address for a customer unless a city, state, and zip code exist for the same date interval. The circular inclusion constraint would therefore require that if a row for any given interval and any given customer ID exists in any of street_addresses, cities, states, or zip_codes, matching rows must exist in all of those tables. Today’s relational DBMSs do not support circular inclusion constraints, nor are they included in the current SQL standard. If such a constraint is necessary, it will need to be enforced through application code.

fOr furTHEr rEaDiNg There are many books available that deal with the theory of relational databases. You can find useful supplementary information in the following:

Chapple, Mike. “Database Normalization Basics”; available at http:// databases.about.com/od/specificproducts/a/normalization.htm

Date, C. J. “On DN/NF Normal Form”; available at www.dbdebunk.com/ page/page/621935.htm

Date, C. J., Hugh Darwen, and Nikos Lorentzos. Temporal Data and the Relational Model. Morgan Kaufmann, 2002.

Earp, Richard. Database Design Using Entity-Relationship Diagrams. Taylor & Frances, 2007.

Halpin, Terry, and Tony Morgan. Information Modeling and Relational Databases, 2nd ed. Morgan Kaufmann, 2008.

Hillyer, Mike. “An Introduction to Database Normalization”; available at http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

Olivé, Antoni. Conceptual Modeling of Information Systems. Springer, 2007.

Pratt, Philip J., and Joseph J. Adamski. Concepts of Database Management, 6th ed. Course Technology, 2007.

Ritchie, Colin. Database Principles and Design, 3rd ed. Cengage Learning Business Press, 2008.

Wise, Barry. “Database Normalization and Design Techniques”; avail- able at www.barrywise.com/2008/01/database-normalization-and- design-techniques/

ap pl ic ab le c op yr ig ht l aw .

How long are you willing to wait for a computer to respond to your request for information? 30 seconds? 10 seconds? 5 seconds? In truth, we humans aren’t very patient at all. Even five seconds can seem like an eternity when you’re waiting for something to appear on the screen. A database that has a slow response time to user queries usually means that you will have dissatisfied users.

Slow response times can be the result of any number of problems. You might be dealing with a client workstation that isn’t properly configured, a poorly written application program, a query involving multiple join operations, a query that requires reading amounts of data from disk, a congested network, or even a DBMS that isn’t robust enough to handle the volume of queries submitted to it.

One of the duties of a database administrator (DBA) is to optimize database performance (also known as performance tuning). This includes modifying the design—where possible—to avoid perfor- mance bottlenecks, especially involving queries.

For the most part, a DBMS takes care of storing and retrieving data based on a user’s commands without human intervention. The strat- egy used to process a data manipulation request is handled by the DBMS’s query optimizer, a portion of the program that determines the most efficient sequence of relational algebra operations to perform a query.

Although most of the query optimizer’s choices are out of the hands of a database designer or application developer, you can influence the behavior of the query optimizer and also optimize database perfor-

127

7Chapter Database Structure and

Performance Tuning

ap pl ic ab le c op yr ig ht l aw .