DB Development Exercise
Chapter 13
The Data Warehouse
BLCN-534: Fundamentals of Database Systems
Chapter Objectives
- Compare the data needs of transaction processing systems with those of decision support systems.
- Describe the data warehouse concept and list its main features.
- Compare the enterprise data warehouse with the data mart.
- Design a data warehouse.
- Build a data warehouse, including the steps of data extraction, data cleaning, data transformation, and data loading.
- Describe how to use a data warehouse with online analytic processing and data mining.
13-*
Chapter Objectives
List the types of expertise needed to administer a data warehouse.
List the challenges in data warehousing.
13-*
Application Systems
- Transaction Processing Systems (TPS)
- Everyday application systems that support banking and insurance operations, manage the parts inventory on manufacturing assembly lines, keep track of airline and hotel reservations, support Web-based sales, etc.
- Decision Support Systems (DSS)
- specifically designed to aid managers in decision-making tasks.
13-*
The Data Warehouse Concept
A data warehouse is a broad-based, shared database for management decision making that contains data that has been accumulated over time.
Formally, a database warehouse is, “a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s decisions.”
13-*
Characteristics of
Data Warehouse Data
- The data is subject oriented
- The data is integrated
- The data is non-volatile
- The data is time variant
- The data must be high quality
- The data may be aggregated
- The data is often denormalized
- The data is not necessarily absolutely current
13-*
The Data is Subject Oriented
Data warehouses are organized around subjects, really the major entities of concern in the business environment.
- Sales, customers, orders, claims, accounts, employees, other entities that are central to the company’s business.
13-*
The Data is Integrated
Data about each of the subjects in the data warehouse is typically collected from several of the company’s transactional databases, each of which supports one or more applications that have something to do with the particular subject.
All of the data about a subject must be organized or integrated in such a way that it provides a unified, overall picture of all the important details about the subject over time.
Data from disparate application databases must be transformed into common measurements, codes, data types.
13-*
The Data is Non-Volatile
Once data is added to the data warehouse, it doesn’t change.
It will never change. Changing it would be like going back and rewriting history.
13-*
The Data is Time Variant
Data warehouse data, with its historic nature, always includes some kind of a timestamp.
If we are storing sales data on a weekly or monthly basis and we have accumulated ten years of such historic data, each weekly or monthly sales figure must be accompanied by a timestamp indicating the week or month (and year!) that it represents.
13-*
The Data Must Be
High Quality
Consider a section of a data warehouse in which the subject is customer.
If there is a customer address misspelling in one transactional file, when the data from that file is integrated with the data from the other transactional files, there will be some difficulty in reconciling whether the two different addresses both represent one customer, or whether they actually represent two different customers.
This must be reconciled as the data is integrated and entered into the data warehouse.
13-*
The Data May Be Aggregated
The type of data that management requires for decision making is generally summarized data.
The sheer volume of all the historic detail data would make the data warehouse unacceptably huge in many cases.
If the detail data was stored in the data warehouse, the amount of time that it would take to summarize the data for management every time a query was posed would often be unacceptable.
13-*
The Data is Often Denormalized
If a company is willing to tolerate the substantial additional space taken up by the redundant denormalized data, it can gain the advantage of the improved query performance that redundancy provides without paying the penalties of increased update time and potential data integrity problems.
This works because the data integrity problems that can be caused by redundant data only arise when the data is updated. The historic data in the data warehouse will not be updated.
13-*
The Data is Not Necessarily Absolutely Current
Data warehouse data is updated at some time interval -- weekly, monthly, etc.
Any changes since the last data warehouse update are not recorded in it until the next scheduled update.
Inconsequential when looking at long-term trends.
13-*
Types of Data Warehouses
Enterprise Data Warehouse (EDW)
Data Mart (DM)
13-*
Enterprise Data Warehouse
Large-scale; incorporates the data of an entire company or of a major division, site, or activity of a company.
A full scale EDW is built around several different subjects.
Support a wide variety of DSS applications and serve as a data resource with which company managers can explore new ways of using the company’s data to its advantage.
13-*
The Data Mart
Small-scale; designed to support a small part of an organization.
A company will often have several DMs.
Are based on a limited number of subjects (possibly one) and are constructed from a limited number of transactional databases.
13-*
Which to Choose:
The EDW, the DM, or Both?
It depends from company to company.
Top-down development implies that the EDW was created first and then later data was extracted from an EDW to create one or more DMs.
A company that has deliberately or as a matter of circumstance developed a series of independent DMs may decide, in a bottom-up development fashion to build an EDW out of the existing DMs.
13-*
Designing a Data Warehouse
Two characteristics of data warehouses are central to any design:
- The subject orientation.
- The historic nature of the data.
Data warehouses are often referred to as multidimensional databases because each occurrence of the subject is referenced by an occurrence of each of several dimensions or characteristics of the subject, one of which is time.
13-*
Multidimensional Databases
Two dimensions can easily be visualized on a flat piece of paper.
13-*
Multidimensional Databases
- Three dimensions can easily be visualized on a flat piece of paper as a cube.
- Four or more dimensions are more difficult to visualize.
13-*
Storing Multidimensional Data
There is much interest in storing multidimensional data in relational databases.
The star schema.
- Visual design in which the subject is in the middle and the dimensions radiate outwards.
- Have a “fact table” which represents the data warehouse “subject” and several “dimension tables.”
13-*
General Hardware Company Data Warehouse
Here is the General Hardware transactional database.
13-*
General Hardware Company Data Warehouse
SALE is the fact table.
Like any relational table, must have a primary key.
Dimension tables:
SALESPERSON
PRODUCT
TIME PERIOD
13-*
General Hardware Company Data Warehouse
13-*
Good Reading Bookstores Data Warehouse
- Do they need a data warehouse, since they already store a date attribute?
- Yes, for two reasons:
- While the transactional database performs acceptably with perhaps the last couple of months of data in it, its performance would degrade to an unacceptable level if we tried to keep ten years of data in it.
- The kinds of management decision making that require long-term historic sales data require aggregate not daily data.
13-*
Good Reading Bookstores Data Warehouse
- SALE is the fact table.
- Like any relational table, must have a primary key.
- Dimension tables:
- BOOK
- PUBLISHER
- CUSTOMER
- TIME PERIOD
- Snowflake design
- One dimension table (BOOK) leads to another dimension table (PUBLISHER).
13-*
Lucky Rent-A-Car Data Warehouse
RENTAL is the fact table.
Does not contain aggregated data.
Dimension tables:
CAR
MANUFACTURER
CUSTOMER
TIME PERIOD
Snowflake design
One dimension table (CAR) leads to another dimension table (MANUFACTURER).
13-*
What About a World Music Association Data Warehouse?
- There is already a Year attribute in the RECORDING table.
- The essence of the WMA data is historic.
- By its nature, the amount of data in a WMA type transactional database is much lower than the amount of data in a Good Reading or Lucky-type transactional database.
- Since the nature of the WMA transactional database blurs with what a WMA data warehouse would look like, no WMA data warehouse is needed.
13-*
Building a Data Warehouse
- Data Extraction
- Data Cleaning
- Data Transformation
- Data Loading
13-*
Building a Data Warehouse:
Data Extraction
Process of copying the data from the transactional databases in preparation for loading it into the data warehouse.
This is not a one-time event.
The data is likely to come from several transactional databases.
Some of the data entering into this process may come from outside of the company (data enrichment).
13-*
Lucky Rent-A-Car with Enrichment Data
13-*
In the CUSTOMER table, Customer Age, Customer Income, and Customer Education is the enrichment data.
Data Cleaning
Transactional data can have all kinds of errors in it.
Data warehouses are very sensitive to data errors
Data errors must be “cleaned” or “cleansed” or “scrubbed” as the data is loaded into the data warehouse.
13-*
Data Cleaning
There are two steps to cleaning transactional data in preparation for loading it into a data warehouse.
Identify the problem data.
Due to the massive volume of data, this is typically done using a program.
Fix it.
Can be handled by using sophisticated artificial intelligence programs or by creating exception reports for employees to scrutinize.
13-*
Good Reading Bookstores Before Data Cleaning
Errors in Customer:
Missing data - in row 1, city is blank.
Questionable data - the state for rows 2 & 6 should be the same.
13-*
Good Reading Bookstores Before Data Cleaning
Errors in Customer:
Possible Misspelling - do rows 3 & 8 refer to the same person?
Impossible Data - row 10s state “RP” is wrong.
13-*
Good Reading Bookstores Before Data Cleaning
Errors in SALE:
Questionable data - is the book quantity of 21 in row 2 correct?
Impossible/Out-of-Range Data - row 5 indicates that a single book costs $3,200.99.
13-*
Good Reading Bookstores Before Data Cleaning
Errors in SALE:
Apparently Incorrect Data - there is no customer number 12738, as stated in row 8.
Impossible Data - row 10 shows a negative price for a book, which is impossible.
13-*
Data Transformation
As the data is extracted from the transactional databases, it must go through several kinds of data transformations on its way to the data warehouse:
- Data from different transactional databases being merged to form the data warehouse tables.
- Data will often be aggregated as it is being extracted from the transactional databases and prepared for the data warehouse.
13-*
Data Transformation
Units of measure used for attributes in different transactional databases must be reconciled as they are being merged into common data warehouse tables.
Coding schemes used for attributes in different transactional databases must be reconciled as they are being merged into common data warehouse tables.
Sometimes values from different attributes in transactional databases are combined into a single attribute in the data warehouse (e.g., employee name).
13-*
Data Loading
- After all of the extracting, cleaning, and transforming, the data is ready to be loaded into the data warehouse.
- A schedule for regularly updating the data warehouse must be put in place.
13-*
Using a Data Warehouse
- Online analytic processing (OLAP)
- Data Mining
13-*
Online Analytic Processing
- A decision support methodology based on viewing data in multiple dimensions.
- There are many OLAP systems on the market today.
- The OLAP environment’s multidimensional data is very well suited for querying and for multi-time period trend analyses.
13-*
Online Analytic Processing
- Drill-Down
- Going back to the database and retrieving finer levels of data detail than you have already retrieved.
- Slice
- A subset of the data that focuses on a single value of one of the dimensions.
- Pivot or Rotation
- Merely a matter of interchanging the data dimensions.
13-*
Online Analytic Processing
- A slice of the patient data cube.
13-*
Data Mining
The searching out of hidden knowledge in the company’s data that can give the company a competitive advantage in its marketplace.
Due to the massive volume of data warehouse data, data mining must be done by software.
Case-based learning
Decision trees
Neural networks
Genetic algorithms
13-*
Data Mining Application: Market Based Analysis
Consider the data collected by a supermarket as it checks out its customers by scanning the bar codes on the products they’re purchasing.
The company might have software study the collected market baskets, each of which is literally the goods that a particular customer bought in one trip to the store.
The software might try to discover whether certain items “fall into” the same market basket more frequently than would otherwise be expected.
Then the items often bought in the same shopping trip can be placed next to each other in the store to remind someone buying one that they might also need the other.
13-*
Data Mining: Lucky Rent-A Car
13-*
A data mining application may look for patterns in the data.
Rows 2, 5, 8, and 11 all involve rentals of luxury class cars with high-cost (revenue to the company) figures.
Data Mining: Lucky Rent-A Car
13-*
If, as is the case here, these similar rentals were made by people with similar demographics, a “cluster”, then future marketing can concentrate on selling this product to people with these demographics.
Administering a Data Warehouse
- The data warehouse requires a serious level of management.
- Data warehouse administrator - personnel specialization in the management of the data warehouse.
- Three kinds of employee expertise is required:
- Business expertise
- Data expertise
- Technical expertise
13-*
Administering a Data Warehouse: Business Expertise
- An understanding of the company’s business processes that underlies an understanding of the company’s transactional data and databases.
- An understanding of the company’s business goals to help in determining what data should be stored in the data warehouse for eventual OLAP and data mining purposes.
13-*
Administering a Data Warehouse: Data Expertise
- An understanding of the company’s transactional data and databases for selection and integration into the data warehouse.
- An understanding of the company’s transactional data and databases to design and manage data cleaning and data transformation, as necessary.
- Familiarity with outside data sources for the acquisition of enrichment data.
13-*
Administering a Data Warehouse: Technical Expertise
- An understanding of data warehouse design principles for the initial design.
- An understanding of OLAP and data mining techniques so that the data warehouse design will properly support these processes.
13-*
Administering a Data Warehouse: Technical Expertise
- An understanding of the company’s transactional databases in order to manage or coordinate the regularly scheduled appending of new data to the data warehouse.
- An understanding of how to handle very large databases with their unique requirements for security, backup and recovery, being split across multiple disk devices, etc.
13-*
Challenges in Data Warehousing
- Data cleaning and finding more “dirty” data than expected.
- Problems associated with coordinating the regular appending of new data from the transactional databases to the data warehouse.
- Difficulties in managing very large databases.
- The challenge of building and maintaining the data dictionary.
13-*