Oracle Advanced person Needed
Oracle/FIT5195-2-Star Schema.pdf
Week 2 – Star Schema
Semester 1, 2020
FIT5195 – Business Intelligence and Data Warehousing
Developed by: Agnes Haryanto
Agnes.Haryanto@monash.edu
MONASH INFORMATION
TECHNOLOGY
Agenda
1. Notations and Processes
1. Star Schema Notation
2. E/R Diagram Notation
3. Transformation Process (Case Study)
2. Two-Column Table Methodology
Recall – The Big Picture
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
Recall – Data Warehouse
▪ To address the drawback of operational database, and a need for decision-
making support data, data warehouse is needed.
▪ A data warehouse is a multi-dimensional view of databases, with
aggregates and pre-computed summaries.
➢ In many ways, it is basically doing aggregates in advance; that is exactly pre-
computation done at the design level, rather than at the query level.
Recall – Data Warehouse
Star Schema
▪ A Star Schema is a design representation of a multi-dimensional view. It is a
data modeling technique used to map multidimensional decision support
data into a relational database.
▪ The reason for the star schema’s development is that existing relational
modeling techniques: ER and normalization, did not yield a database
structure that served the advanced data analysis requirements well.
Star Schema Components
▪ There are Three main components of the Star Schema:
1. Facts
2. Dimensions
3. Attributes
Star Schema Components
1. Facts
Facts are numeric measurements (values) that represent a specific business aspect
or activity.
For example, sales figures are numeric measurements that represent product and/or
service sales.
2. Dimensions
Dimensions are qualifying characteristics that provide additional perspectives to a
given fact.
For example, sales might be viewed from specific dimension(s), such as sales
location, sales period, sales product, etc.
Star Schema Notation
▪ A Sales Star Schema
➢ Fact:
• Sales
➢ Dimensions:
• Time
• Product
• Branch
▪ Notation-wise, the Fact uses a bolder line, to differentiate between Fact
from Dimensions.
Star Schema Notation
▪ A Sales Star Schema
➢ Fact:
• Sales
➢ Dimensions:
• Time
• Product
• Branch
▪ The lines that represent a relationship
between the fact and dimensions can be
straight lines or bended lines.
Star Schema Notation
▪ Using the star schema notation, the
number of dimensions can be unlimited.
▪ If there is more dimensions, then we just
add more dimensions linked to the Fact.
Star Schema Components
3. Attributes
Each dimension table contains attributes.
For example:
Product dimension: Prod Type,
Description.
Location dimension: Region,
State,
City.
Time dimension: Year,
Month.
Star Schema Notation
Star Schema Notation
▪ Sales Star Schema
(b) Sales Star Schema complete with the Attributes(a) Outline star schema for Sales
E/R Diagram Notation
E/R Diagram Notation
(a) An Entity in E/R Diagram
(b) Relationships in E/R Diagram
E/R Diagram Notation
E/R Diagram Notation
Associative
Relationship
E/R Diagram Notation
Non-Associative
Relationship
Transformation Process
Transformation Process
Transformation Process
Case Study #1
Case Study #1: International College
The admission office handles enrolment, payment, and marketing campaigns to international
students, often through educational agents located overseas. This admission office has an
operational system that maintains all the details of international students enrolled in the College.
Payment details are also handled by this office. Basically, the operational system has the following
features:
▪ Every student details are kept in the database. This includes the courses that the students enroll.
▪ As the College is a multi-campus university, some courses are offered in a different campus. The
admission office handles international students of all campuses.
▪ Some international students coming to the College are handled by an educational agent. This is
particularly common for the first course that a student enrolls in. Subsequent courses are not
normally handled by an agent, because the students themselves deal directly with the College.
▪ International students pay tuition fees several times (normally once every semester) for each
course they are doing.
Case Study #1: International College
The College now requires a data warehouse for analysis
purposes. The analysis is needed for identifying at least the
following questions:
1. How many students come from certain countries?
2. What is the total income for certain postgraduate
courses?
3. How many students are handled by certain agents?
4. How the number of enrolment of courses fluctuates
across the year?
Case Study #1: International College
▪ College Star Schema
➢ Fact:
• Number of Students
• Total Income
➢ Dimensions:
• Country
• Agent
• Course
• Year
Transformation Process
Case Study #2
Case Study #2: Sales
▪ Suppose that we would like to analyze
Total Sales from various point of
views, such as Quarter, Branch, and
Product Category.
Case Study #2: Sales
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Time
• Branch
• Product Category
Two Column Table Methodology
Two Column Table Methodology
When creating a star schema, you need to
imagine that the data you want to analyse
consists of two columns.
The first column is the category (e.g. A, C,
D, E), and the second column is the
statistical numerical figure (e.g. B).
The second column (e.g. B) has to be
consistent throughout all the two-column
tables.
One Fact Measurement:
Two Column Table Methodology
▪ Case Study 1: Analysis of Accountants
Suppose the CPA organization would like to analyze its members (i.e.
accountants) in a particular city. Assume that the organization has the full
details of its members.
Education Number of Accountants
Diploma 84953
Bachelor 349203
Higher Degree 98943
Others 2322
Two Column Table Methodology
▪ We can also look at the figures from the gender point of view, like:
▪ Another way to analyze number of accountants is form the type of the accountant job
itself; something like:
▪ Note that the figures are fictitious, and the “Types” of Accountants (indicating different
roles of accountants) are also fictitious.
Gender Number of Accountants
Male 434322
Female 89932
Type Number of Accountants
Government 3843
Private Business 45303
Personal 45930
etc
etc
Two Column Table Methodology
▪ You can further identify other example to analyze number of accountants. In
the above three tables, the first angle to look at the number of accountants
is from the educational background, the last one is from the type of the
accountant itself, whether it is a private business accountant, etc.
▪ As you can see, the second column is CONSISTENTLY UNIFORM. In the
above example, it is number of accountants. The first column changes
depending on from which angle that you want to see.
Two Column Table Methodology
▪ Therefore, in this case study, the star schema could look like the following:
Two Column Table Methodology
The second column in the two-column
tables, which is the numerical fact
measurement (e.g. column B) can
actually be multiple columns (call them:
B1, B2, B3), as long as all of these
columns (e.g. B1, B2, B3) relate to all of
the categories (e.g. A, C, D, E).
Multiple Fact Measurements:
Two Column Table Methodology
▪ Case Study 2: Student Enrollment
The University Administrator(s) needs to keep track of the number of
enrollment for particular unit or campus and the students’ performance
each year in order to maintain the University performance. The head of
admin has assigned you the task of developing a small Data Warehouse in
which to keep track the enrollment and performance statistics.
Two Column Table Methodology
▪ For example:
▪ Another example could be something like this:
Subject Number of Students Total Score
Database 8 539
Java 5 327
SAP 1 63
Network 2 105
Semester Number of Students Total Score
One 9 618
Two 7 416
F1 F2
F1 F2
Two Column Table Methodology
▪ In analyzing number of students (apart from the subject and semester as
shown above), you could also see the number of student from another
angle, for example from the campus and grade:
Campus Number of Students Total Score
Main 9 658
City 5 271
DE 2 105
F1 F2
Two Column Table Methodology
▪ For example:
Grade Number of Students Total Score
HD 3 253
D 4 300
C 4 256
P 2 105
N 3 120
F1 F2
Two Column Table Methodology
▪ The first columns of the above examples are the dimensions, whereas
the other columns that contain the statistical/summarized/aggregated
values is the fact.
▪ In the above example, the fact is then STUDENT_ENROLLMENT_FACT,
and the dimensions are SUBJECT, SEMESTER, GRADE and CAMPUS.
Two Column Table Methodology
▪ The star schema for the STUDENT ENROLLMENT is shown as follows:
End of Week 2 Lecture
Oracle/FIT5195-3-Bridge Table.pdf
Week 3 – Bridge Tables
Semester 1, 2020
FIT5195 – Business Intelligence and Data Warehousing
Developed by: Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH INFORMATION
TECHNOLOGY
Agenda
1. Bridge Tables
2. Temporary Tables
1. Temporary Dimension Tables
2. Temporary Tables in the Operational Database
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
Bridge Tables
Bridge Tables
▪ A bridge table is a table that links between two dimensions; and only one
of these two dimensions are linked to the fact.
➢ As a result, the star schema becomes a snowflake schema.
Bridge Tables
▪ Two reasons on why a dimension cannot be connected directly to the Fact: ▪
a) The Fact table has a fact measure, and the dimension has a key identity. In
order to connect a dimension to the Fact, the dimension’s key identity must
contribute directly to the calculation of the fact measure. Unfortunately, this
cannot happen if the operational database does not have this data.
b) The operational database does not have this data if the relationship between
two entities in the operational database that hold the information about
dimension’s key identity and the intended fact measure is a many-many
relationship.
Bridge Tables
Case Study #1
Case Study #1 – A Product Sales Case Study
▪ A company management team would like to
analyze the statistics of its product sales
history. The analysis is needed to identify
popular products, suppliers supplying those
products, the best time to purchase more
stock, etc.
▪ A small data warehouse is to be built to keep
track of the statistics.
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
Case Study #1 – A Product Sales Case Study
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
Case Study #1 – A Product Sales Case Study
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Product
• Customer locations/suburbs
• Time period
• Supplier
▪ Possible Two-Column Methodology Tables:
Case Study #1 – A Product Sales Case Study
ProductNo TotalSales
A1 $130,000
B2 $15,900
C3 $2,500,000
… …
TimeID TotalSales
201801 $25,000
201802 $4,700
201803 $3,500
… …
Suburb TotalSales
Caulfield $6,500
Chadstone $12,000
Clayton $1,800
… …
(a) Product point of view (b) Time point of view (c) Suburb point of view
Case Study #1 – A Product Sales Case Study
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Product
• Customer locations/suburbs
• Time period
• Supplier
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Supplier point of view
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Supplier point of view
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
Bridge Table
▪ To create Time Dimension: - create table TimeDim as
select
distinct to_char(SalesDate, ’YYYYMM’) as TimeID,
to_char(SalesDate, ’YYYY’) as Year,
to_char(SalesDate, ’MM’) as Month
from Sales;
▪ To create Customer Location Dimension: - create table CustLocDim as
select distinct Suburb, Postcode
from Customer;
Case Study #1 – A Product Sales Case Study
▪ To create Product Dimension: - create table ProductDim as
select distinct ProductNo, ProductName
from Product;
▪ To create Bridge Table: - create table ProductSupplierBridge as
select *
from StockSupplier;
▪ To create Supplier Dimension: - create table SupplierDim as
select SupplierID, Name as SupplierName
from Supplier;
Case Study #1 – A Product Sales Case Study
▪ To create Fact Table: - create table ProductSalesFact as
Select
to_char(S.SalesDate, ’YYYYMM’) as TimeID,
P.ProductNo,
C.Suburb,
sum(SI.QtySold*P.Price) as TotalSales
from Sales S, Product P, Customer C, SalesItem SI
where S.SalesNo = SI.SalesNo
and SI.ProductNo= P.ProductNo
and C.CustomerID = S.CustomerID
group by
to_char(S.SalesDate, ’YYYYMM’), P.ProductNo, C.Suburb;
Case Study #1 – A Product Sales Case Study
Bridge Tables
Case Study #2
Case Study #2 – A Truck Delivery Case Study
▪ A trucking company is responsible for picking up goods from warehouses of a retail chain
company, and delivering the goods to individual retail stores.
▪ A truck carry goods during a single trip, which is
identified by TripID, and delivers these goods to
multiple stores. Trucks have different capacities
for both the volumes they can hold and the
weights they can carry.
▪ At the moment, a truck makes several trips each
week. An operational database is being used to
keep track the deliveries, including the scheduling
of trucks, which provide timely deliveries to
stores.
Case Study #2 – A Truck Delivery Case Study
▪ A trip may pick up goods from many
warehouses
o i.e. a many-many relationship between
Warehouse and Trip
▪ A trip uses one truck only, and a truck may
have many trips in the history
o i.e. a many-1 relationship between Trip and
Truck
▪ A trip delivers goods (e.g. TVs, fridges, etc)
potentially to several stores
o a many-many relationship between Trip and
Store, which is represented by the Destination
table
▪ Sample data in the operational database:
Case Study #2 – A Truck Delivery Case Study
WarehouseID Location
W1 Warehouse1
W2 Warehouse1
W3 Warehouse1
… …
(d) Truck Table
(b) Trip Table (c) TripFrom Table
TripID Date TotalKm TruckID
Trip1 14-Apr-2018 370 Truck1
Trip2 14-Apr-2018 570 Truck2
Trip3 14-Apr-2018 250 Truck3
Trip4 15-Jul-2018 450 Truck1
… … … …
TripID WarehouseID
Trip1 W1
Trip1 W2
Trip1 W3
Trip2 W1
Trip2 W2
… …
TruckID VolCapacity WeightCategory CostPerKm
Truck1 250 Medium $1.20
Truck2 300 Medium $1.50
Truck3 100 Small $0.80
Truck4 550 Large $2.30
Truck5 650 Large $2.50
… … … …
StoreID StoreName Address
M1 MyStore City Melbourne
M2 MyStore Chaddy Chadstone
M3 MyStore HiPoint High Point
M4 MyStore Donc Doncaster
M5 MyStore North Northland
M6 MyStore South Southland
M7 MyStore East Eastland
M8 MyStore Knox Knox
… …
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(a) Warehouse Table
(e) Store Table (f) Destination Table
Case Study #2 – A Truck Delivery Case Study
▪ The management of this trucking company would like to analyze the deliver cost, based on
trucks, time period, and store.
Case Study #2 – A Truck Delivery Case Study
▪ Sales Star Schema
➢ Fact:
• Total Delivery Cost
(distance * cost per kilometre)
➢ Dimensions:
• Truck
• Time period
• Store
Case Study #2 – A Truck Delivery Case Study
▪ From the Truck point of view, Truck1 has two trips (e.g. Trip1 and Trip4), with the total kilometres of
820km (370km + 450km). The cost for Truck1 is $1.20. Hence, calculating the cost for Truck1 is
straightforward. Other trucks can be calculated this way.
▪ From the Period point of view, 14-Apr-2018 has three trips (e.g. Trip1,Trip2, and Trip3). Trip1 (370km) is
delivered by Truck1 which costs $1.20/km. Trip2 and Trip 3, on the same day, can be calculated the same
way. Hence, on 14-Apr-2018, the total cost can be calculated.
▪ From the Store point of view; The cost is calculated
based on Trip, but a trip delivers goods to many
stores. Therefore, the delivery cost for each store
cannot be calculated. The delivery cost is for the trip –
not for the store.
Case Study #2 – A Truck Delivery Case Study
Solution Model 1 – Using a Bridge Table
Case Study #2 – A Truck Delivery Case Study
Solution Model 2 – add a Weight Factor attribute
A weight factor is only needed if we want to estimate
the contribution that a dimension made to the fact
Case Study #2 – A Truck Delivery Case Study
StoreID StoreName Address
M1 MyStore City Melbourne
M2 MyStore Chaddy Chadstone
M3 MyStore HiPoint High Point
M4 MyStore Donc Doncaster
M5 MyStore North Northland
M6 MyStore South Southland
M7 MyStore East Eastland
M8 MyStore Knox Knox
… …
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(c) Store Table(b) Bridge Table
(a) Trip Dimension Table
▪ To create Trip Dimension: - create table TripDim2 as
select T.TripID, T.TripDate, T.TotalKm,
1.0/count(*) as WeightFactor
from Trip T, Destination D
where T.TripID = D.TripID
group by T.tripid, T.tripdate, T.totalkm;
Case Study #2 – A Truck Delivery Case Study
Case Study #2 – A Truck Delivery Case Study
Solution Model 3 – a List Aggregate version
Case Study #2 – A Truck Delivery Case Study
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(b) Bridge Table
(a) Trip Dimension Table
listagg (Attr1, ‘_’) within group
(order by Attr1) as ColumnName
▪ To create Trip Dimension: - create table TripDim3 as
select T.TripID, T.TripDate, T.TotalKm,
1.0/count(D.StoreID) as WeightFactor,
listagg (D.StoreID, ’_’) within group
(order by D.StoreID) as StoreGroupList
from Trip T, Destination D
where T.TripID = D.TripID
group by T.TripID, T.TripDate, T.TotalKm;
Case Study #2 – A Truck Delivery Case Study
▪ Joining based on the StoreGroupList attribute in the Trip dimension table and the StoreID in the Store dimension table:
- select *
from TripDim3 T, StoreDim3 S
where T.StoreGroupList like ’%’||S.StoreID||’%’;
▪ Without the StoreGroupList attribute in the Trip dimension, we need to join three tables:
- select *
from TripDim3 T, BridgeTable3 B, StoreDim3 S
where T.TripID = B.TripID
and B.StoreID = S.StoreID;
Case Study #2 – A Truck Delivery Case Study
Bridge Tables
Summary
Bridge Tables – Summary
▪ In principal, a Bridge Table is used: ▪
a) When it is impossible to have a dimension connected directed to the Fact table,
because simply there is no relationship between this dimension and the Fact table
(e.g. in the Product Sales case study, it is impossible to have a direct link from
SupplierDim to ProductSalesFact)
b) When an entity (which will become a dimension) has a many-many relationship with
another entity (dimension) in the E/R schema of the operational database (e.g.
Supplier and Stock has a many-many relationship).
c) When temporality aspect (data history) is maintained in the operational database
and the bridge table can be used to accommodate the dimension that has temporal
attributes (e.g. product supply history is maintained in the second snowflake schema
example).
Bridge Tables – Summary
▪ When a Bridge Table is used in the schema, there are two additional options: ▪
a) A Weight Factor is used to estimate the contribution of a dimension in the calculation
of the fact measure. Because this is only an estimate, a weight factor is option.
b) Every snowflake schema (whether it has Weight Factor or not) can be implemented
in two ways: a List Aggregate version, and a non-List Aggregate version.
Temporary Dimension Tables
43
1. Direct copying from a relational table in the operational database
2. Selecting certain attributes (or records) from a relational table in the
operational database
3. Creating the dimension table from scratch
– Insert records into the newly created dimension table
– Suitable for small dimension table
– Temporary dimension table is needed in circumstances where a new
dimension table cannot be created directly
Ways to Create Dimension Tables
44
Temporary Dimension Tables - Sales Case Study
An E/R Diagram for the
Sales Case Study
A Star Schema for the Sales Case Study
QuarterID - ‘YYYYQ’ is a unique
combination between year and quarter,
where Q is between 1 and 4
45
▪ TimeDIM has to be created manually
▪ Questions:
Can we insert the records into TimeDIM
one by one?
▪ If yes, how many records to be
inserted?
▪ If no, why?
Temporary Dimension Tables - Sales Case Study (cont.)
46
Solution for TimeDIM:
1. Create a temporary Time Dimension table that contains Month and Year
2. Add with a new attribute, called QuarterID.
3. Perform a series of updates to fill in the QuarterID column with the correct
values
4. Create the final TimeDIM table
Temporary Dimension Tables - Sales Case Study (cont.)
47
Temporary Dimension Tables - Sales Case Study (cont.)
1
2
3
Exercise:
Construct the SQL command for Step 4
48
Temporary fact table is required
because TimeDIM is not directly copied
from the operational database
Temporary Fact Tables - Sales Case Study
1
49
Temporary Fact Tables - Sales Case Study (cont.)
2 3
4
50
Temporary Tables in Operational Database
Total_Sales = SUM(TotalPrice)
An E/R Diagram for the
Sales Case Study
A Star Schema for the Sales Case Study
51
Temporary Tables in Operational Database - Case Study
An E/R Diagram for the
Sessional Contract Jobs
52
Temporary Tables in Operational Database - Case Study (cont.)
A Star Schema for the
Sessional Contract Jobs
Last Degree of the
Employee
53
▪ One employee may have multiple
degrees.
▪ Most of the work requires a Bachelor
degree, to allow current Master
students to work as a teaching
assistant or as a programmer
▪ Current PhD students who have a
Master degree often work as a
teaching assistant too
▪ BUT, only the last degree is required
in DigreeDIM table.
Temporary Tables in Operational Database - Case Study (cont.)
54
Temporary Tables - EmployeeTemp Table
Retrieves the last degree of the
employees
55
Sample Records for Employee and Emp_Degree Table
Employee
Emp_Degree
56
RANK() OVER Function
Retrieved for
EmployeeTemp
table.
57
Create Fact Table from EmployeeTemp Table
58
Temporary Dimension table
Needed when it is not possible to create a dimension table directly; need to
apply a series of additional steps when creating a dimension table through the
creation of an intermediate table
Temporary table in operational database
Tables in the operational database are not readily used to create the Fact table.
Need transformation phase, where an intermediate table, called the Temporary
Operational Database Table is created.
Summary
Oracle/FIT5195-4-Snowflake Schema.pdf
Week 4 – Snowflake Schema
Semester 1, 2020
FIT5195 – Business Intelligence and Data Warehousing
Developed by: Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH INFORMATION
TECHNOLOGY
Agenda
1. Hierarchies
1. Hierarchy vs. Non-Hierarchy
2. Hierarchy versus Multiple Independent Dimensions
3. Linked Dimensions
4. Hierarchy Design Considerations
2. Determinant Dimensions
1. Determinant vs. Non-Determinant Dimensions
2. Determinant Dimension vs. Pivoted Fact Table
3. Non-Type as a Determinant Dimension
4. Multiple Relationship between a Dimension and the Fact
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
Hierarchies
Hierarchies
▪ A Hierarchy is formed when a dimension is broken down to two or
more dimensions in a hierarchical manner.
➢ As a result, the star schema becomes a snowflake schema.
(a) A simple Star Schema
(b) A Snowflake with Hierarchy
Hierarchies
(a) A Snowflake with a Bridge Table (b) A Snowflake with Hierarchy
Hierarchies
1. Hierarchy versus Non-Hierarchy Dimensions
2. Hierarchy versus Multiple Independent Dimensions
3. Linked Dimensions
4. Hierarchy Design Considerations
1. Hierarchy vs. Non-Hierarchy
1. Hierarchy vs. Non-Hierarchy
1. Hierarchy vs. Non-Hierarchy
HierarchyNon-Hierarchy
1. Hierarchy vs. Non-Hierarchy
CityID City Name State CountryID
MEL Melbourne Victoria AU
SYD Sydney New South Wales AU
KUL Kuala Lumpur Selangor MA
JNB Johannesburg Johannesburg SA
CampusID CampusName Address Suburb Postcode CityID
CL Clayton Campus Wellington Rd Clayton 3800MEL
CA Caulfield Campus Dandenong Rd Caulfield East 3145MEL
PA Parkville Campus Royal Parade Parkville 3052MEL
SY Sydney Campus Opera Boulevard Sydney 2001SYD
MUM Malaysia Campus Jalan Lagoon Bandar Sunway 47500KUL
MSA South Africa Campus Peter St Johannesburg 1725JNB
(a) Campus Dimension Table
CountryID Country Name
AU Australia
MA Malaysia
SA South Africa
(b) City Dimension Table (c) Country Dimension Table
1. Hierarchy vs. Non-Hierarchy
(a) Correct Hierarchy (b) A Wrong Hierarchy
1. Hierarchy vs. Non-Hierarchy
1. One table vs. many tables
The consequence is in the join query processing when producing reports.
2. Normalized vs. Un-normalized
With the hierarchy option, the tables are normalized, which follows the relational
model. In contrast, the non-hierarchy option, the table (e.g. the Campus
Dimension table) is unnormalized, which is basically 1NF with visible replication
on information, which is prone to anomalies (e.g. insert, update, and delete
anomalies).
3. Drilling down and rolling up
The hierarchy model is not about drilling down information exploration.
1. Hierarchy vs. Non-Hierarchy – Summary
▪ From the query point of view, both versions need two queries.
▪ From the query processing point of view, the non-hierarchy version uses
one join operation only, because it only needs to join the fact and one
dimension.
▪ From the conceptual point of view, the hierarchy model does not actually
offer a better roll up or drill down features.
2. Hierarchy vs. Multiple Independent Dimensions
2. Hierarchy vs. Multiple Independent Dimensions
(a) Separate Dimensions (b) Combined Dimensions
(c) Hierarchy Dimensions
2.1. Separate vs Combined Dimension
(a) Separate Dimensions (b) Combined Dimensions
2.1. Separate vs Combined Dimension
(a) Separate Dimension Model – Campus Dimension Table
(b) Separate Dimension Model –
Country Dimension Table
(c) Separate Dimension Model – Fact Table
2.1. Separate vs Combined Dimension
(d) Combined Dimension Model – Campus Dimension Table
(e) Combined Dimension Model
– Fact Table
2.1. Separate vs Combined Dimension (c) Separate Dimension Model – Fact Table (e) Combined Dimension Model – Fact Table
2.1. Separate vs Combined Dimension
(a) Separate Dimension Model – Campus Dimension Table (b) Separate Dimension Model –
Country Dimension Table
(d) Combined Dimension Model – Campus Dimension Table
2.1. Separate vs Combined Dimension
(a) Separate Dimensions (b) Combined Dimensions
2.2. Combined Dimension vs. Hierarchy
(b) Combined Dimensions (c) Hierarchy Dimensions
2.2. Combined Dimension vs. Hierarchy
(a) Hierarchy Model – Campus Dimension Table
(c) Hierarchy Model – Fact Table
(b) Hierarchy Model – Country Dimension Table
2.2. Combined Dimension vs. Hierarchy
▪ Reasons on why the Hierarchy model is not ideal:
➢ An efficient query processing to query the fact and dimensions,
➢ The identifier of the Campus Dimension (e.g. CampusID) already covers the
child dimension (e.g. Country Dimension), and
➢ Campus and Country information is often regarded as one entity; at least in this
case study.
(b) Combined Dimensions (c) Hierarchy Dimensions
3. Linked Dimensions
3. Linked Dimensions
(a) Linked Dimension Model – Campus Dimension Table
(c) Linked Dimension Model – Fact Table
(b) Linked Dimension Model – Country Dimension Table
4. Hierarchy Design Considerations
Hierarchy – Summary
▪ A dimension hierarchy is connecting two or more dimensions in a hierarchical
manner, using a many-1 relationship.
➢ As a result, the dimensions in a hierarchy are normalized, in 3NF, using the
context of Relational Database Design.
▪ Compare and contrasts five different models:
1. Separate Dimension model
2. Combined Dimension model
3. Hierarchy model
4. Linked Dimension model
5. Hierarchy Design Considerations
Determinant Dimensions
31
Dimensions and Fact Tables - Recap
A Star Schema for the Sales Case Study
32
Example #1:
Querying Fact Table
33
Example #2:
Querying Fact Table (cont.)
34
A star schema may have a special dimension
all data retrieval from this star schema must use this dimension
else, the retrieved data will become meaningless.
Determinant Dimensions
Determinant
_DIM
Determinant_
Att
..
..
Special dimension
Key identifier
35
Introducing a Determinant Dimension – Petrol Station Case Study
Sample operational database of a company about petrol prices from all petrols
stations in Victoria:
36
The requirements for the data warehouse
are to answer the questions related to
(a) average petrol price,
(b) minimum petrol price, and
(c) maximum petrol price
Petrol Station - Star Schema Version-1
Chosen Dimensions:
(i) day of week,
(ii) suburb, and
(iii) petrol company.
Two-column Table for Petrol Prices based on Day of Week
Category Fact Measures
37
Minimum price for which petrol type?
Petrol Station - Star Schema Version-1 (cont.)
38
▪ Not all petrol stations sell all types of petrol, e.g. Premium 98.
▪ Calculating the average price for Premium 98 on Monday cannot be:
Total_Petrol_Price_P98 / Number_of_Petrol_Station
▪ The number of petrol stations is different for different petrol type!
Petrol Station - Star Schema Version-1 (cont.)
39
▪ Solution: Split the Number of Petrol Station column into six columns – one
for each petrol type.
▪ Question: How many columns in total?
Petrol Station - Star Schema Version-1 (cont.)
Two-column
Table for Petrol
Prices based
on Suburb Two-column Table for Petrol Prices based on Company
40
Petrol Station - Star Schema Version-1 (cont.)
41
Petrol Station - Star Schema Version-2
42
Petrol Station - Star Schema Version-2
Min. price for
which petrol
type??
43
select DayofWeek,
Min_Petrol_Price
from PetrolFact
where DayofWeek = ’Monday’
and PetrolType = ’Unleaded’;
Petrol Station - Star Schema Version-2
select DayofWeek, PetrolType,
Min_Petrol_Price
from PetrolFact
where DayofWeek = ’Monday’;
44
Petrol Station - Star Schema Version-2
Determinant Dimension uses dotted box in the star schema.
The new star schema with a Determinant Dimension for Petrol Case Study
Determinant vs. Non-
Determinant
Dimensions
46
The Olympic Games Case Study
▪ The Olympic Games committee maintains an operational database that
stores all matches, games, as well as the medal winners of the Olympic
Games over the years
▪ Let’s build a data warehouse to analyse the medal counts, by each country,
sport, and at which Olympic Games
Determinant vs. Non-Determinant Dimensions
47
The Olympic Games Case Study - The Star Schemas
Version-1
Version-2
The questions are:
1. What is the difference
between these two
versions of star schema?
2. Is Medal Type Dimension
a Determinant
Dimension?
48
The Olympic Games Case Study - The Star Schema Version-1
Two-column tables
(pivot tables) for each
dimension category in
star schema version-1
49
The Olympic Games Case Study - The Star Schema Version-2
50
▪ Is MedalTypeDIM a determinant dimension?
– To retrieve the data from version-2 star schema, must we have the
information from Medal Type Dimension?
Determinant vs. Non-Determinant Dimensions
Version-2
COUNT = total medals from
other dimensions, without the medal type dimension)
51
Fact Table for Version-1
The Olympic Games Case Study - The Star Schema Version-1 vs Version-2
Fact Table for Version-2
52
The Olympic Games Case Study - The Star Schema Version-1 vs Version-2
Differenc
es
Version 1 Version 2 (with
MedalTypeDIM)
Storage Lower with only three records Nine records
Modelling The schema looks more
complex and crowded; but the
storage cost is lower
More concise and compact with
less number of measures in the
fact, easier to understand
Query
Processin
g
Less join processing between
dimension tables and the fact
table.
Requires additional join with
MedalTypeDIM
Determinant
Dimensions vs.
Pivoted Fact Table
54
The PTE Academic Test Case Study ▪ PTE Test consists of four main components: Listening, Reading, Speaking, and
Writing
▪ There are many test venues (all major cities in Australia, and overseas as well), each
country may have a different price setting
▪ A student taking a PTE test will have one score for each of these four components,
as well as one overall score
▪ The score is a numerical score, ranging from 10 to 90
▪ The results will normally be released five days after the actual test date
▪ A student may take a PTE Academic Test several times, in order to improve the
score
▪ A student is identified by his/her RegistrationID, which stays the same student stays
the same
Determinant Dimensions vs. Pivoted Fact Table
55
ERD for PTE Academic Test
56
The PTE Academic Test Case Study - Sample Records for 8 Students
57
The PTE Academic Test Case Study - Sample Records for 8 Students
58
The PTE Academic Test Case Study - Star Schema with Determinant Dim.
59
To implement the star schema in SQL:
create table CountryVenueDim
as
select distinct C.CountryCode, C.CountryName, T.TestPrice
from Test_Venue T, Country C
where T.CountryCode = C.CountryCode;
create table CitizenshipDim
as
select distinct C.CountryCode as Citizenship, C.CountryName
from Student S, Country C
where S.Citizenship = C.CountryCode;
create table YearDim
as
select distinct to_char(TestDate, ’YYYY’) as Year from Test;
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
60
create table GradeDim (
Grade varchar2(3),
Description varchar2(20),
MinScore number,
MaxScore number
);
insert into GradeDim values (’4.5’, ’Functional’, 30, 35);
insert into GradeDim values (’5’, ’Vocational’, 36, 49);
insert into GradeDim values (’6’, ’Competent’, 50, 64);
insert into GradeDim values (’7’, ’Proficient’, 65, 78);
insert into GradeDim values (’8-9’, ’Superior’, 79, 90);
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
Superior
61
create table TestComponentDim (
TestComponent varchar2(20)
);
insert into TestComponentDim values
(’Listening’);
insert into TestComponentDim values
(’Reading’);
insert into TestComponentDim values
(’Writing’);
insert into TestComponentDim values
(’Speaking’);
insert into TestComponentDim values
(’Overall’);
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
Create temporary fact table:
create table TempFact
as
select TV.CountryCode, S.Citizenship,
to_char(T.TestDate, ’YYYY’) as Year,
TR.ListeningScore, TR.ReadingScore,
TR.WritingScore, TR.SpeakingScore,
TR.OverallScore, TR.RegistrationID
from Test_Venue TV, Test T, Student S,
Test_Result TR
where TV.VenueID = T.VenueID
and T.TestNo = TR.TestNo
and TR.RegistrationID = S.RegistrationID;
62
alter table TempFact
add (
GradeOverall varchar2(3),
GradeListening varchar2(3),
GradeReading varchar2(3),
GradeWriting varchar2(3),
GradeSpeaking varchar2(3)
);
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
update TempFact
set GradeOverall = (
case
when OverallScore >= 30 and
OverallScore <= 35 then ’4.5’
when OverallScore >= 36 and
OverallScore <= 49 then ’5’
when OverallScore >= 50 and
OverallScore <= 64 then ’6’
when OverallScore >= 65 and
OverallScore <= 78 then ’7’
when OverallScore >= 79 and
OverallScore <= 90 then ’8-9’
end);
Repeat for
GradeListening, GradeReading,
GradeWriting, and GradeSpeaking
Superior
63
▪ The TempFact table has the correct grades for each student.
▪ The problem now is to breakdown the grades for a student into multiple test
components – Create another temporary fact for each of the test components, eg. a temporary
fact called ’OverallFact’ for the Overall Score, and so on so forth
create table OverallFact
as
select CountryCode, Citizenship, Year,
GradeOverall As Grade, ’Overall’ as TestComponent,
count(RegistrationID) as Total_Students_Overall
from TempFact
group by CountryCode, Citizenship, Year, GradeOverall, ’Overall’;
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
Repeat for
GradeListening, GradeReading, GradeWriting, and GradeSpeaking
64
▪ There were eleven students taking the test which are now grouped into nine
records in Table OverallFact
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
65
▪ There are also three other fact tables: ReadingFact, WritingFact, and
SpeakingFact, with identical table structures
▪ These five fact tables can now be ’combined’ to form one final fact table
using a union operator: create table FinalFact as
select CountryCode, Citizenship, Year, Grade, TestComponent,
Total_Students_Overall as Total_Students
from OverallFact
union select * from ListeningFact
union select * from ReadingFact
union select * from WritingFact
union select * from SpeakingFact;
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
66
▪ The final fact table contains 45
records
▪ Conclusion: – When the input record in the
operational database contains the
scores of all the test components
as one record, it will be
challenging to break down the
records into multiple records in
the final fact table, because there
is a Test Component dimension.
The PTE Academic Test - Star Schema with Determinant Dim. (cont.)
67
The PTE Academic Test - Star Schema without Determinant Dim.
● Fact measure for all the test components
● This means that the fact table is a Pivoted Fact Table ○ Shifts the Test Component into the fact measure;
○ The Determinant Dimension - TestComponentDim is then removed from the star schema
68
▪ How many records do we have in fact table with five fact measures?
▪ However, since our test result data is incomplete, we will have problems
with the final fact, because in the non-determinant dimension version, we
need to keep track the zeroes – Even if the data is the test result is a lot, but still does not cover all possible
combinations from the dimension!
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
Answer:
60 records in the final fact table
= 2 countries of test venues * 6 countries of citizenship * 1 year * 5 grades
69
▪ ‘Superior’ grade is only available for Listening and Reading scores
▪ Using the normal fact table, there will not be entry in the fact table for Superior-Speaking, and for
Superior-Writing;
▪ There will not be zero value in TotalStudents, where the Grade column is ’Superior’ and the Test
Component column is either ’Speaking’ or ’Writing’ → less than 60 rows of records in the fact table
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
Superior
70
Solution: Step1
Get all possible combinations from all dimensions using Cartesian product
between all dimensions:
create table AllDimensions
as
select CO.CountryCode, CI.Citizenship, Y.Year, G.Grade
from CountryVenueDim CO,CitizenshipDim CI,
YearDim Y, GradeDim G;
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
71
Solution: Step 2
Re-create five temporary fact tables, one for each test component, using Outer
Join operation between AllDimensions table and each of the temporary fact
tables created in the previous section (see slide #35).
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
create table OverallFactNew as
select A.CountryCode, A.Citizenship,
A.Year, A.Grade,
nvl(O.Total_Students_Overall, 0)
as Total_Students_Overall
from AllDimensions A, OverallFact O
where A.CountryCode = O.CountryCode(+)
and A.Citizenship = O.Citizenship(+)
and A.Year = O.Year(+)
and A.Grade = O.Grade(+);
create table ListeningFactNew as
select A.CountryCode, A.Citizenship,
A.Year, A.Grade,
nvl(O.Total_Students_Listening, 0)
as Total_Students_Listening
from AllDimensions A, ListeningFact O where
A.CountryCode = O.CountryCode(+)
and A.Citizenship = O.Citizenship(+)
and A.Year = O.Year(+)
and A.Grade = O.Grade(+);
create table ReadingFactNew as select ...;
create table WritingFactNew as select ...;
create table SpeakingFactNew as select ...;
72
Solution: Step 3
Create the final fact table using the join operation
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
create table FinalFact2 as
select O.CountryCode, O.Citizenship, O.Year, O.Grade, O.Total_Students_Overall, L.Total_Students_Listening,
R.Total_Students_Reading, W.Total_Students_Writing, S.Total_Students_Speaking
from OverallFactNew O, ListeningFactNew L, ReadingFactNew R, WritingFactNew W, SpeakingFactNew S
where O.CountryCode = L.CountryCode
and L.CountryCode = R.CountryCode and R.CountryCode = W.CountryCode
and W.CountryCode = S.CountryCode and O.Citizenship = L.Citizenship
and L.Citizenship = R.Citizenship and R.Citizenship = W.Citizenship and W.Citizenship = S.Citizenship
and O.Year = L.Year and L.Year = R.Year and R.Year = W.Year and W.Year = S.Year
and O.Grade = L.Grade and L.Grade = R.Grade and R.Grade = W.Grade and W.Grade = S.Grade;
73
Solution: Step 4
Because our sample data is small, there are many records with zeroes values in
the Total Students columns. Delete records where Total Students in all of the
five test components are equal to zero.
The PTE Academic Test - Star Schema without Determinant Dim. (cont.)
delete from FinalFact2
where Total_Students_Overall = 0
and Total_Students_Listening = 0
and Total_Students_Reading = 0
and Total_Students_Writing = 0
and Total_Students_Speaking = 0;
Non-Type as a
Determinant
Dimension
75
▪ Type Dimension maybe a candidate for a Determinant Dimension:
– PetroTypeDim is a Determinant Dimension in the Petrol Price case
study,
– MedalTypeDim in the Olympic Games case study is a not a
Determinant Dimension
▪ Type Dimension is a good candidate for a Determinant Dimension, but
needs further examination to check if it is really a Determinant Dimension
▪ Other dimensions - non-type dimension can potentially be a Determinant
Dimension
Non-Type as a Determinant Dimension
76
▪ The Class Type Dimension in
this example is not a
Determinant Dimension
because we could get Number
of Students without referring to
any of the class types. ▪ E.g. retrieving number of
students in the Science
Faculty last year is certainly a
useful information, even
without Class Type
information.
Non-Type as a Determinant Dimension – University Enrolment Case Study
Type Dimension
77
Total student of
Science Faculty
Non-Type as a Determinant Dimension – University Enrolment Case Study
Total student of
Science Faculty
last year
vs.
The information is
dependent on the YearDIM
- determinant dimension
78
Converting a
Determinant Dimension
to a Pivoted Fact Table
Non-Type as a Determinant Dimension – University Enrolment Case Study
79
Question 1:
What if the number of records in the Determinant Dimension is huge?
– Number of fact measures increased to 20, 50?
– May not be practical
Solution:
Enforce Determinant Dimension through the User-Interface level ▪ Users are forced to enter/choose a criteria from the Determinant
Dimension.
▪ In filling a web-based form, this enforcement is normally indicated by a
star (*) next to the text entry to indicate that this text entry must be filled
in, before the form can proceed.
Pivoted Fact Table for University Enrolment Case Study
80
Question 2:
What if there are a lot of attributes in the Determinant Dimension? – If we shift the Determinant Dimension to the fact measures, only the key
identifier is kept in the fact measure.
– All other attributes from the Determinant Dimension will be lost from the star
schema. This issue will be discussed in the next section.
Pivoted Fact Table for University Enrolment Case Study (cont.)
Multiple Relationship
between a Dimension
and the Fact
82
▪ Private Taxi Case Study
– The dimension (or a Determinant Dimension) to be shifted to the fact
measure (e.g. Pivoted Fact Table) has many attributes.
– If only the key identifier of the dimension (or the Determinant
Dimension) is used as the fact measure, all other attributes of the
dimension (or the Determinant Dimension) are no longer exists in the
star schema
Multiple Relationship between a Dimension and the Fact
83
▪ Assume that
– The business rules stated that there is no Determinant Dimension. Even without a
Determinant Dimension, we could potentially shift a dimension to become fact measures
– The company is a small private taxi company, has only five cars
▪ In the fact table, instead of having three fact measures, for each fact measure, we could
replicate five times – one for each car.
Private Taxi Case Study
Replicate for each
car ===> 15 fact
measures
Remove this
CarDIM?
84
▪ CarDIM should not be removed – Other details of the car are
included in the data retrieval
▪ Because Car Dimension is not a
Determinant Dimension, we can
simply keep the original star
schema – Include only CarNo in the fact
table
– Use join operation to get more
details about car in query
Private Taxi Case Study (cont.)
select F.MonthYear, D.Make,
D.Model,
sum(F.Total_Kilometers)
from PrivateTaxiFact1 F,
CarDim D
where F.CarNo = D.CarNo
and F.CarNo = 5
and F.MonthYear = ’1908’
group by F.MonthYear, D.Make,
D.Model;
85
Insist that fact measures are broken into five cars
▪ Need to have five Car Dimension’s key
identifier in the fact
Private Taxi Case Study (cont.)
To retrieve Total Kilometers of five cars:
select F.MonthYear, D1.CarMake, D1.Year,
D2.CarMake, D2.Year, D3.CarMake, D3.Year,
D4.CarMake, D4.Year, D5.CarMake, D5.Year,
sum(Total_Kilometers)
from PrivateTaxiFact2 F, CarDim D1, CarDim
D2, CarDim D3, CarDim D4, CarDim D5
where F.CarNo1 = D1.CarNo
and F.CarNo2 = D2.CarNo
and F.CarNo3 = D3.CarNo
and F.CarNo4 = D4.CarNo
and F.CarNo5 = D5CarNo
and MonthYear = ’1908’
group by F.MonthYear, D1.CarMake, D1.Year,
D2.CarMake, D2.Year, D3.CarMake, D3.Year,
D4.CarMake, D4.Year, D5.CarMake, D5.Year;
86
▪ Although conceptually the dimension is already shifted to the fact measure,
we still need to keep the original dimension in the star schema.
▪ This is the only way to preserve all other attributes of that dimension.
▪ This relationship replication is applicable to shifting a Determinant
Dimension to the fact measure.
Multiple Relationship between a Dimension and the Fact
87
Two alternatives to enforce the implementation of a star schema that features a
Determinant Dimension
1. Enforce through the user-interface, whereby the user has to fill in or to
choose the Determinant Attribute during the search.
2. Shift the Determinant Dimension into a Pivoted Fact Table, whereby the key
identifier of the Determinant Dimension is incorporated into the fact
measure.
Determinant Dimension - Conclusion
Oracle/FIT5195-5-MoreComplexProcesses_update (1).pdf
Week 5 – More Complex Processes
Semester 1, 2020
FIT5195 – Business Intelligence and Data Warehousing
Developed by: Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH INFORMATION
TECHNOLOGY
Agenda
1. More Complex Processes
1. Use of count Function
2. Average in the Fact
3. Outer Join
2. One Attribute Dimensions
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
More Complex Processes
in Creating Fact
Recall – Star Schema Components
▪ There are Three main components of the Star Schema:
1. Facts
2. Dimensions
3. Attributes
Recall – Fact
▪ A Fact Table consists of key attributes from each dimension,
and fact measures.
▪ A Fact Table is created by a join operation, that joins several
tables from the operational database.
▪ Fact tables are created either through TempFact or directly
retrieval from the tables in the operational database.
▪ The fact measure itself is an aggregated value.
➢ In the SQL command, the fact measure attribute in the Fact Table is created using an aggregate function, such as count
or sum, and the group by operation.
Use of count Function
▪ The count function is one of the most common aggregate functions used to
create the fact measures.
▪ In SQL, there are three ways of using the count function:
1. count(*) – counts number of records in the query result
2. count(attribute) – counts number of records of that attribute excluding
null values
3. count(distinct attribute) – counts number of unique values in that
mentioned attribute
Case Study #1: Mobile Apps Repositories
Monalisa University is an international university. It has an online mobile app store: Monalisa App Store.
The Monalisa App Store allows students from any university in the world to publish their applications and
receive feedbacks. This app store is considered as a research environment where applications developed
during researches and studies can be tested and used by real users. The applications are free and open
source.
Basically, the operational system has the following features:
▪ Every user details are kept in the database, which includes the universities that the students enroll.
▪ Users can publish their apps or download other apps.
▪ Users can give feedbacks and ratings to other apps.
▪ The download and feedback statistics are stored in the database.
▪ Apps are organized into different categories; and many authors may have more than one app in
different categories.
Case Study #1: Mobile Apps Repositories
▪ A data warehouse is needed
analyze the ratings and feedbacks
by different authors and apps, so
that the staff at Monalisa University
can connect with the talented
authors and send them awards
annually.
▪ The author with the highest
average apps rating will be named
author of the year, and will receive
an award.
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table CategoryDim as
select * from Category;
Case Study #1: Mobile Apps Repositories
create table UniversityDim as
select distinct UniversityID,
UniversityName
from University;
Case Study #1: Mobile Apps Repositories
create table LocationDim as
select distinct
Country || City as LocationID,
City,
Country
from University;
Case Study #1: Mobile Apps Repositories
create table TimeDim as
select distinct
to_char(DownloadDate, ’YYYYMM’) as TimeID,
to_char(DownloadDate, ’MM’) as Month,
to_char(DownloadDate, ’YYYY’) as Year
from Download;
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table AppsDownloadFact as
select
to_char(D.DownloadDate, ’YYYYMM’) as DownloadMonth,
U.Country || U.City as LocationID,
A.CategoryID,
A.ApplicationID,
U.UniversityID,
count(*) as TotalDownloads
from University U, App_User R, Download D, Application A
where
U.UniversityID = R.UniversityID and
R.UserID = D.DownloaderID and
D.ApplicationID = A.ApplicationID
group by
to_char(D.DownloadDate, ’YYYYMM’),
U.Country || U.City,
A.CategoryID,
A.ApplicationID,
U.UniversityID;
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table TimeDim as
select distinct
to_char(CreationDate, ’YYYYMM’) as TimeID,
to_char(CreationDate, ’MM’) as Month,
to_char(CreationDate, ’YYYY’) as Year
from Application;
Case Study #1: Mobile Apps Repositories
create table AppsFact as
select
to_char(A.CreationDate, ’YYYYMM’) as CreationMonth,
U.Country || U.City as LocationID,
A.CategoryID,
A.ApplicationID,
U.UniversityID,
count(distinct ApplicationID) as TotalApps
from University U, App_User R, Download D, Application A
where
U.UniversityID = R.UniversityID and
R.UserID = D.DownloaderID and
D.ApplicationID = A.ApplicationID
group by
to_char(A.CreationDate, ’YYYYMM’),
U.Country || U.City,
A.CategoryID,
A.ApplicationID,
U.UniversityID;
Average in the Fact
Average in the Fact Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
The operational database
contains:
• 9 records of Semester one
• 7 records of Semester two
Average in the Fact Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
The operational database
contains:
• 9 records of Semester one
• 7 records of Semester two
• 8 records of Database Unit
(6 Semester one and 2
Semester two)
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact (b) Subject Dimension
(c) Semester Dimension
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact (b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact (b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
• Semester two: (64+32)/2 = 48
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact (b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
• Semester two: (64+32)/2 = 48
These are actually
incorrect!
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact (b) Subject Dimension
(c) Semester Dimension
The SQL command: select avg(Average_Score)
from EnrolmentFact
where UnitCode = ’IT001’;
Query: Calculate Average Score for the Database Unit:
• (73.833 + 48) / 2 = 60.9165
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database Query: Calculate Average Score for the Database Unit.
Calculation using Fact:
• (73.833 + 48) / 2 = 60.9165
Calculation based on the Operational Database:
• (81+41+74+85+87+75+64+32) / 8 =
539 / 8 = 67.375
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database Query: Calculate Average Score for the Java Unit in both Semesters.
Calculation using Fact:
• (56+71.667) / 2 = 63.833
Calculation based on the Operational Database: • (65+47+78+73+64) / 5 = 65.4
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database Query: Calculate Average Score for Semester One.
Calculation using Fact:
• (73.833+56+63) / 3 = 64.287
Calculation based on the Operational Database: • (81+41+74+85+87+75+65+47+63) / 9 = 68.667
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database Query: Calculate Average Score for Semester Two.
Calculation using Fact:
• (48+71.667+52.5) / 3 = 57.389
Calculation based on the Operational Database: • (64+32+78+73+64+53+53) / 7 = 59.4286
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Average in the Fact
Unit Code Semester Total_Score Number_of_Students
IT001 1 443 6
IT001 2 96 2
IT002 1 112 2
IT002 2 215 3
IT003 1 63 1
IT004 2 105 2
(a) Fact Version 2
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(b) Subject Dimension
(c) Semester Dimension
Average in the Fact
Unit Code Semester Total_Score Number_of_Students
IT001 1 443 6
IT001 2 96 2
IT002 1 112 2
IT002 2 215 3
IT003 1 63 1
IT004 2 105 2
(a) Fact Version 2
The SQL command: select sum(Total_Score)/
sum(Number_of_Students)
as Average_Score
from EnrolmentFact2
where UnitCode = ’IT001’;
Query: Calculate Average Score for
the Database Unit:
• (443+96)/(6+2) = 67.375
Average in the Fact
▪ The problem of Average in the Fact is known as the Average of an Average
problem.
➢ This problem is well known in Mathematics and Statistics.
➢ Average of an average will simply produce an incorrect average result.
▪ Hence, it is not desirable to have an average measure in the fact.
➢ Exceptional case: when the analysis ALWAYS uses all the dimensions (e.g.
Determinant Dimensions).
Min & Max in the Fact
▪ If Average should not be used in the fact, how about Min or Max?
➢ Yes, we can.
➢ Because Max of Max is always a global Max, and Min of Min is always a global
Min.
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension (c) Semester Dimension
Min & Max in the Fact
▪ Query: Find the Maximum Score of Database Unit.
• Max of {87, 64} is 87.
▪ The SQL command: select max(Max_Score) from EnrolmentFact
where UnitCode = ’IT001’;
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension (c) Semester Dimension
Min & Max in the Fact
▪ Query: Find the Minimum Score of Database Unit.
• Min of {41, 32} is 32.
▪ The SQL command: select min(Min_Score) from EnrolmentFact
where UnitCode = ’IT001’;
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension (c) Semester Dimension
Average in the Fact – Conclusion
▪ Average in the Fact is not desirable, although technically it satisfies the two
criteria of the fact (e.g. must be a numerical and aggregate value).
▪ Min and Max in the Fact can still be used, since Min Score and Max Score are
valid fact measures (e.g. they are numerical and aggregated values).
▪ In general, count and sum are more common.
Outer Join
Case Study #2: Employment Agency
The Employment Agency has a file of candidates who are willing to work.
▪ They record the candidates number, name, contact address, contact phone number and
maximum hours the candidate is available per week.
▪ Each candidate may have several qualifications.
▪ The Agency uses special codes to record the candidate’s qualification for a job opening.
▪ As well as recording the code, the Agency also records the experience of the candidate in
each qualification area, expressed as the number of months experience.
Case Study #2: Employment Agency
The Agency also has a list of companies that request temporaries.
▪ Each company is assigned a company number as an identifier.
▪ The company name, type of business and principal contact for employment placements is also recorded.
▪ Each time a company requests a temporary employee, the Agency makes an entry in the (job) openings file. This file contains an opening number, the company requesting an employee, required qualification, starting date, anticipated ending date, and hourly pay.
▪ Each opening requires only one specific or main qualification.
▪ The Agency may be able to fill the opening from the staff on its books, however in some circumstances it cannot fill the request.
Case Study #2: Employment Agency
▪ When a candidate matches the qualification, and is available, he/she is given the job, and an entry is made in the placement record folder.
➢ This folder contains an opening number, candidate number, actual start date, total hours worked to date, and end date if the placement is completed.
➢ The placement record folder is used by the Agency as a source of placement histories for it’s various temporaries.
Case Study #2: Employment Agency
▪ Note:
➢ Not all Openings have
Placements, but each Placement
comes from the Opening.
➢ A Candidate may have several
Placements.
➢ A Company may open several
Openings.
➢ Each Opening has one main
qualification requirement.
Case Study #2: Employment Agency
▪ A data warehouse to analyze
the total openings and total
placements by job durations,
qualification requirements, and
placement starting month.
▪ Several types of job durations:
➢ Short-term (<10 days)
➢ Medium-term (10 to 30 days)
➢ Long-term (>30 days)
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
▪ Several types of job durations:
➢ Short-term (<10 days)
➢ Medium-term (10 to 30 days)
➢ Long-term (>30 days)
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
1. How many short-term job openings?
2. How many jobs requiring certain qualifications?
3. How many people started working in a certain
month?
Case Study #2: Employment Agency
create table QualificationDim as
select * from Qualification;
Case Study #2: Employment Agency
create table MonthDim as
select distinct
to_char(ActualStartDate, ’Month’)
as MonthName
from Placement;
Case Study #2: Employment Agency
create table DurationDim
(DurationID number,
DurationDesc varchar2(20));
insert into DurationDim values (1, ’Short-Term’);
insert into DurationDim values (2, ’Medium-Term’);
insert into DurationDim values (3, ’Long-Term’);
Case Study #2: Employment Agency
create table TempFact as
select O.QCode,
O.StartDate,
O.EndDate,
to_char(P.ActualStartDate, ’Month’)
as MonthName,
O.OpenNo,
P.CandNo
from Opening O, Placement P
where O.OpenNo = P.OpenNo (+);
Case Study #2: Employment Agency
create table TempFact as
select O.QCode,
O.StartDate,
O.EndDate,
to_char(P.ActualStartDate, ’Month’)
as MonthName,
O.OpenNo,
P.CandNo
from Opening O, Placement P
where O.OpenNo = P.OpenNo (+); Left Outer Join
Case Study #2: Employment Agency
alter table TempFact
add (DurationID number);
update TempFact
set DurationID = 1
where EndDate - StartDate < 10;
update TempFact
set DurationID = 2
where EndDate - StartDate >= 10
and EndDate - StartDate <=30;
update TempFact
Set DurationID = 3
where EndDate - StartDate > 30;
Case Study #2: Employment Agency
create table AgencyFact as
select
QCode, DurationID, MonthName,
count(OpenNo) as TotalOpening,
count(CandNo) as TotalPlacement
from TempFact
group by QCode, DurationID, MonthName;
One-Attribute
Dimensions
61
One-attribute dimensions
One-attribute dimensions ● Dimensions with only one attribute, i.e. key attribute
62
Column-based Solution in the Fact
● Medal Type Dimension table seems unnecessary, because Medal Type also
exists in the fact table
● Hence, there is no need to reference to the Medal Type Dimension
63
Column-based Solution in the Fact (cont.)
Column-based solution ● Convert the dimension attribute into
new columns (fact measures) in the
fact table
● Only applicable for dimensions with
low cardinality, because it is not
practical to create a new column in
the fact table for each record of the
dimension
64
Row-based Solution in the Fact
● What happens if the one-attribute dimension has a lot of records?
● The solution: ○ Remove the dimension and keep the attribute of the dimension as a dimension-
less key in the fact table
Row-based solution whereby
each medal type is
represented by records in the
fact table
65
Row-based Solution in the Fact - Sales Case Study
● Assume that there are only four quarters (e.g. Q1, Q2, ..., Q4), regardless
the year
● The Quarter could be shifted to the fact measures
● Result: Pivoted fact table
Q1 to Q4
A Sales Star Schema with Pivoted Fact
66
▪ What if the Quarter also embeds the Year, e.g. Q12020, Q22018, and etc.? – Pivot Fact table cannot be the solution, because the number of possible records
in the Time Dimension can be large
▪ The solution: – Include the Quarter attribute as a Dimension-less Key in the fact, and to remove
the Time Dimension from the star schema
– Result: Row-based solutions - not restricted by the number of records in the
one-attribute dimension
Row-based Solution in the Fact (cont.)
67
▪ Similarity: – The contents of the fact table are identical.
▪ Difference: – The dimension-less key attribute has no references to the Time Dimension,
because the Time Dimension has been removed, whereas the Quarter in the star
schema has a reference to the Time Dimension.
Star Schema with Dimension-less Key vs. One-Attribute Dimension
vs
Star Schema with Dimension-less key
Star Schema with TimeDIM
68
▪ Looking at the high level design, it is
often more desirable to keep the
one-attribute dimension, i.e. the Time
Dimension in this case
– Provides a clearer design,
considering that the fact table
remain the same, with or without
the one-attribute dimensions
Star Schema with Dimension-less Key vs. One-Attribute Dimension (cont.)
69
One-attribute dimensions
70
▪ If there is more than one dimension with single attribute, and these
dimensions are not related to each other, it is often desirable to combine
them into a junk dimension
▪ Case study: – Analyse the number of sales based on
these dimensions
Combine all one-attribute dimensions
Sales Fact Table
71
Combine all one-attribute dimensions (cont.)
Cartesian product of 4
dimension tables
72
73
▪ Assuming that each book can only have one category, then Category is
related to Book
▪ In this case, we can simply move the category into the book dimension
Combine with other normal dimensions - Book Sales Case Study
74
Combine with other normal dimensions - Book Sales Case Study (cont.)
75
Determinant Dimension with one-attribute only - Weather Data Case Study
76
Determinant Dimension with one-attribute only - Weather Data Case Study
▪ WeatherMonth attribute is a Determinant Dimension in this case.
▪ Hence, WeatherMonth Dimension must be kept in the star schema
– To enforce that WeatherMonth Dimension (or WeatherMonth attribute)
must be used in all queries to this star schema
WeatherMonthDI
M
77
Questions:
1. What is the granularity of this
star schema?
2. How can we have a higher
granularity in this star schema
(no category to go for)?
Answer to Question #2: ▪ Combine books into groups
based on the authors
One-attribute dimension with bridge - Book Sales Case Study
78
▪ The ISBN Group List can be
created by using the listagg function
▪ The Book Group List
Dimension is a one-attribute
dimension
▪ It is not possible to remove
this dimension as it links the
fact and the bridge table
One-attribute dimension with bridge - Book Sales Case Study (cont.)
79
1. To move the one attribution dimensions to the fact table
– The one-attribute dimension must have low cardinality
– Else, keep the dimension attribute in the fact as a dimension-less key
– From high level point of view, sometimes it is desirable to keep the
attribute dimension in the design
2. To combine the one-attribute dimension with other dimensions
– Form a junk dimension that combines several one-attribute dimensions
– Move the one-attribute dimension into another normal dimension
– One-attribute dimensions cannot be removed if:
▪ It is a determinant dimension
▪ It is connected to a bridge table
Summary for One-Attribute Dimensions
End of Week 5 Lecture
Oracle/Requiremnts-Oracle.pdf
FIT5195 S1 2020 – Take Home Test (10%)
Due date: Week 6, Friday 1-May-2020, 11:55pm
This Take Home Test consists of two Case Studies. You are required to answer all questions.
Case Study 1: Government Hospital Data Warehouse (50 marks) Currently, the Victorian government has a simple database system implemented to store the
collective data of all government-run hospitals. The database system includes information about their
services, staff, and patients. The simple database system consists of the following tables and
attributes:
Figure 1 : A simplified structure of the Government Hospital’s current E/R Diagram
Due to the nationwide impact of a new pandemic, the Premier of Victoria (i.e. the head of government
in the Australian state of Victoria) requires a data warehouse for analyzing trends and forecasting
purposes. The goal of the analyzing is to identify which frequently used services, popular hospital
locations, etc.
You are required to design a small Data Warehouse to keep track of the statistics. The Premier is
particularly interested in analyzing the total number (population quantity) of patients and total
service charged by services used, time periods, age groups, and hospital locations.
The Premier also wants to be able to drill down based on the cost of services range (low price < $20,
medium price between $20 and $50, and high price >$50).
Tasks:
[1]. Create the operational database based on the above E/R Diagram, and then populate with sufficient numbers of records in each table. It is suggested that each table would have 5-20
records depending on the use of that table in the operational database. You need to create a script
file containing the CREATE TABLE and INSERT INTO commands.
You can follow the list of services from this link:
https://www.monash.edu/health/medical/services.
[2]. Then develop a GOVERNMENT HOSPITAL star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you
have to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[3]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.
[4]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables.
[5]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of patients making appointments during Winter. b) Show the total service charged for each service cost type. c) Show the total number of patients by each age group (infant <1, children <18, adult 18+,
senior 65+) in April 2020.
d) Show the total service charged for general medical consultations in each suburb.
Submission checklist for Case Study 1:
A. A pdf file or word document containing:
i. The SQL commands to create tables and insert records to the operational database –
Task 1
ii. The star schema diagram – Task 2
iii. The Two-Column Table Methodology illustration – Task 3
iv. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 4
v. The SQL commands to answer the queries in Task 5 and the query results
B. The .sql files containing:
i. The SQL commands to create the operational database.
ii. The SQL commands to create the data warehouse.
iii. The SQL commands to answer the queries in Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy1_yourStudentID.
Case Study 2: Accident Records Data Warehouse (50 marks)
Figure 2 : A simplified structure of the Accident Record’s current database system
The above figure displays how the Victorian Roads and Safety (VicR&S) Department stores accident
related data. When there are any road accidents, the details about the accidents such as the location,
date, time and event (e.g. what exactly happened during the accident) are recorded. For each accident,
the information (e.g. name, contact number, employment start date, current employment branch)
about one police officer who is in charge of overseeing the events of the accident (e.g finding out
who/what caused the accident) are stored. In addition, to identify a particular accident, each accident
is given a unique accident number.
On the other hand, one accident may be caused by several vehicles (e.g. a collision between 2
vehicles). All vehicles have owners who may or may not be the same as the vehicle driver during the
accident. Therefore, the VicR&S database stores information about vehicles involved in the accident
(i.e. vehicle number, vehicle model, vehicle make and vehicle manufacturer), the information about
the owners of the vehicles (i.e. owner name, owner address, owner contact number) and the
information about the drivers at the time of the accident (driver name, driver address, driver contact
number, driver licence no, and when the driver started to drive the vehicle). Furthermore, for each
accident, the severity of damage incurred by each vehicle is recorded.
Currently, to reduce the number of road accidents VicR&S department is interested in the following:
- The total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).
- The total number of accidents by each vehicle model. - The number of vehicles involved in every accident event on different locations. - The number of accidents taken care of different police officer branches.
Tasks:
[1]. Develop an ACCIDENT RECORDS star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you have
to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[2]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.
[3]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables. The
operational tables are accessible from the ACCIDENT account. The result of this task is the SQL
commands. You will also need to show the contents of the tables that you have created.
[4]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).
b) Show the total number of accidents by each vehicle model. c) Show the number of vehicles involved in every accident event on different locations. d) Show the number of accidents taken care of by different police officer branches.
[5]. You need to come up with additional two more questions and answer these questions using the SQL commands. Also explain the reason for why the management would like to have such
information.
Submission checklist for Case Study 2:
A. A pdf file or word document containing:
i. The star schema diagram – Task 1
ii. The Two-Column Table Methodology illustration – Task 2
iii. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 3
iv. The SQL commands to answer the queries in Task 4 and the query results
v. The additional two questions with the SQL commands, query result, and your explanation – Task 5
B. The .sql files containing:
i. The SQL commands to create the data warehouse.
ii. The SQL commands to answer the queries in Task 4 and Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy2_yourStudentID.
Submission Method:
1. A zip file: a. Save the CaseStudy1_yourStudentID and CaseStudy2_yourStudentID folders into one
folder named FIT5195_Test_yourname_studentID.
b. ZIP the folder FIT5195_Test_yourname_studentID.zip. This must be a ZIP file and not other types of compressed folder. The zip file should contain the prescribed folder
structure as follows:
● FIT5195_Test_yourname_studentID/ ○ CaseStudy1_yourStudentID/
○ Files for tasks ○ CaseStudy2_yourStudentID/
○ Files for tasks
2. Upload your zip file on Moodle by the due date: Friday 1-May-2020, 11:55pm (No late submission is accepted).
END OF TAKE HOME TEST
Oracle/Week-1c-SQL-revision.pdf
11
SQL - Revision
Assoc. Prof. David Taniar
22
33
SQL - Revision
Outline: A. Introduction to SQL client environment
B. Create tables
C. Insert into
D. Simple query retrieval
E. Updating and deleting records
F. Commit
G. Joining multiple tables
H. Aggregate functions and group by
I. Alter tables
44
A. Introduction to SQL client
• There are several SQL client software:
• SQL Developer
• SQL*Plus (Windows, Mac, Unix)
• Login details:
• Username: S12345678 (12345678 is your student id)
• Password: student
• Host string: the unit code
5
Introduction to SQL Developer
Oracle username which starts
with S and followed by your
student ID. Password is student
Location of
the server
This is the hoststring, which
is the database name
rhino.its.monash.edu
FIT5195
FIT5195
6
Introduction to SQL Developer
Single Statement Execution Script Execution
Output of Execution
Type Command
77
B. Create Tables
• General Syntax:
CREATE TABLE <table_name>
(attribute1 data_type <NOT NULL>,
attribute2 data_type,
…
PRIMARY KEY (attribute1),
FOREIGN KEY (attribute1)
REFERENCES <references table_name> (key_attribute));
88
Create Tables – Data Types
• Data type denotes a kind of data of an attribute value
• Character data types: VARCHAR2 and CHAR
Make VARCHAR2(20)
Model VARCHAR2(30)
• Number data types: NUMBER
Year NUMBER(4)
StampDuty NUMBER(8,2)
• Date data type: DATE
Salesdate DATE
B
99
Create Tables – Data Types
• Date Data Type
• DATE stores dates from 1/1/4712 BC to 12/31/4712 AD
• Default date format: DD-MON-YYYY
• example: 05-JUN-2011
• Example declaration: salesDate DATE
• DATE data type also stores time values
B
1010
Create Tables – Data Types
• Default time format: HH:MI:SS A.M.
• If no time value is given when a date is inserted, default value is
12:00:00 A.M.
• If no date value is given when a time is inserted, default date is
first day of current month
• Example salesDate field: 07-JUN-2016 12:00:00 A.M.
B
1111
Create Tables – Constraints
• Integrity Constraints
• Primary Key attribute
• NOT NULL constraints
• Specifies that a field cannot be NULL
• Sample Declaration: Field_name data_type NOT NULL
• Foreign Key attribute in a table refers to another record in another table
B
1212
Create Tables – Example 1
• Example: create CAR, CARSALES and CUSTOMER tables
• Commands:
• CAR table
CREATE TABLE Car
(carID NUMBER(5) NOT NULL,
make VARCHAR2(20) NOT NULL,
model VARCHAR2(30),
year NUMBER(4),
colour VARCHAR2(25),
advertisedPrice NUMBER(10),
PRIMARY KEY (carID)
);
B
1313
Create Tables – Example 2
• CUSTOMER table
CREATE TABLE Customer
(customerID NUMBER(5) NOT NULL,
cNAME VARCHAR2(20) NOT NULL,
cPhone VARCHAR2(10),
address VARCHAR2(50),
PRIMARY KEY (customerID)
);
B
1414
Create Tables – Example 3
• CARSALES table
CREATE TABLE Carsales
(carID NUMBER(5) NOT NULL,
customerID NUMBER(5) NOT NULL,
salesDate DATE NOT NULL,
purchasedPrice NUMBER(10) NOT NULL,
stampDuty NUMBER(8,2) NOT NULL,
PRIMARY KEY (carID,customerID),
FOREIGN KEY (carID) REFERENCES Car(carID),
FOREIGN KEY (customerID) REFERENCES Customer(customerID)
);
B
1515
Create Tables – by Copying
B
• You can create a new table by copying from an existing table:
CREATE TABLE <table_name>
AS SELECT *
FROM … ;
• For example:
CREATE TABLE car
AS SELECT *
FROM dtaniar.car;
• Notes:
• It creates and copy the records from the existing table
• However, it does not copy the PK and FK
• In the above example, it copies the table from dtaniar account.
1616
Create Tables – View Tables
• Viewing Information about Tables
• To view all tables in the database, the general syntax is:
SELECT * FROM TAB;
• For example:
SELECT * FROM tab;
TNAME TABTYPE CLUSTERID
------------------------------ ------- ----------
CAR TABLE
CARSALES TABLE
CUSTOMER TABLE
B
1717
Create Tables – Describe Tables
• To view the table structure, the general syntax is:
DESCRIBE <table_name>
• For example:
DESC car;
Name Null Type
--------------- -------- ------------
CARID NOT NULL NUMBER(5)
MAKE NOT NULL VARCHAR2(20)
MODEL VARCHAR2(30)
YEAR NUMBER(4)
COLOUR VARCHAR2(25)
ADVERTISEDPRICE NUMBER(10)
B
1818
Create Tables – Drop Tables
• To drop an unwanted table, the general syntax is:
DROP TABLE <table_name>;
• For example:
DROP TABLE Car;
• If the table that you want to delete (e.g. table Car) is being used as a FK by
another table (e.g. table Carsales), then you cannot delete table Car.
• In this case, you need to delete table Carsales first, before deleting table Car.
B
1919
C. Insert Into
• General Syntax
• To insert values for every single attribute in a record:
INSERT INTO <table_name>
VALUES (attribute1_value, attribute2_value,….);
• Example:
INSERT INTO Car
VALUES (1, 'Holden', 'Cruze', 2015, 'Black', 25780);
• Strings are enclosed in single quotes (‘) and are case-sensitive
(e.g. ‘General Practice’ is different from ‘general practice’)
20
Insert Into
• To insert a value of selected attributes:
INSERT INTO <table_name> (attribute1,attribute2,…)
VALUES (attribute1_value, attribute2_value,….)
• Example:
INSERT INTO Car (carID, make) VALUES (’16’, ’Audi’);
C
2121
Insert Into
• The TO_DATE function:
• TO_DATE (‘date_value’, ’format mask’);
• Example:
INSERT INTO Carsales
VALUES (1,4, TO_DATE('04/Feb/2015', 'DD/MON/YYYY'),25780,824.96);
C
2222
Insert Into
• The common DATE format masks
Format Mask Formatted Data
DD-MON-YYYY 05-FEB-2007
MM/DD/YYYY 02/05/2007
HH:MI AM 02:30 PM
MONTH DAY, YYYY FEB 5, 2007
MM/DD/YYYY HH:MI AM 02/05/2007 02:30 PM
Sample DATE format masks
C
23
Insert Into
• Insert multiple records one-by-one:
INSERT INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800);
INSERT INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200);
INSERT INTO Car VALUES (4, 'Holden', 'Commodore', 2008, 'Grey', 12650);
• Insert multiple records at once:
INSERT ALL
INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800)
INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200)
INTO Car VALUES (’ 4, 'Holden', 'Commodore', 2008, 'Grey', 12650)
SELECT * FROM DUAL;
C
2424
D. Simple Query Retrieval
• Simple Retrieval
• Retrieve all Records
• General Syntax :
SELECT *
FROM <table_name>;
• Example: retrieve everything from the CAR table:
SELECT * FROM car;
• Output:
2525
Simple Query Retrieval
• Retrieve Specific Fields
• General Syntax:
SELECT <attribute1, attribute2…>
FROM <table_name>;
• Example: select only the model from the CAR table
SELECT model FROM car;
D
• Output:
2626
Simple Query Retrieval
• Eliminating Duplicated Records (DISTINCT qualifier)
• General Syntax:
SELECT DISTINCT <attribute1,attribute2,…>
FROM <table_name>;
• Example: eliminating duplicates for the MAKE values:
SELECT DISTINCT make FROM car;
D
2727
Simple Query Retrieval
• Output (with DISTINCT):
D
• Original Output (without DISTINCT):
2828
Simple Query Retrieval
• DISTINCT Multiple Attributes
• Example: eliminating duplicates for the MAKE and YEAR values:
SELECT DISTINCT make, year
FROM car;
D
MAKE YEAR
-------------------- ----------
Volkwagen 2015
Toyota 2013
Mini 2009
Holden 2015
BMW 2016
Honda 2012
Subaru 2014
Holden 2008
Jeep 2006
Mazda 2011
Audi 2016
Lexus 2009
Honda 2010
Nissan 2010
Mercedes Benz 2006
15 rows selected
2929
Simple Query Retrieval
• Conditional Retrieval
• Search Conditions specified for more complex data retrieval
• The WHERE Clause
• Operators:
• equal (=)
• greater than (>)
• less than (<)
• greater than or equal to (>=)
• less than or equal to (<=)
• Not equal (<>)
D
3030
Simple Query Retrieval
• General Syntax:
SELECT <attribute1,attribute2,…>
FROM <ownername.table_name1>
WHERE <search condition>;
• Example:
SELECT year, model, advertisedPrice
FROM car
WHERE make='Holden'
YEAR MODEL ADVERTISEDPRICE
---------- ------------ ---------------
2015 Cruze 25780
2008 Commodore 12650
D
3131
Simple Query Retrieval
• “BETWEEN”
• Example: list the carid, salesDate and purchasedPrice after 2014 and
before 2016
SELECT carid, salesdate, purchasedPrice
FROM carsales
WHERE salesdate
BETWEEN TO_DATE('01-JAN-2014', 'DD-MON-YYYY')
AND TO_DATE('31-DEC-2015','DD-MON-YYYY’);
CARID SALESDATE PURCHASEDPRICE
---------- ----------- --------------
1 04-FEB-2015 25780
8 12-DEC-2015 13999
D
3232
Simple Query Retrieval
• Using a String Comparison
• Example: list the carid, salesDate and purchasedPrice after 2014 and
before 2016
SELECT carid, salesdate, purchasedPrice
FROM carsales
WHERE TO_CHAR(salesdate, ‘YYYYMMDD’) >’20140101’
AND TO_CHAR(salesdate, ‘YYYY’MMDD’) < ‘20151231’;
CARID SALESDATE PURCHASEDPRICE
---------- ----------- --------------
1 04-FEB-2015 25780
8 12-DEC-2015 13999
D
3333
Simple Query Retrieval
• “AND” or “OR”
• AND: both conditions must be true
• Example:
SELECT make, model, year
FROM car
WHERE year>2014 AND make='Holden';
MAKE MODEL YEAR
--------- ---------------- ----------
Holden Cruze 2015
D
3434
Simple Query Retrieval
• OR: either one of the condition is true
• Example:
SELECT make, model, colour, year, advertisedPrice
FROM car
WHERE colour='Red’ OR year>2014;
MAKE MODEL COLOUR YEAR ADVERTISEDPRICE
--------- ---------- ---------- ---------- ----------------
Holden Cruze Black 2015 25780
BMW 520d Grey 2016 98800
Audi A5 Black 2016 68200
Honda Civic Red 2012 13488
Lexus RX350 Red 2009 33500
Volkwagen Golf White 2015 39888
6 rows selected
D
3535
Simple Query Retrieval
• Other Conditions
• LIKE/NOT LIKE
• Example: displaying all MAKE from car table that has their first
character as ‘M’
SELECT make
FROM CAR
WHERE make LIKE 'M%';
MAKE
----------------
Mazda
Mercedes Benz
Mini
D
3636
Simple Query Retrieval
• IN/NOT
• suitable to perform a set member search
• Example:
• IN - displaying all MODEL that its MAKE is either ‘Holden’ or ‘Honda’
SELECT make, model
FROM car
WHERE make IN ('Holden', 'Honda');
MAKE MODEL
-------------------- -------------
Holden Cruze
Holden Commodore
Honda Civic
Honda City
D
3737
Simple Query Retrieval
• IN/NOT
• Example:
• NOT IN - displaying make, model and colour of cars that are not ‘Black’
and ‘White’
SELECT make, model, colour
FROM car
WHERE colour NOT IN ('Black’, 'White');
MAKE MODEL COLOUR
-------------- ------------ ------------
BMW 520d Grey
Holden Commodore Grey
Honda Civic Red
Lexus RX350 Red
Mazda 3Series Blue
Mercedes Benz C200 Blue
Honda City Grey
D
3838
Simple Query Retrieval
• NULL/NOT NULL
• Example:
• NULL Operator:
SELECT carid, make
FROM car
WHERE colour IS NULL;
CARID MAKE
------ ---------
12 Subaru
D
3939
Simple Query Retrieval • NOT NULL Operator:
SELECT carid, make
FROM car
WHERE colour IS NOT NULL;
CARID MAKE
------- --------------------
1 Holden
2 BMW
3 Audi
4 Holden
5 Honda
6 Jeep
7 Lexus
8 Mazda
9 Nissan
10 Volkwagen
11 Mercedes Benz
13 Honda
14 Mini
15 Toyota
14 rows selected
D
4040
Simple Query Retrieval
• MULTIPLE OPERATORs
• Example: list cars make start with ‘M’ and was made before 2016
SELECT year, make, model
FROM car
WHERE make LIKE 'M%'
AND year<2016;
YEAR MAKE MODEL
---- ---------------- -------------
2011 Mazda 3Series
2006 Mercedes Benz C200
2009 Mini Cooper
D
4141
Simple Query Retrieval
• Sorting
• specify to sort the output by using ORDER BY
• General Syntax:
SELECT <attribute1,attribtue2,..>
FROM <table_name>
ORDER BY <attribute_name> [DESC];
D
4242
Simple Query Retrieval • Example: retrieving all make, model, advertisedPrice and colour in a descending
order of the advertisedPrice SELECT make, model, advertisedPrice, colour
FROM car
ORDER BY advertisedPrice DESC;
MAKE MODEL ADVERTISEDPRICE COLOUR
--------------- ------------------ --------------- --------
BMW 520d 98800 Grey
Audi A5 68200 Black
Volkwagen Golf 39888 White
Subaru Outback 33950 NULL
Lexus RX350 33500 Red
Holden Cruze 25780 Black
Toyota Corolla 23888 White
Jeep Grand Cherokee 21999 Black
Mini Cooper 19750 Black
Nissan Dualis 18888 White
Mercedes Benz C200 16995 Blue
Mazda 3Series 13999 Blue
Honda Civic 13488 Grey
Holden Commodore 12650 Grey
Honda City 7490 Grey
15 rows selected
D
43
Simple Query Retrieval
• “AS”: to rename a column
• Example: list all SalesDate, PurchasedPrice, StampDuty and
TotalPrice (TotalPrice = PurchasedPrice+StampDuty )
SELECT salesdate, purchasedPrice, stampduty,
(purchasedPrice+StampDuty) as TotalPrice
FROM carSales;
SALESDATE PURCHASEDPRICE STAMPDUTY TOTALPRICE
----------- -------------- ---------- ----------
04-FEB-2015 25780 824.96 26604.96
13-JUL-2016 12650 506 13156
12-DEC-2015 13999 559.96 14558.96
14-JUN-2016 39888 1276.42 41164.42
18-MAY-2016 98800 5631.6 104431.6
4444
E. Updating and Deleting Records
• In the created tables
• UPDATE command – updating
• DELETE command – deletion
4545
Updating Records
• Update
• General Syntax:
UPDATE <table_name>
SET <attribute_name> = <new_value>
WHERE <expression> <operator> <expression>;
• records can be updated in only one table at a time
• update multiple fields that are within the same table
• WHERE clause - make the command updates specific records only
E
46
Updating Records
• Example:
• Updating colour of carID ‘5’ from ‘Red’ to ‘Grey’
UPDATE car
SET colour='Grey'
WHERE carID =5;
46
E
47
Deleting Records
• Delete
• General Syntax:
DELETE FROM <table_name>
WHERE <search_condition>;
• remove specific records from a database table
• use WHERE clause to specify multiple records to delete multiple records at
one time
• If the search condition is omitted, all records in the table are deleted.
47
E
48
Deleting Records
• Example:
• deleting a single record from the CAR table
DELETE FROM car
WHERE carid= 5;
• deleting multiple records from the CAR table that contain MAKE
starting with ‘M’
DELETE FROM car
WHERE make LIKE ‘M%’;
• deleting all records from the CAR table
DELETE FROM car;
48
E
49
Deleting Records
• Notice:
• not allowed to delete a primary key record that has its corresponded foreign
key somewhere else in another table
• Example: delete a primary key CarId is 5 in the CAR table that has a foreign
key record that CarId is 5 in the CARSALES table
DELETE FROM car WHERE carid=5;
DELETE FROM car
*
ERROR at line 1:
ORA-02292: integrity constraint (SYSTEM.SYS_C005453) violated - child
record found
49
E
5050
F. Commit
• Commit
• When inserted data by issuing the INSERT command
• the changes are only saved in the local database buffer
• are not saved in the database
• until you COMMIT the transaction
• it is important to remember to COMMIT whenever you have finished inserting
values or make changes to the database values
5151
Commit
• General Syntax:
• Sample of inserting a new record with commit
Effect of Inserting New Record with Commit
F
Insert a new record
‘16’ with commit
All original
records are
currently in CAR
table
5252
G. Joining Multiple Tables
• Join
• database query to join multiple database tables together
• the data needed or the conditions specified come from more than one table.
• Syntax:
SELECT <attribute1,attribute2,… >
FROM <table_name1,table_name2,…>
WHERE <table_name1.join_attribute> = <table_name2.join_attribut>
AND <search_condition>
53
Joining Multiple Tables
• joining the CAR, CARSALES and CUSTOMER tables:
SELECT customer.cname, customer.cphone, carsales.salesdate, carsales.purchasedPrice,
carsales.stampDuty, car.make, car.model
FROM customer, carsales, car
WHERE customer.customerID = carsales.customerID
AND car.carID = carsales.carID;
• using prefix when joining tables:
• when more than one table is involved, a prefix for each attribute is
recommended to avoid ambiguity
SELECT ct.cname, ct.cphone, cs.salesdate,
cs.purchasedPrice,cs.stampDuty,
c.make, c.model
FROM customer ct, carsales cs, car c
WHERE ct.customerID = cs.customerID
AND c.carID = cs.carID;;
53
G
• The output of both example:
54
Joining Multiple Tables
• Subquery
• Example: retrieving the make, model and advertisedPrice of the cars that
are not sold
SELECT make, model, advertisedPrice FROM car
WHERE carid NOT IN
(SELECT carid FROM carsales);
MAKE MODEL ADVERTISEDPRICE
----------------- ----------------- ---------------
Subaru Outback 33950
Lexus RX350 33500
Mini Cooper 19750
Nissan Dualis 18888
Mercedes Benz C200 16995
Jeep Grand Cherokee 21999
6 rows selected
5555
H. Aggregate Functions and Group By
• Aggregate Functions
• summarize the input table
• often used include:
• COUNT - count number of records in the input table
• SUM - calculate the sum of a numerical attribute
• MIN and MAX - find the smallest and the largest value of a certain
attribute
56
Aggregate Functions
• Count(*)
• Example: returning the number of records that is available from
the CAR table
SELECT COUNT(*)
FROM car;
COUNT(*)
------------------
15
H
57
Aggregate Functions
• Example: returning the number of values of colour that is available from the CAR table
SELECT COUNT(colour) FROM car;
COUNT(COLOUR)
-------------
14
• Example: returning the number of distinct values of colour that is available from the CAR table
SELECT COUNT(DISTINCT colour) FROM car;
COUNT(DISTINCTCOLOUR)
---------------------
5
COUNT(attribute) & COUNT(DISTINCT attribute)
Cause by a
record with
null value 15 records
include one
NULL value
Output:
58
Aggregate Functions
• MIN and MAX
• Example: calculate the minimum and maximum advertisedPrice
of the cars
SELECT MIN(advertisedPrice) AS "Lowest Price", MAX(advertisedPrice) AS "Highest Price”
FROM car;
Lowest Price Highest Price
------------ --------------
7490 98800
5959
Aggregate Functions and Group By
• Group By
• group an input table into a number of groups based on one or more
nominated attributes
• often used in conjunction with aggregate functions
• CAR table: group by Make
SELECT make, count(*)
FROM car
GROUP BY make;
H
MAKE COUNT(*)
-------------------- ----------
Holden 2
Lexus 1
Subaru 1
BMW 1
Mazda 1
Nissan 1
Audi 1
Volkwagen 1
Toyota 1
Honda 2
Mercedes Benz 1
Jeep 1
Mini 1
60
Aggregate Functions and Group By
• CAR table: grouping by Make with combination of getting only the record
groups that contain the count value greater than 1 is as follows
SELECT make, COUNT(DISTINCT model)
FROM car
GROUP BY make
HAVING COUNT(DISTINCT model)>1;
MAKE COUNT(DISTINCTMODEL)
-------------------- --------------------
Holden 2
Honda 2
H
61
Aggregate Functions and Group By
• multiple tables: selecting the number of sold cars for each colour
SELECT colour, COUNT(DINSTINCT customerId)
FROM car c, carsales cs
WHERE c.carid=cs.carid
GROUP BY colour;
COLOUR COUNT(DISTINCTCUSTOMERID)
----------- -------------------------
White 2
Grey 4
Blue 1
Black 2
Red 1
H
6262
Aggregate Functions
• Count vs. Sum
• COUNT - count number of records in the input table
• SUM - calculate the sum of a numerical attribute
SELECT <attribute1, attribute2>, COUNT(*)
FROM <table_name>
GROUP BY <attribute1, attribute2>;
SELECT <attribute1, attribute2>, SUM (attribute3)
FROM <table_name>
GROUP BY <attribute1, attribute2>;
H
Aggregate Functions
63
• Sum without Group By
• Example: calculate the sum of purchasedPrice
SELECT SUM(purchasedPrice) AS totalsales
FROM carsales;
TOTALSALES
----------
402983
Aggregate Functions
64
• Sum with Group By
• Example: calculate the sum of purchasedPrice for each make in
carsales table
SELECT c.make, SUM(cs.purchasedPrice) AS totalsales
FROM carsales cs, car c
WHERE c.CARID=cs.CARID
GROUP BY c.make;
MAKE TOTALSALES
-------------------- ----------
Holden 38430
BMW 197600
Mazda 13999
Audi 68200
65
Simple Query Retrieval
• SUM with Calculations
• Example: list all Cname, SalesDate and calculate the TotalPrice for
each customer (TotalPrice = PurchasedPrice+StampDuty )
SELECT ct.cname, cs.salesdate, SUM(purchasedPrice+StampDuty) as TotalPrice
FROM carSales cs, customer ct
WHERE cs.customerid=ct.customerid
GROUP BY ct.cname, cs.salesdate;
CNAME SALESDATE TOTALPRICE
-------------------- ----------- ----------
Jone 14-JUN-2016 14027.52
Rex 12-DEC-2015 14558.96
Lily 13-JUL-2016 13156
6666
I. Alter Tables – Add New Fields
• Syntax:
ALTER TABLE <table_name>
ADD (attribute_name data_type_declare constraints_declare);
• attribute name: referring to the new attribute that you want to add into the
existing table
• data type: defines the data type and the size of the new attribute
• constraint: defines the constraints that the new attribute might be enforced by
certain constraints
• Example: add a transmission attribute to the CAR table
ALTER TABLE car
ADD (transmission VARCHAR2(10));
67
Alter Tables – Modify Fields
• Syntax:
ALTER TABLE <table_name>
MODIFY (attribute_name new_data_type);
• attribute name: refers to the attribute that you want to modify
• new data: defines the new data type that you want to use replacing the old one
• Example: change the data type of the transmission to CHAR with a size of
30
ALTER TABLE car
MODIFY (transmission CHAR(30));
I
68
Alter Tables – Drop Columns
• Syntax:
ALTER TABLE <table_name>
DROP COLUMN attribute_name;
• Example: delete the attribute transmission from the CAR table
ALTER TABLE car
DROP COLUMN transmission ;
I
Oracle/Week-4a-StarSchema-StudentEnrollment.pdf
1
More on Student Enrollment Case Study (Tutorial 2)
1. Two dimensions vs. One dimension The star schema of the case study used in Tutorial 2 (Student-Enrollment system) is shown as follows.
Star Schema-1 (one semester-year dimension)
The SQL codes to create dimension and fact tables are as follows:
--Campus dimension
Create table campus_dim as
SELECT distinct Ocampus
FROM Offering;
--Semester_year dimension
Create table sem_year_dim as
SELECT distinct Oyear||Osem as sem_id, Oyear, Osem
FROM Offering;
--Subject Dimension
Create table subject_dim as
SELECT *
FROM subject;
--Grade Dimension
Create table grade_dim as
SELECT distinct Grade
FROM Enrollment;
2
--The Fact
CREATE TABLE student_enrollment_fact as
SELECT
o.Ocampus,
o.Oyear||o.Osem as sem_id,
s.Ucode,
e.Grade,
count(st.sid) as num_of_student,
sum(e.score) as Total_score
FROM subject s, enrollment e, offering o, student st
WHERE e.OID = o.OID
and s.Ucode = o.Ucode
and st.SID = e.SID
GROUP BY
o.Ocampus,
o.Oyear||o.Osem,
s.Ucode,
e.Grade;
The records in the data warehouse are as follows:
SQL> select * from campus_dim;
OCAMPUS
---------
City
Main
DE
SQL> select * from sem_year_dim;
SEM_ID OYEAR OSEM
---------- ---------- ----
20092 2009 2
20091 2009 1
SQL> select * from subject_dim;
UCODE UTITLE UCREDIT
---------- -------------------- ----------
IT001 Database 5
IT002 Java 5
IT003 SAP 10
IT004 Network 5
IT005 ASP.NET 5
SQL> select * from grade_dim;
GRADE
-----
HD
P
D
C
N
3
SQL> select * from student_enrollment_fact;
OCAMPUS SEM_ID UCODE GRADE NUM_OF_STUDENT TOTAL_SCORE
--------- --------- ---------- ----- -------------- -----------
Main 20091 IT001 HD 3 253
Main 20091 IT001 D 2 149
Main 20091 IT002 C 1 65
Main 20091 IT002 N 1 47
Main 20092 IT002 D 2 151
City 20092 IT001 C 1 64
Main 20091 IT001 N 1 41
Main 20092 IT002 C 1 64
DE 20092 IT004 P 2 105
City 20091 IT003 C 1 63
City 20092 IT001 N 1 32
11 rows selected.
Look at particularly the sem_year_dim dimension, both in the SQL code and the contents of this dimension. In the above version of the star schema, the semester and the year information are actually combined into one dimension, called sem_year_dim dimension. Can we have two separate dimensions: one for semester, and the other for year? The answer is “yes, you can”. The new star schema would look like as follows. Because the original sem_year_dim is now split into two dimensions: sem_dim, and year_dim, consequently, in the fact table, we will have two attributes: semester and year, instead of just one attribute sem_id.
4
Star Schema-2 (two separate dimensions for semester and year)
The SQL codes to create these two new dimensions (sem_dim and year_dim) and the fact tables are as follows:
--Semester dimension
Create table sem_dim as
SELECT distinct Osem as sem_id
FROM Offering;
--Year dimension
Create table year_dim as
SELECT distinct Oyear as year_id
FROM Offering;
5
--The Fact
CREATE TABLE student_enrollment_fact_2 as
SELECT
o.Ocampus,
o.Osem as sem_id,
o.Oyear as year_id,
s.Ucode,
e.Grade,
count(st.sid) as num_of_student,
sum(e.score) as Total_score
FROM subject s, enrollment e, offering o, student st
WHERE e.OID = o.OID
and s.Ucode = o.Ucode
and st.SID = e.SID
GROUP BY
o.Ocampus,
o.Osem,
o.Oyear,
s.Ucode,
e.Grade;
The contents of the two dimensions and the new fact are as follows:
SQL> select * from sem_dim;
SEM_ID
----------
1
2
SQL> select * from year_dim;
YEAR_ID
----------
2009
SQL> select * from student_enrollment_fact_2;
OCAMPUS SEM_ID YEAR_ID UCODE GR NUM_OF_STUDENT TOTAL_SCORE
--------- ------ -------- ---------- -- -------------- -----------
DE 2 2009 IT004 P 2 105
Main 1 2009 IT001 N 1 41
Main 1 2009 IT001 D 2 149
Main 2 2009 IT002 C 1 64
Main 1 2009 IT002 N 1 47
Main 2 2009 IT002 D 2 151
City 2 2009 IT001 N 1 32
Main 1 2009 IT002 C 1 65
City 1 2009 IT003 C 1 63
Main 1 2009 IT001 HD 3 253
City 2 2009 IT001 C 1 64
11 rows selected.
Now compare the contents of Fact-1 and Fact-2. They are almost the same, except that Fact-1 has one attribute from the sem_year_dim dimension, whereas
6
Fact-2 has two attributes, called sem_id (from semester_dim) and year_id (from year_dim). Conclusion: From the FACT point of view, it does not really matter whether we have two separate dimensions or just one dimension. They are the same. However, if two information, like semester and year, is often seen as one entity or one piece of information, it would be easier if the two information is present in one dimension.
7
2. Dimension Hierarchy (Hierarchy vs. No Hierarchy) Suppose we have richer information about campus, as follows: Campus Dimension CampusID CampusName Address Suburb City State Postcode Country CL Clayton
Campus Wellington Rd
Clayton Melbourne Victoria 3800 Australia
CA Caulfield Campus
Dandenong Rd
Caulfield East
Melbourne Victoria 3145 Australia
PA Parkville Campus
Royal Parade
Parkville Melbourne Victoria 3052 Australia
SY Sydney Campus
Opera Boulevard
Sydney Sydney New South Wales
2001 Australia
MA Malaysia Campus
Jalan Lagoon
Bandar Sunway
Sydney Selangor 47500 Malaysia
SA South Africa Campus
Peter St Johanne sburg
Johannesb urg
Johanne sburg
1725 South Africa
In the star schema, we have two options: Option 1 is to have one dimension called campusDIM containing all of these attributes.
Option 1 (no hierarchy)
8
The SQL to create the Campus Dimension is shown as follows: --Campus dimension
Create Table campus_dim As
SELECT Distinct Ocampus, <<and any other attributes, eg campus name, city, etc>>
FROM Offering;
Option 2 is to have a hierarchy of three dimensions (e.g. campusDIM connected to the Fact, campusDIM is also the parent of cityDIM – because campus-city forms a hierarchy, and cityDIM becomes the parent of countryDIM).
Option 2 (with hierarchy)
The Campus dimension is still linked to the Fact, whereas the City dimension is linked (through hierarchy) to the Campus, and the Country dimension is linked
9
to the City dimension. Therefore, there is a hierarchy from Campus to City, and to Country.
Note that the information about city is pulled out from the campus, and then create a hierarchy between campus and city. The information about country is also pulled out from the campus and city. The SQL to create the three dimensions in the hierarchy is as follows: --Campus dimension
Create Table campus_dim As
SELECT Distinct Ocampus, <<and any other attributes, eg campus name, etc>>
FROM <<table that stores information about campus, city, country>>;
--City dimension
Create Table city_dim As
SELECT Distinct Ocity, <<and any other attributes about the city>>
FROM <<table that stores information about campus, city, country>>;
--Country dimension
Create Table country_dim As
SELECT Distinct Ocountry, <<and any other attributes about the country>>
FROM <<table that stores information about campus, city, country>>;
The contents of the dimensions in the hierarchy (campus, city, and country dimensions are as follows): Campus Dimension CampusID CampusName Address Suburb Postcode CityID CL Clayton Campus Wellington Rd Clayton 3800 MEL CA Caulfield Campus Dandenong Rd Caulfield East 3145 MEL PA Parkville Campus Royal Parade Parkville 3052 MEL SY Sydney Campus Opera Boulevard Sydney 2001 SYD MA Malaysia Campus Jalan Lagoon Bandar Sunway 47500 KUL SA South Africa Campus Peter St Johannesburg 1725 JNB
City Dimension CityID CityName State CountryID MEL Melbourne Victoria AU SYD Sydney New South Wales AU KUL Kuala Lumpur Selangor MA JNB Johannesburg Johannesburg SA
Country Dimension CountryID CountryName AU Australia MA Malaysia SA South Africa
Note that the creation of the fact table is not affected at all. Either model (option 1 or option 2) will not affect the fact and other dimensions (e.g. grade, subject, semester_year dimensions).
10
Hierarchy should use the higher detail dimension (e.g. campus) as the parent, which is linked to the fact. The lower detail dimension (e.g. city) becomes the child in the hierarchy. The lowest detail dimension (e.g. country) becomes the lowest in the hierarchy. One of the most common mistakes is the wrong ordering of the dimensions in the hierarchy, such as in the following star schema:
An incorrect hierarchy (upside down)
11
Hierarchy vs No Hierarchy Since there are two options in implementing a star schema: with and without hierarchy, which option is more preferred?
1. One table vs. many tables Without hierarchy, there is only one table (e.g. the campus dimension table). With hierarchy, there are three tables (e.g. campus, city, and country dimension tables). The consequence is in the join query processing when producing reports. When you want to produce a report involving the fact and campus, with the no hierarchy option, we only need to use two tables: the campus dimension table, and the fact table. Hence, the join query processing is simple. With the hierarchy option, we need to join the fact with three tables (campus, city, and country). Consequently, the join query processing will be more complex
2. Normalized vs. Unnormalized With the hierarchy option, the tables are normalized, which follows the relational model. In contrast, the non-hierarchy option, the table (e.g. the campus dimension table) is unnormalized, which is basically 1NF with visible replication on information which is prone to anomalies (e.g. insert, update, and delete anomalies). Since no update, insert, or delete operations are performed to the dimension tables (once they have created), hence, anomalies become irrelevant. Therefore, the unnormalized option may be preferred, also due to a simpler join query processing.
3. Drilling down and rolling up People often mistakenly thought that the hierarchy model is good for drilling down. The hierarchy is not from country to city and to campus. The correct hierarchy is from the detail to the more general, such as from campus to city, and to country. Hence, we are not drilling down from campus to country, because drilling down should from something that is general (such as country) to something more detail (such as city and campus). In other words, the hierarchy model is not about drilling down information exploration. How about rolling up? Rolling up is exploring information from the detail to a more general. For example, retrieve number of students located at a campus level, and then rolling up to the country level. With the hierarchy option, we still need to do two queries: one for the campus and the second for the country.
12
Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select CountryName, Sum(Num_of_student) From Fact F, CampusDim C1, CityDim C2, CountryDim C3 Where F.CampusID = C1.CampusID And C1.CityID = C2.CityID And C2.CountryID = C3.CountryID Group By CountryName; Whereas with the non-hierarchy option, we still need to do two queries: Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select Country, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By Country; From the query point of view, both need two queries. From the query processing point of view, the non-hierarchy option uses one join operation only, because it only needs to join the fact and one dimension. From the conceptual point of view, the hierarchy model does not actually offer a better roll up or drill down features. If we insist to have a drill down, that is from country to campus, we still need to do the two queries above, but we will do the country query first, and then the campus query second.
Oracle/Week-4b-Determinant Dimensions.pdf
1
Determinant Dimensions The Petrol Price Case Study
1. Snapshot of an operational database The following is a sample operational database that keeps the petrol prices (of different kind of petrol) of petrol stations within a period of time
2
2. Star Schema The requirements for the data warehouse are to answer questions related to (a) average petrol price, (b) min petrol price, and (c) max petrol prices. From the sample data above, we choose the following three angles to view the fact measures: (i) day of week, (ii) suburb, and (iii) petrol company. From the Day of Week point of view, it is sensible to answer the question something like which day is has the lowest petrol price. We use a two-‐column table methodology to visualize a conceptual model of the star schema. Hence, the two-‐column table for the day of week is as follows: Day of week Total Petrol
Price Num of Petrol Station
Min Petrol Price
Max Petrol Price
Monday Tuesday … Notes:
1. In the above 2-‐column table, the first column (say column A, which is Day of Week) is a category. The second columns (say, columns B1, B2, B3, and B4) are the fact measures.
2. One of the requirements for the fact measure is average petrol price. Since, the fact table should not include an average as a fact measure, we include two attributes: total petrol price, and number of petrol stations, instead.
The visualization of the above two-‐column table for Day of Week is rather incomplete, because there is no data in the fact measure columns. Take a look at the Min Petrol Price column in the above table. What kind of value should be in that column? Supposed Monday has a value in the Min Petrol Price column, what does that value min? It does not specify which petrol type, whether it is Unleaded 91, or Premium 95, or Premium 98, for example. Therefore, the Min Petrol Price column, although seems right, does not make sense at all. How can this problem be solved? One solution is to have one Min Petrol Price for each petrol type (e.g. Unleaded 91, Premium 95, Premium 98, E10, Diesel, and LPG). Therefore, we have 6 attributes to represent Min Petrol Prices, namely: Min Unleaded 91 Price, Min Premium 95 Price, Min Premium 98 Price, Min E10 Price, Min Diesel Price, and Min LPG Price.
Min Petrol Price Day of week
Total Petrol Price
Num of Petrol Station
Unleaded P95 P98 E10 Diesel LPG Max Petrol Price
Monday 119.9c 132.9c 139.9c 114.9c 115.9c 55.7c Tuesday 118.9c 131.9c 136.9c 114.9c 115.9c 56.9c …
3
We need to do the same for the Total Price, and Max Petrol Price; each of which should be divided into 6 columns. The new two-‐column table for Day of Week is then shown as follows:
Total Petrol Price Min Petrol Price Max Petrol Price Day of Week U 95 98 E D LPG
Num of Petrol Station
U 95 98 E D LPG U 95 98 E D LPG
Monday Tuesday
The matter will become more complicated if number of petrol stations that has Unleaded 91 is different from that of Premium 98, for example. Then to calculate an average of Unleaded 91 on Monday, and an average of Premium 98 on Monday cannot be simply a division between Tot91 and Num of Petrol Stations, and between Tot98 and Num of Petrol Stations, because the based line, which is Num of Petrol Stations for different fuel type is different. Therefore, we need to divide the Num of Petrol Stations column into 6 columns – one for each petrol type. As a result, we will end up with 24 columns to store the fact measures, instead of just 4 columns, as initially thought. Using the same principle, we can have a two-‐column table for Suburb, and another one for Company.
Total Petrol Price Num of Petrol Station
Min Petrol Price Max Petrol Price Suburb
U 95 98 E10 D L U 95 98 E10 D L U 95 98 E10 D L U 95 98 E10 D L
Box Hill Clayton
Total Petrol Price Num of Petrol Station
Min Petrol Price Max Petrol Price Company
U 95 98 E D L U 95 98 E D L U 95 98 E D L U 95 98 E D L
Caltex 7-‐Eleven Based on the above three two-‐column tables, we are now confident to have the following star schema:
4
5
3. Star Schema version-2 Is there another way to reduce the number of fact measures? The above star schema has 24 fact measures, because there are 6 different kinds of petrol. The answer is yes!!! We can have a new dimension called Petrol Type DIM, which stores the different kinds of petrol. With this new dimension, the fact measure is reduced to 4, which is the original fact measures we had in the beginning. Therefore, the new star schema becomes as follows:
With this new star schema, as an example, we can retrieve data that shows the Min Petrol Price of Premium 98 on Monday. Or another example is to get the Average Petrol Price (which is Total Petrol Price/Num of Petrol Station) of Unleaded Petrol in 7-‐Eleven petrol station located at Clayton. Note that when retrieving data from a data warehouse, there is no obligation that we must use all dimensions. In the Monday Premium 98 example, only two dimensions are used, which are Day of Week DIM, and Petrol Type DIM. And in the second example, only three dimensions are used: Petrol Type DIM, Company DIM, and Suburb DIM. So, to retrieve data from the data warehouse, we can use
6
as few as one dimension table; but we can also use all dimension tables. We are not restricted to which dimension we have to use. However, this imposes a new problem. For example, the above star schema allows us to retrieve the Min Petrol Price of Monday, which uses only Day of Week DIM. This data retrieval does not make any sense; which petrol type are we referring to? Based on this case study, it is clear that the Petrol Type DIM holds the key to any data retrieval. We must use the Petrol Type DIM in any retrieval, such as the Min Petrol Price of Premium 98 on Monday, or the Average Petrol Price of Unleaded in 7-‐Eleven petrol station located in Clayton suburb. The Petrol Type DIM, in this example, is called a “Determinant Dimension” – a dimension which needs to be used in the data retrieval, because the fact measures are determined by this dimension. In order to differentiate between Determinant Dimension and all other dimensions, Determinant Dimension is denoted by a star. The fact measures are also denoted by a star. The new star schema with a Determinant Dimension is as follows:
Oracle/Week-4c-Determinant -vs-non-determinant.pdf
1
Determinant vs. Non-Determinant Dimensions
1. The Olympic Games Case Study The Olympic Games committee maintains an operational database that stores all matches, games, as well as the medal winners of the Olympic Games over the years (https://www.olympic.org/). We would like to build a data warehouse to analyze the medal counts, by each country, sport, and at which Olympic Games. There are two possible star schemas. Version-‐1 star schema contains 3 dimensions, whereas Version-‐2 star schema contains 4 dimensions. For simplicity, a very minimal number of attributes are included in each dimension. Version-‐1 star schema has CountryDIM, SportDIM, and OlympicGamesDIM as the dimensions. Three fact measures are included in the fact table, namely number of gold medals, silver medals, and bronze medals.
Version-1 Star Schema
2
Version-‐2 star schema has four dimensions – with an additional of Medal Type DIM (which is either Gold, Silver, or Bronze), but there is only one fact measure in the Fact table, which is Num of Medals.
Version-2 Star Schema with the Medal Type DIM
What is the difference between these two versions of star schema? Is Medal Type DIM a Determinant Dimension? In order to answer these questions, we need to visualize two-‐column tables for each dimension category. 2. Two-Column Tables for Version-1 Star Schema (without the Medal Type DIM) The first two-‐column table is from the Country point of view, which is as follows: Country Num of Gold Num of Silver Num of Bronze USA 733 602 488 China 199 143 133 Australia 167 170 189
3
These are the number of Gold, Silver and Bronze medals that these country got on all Olympic Games (several Olympic Games) recorded in the operational database. Note that this two-‐column table methodology is to help the data warehouse designer to visualize the view of the fact measures from each dimension. The second two-‐column table is from the Sport point of view. Assuming that in the operational database, it records 20 past Olympic Games, and at each Olympic, there is only one gold for 100m Butterfly Men, for instance. Sport Num of
Gold Num of Silver
Num of Bronze
Swimming 100m Butterfly Men 20 20 20 Swimming 400m Freestyle Women 20 20 20 Swimming 4x100m Medley Relay Men 20 20 20 The third two-‐column table is from the Olympic Name point of view. Olympic Name Num of
Gold Num of Silver
Num of Bronze
London 2012 302 304 356 Beijing 2008 302 303 353 Athens 2004 301 301 327 All the three two-‐column tables above seem to be reasonably correct. The first columns are the categories, while the other columns are the fact measures which are numeric and aggregate values. Because these three two-‐column tables make sense, we are confident that version-‐1 star schema is correct. 3. Two-Column Tables for Version-2 Star Schema (with the Medal Type DIM) The two-‐column tables for the first three dimensions, namely Country, Sport, and Olympic Names are as follows: Country Num of Medals USA 1823 China 475 Australia 526 Sport Num of Medals Swimming 100m Butterfly Men 60 Swimming 400m Freestyle Women 60 Swimming 4x100m Medley Relay Men 60
4
Olympic Name Num of Medals London 2012 962 Beijing 2008 958 Athens 2004 929 The question is whether these two-‐column tables make sense. If we look at the country, it makes sense to see how many medals Australia has received in all Olympic Games; the same with Sport, and Olympic Names. Finding how many medals (regardless the medal types) for each country, for each sport, and for each Olympic seems to be reasonable. The fourth two-‐column table for version-‐2 star schema is the Medal Type, which is as follows: Medal Type Num of Medals Gold 4115 Silver 4095 Bronze 4474 This two-‐column table on Medal Type seems to be reasonable too. Hence, version-‐2 star schema (with Medal Type DIM) is correct. So in conclusion, both star schemas (with or without Medal Type DIM) are correct. Now going back to the original question: Is Medal Type DIM a Determinant Dimension? 4. Determinant or Non-Determinant Dimensions A “Determinant Dimension” is a dimension that the fact measure relies on, and consequently, all data retrieval from the data warehouse must include this dimension. If the data retrieval from the data warehouse does not include this determinant dimension, the retrieval result will not make sense at all. In the previous case study on Petrol Price, Petrol Type DIM is a Determinant Dimension (refer to the star schema below). Note that a Determinant Dimension is denoted by a star. The fact measures affected by the determinant dimension are also starred.
5
The fact measures: Total Petrol Price, Num of Petrol Station, Min Petrol Price, and Max Petrol Price, depend on the Petrol Type, which is indicated by the Petrol Type DIM. That means analyzing the min petrol price from the day of week point of view, must include the petrol type. Otherwise, it doesn’t make sense to retrieve data to show that on Monday the lowest (min) petrol price is, for example, 109.90cents. As it does not indicate which petrol type it is, this lowest petrol price is meaningless. Therefore, a better data retrieval is to retrieve the record to show that for example, on Monday, the min “Unleaded” petrol price is 109.90cents (e.g. Unleaded is a petrol type obtained from the Petrol Type DIM). Now going back to the Olympic Games case study (refer to Version-‐2 star schema with Medal Type DIM). Is Medal Type DIM a Determinant Dimension? The answer to this question can be answered by another question. To retrieve the data from version-‐2 star schema, must we have the information from Medal Type DIM? The answer is clearly no, because we can simply retrieve a record from the fact to show that Australia in London 2012 Olympic Games received 10 medals in Swimming. This covers three dimensions, namely Country (Australia), Olympic Name (London 2012), and Sport (Swimming). In this example, Medal Type DIM is not involved, and the information retrieved still makes sense.
6
So the answer to the question whether Medal Type DIM is a Determinant Dimension or not, the answer is clearly No!! The next question is: what is the difference between the Olympic Games case study and the Petrol Price case study. Both are very similar, but the Olympic Games Medal Type DIM is not a determinant dimension, whereas the Petrol Type DIM is a determinant dimension. The answer is the aggregate function used in the fact measure. In the Olympic Games case study, the fact measure function is COUNT, which is count of medals. The breakdown of the medals is gold, silver, and bronze; but the main aggregate function of the fact measure is number of medals, which is a count. If the fact measure is a count, then the dimension (e.g. Medal Type DIM) is not a determinant dimension, because we can still analyze the fact measure which is the total medals from other dimensions, without the medal type dimension. On the other hand, the Petrol Price case study uses AVG, MIN, and MAX as the aggregation functions. Note that we do not store average as a fact measure, but total price and number of stations. These two fact measures will be used to calculate the average. Although the average is not explicitly stored in the fact, implicitly, the total price and number of stations represent the average. If the aggregate function to calculate the fact measures is not COUNT, then a determinant dimension is needed. In this case, Petrol Type DIM is hence a determinant dimension, because Min Petrol Price, for example, does not have any meaning without petrol type. 5. Version-1 (without Medal Type DIM) vs. version-2 (with Medal Type DIM) As both versions in the Olympic Games case study are correct, let’s compare and contract these two versions. In order to do this, let’s have a look at the records in the respective fact tables. The fact table for version-‐1 star schema (without Medal Type DIM) has 6 attributes: three from the dimensions, and the other three for the fact measures. The contents of the fact table are as follows: Fact (version-1 star schema) Country Sport Olympic Name Num of
Gold Num of Silver
Num of Bronze
USA Swimming London 2012 16 9 6 China Swimming London 2012 5 1 4 Australia Swimming London 2012 1 6 3 The fact table for version-‐2 star schema (with Medal Type DIM) consists of 5 columns: 4 from the dimension, but only one fact measure.
7
Fact (version-2 star schema) Country Sport Olympic Name Medal Type Num of Medals USA Swimming London 2012 Gold 16 USA Swimming London 2012 Silver 9 USA Swimming London 2012 Bronze 6 China Swimming London 2012 Gold 5 China Swimming London 2012 Silver 1 China Swimming London 2012 Bronze 4 Australia Swimming London 2012 Gold 1 Australia Swimming London 2012 Silver 6 Australia Swimming London 2012 Bronze 3 From the storage point of view, it is clear that version-‐1 is the winner. It has only 3 records, whereas in version-‐2, the same information is represented in 9 records. From the modeling point of view, some may prefer version-‐2, because the model is concise and more compact. When the number of fact measure is reasonably large (like in the Petrol Price case study), the star schema with a determinant dimension looks very slim and compact – hence, it is easy to understand. But consequently, the storage requirement increases as well. In contrast, with many different petrol types, if the star schema does not use a determinant dimension, the number of attributes in the fact will dramatically increase, and the schema looks more complex and crowded; but the storage cost is lower.
Oracle/Week-5-AVG-in-Fact.pdf
Page 1/5
Average in the Fact? Should we store an “average” fact measure in the fact table? No!!!!!!! Consider the following example. Suppose we have the following 16 records as our data source in the operational database. Note that there are 9 Semester One and 7 Semester Two records, respectively. Out of the 8 Database unit recodes, 6 of them are Semester One, and 2 of them are Semester Two. An Operational Database Ucode Utitle Semester Sfname Score
IT001 Database 1 Mirriam 81 IT001 Database 1 Allan 41 IT001 Database 1 Ben 74 IT001 Database 1 Kate 85 IT001 Database 1 Larry 87 IT001 Database 1 Leonard 75 IT001 Database 2 Juan 64 IT001 Database 2 Andy 32 IT002 Java 1 Ally 65 IT002 Java 1 Menson 47 IT002 Java 2 Mirriam 78 IT002 Java 2 Ben 73 IT002 Java 2 Larry 64 IT003 SAP 1 Ally 63 IT004 Network 2 Juan 53 IT004 Network 2 Menson 52 The star schema of the above operational database contains one fact and two dimensions. The dimensions are: Subject and Semester (One or Two); and the fact measure is Average Score.
The fact table aggregates these score records based on their dimensions, which are subject and semester. If we store Average Score in the fact table, this is how the fact table will look like:
Page 2/5
Fact Table UCode Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
The dimension tables look like as follows: SubjectDIM Table UCode UTitle
IT001 Database IT002 Java IT003 SAP
IT004 Network
SemesterDIM Table Semester
1 2 Looking at the Fact Table, the average score for the unit Database in Semester One is 73.833 (average of the first 6 score records); the average score for the unit Database in Semester Two is 48 ((64 + 32)/2). Is this fact table correct? It looks correct. But actually it is incorrect. For example, if we want to query the fact table to find out what is the average score of the Database unit, by looking at the above fact table, the answer would be (73.833+48)/2= 60.9165. The SQL to query the Fact Table is as follows:
Select Avg(Average_Score) From FactTable Where UCode = ‘IT001’;
Is this correct? No. In the Operational Database, there are 8 records for Database unit in Semester One and Two (see the first eight records in the operational database). If we sum all the score of these eight records and divided by eight records, the result will be 539/8=67.375; not 60.9165. Let’s do further comparisons:
Page 3/5
The average score for Java unit in Semester One and Two using the above fact is (56+71.667)/2=63.833. The actual average score for Java unit in Semester One and Two is not 63.833, but 65.4 (see the next 5 records in the above score list, and sum these scores and then divide by 5, 327/5). So again, the above fact table, which stores the average score, will not produce correct results. Ok, now let’s calculate further. The average score for Semester One using the above fact is (73.833+56+63)/3=64.278. In the above score list records, there are nine Semester One records, and the average is in fact 68.667. For Semester Two, using the above fact the average scare for Semester Two is (48+71.667+52.5)/3=57.389; whereas the actual average score for the seven Semester Two records is 59.4286. So, storing average as a fact measurement is not a good idea. How do we solve the above problems? In the fact table, we should store the “total score” and “number of students” in each aggregate group. Hence, the fact table should look like this. Fact Table 2 UCode Semester Total_Score NumberofStudents
IT001 1 443 6
IT001 2 96 2
IT002 1 112 2
IT002 2 215 3
IT003 1 63 1
IT004 2 105 2
Note that the dimension tables remain unchanged: SubjectDIM Table UCode UTitle
IT001 Database IT002 Java IT003 SAP
IT004 Network
SemesterDIM Table Semester
1 2 Using the correct fact table above, it is easy to calculate the average score of Database unit, which is (443+96)/(6+2)=67.375. The SQL to query the Correct Fact Table is as follows:
Page 4/5
Select Sum(Total_Score)/Sum(NumberofStudents) From FactTable2 Where UCode = ‘IT001’;
The correct star schema is then as follows:
Conclusion The problem of AVG in the Fact is known as the “Average of an Average” problem. This problem is well known in Mathematics. Average of an average will simply produce an incorrect average result (almost all the time). Hence, it is not desirable to have an average measure in the fact – unless the analysis ALWAYS uses all the dimensions. How about Min or Max in the Fact? Can we do it? – Yes we can. Because Max of Max is always a global max, and Min of Min is always a global min. For example, using the above sample data, assume we have Max_Score and Min_Score in the Fact, as follows: Fact Table 3 Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
SubjectDIM Table UCode UTitle
IT001 Database IT002 Java IT003 SAP
IT004 Network
Page 5/5
SemesterDIM Table Semester
1 2 (*Note: the dimension tables are unchanged) Assuming we want to get the Max_Score of IT001, then the max of {87, 64} will produce 87, and 87 is the maximum score of IT001, because 87 is the max in semester 1, which is greater than any max of IT001 (e.g. in semester 2). In other words, “Max of Max” is correct. The SQL to retrieve the maximum score of IT001 is as follows:
Select Max(Max_Score) From FactTable3 Where UCode = ‘IT001’;
The same applies to “Min of Min”. If we want to get the minimum score of IT001, the result will be 32, which is the minimum between 41 and 32.
Select Min(Min_Score) From FactTable3 Where UCode = ‘IT001’;
We certainly don’t want to mix between min and max. For example, retrieving the minimum of Max_Score would be meaningless; the same as retrieving the maximum of Min_Score. As a final conclusion:
• Average in the fact is not desirable, although technically it satisfies the two criteria of the fact (e.g. must be a numerical and aggregate value)
• Min and Max in the fact can still be used, since min_score and max_score are valid fact measures (e.g. they are numerical and aggregated values)
• In general, count and sum are more common. Count is “number of”, and Sum is “total of”.