Oracle Advanced person Needed

profileHelpStudent13
Oracle.zip

Oracle/FIT5195-2-Star Schema.pdf

Week 2 – Star Schema

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Notations and Processes

1. Star Schema Notation

2. E/R Diagram Notation

3. Transformation Process (Case Study)

2. Two-Column Table Methodology

Recall – The Big Picture

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

Recall – Data Warehouse

▪ To address the drawback of operational database, and a need for decision-

making support data, data warehouse is needed.

▪ A data warehouse is a multi-dimensional view of databases, with

aggregates and pre-computed summaries.

➢ In many ways, it is basically doing aggregates in advance; that is exactly pre-

computation done at the design level, rather than at the query level.

Recall – Data Warehouse

Star Schema

▪ A Star Schema is a design representation of a multi-dimensional view. It is a

data modeling technique used to map multidimensional decision support

data into a relational database.

▪ The reason for the star schema’s development is that existing relational

modeling techniques: ER and normalization, did not yield a database

structure that served the advanced data analysis requirements well.

Star Schema Components

▪ There are Three main components of the Star Schema:

1. Facts

2. Dimensions

3. Attributes

Star Schema Components

1. Facts

Facts are numeric measurements (values) that represent a specific business aspect

or activity.

For example, sales figures are numeric measurements that represent product and/or

service sales.

2. Dimensions

Dimensions are qualifying characteristics that provide additional perspectives to a

given fact.

For example, sales might be viewed from specific dimension(s), such as sales

location, sales period, sales product, etc.

Star Schema Notation

▪ A Sales Star Schema

➢ Fact:

• Sales

➢ Dimensions:

• Time

• Product

• Branch

▪ Notation-wise, the Fact uses a bolder line, to differentiate between Fact

from Dimensions.

Star Schema Notation

▪ A Sales Star Schema

➢ Fact:

• Sales

➢ Dimensions:

• Time

• Product

• Branch

▪ The lines that represent a relationship

between the fact and dimensions can be

straight lines or bended lines.

Star Schema Notation

▪ Using the star schema notation, the

number of dimensions can be unlimited.

▪ If there is more dimensions, then we just

add more dimensions linked to the Fact.

Star Schema Components

3. Attributes

Each dimension table contains attributes.

For example:

Product dimension: Prod Type,

Description.

Location dimension: Region,

State,

City.

Time dimension: Year,

Month.

Star Schema Notation

Star Schema Notation

▪ Sales Star Schema

(b) Sales Star Schema complete with the Attributes(a) Outline star schema for Sales

E/R Diagram Notation

E/R Diagram Notation

(a) An Entity in E/R Diagram

(b) Relationships in E/R Diagram

E/R Diagram Notation

E/R Diagram Notation

Associative

Relationship

E/R Diagram Notation

Non-Associative

Relationship

Transformation Process

Transformation Process

Transformation Process

Case Study #1

Case Study #1: International College

The admission office handles enrolment, payment, and marketing campaigns to international

students, often through educational agents located overseas. This admission office has an

operational system that maintains all the details of international students enrolled in the College.

Payment details are also handled by this office. Basically, the operational system has the following

features:

▪ Every student details are kept in the database. This includes the courses that the students enroll.

▪ As the College is a multi-campus university, some courses are offered in a different campus. The

admission office handles international students of all campuses.

▪ Some international students coming to the College are handled by an educational agent. This is

particularly common for the first course that a student enrolls in. Subsequent courses are not

normally handled by an agent, because the students themselves deal directly with the College.

▪ International students pay tuition fees several times (normally once every semester) for each

course they are doing.

Case Study #1: International College

The College now requires a data warehouse for analysis

purposes. The analysis is needed for identifying at least the

following questions:

1. How many students come from certain countries?

2. What is the total income for certain postgraduate

courses?

3. How many students are handled by certain agents?

4. How the number of enrolment of courses fluctuates

across the year?

Case Study #1: International College

▪ College Star Schema

➢ Fact:

• Number of Students

• Total Income

➢ Dimensions:

• Country

• Agent

• Course

• Year

Transformation Process

Case Study #2

Case Study #2: Sales

▪ Suppose that we would like to analyze

Total Sales from various point of

views, such as Quarter, Branch, and

Product Category.

Case Study #2: Sales

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Time

• Branch

• Product Category

Two Column Table Methodology

Two Column Table Methodology

When creating a star schema, you need to

imagine that the data you want to analyse

consists of two columns.

The first column is the category (e.g. A, C,

D, E), and the second column is the

statistical numerical figure (e.g. B).

The second column (e.g. B) has to be

consistent throughout all the two-column

tables.

One Fact Measurement:

Two Column Table Methodology

▪ Case Study 1: Analysis of Accountants

Suppose the CPA organization would like to analyze its members (i.e.

accountants) in a particular city. Assume that the organization has the full

details of its members.

Education Number of Accountants

Diploma 84953

Bachelor 349203

Higher Degree 98943

Others 2322

Two Column Table Methodology

▪ We can also look at the figures from the gender point of view, like:

▪ Another way to analyze number of accountants is form the type of the accountant job

itself; something like:

▪ Note that the figures are fictitious, and the “Types” of Accountants (indicating different

roles of accountants) are also fictitious.

Gender Number of Accountants

Male 434322

Female 89932

Type Number of Accountants

Government 3843

Private Business 45303

Personal 45930

etc

etc

Two Column Table Methodology

▪ You can further identify other example to analyze number of accountants. In

the above three tables, the first angle to look at the number of accountants

is from the educational background, the last one is from the type of the

accountant itself, whether it is a private business accountant, etc.

▪ As you can see, the second column is CONSISTENTLY UNIFORM. In the

above example, it is number of accountants. The first column changes

depending on from which angle that you want to see.

Two Column Table Methodology

▪ Therefore, in this case study, the star schema could look like the following:

Two Column Table Methodology

The second column in the two-column

tables, which is the numerical fact

measurement (e.g. column B) can

actually be multiple columns (call them:

B1, B2, B3), as long as all of these

columns (e.g. B1, B2, B3) relate to all of

the categories (e.g. A, C, D, E).

Multiple Fact Measurements:

Two Column Table Methodology

▪ Case Study 2: Student Enrollment

The University Administrator(s) needs to keep track of the number of

enrollment for particular unit or campus and the students’ performance

each year in order to maintain the University performance. The head of

admin has assigned you the task of developing a small Data Warehouse in

which to keep track the enrollment and performance statistics.

Two Column Table Methodology

▪ For example:

▪ Another example could be something like this:

Subject Number of Students Total Score

Database 8 539

Java 5 327

SAP 1 63

Network 2 105

Semester Number of Students Total Score

One 9 618

Two 7 416

F1 F2

F1 F2

Two Column Table Methodology

▪ In analyzing number of students (apart from the subject and semester as

shown above), you could also see the number of student from another

angle, for example from the campus and grade:

Campus Number of Students Total Score

Main 9 658

City 5 271

DE 2 105

F1 F2

Two Column Table Methodology

▪ For example:

Grade Number of Students Total Score

HD 3 253

D 4 300

C 4 256

P 2 105

N 3 120

F1 F2

Two Column Table Methodology

▪ The first columns of the above examples are the dimensions, whereas

the other columns that contain the statistical/summarized/aggregated

values is the fact.

▪ In the above example, the fact is then STUDENT_ENROLLMENT_FACT,

and the dimensions are SUBJECT, SEMESTER, GRADE and CAMPUS.

Two Column Table Methodology

▪ The star schema for the STUDENT ENROLLMENT is shown as follows:

End of Week 2 Lecture

Oracle/FIT5195-3-Bridge Table.pdf

Week 3 – Bridge Tables

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Bridge Tables

2. Temporary Tables

1. Temporary Dimension Tables

2. Temporary Tables in the Operational Database

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

Bridge Tables

Bridge Tables

▪ A bridge table is a table that links between two dimensions; and only one

of these two dimensions are linked to the fact.

➢ As a result, the star schema becomes a snowflake schema.

Bridge Tables

▪ Two reasons on why a dimension cannot be connected directly to the Fact: ▪

a) The Fact table has a fact measure, and the dimension has a key identity. In

order to connect a dimension to the Fact, the dimension’s key identity must

contribute directly to the calculation of the fact measure. Unfortunately, this

cannot happen if the operational database does not have this data.

b) The operational database does not have this data if the relationship between

two entities in the operational database that hold the information about

dimension’s key identity and the intended fact measure is a many-many

relationship.

Bridge Tables

Case Study #1

Case Study #1 – A Product Sales Case Study

▪ A company management team would like to

analyze the statistics of its product sales

history. The analysis is needed to identify

popular products, suppliers supplying those

products, the best time to purchase more

stock, etc.

▪ A small data warehouse is to be built to keep

track of the statistics.

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

Case Study #1 – A Product Sales Case Study

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

Case Study #1 – A Product Sales Case Study

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Product

• Customer locations/suburbs

• Time period

• Supplier

▪ Possible Two-Column Methodology Tables:

Case Study #1 – A Product Sales Case Study

ProductNo TotalSales

A1 $130,000

B2 $15,900

C3 $2,500,000

… …

TimeID TotalSales

201801 $25,000

201802 $4,700

201803 $3,500

… …

Suburb TotalSales

Caulfield $6,500

Chadstone $12,000

Clayton $1,800

… …

(a) Product point of view (b) Time point of view (c) Suburb point of view

Case Study #1 – A Product Sales Case Study

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Product

• Customer locations/suburbs

• Time period

• Supplier

Case Study #1 – A Product Sales Case Study

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Supplier point of view

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Supplier point of view

Case Study #1 – A Product Sales Case Study

Case Study #1 – A Product Sales Case Study

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Case Study #1 – A Product Sales Case Study

Case Study #1 – A Product Sales Case Study

Bridge Table

▪ To create Time Dimension: - create table TimeDim as

select

distinct to_char(SalesDate, ’YYYYMM’) as TimeID,

to_char(SalesDate, ’YYYY’) as Year,

to_char(SalesDate, ’MM’) as Month

from Sales;

▪ To create Customer Location Dimension: - create table CustLocDim as

select distinct Suburb, Postcode

from Customer;

Case Study #1 – A Product Sales Case Study

▪ To create Product Dimension: - create table ProductDim as

select distinct ProductNo, ProductName

from Product;

▪ To create Bridge Table: - create table ProductSupplierBridge as

select *

from StockSupplier;

▪ To create Supplier Dimension: - create table SupplierDim as

select SupplierID, Name as SupplierName

from Supplier;

Case Study #1 – A Product Sales Case Study

▪ To create Fact Table: - create table ProductSalesFact as

Select

to_char(S.SalesDate, ’YYYYMM’) as TimeID,

P.ProductNo,

C.Suburb,

sum(SI.QtySold*P.Price) as TotalSales

from Sales S, Product P, Customer C, SalesItem SI

where S.SalesNo = SI.SalesNo

and SI.ProductNo= P.ProductNo

and C.CustomerID = S.CustomerID

group by

to_char(S.SalesDate, ’YYYYMM’), P.ProductNo, C.Suburb;

Case Study #1 – A Product Sales Case Study

Bridge Tables

Case Study #2

Case Study #2 – A Truck Delivery Case Study

▪ A trucking company is responsible for picking up goods from warehouses of a retail chain

company, and delivering the goods to individual retail stores.

▪ A truck carry goods during a single trip, which is

identified by TripID, and delivers these goods to

multiple stores. Trucks have different capacities

for both the volumes they can hold and the

weights they can carry.

▪ At the moment, a truck makes several trips each

week. An operational database is being used to

keep track the deliveries, including the scheduling

of trucks, which provide timely deliveries to

stores.

Case Study #2 – A Truck Delivery Case Study

▪ A trip may pick up goods from many

warehouses

o i.e. a many-many relationship between

Warehouse and Trip

▪ A trip uses one truck only, and a truck may

have many trips in the history

o i.e. a many-1 relationship between Trip and

Truck

▪ A trip delivers goods (e.g. TVs, fridges, etc)

potentially to several stores

o a many-many relationship between Trip and

Store, which is represented by the Destination

table

▪ Sample data in the operational database:

Case Study #2 – A Truck Delivery Case Study

WarehouseID Location

W1 Warehouse1

W2 Warehouse1

W3 Warehouse1

… …

(d) Truck Table

(b) Trip Table (c) TripFrom Table

TripID Date TotalKm TruckID

Trip1 14-Apr-2018 370 Truck1

Trip2 14-Apr-2018 570 Truck2

Trip3 14-Apr-2018 250 Truck3

Trip4 15-Jul-2018 450 Truck1

… … … …

TripID WarehouseID

Trip1 W1

Trip1 W2

Trip1 W3

Trip2 W1

Trip2 W2

… …

TruckID VolCapacity WeightCategory CostPerKm

Truck1 250 Medium $1.20

Truck2 300 Medium $1.50

Truck3 100 Small $0.80

Truck4 550 Large $2.30

Truck5 650 Large $2.50

… … … …

StoreID StoreName Address

M1 MyStore City Melbourne

M2 MyStore Chaddy Chadstone

M3 MyStore HiPoint High Point

M4 MyStore Donc Doncaster

M5 MyStore North Northland

M6 MyStore South Southland

M7 MyStore East Eastland

M8 MyStore Knox Knox

… …

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(a) Warehouse Table

(e) Store Table (f) Destination Table

Case Study #2 – A Truck Delivery Case Study

▪ The management of this trucking company would like to analyze the deliver cost, based on

trucks, time period, and store.

Case Study #2 – A Truck Delivery Case Study

▪ Sales Star Schema

➢ Fact:

• Total Delivery Cost

(distance * cost per kilometre)

➢ Dimensions:

• Truck

• Time period

• Store

Case Study #2 – A Truck Delivery Case Study

▪ From the Truck point of view, Truck1 has two trips (e.g. Trip1 and Trip4), with the total kilometres of

820km (370km + 450km). The cost for Truck1 is $1.20. Hence, calculating the cost for Truck1 is

straightforward. Other trucks can be calculated this way.

▪ From the Period point of view, 14-Apr-2018 has three trips (e.g. Trip1,Trip2, and Trip3). Trip1 (370km) is

delivered by Truck1 which costs $1.20/km. Trip2 and Trip 3, on the same day, can be calculated the same

way. Hence, on 14-Apr-2018, the total cost can be calculated.

▪ From the Store point of view; The cost is calculated

based on Trip, but a trip delivers goods to many

stores. Therefore, the delivery cost for each store

cannot be calculated. The delivery cost is for the trip –

not for the store.

Case Study #2 – A Truck Delivery Case Study

Solution Model 1 – Using a Bridge Table

Case Study #2 – A Truck Delivery Case Study

Solution Model 2 – add a Weight Factor attribute

A weight factor is only needed if we want to estimate

the contribution that a dimension made to the fact

Case Study #2 – A Truck Delivery Case Study

StoreID StoreName Address

M1 MyStore City Melbourne

M2 MyStore Chaddy Chadstone

M3 MyStore HiPoint High Point

M4 MyStore Donc Doncaster

M5 MyStore North Northland

M6 MyStore South Southland

M7 MyStore East Eastland

M8 MyStore Knox Knox

… …

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(c) Store Table(b) Bridge Table

(a) Trip Dimension Table

▪ To create Trip Dimension: - create table TripDim2 as

select T.TripID, T.TripDate, T.TotalKm,

1.0/count(*) as WeightFactor

from Trip T, Destination D

where T.TripID = D.TripID

group by T.tripid, T.tripdate, T.totalkm;

Case Study #2 – A Truck Delivery Case Study

Case Study #2 – A Truck Delivery Case Study

Solution Model 3 – a List Aggregate version

Case Study #2 – A Truck Delivery Case Study

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(b) Bridge Table

(a) Trip Dimension Table

listagg (Attr1, ‘_’) within group

(order by Attr1) as ColumnName

▪ To create Trip Dimension: - create table TripDim3 as

select T.TripID, T.TripDate, T.TotalKm,

1.0/count(D.StoreID) as WeightFactor,

listagg (D.StoreID, ’_’) within group

(order by D.StoreID) as StoreGroupList

from Trip T, Destination D

where T.TripID = D.TripID

group by T.TripID, T.TripDate, T.TotalKm;

Case Study #2 – A Truck Delivery Case Study

▪ Joining based on the StoreGroupList attribute in the Trip dimension table and the StoreID in the Store dimension table:

- select *

from TripDim3 T, StoreDim3 S

where T.StoreGroupList like ’%’||S.StoreID||’%’;

▪ Without the StoreGroupList attribute in the Trip dimension, we need to join three tables:

- select *

from TripDim3 T, BridgeTable3 B, StoreDim3 S

where T.TripID = B.TripID

and B.StoreID = S.StoreID;

Case Study #2 – A Truck Delivery Case Study

Bridge Tables

Summary

Bridge Tables – Summary

▪ In principal, a Bridge Table is used: ▪

a) When it is impossible to have a dimension connected directed to the Fact table,

because simply there is no relationship between this dimension and the Fact table

(e.g. in the Product Sales case study, it is impossible to have a direct link from

SupplierDim to ProductSalesFact)

b) When an entity (which will become a dimension) has a many-many relationship with

another entity (dimension) in the E/R schema of the operational database (e.g.

Supplier and Stock has a many-many relationship).

c) When temporality aspect (data history) is maintained in the operational database

and the bridge table can be used to accommodate the dimension that has temporal

attributes (e.g. product supply history is maintained in the second snowflake schema

example).

Bridge Tables – Summary

▪ When a Bridge Table is used in the schema, there are two additional options: ▪

a) A Weight Factor is used to estimate the contribution of a dimension in the calculation

of the fact measure. Because this is only an estimate, a weight factor is option.

b) Every snowflake schema (whether it has Weight Factor or not) can be implemented

in two ways: a List Aggregate version, and a non-List Aggregate version.

Temporary Dimension Tables

43

1. Direct copying from a relational table in the operational database

2. Selecting certain attributes (or records) from a relational table in the

operational database

3. Creating the dimension table from scratch

– Insert records into the newly created dimension table

– Suitable for small dimension table

– Temporary dimension table is needed in circumstances where a new

dimension table cannot be created directly

Ways to Create Dimension Tables

44

Temporary Dimension Tables - Sales Case Study

An E/R Diagram for the

Sales Case Study

A Star Schema for the Sales Case Study

QuarterID - ‘YYYYQ’ is a unique

combination between year and quarter,

where Q is between 1 and 4

45

▪ TimeDIM has to be created manually

▪ Questions:

Can we insert the records into TimeDIM

one by one?

▪ If yes, how many records to be

inserted?

▪ If no, why?

Temporary Dimension Tables - Sales Case Study (cont.)

46

Solution for TimeDIM:

1. Create a temporary Time Dimension table that contains Month and Year

2. Add with a new attribute, called QuarterID.

3. Perform a series of updates to fill in the QuarterID column with the correct

values

4. Create the final TimeDIM table

Temporary Dimension Tables - Sales Case Study (cont.)

47

Temporary Dimension Tables - Sales Case Study (cont.)

1

2

3

Exercise:

Construct the SQL command for Step 4

48

Temporary fact table is required

because TimeDIM is not directly copied

from the operational database

Temporary Fact Tables - Sales Case Study

1

49

Temporary Fact Tables - Sales Case Study (cont.)

2 3

4

50

Temporary Tables in Operational Database

Total_Sales = SUM(TotalPrice)

An E/R Diagram for the

Sales Case Study

A Star Schema for the Sales Case Study

51

Temporary Tables in Operational Database - Case Study

An E/R Diagram for the

Sessional Contract Jobs

52

Temporary Tables in Operational Database - Case Study (cont.)

A Star Schema for the

Sessional Contract Jobs

Last Degree of the

Employee

53

▪ One employee may have multiple

degrees.

▪ Most of the work requires a Bachelor

degree, to allow current Master

students to work as a teaching

assistant or as a programmer

▪ Current PhD students who have a

Master degree often work as a

teaching assistant too

▪ BUT, only the last degree is required

in DigreeDIM table.

Temporary Tables in Operational Database - Case Study (cont.)

54

Temporary Tables - EmployeeTemp Table

Retrieves the last degree of the

employees

55

Sample Records for Employee and Emp_Degree Table

Employee

Emp_Degree

56

RANK() OVER Function

Retrieved for

EmployeeTemp

table.

57

Create Fact Table from EmployeeTemp Table

58

Temporary Dimension table

Needed when it is not possible to create a dimension table directly; need to

apply a series of additional steps when creating a dimension table through the

creation of an intermediate table

Temporary table in operational database

Tables in the operational database are not readily used to create the Fact table.

Need transformation phase, where an intermediate table, called the Temporary

Operational Database Table is created.

Summary

Oracle/FIT5195-4-Snowflake Schema.pdf

Week 4 – Snowflake Schema

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Hierarchies

1. Hierarchy vs. Non-Hierarchy

2. Hierarchy versus Multiple Independent Dimensions

3. Linked Dimensions

4. Hierarchy Design Considerations

2. Determinant Dimensions

1. Determinant vs. Non-Determinant Dimensions

2. Determinant Dimension vs. Pivoted Fact Table

3. Non-Type as a Determinant Dimension

4. Multiple Relationship between a Dimension and the Fact

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

Hierarchies

Hierarchies

▪ A Hierarchy is formed when a dimension is broken down to two or

more dimensions in a hierarchical manner.

➢ As a result, the star schema becomes a snowflake schema.

(a) A simple Star Schema

(b) A Snowflake with Hierarchy

Hierarchies

(a) A Snowflake with a Bridge Table (b) A Snowflake with Hierarchy

Hierarchies

1. Hierarchy versus Non-Hierarchy Dimensions

2. Hierarchy versus Multiple Independent Dimensions

3. Linked Dimensions

4. Hierarchy Design Considerations

1. Hierarchy vs. Non-Hierarchy

1. Hierarchy vs. Non-Hierarchy

1. Hierarchy vs. Non-Hierarchy

HierarchyNon-Hierarchy

1. Hierarchy vs. Non-Hierarchy

CityID City Name State CountryID

MEL Melbourne Victoria AU

SYD Sydney New South Wales AU

KUL Kuala Lumpur Selangor MA

JNB Johannesburg Johannesburg SA

CampusID CampusName Address Suburb Postcode CityID

CL Clayton Campus Wellington Rd Clayton 3800MEL

CA Caulfield Campus Dandenong Rd Caulfield East 3145MEL

PA Parkville Campus Royal Parade Parkville 3052MEL

SY Sydney Campus Opera Boulevard Sydney 2001SYD

MUM Malaysia Campus Jalan Lagoon Bandar Sunway 47500KUL

MSA South Africa Campus Peter St Johannesburg 1725JNB

(a) Campus Dimension Table

CountryID Country Name

AU Australia

MA Malaysia

SA South Africa

(b) City Dimension Table (c) Country Dimension Table

1. Hierarchy vs. Non-Hierarchy

(a) Correct Hierarchy (b) A Wrong Hierarchy

1. Hierarchy vs. Non-Hierarchy

1. One table vs. many tables

The consequence is in the join query processing when producing reports.

2. Normalized vs. Un-normalized

With the hierarchy option, the tables are normalized, which follows the relational

model. In contrast, the non-hierarchy option, the table (e.g. the Campus

Dimension table) is unnormalized, which is basically 1NF with visible replication

on information, which is prone to anomalies (e.g. insert, update, and delete

anomalies).

3. Drilling down and rolling up

The hierarchy model is not about drilling down information exploration.

1. Hierarchy vs. Non-Hierarchy – Summary

▪ From the query point of view, both versions need two queries.

▪ From the query processing point of view, the non-hierarchy version uses

one join operation only, because it only needs to join the fact and one

dimension.

▪ From the conceptual point of view, the hierarchy model does not actually

offer a better roll up or drill down features.

2. Hierarchy vs. Multiple Independent Dimensions

2. Hierarchy vs. Multiple Independent Dimensions

(a) Separate Dimensions (b) Combined Dimensions

(c) Hierarchy Dimensions

2.1. Separate vs Combined Dimension

(a) Separate Dimensions (b) Combined Dimensions

2.1. Separate vs Combined Dimension

(a) Separate Dimension Model – Campus Dimension Table

(b) Separate Dimension Model –

Country Dimension Table

(c) Separate Dimension Model – Fact Table

2.1. Separate vs Combined Dimension

(d) Combined Dimension Model – Campus Dimension Table

(e) Combined Dimension Model

– Fact Table

2.1. Separate vs Combined Dimension (c) Separate Dimension Model – Fact Table (e) Combined Dimension Model – Fact Table

2.1. Separate vs Combined Dimension

(a) Separate Dimension Model – Campus Dimension Table (b) Separate Dimension Model –

Country Dimension Table

(d) Combined Dimension Model – Campus Dimension Table

2.1. Separate vs Combined Dimension

(a) Separate Dimensions (b) Combined Dimensions

2.2. Combined Dimension vs. Hierarchy

(b) Combined Dimensions (c) Hierarchy Dimensions

2.2. Combined Dimension vs. Hierarchy

(a) Hierarchy Model – Campus Dimension Table

(c) Hierarchy Model – Fact Table

(b) Hierarchy Model – Country Dimension Table

2.2. Combined Dimension vs. Hierarchy

▪ Reasons on why the Hierarchy model is not ideal:

➢ An efficient query processing to query the fact and dimensions,

➢ The identifier of the Campus Dimension (e.g. CampusID) already covers the

child dimension (e.g. Country Dimension), and

➢ Campus and Country information is often regarded as one entity; at least in this

case study.

(b) Combined Dimensions (c) Hierarchy Dimensions

3. Linked Dimensions

3. Linked Dimensions

(a) Linked Dimension Model – Campus Dimension Table

(c) Linked Dimension Model – Fact Table

(b) Linked Dimension Model – Country Dimension Table

4. Hierarchy Design Considerations

Hierarchy – Summary

▪ A dimension hierarchy is connecting two or more dimensions in a hierarchical

manner, using a many-1 relationship.

➢ As a result, the dimensions in a hierarchy are normalized, in 3NF, using the

context of Relational Database Design.

▪ Compare and contrasts five different models:

1. Separate Dimension model

2. Combined Dimension model

3. Hierarchy model

4. Linked Dimension model

5. Hierarchy Design Considerations

Determinant Dimensions

31

Dimensions and Fact Tables - Recap

A Star Schema for the Sales Case Study

32

Example #1:

Querying Fact Table

33

Example #2:

Querying Fact Table (cont.)

34

A star schema may have a special dimension

all data retrieval from this star schema must use this dimension

else, the retrieved data will become meaningless.

Determinant Dimensions

Determinant

_DIM

Determinant_

Att

..

..

Special dimension

Key identifier

35

Introducing a Determinant Dimension – Petrol Station Case Study

Sample operational database of a company about petrol prices from all petrols

stations in Victoria:

36

The requirements for the data warehouse

are to answer the questions related to

(a) average petrol price,

(b) minimum petrol price, and

(c) maximum petrol price

Petrol Station - Star Schema Version-1

Chosen Dimensions:

(i) day of week,

(ii) suburb, and

(iii) petrol company.

Two-column Table for Petrol Prices based on Day of Week

Category Fact Measures

37

Minimum price for which petrol type?

Petrol Station - Star Schema Version-1 (cont.)

38

▪ Not all petrol stations sell all types of petrol, e.g. Premium 98.

▪ Calculating the average price for Premium 98 on Monday cannot be:

Total_Petrol_Price_P98 / Number_of_Petrol_Station

▪ The number of petrol stations is different for different petrol type!

Petrol Station - Star Schema Version-1 (cont.)

39

▪ Solution: Split the Number of Petrol Station column into six columns – one

for each petrol type.

▪ Question: How many columns in total?

Petrol Station - Star Schema Version-1 (cont.)

Two-column

Table for Petrol

Prices based

on Suburb Two-column Table for Petrol Prices based on Company

40

Petrol Station - Star Schema Version-1 (cont.)

41

Petrol Station - Star Schema Version-2

42

Petrol Station - Star Schema Version-2

Min. price for

which petrol

type??

43

select DayofWeek,

Min_Petrol_Price

from PetrolFact

where DayofWeek = ’Monday’

and PetrolType = ’Unleaded’;

Petrol Station - Star Schema Version-2

select DayofWeek, PetrolType,

Min_Petrol_Price

from PetrolFact

where DayofWeek = ’Monday’;

44

Petrol Station - Star Schema Version-2

Determinant Dimension uses dotted box in the star schema.

The new star schema with a Determinant Dimension for Petrol Case Study

Determinant vs. Non-

Determinant

Dimensions

46

The Olympic Games Case Study

▪ The Olympic Games committee maintains an operational database that

stores all matches, games, as well as the medal winners of the Olympic

Games over the years

▪ Let’s build a data warehouse to analyse the medal counts, by each country,

sport, and at which Olympic Games

Determinant vs. Non-Determinant Dimensions

47

The Olympic Games Case Study - The Star Schemas

Version-1

Version-2

The questions are:

1. What is the difference

between these two

versions of star schema?

2. Is Medal Type Dimension

a Determinant

Dimension?

48

The Olympic Games Case Study - The Star Schema Version-1

Two-column tables

(pivot tables) for each

dimension category in

star schema version-1

49

The Olympic Games Case Study - The Star Schema Version-2

50

▪ Is MedalTypeDIM a determinant dimension?

– To retrieve the data from version-2 star schema, must we have the

information from Medal Type Dimension?

Determinant vs. Non-Determinant Dimensions

Version-2

COUNT = total medals from

other dimensions, without the medal type dimension)

51

Fact Table for Version-1

The Olympic Games Case Study - The Star Schema Version-1 vs Version-2

Fact Table for Version-2

52

The Olympic Games Case Study - The Star Schema Version-1 vs Version-2

Differenc

es

Version 1 Version 2 (with

MedalTypeDIM)

Storage Lower with only three records Nine records

Modelling The schema looks more

complex and crowded; but the

storage cost is lower

More concise and compact with

less number of measures in the

fact, easier to understand

Query

Processin

g

Less join processing between

dimension tables and the fact

table.

Requires additional join with

MedalTypeDIM

Determinant

Dimensions vs.

Pivoted Fact Table

54

The PTE Academic Test Case Study ▪ PTE Test consists of four main components: Listening, Reading, Speaking, and

Writing

▪ There are many test venues (all major cities in Australia, and overseas as well), each

country may have a different price setting

▪ A student taking a PTE test will have one score for each of these four components,

as well as one overall score

▪ The score is a numerical score, ranging from 10 to 90

▪ The results will normally be released five days after the actual test date

▪ A student may take a PTE Academic Test several times, in order to improve the

score

▪ A student is identified by his/her RegistrationID, which stays the same student stays

the same

Determinant Dimensions vs. Pivoted Fact Table

55

ERD for PTE Academic Test

56

The PTE Academic Test Case Study - Sample Records for 8 Students

57

The PTE Academic Test Case Study - Sample Records for 8 Students

58

The PTE Academic Test Case Study - Star Schema with Determinant Dim.

59

To implement the star schema in SQL:

create table CountryVenueDim

as

select distinct C.CountryCode, C.CountryName, T.TestPrice

from Test_Venue T, Country C

where T.CountryCode = C.CountryCode;

create table CitizenshipDim

as

select distinct C.CountryCode as Citizenship, C.CountryName

from Student S, Country C

where S.Citizenship = C.CountryCode;

create table YearDim

as

select distinct to_char(TestDate, ’YYYY’) as Year from Test;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

60

create table GradeDim (

Grade varchar2(3),

Description varchar2(20),

MinScore number,

MaxScore number

);

insert into GradeDim values (’4.5’, ’Functional’, 30, 35);

insert into GradeDim values (’5’, ’Vocational’, 36, 49);

insert into GradeDim values (’6’, ’Competent’, 50, 64);

insert into GradeDim values (’7’, ’Proficient’, 65, 78);

insert into GradeDim values (’8-9’, ’Superior’, 79, 90);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Superior

61

create table TestComponentDim (

TestComponent varchar2(20)

);

insert into TestComponentDim values

(’Listening’);

insert into TestComponentDim values

(’Reading’);

insert into TestComponentDim values

(’Writing’);

insert into TestComponentDim values

(’Speaking’);

insert into TestComponentDim values

(’Overall’);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Create temporary fact table:

create table TempFact

as

select TV.CountryCode, S.Citizenship,

to_char(T.TestDate, ’YYYY’) as Year,

TR.ListeningScore, TR.ReadingScore,

TR.WritingScore, TR.SpeakingScore,

TR.OverallScore, TR.RegistrationID

from Test_Venue TV, Test T, Student S,

Test_Result TR

where TV.VenueID = T.VenueID

and T.TestNo = TR.TestNo

and TR.RegistrationID = S.RegistrationID;

62

alter table TempFact

add (

GradeOverall varchar2(3),

GradeListening varchar2(3),

GradeReading varchar2(3),

GradeWriting varchar2(3),

GradeSpeaking varchar2(3)

);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

update TempFact

set GradeOverall = (

case

when OverallScore >= 30 and

OverallScore <= 35 then ’4.5’

when OverallScore >= 36 and

OverallScore <= 49 then ’5’

when OverallScore >= 50 and

OverallScore <= 64 then ’6’

when OverallScore >= 65 and

OverallScore <= 78 then ’7’

when OverallScore >= 79 and

OverallScore <= 90 then ’8-9’

end);

Repeat for

GradeListening, GradeReading,

GradeWriting, and GradeSpeaking

Superior

63

▪ The TempFact table has the correct grades for each student.

▪ The problem now is to breakdown the grades for a student into multiple test

components – Create another temporary fact for each of the test components, eg. a temporary

fact called ’OverallFact’ for the Overall Score, and so on so forth

create table OverallFact

as

select CountryCode, Citizenship, Year,

GradeOverall As Grade, ’Overall’ as TestComponent,

count(RegistrationID) as Total_Students_Overall

from TempFact

group by CountryCode, Citizenship, Year, GradeOverall, ’Overall’;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Repeat for

GradeListening, GradeReading, GradeWriting, and GradeSpeaking

64

▪ There were eleven students taking the test which are now grouped into nine

records in Table OverallFact

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

65

▪ There are also three other fact tables: ReadingFact, WritingFact, and

SpeakingFact, with identical table structures

▪ These five fact tables can now be ’combined’ to form one final fact table

using a union operator: create table FinalFact as

select CountryCode, Citizenship, Year, Grade, TestComponent,

Total_Students_Overall as Total_Students

from OverallFact

union select * from ListeningFact

union select * from ReadingFact

union select * from WritingFact

union select * from SpeakingFact;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

66

▪ The final fact table contains 45

records

▪ Conclusion: – When the input record in the

operational database contains the

scores of all the test components

as one record, it will be

challenging to break down the

records into multiple records in

the final fact table, because there

is a Test Component dimension.

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

67

The PTE Academic Test - Star Schema without Determinant Dim.

● Fact measure for all the test components

● This means that the fact table is a Pivoted Fact Table ○ Shifts the Test Component into the fact measure;

○ The Determinant Dimension - TestComponentDim is then removed from the star schema

68

▪ How many records do we have in fact table with five fact measures?

▪ However, since our test result data is incomplete, we will have problems

with the final fact, because in the non-determinant dimension version, we

need to keep track the zeroes – Even if the data is the test result is a lot, but still does not cover all possible

combinations from the dimension!

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

Answer:

60 records in the final fact table

= 2 countries of test venues * 6 countries of citizenship * 1 year * 5 grades

69

▪ ‘Superior’ grade is only available for Listening and Reading scores

▪ Using the normal fact table, there will not be entry in the fact table for Superior-Speaking, and for

Superior-Writing;

▪ There will not be zero value in TotalStudents, where the Grade column is ’Superior’ and the Test

Component column is either ’Speaking’ or ’Writing’ → less than 60 rows of records in the fact table

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

Superior

70

Solution: Step1

Get all possible combinations from all dimensions using Cartesian product

between all dimensions:

create table AllDimensions

as

select CO.CountryCode, CI.Citizenship, Y.Year, G.Grade

from CountryVenueDim CO,CitizenshipDim CI,

YearDim Y, GradeDim G;

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

71

Solution: Step 2

Re-create five temporary fact tables, one for each test component, using Outer

Join operation between AllDimensions table and each of the temporary fact

tables created in the previous section (see slide #35).

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

create table OverallFactNew as

select A.CountryCode, A.Citizenship,

A.Year, A.Grade,

nvl(O.Total_Students_Overall, 0)

as Total_Students_Overall

from AllDimensions A, OverallFact O

where A.CountryCode = O.CountryCode(+)

and A.Citizenship = O.Citizenship(+)

and A.Year = O.Year(+)

and A.Grade = O.Grade(+);

create table ListeningFactNew as

select A.CountryCode, A.Citizenship,

A.Year, A.Grade,

nvl(O.Total_Students_Listening, 0)

as Total_Students_Listening

from AllDimensions A, ListeningFact O where

A.CountryCode = O.CountryCode(+)

and A.Citizenship = O.Citizenship(+)

and A.Year = O.Year(+)

and A.Grade = O.Grade(+);

create table ReadingFactNew as select ...;

create table WritingFactNew as select ...;

create table SpeakingFactNew as select ...;

72

Solution: Step 3

Create the final fact table using the join operation

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

create table FinalFact2 as

select O.CountryCode, O.Citizenship, O.Year, O.Grade, O.Total_Students_Overall, L.Total_Students_Listening,

R.Total_Students_Reading, W.Total_Students_Writing, S.Total_Students_Speaking

from OverallFactNew O, ListeningFactNew L, ReadingFactNew R, WritingFactNew W, SpeakingFactNew S

where O.CountryCode = L.CountryCode

and L.CountryCode = R.CountryCode and R.CountryCode = W.CountryCode

and W.CountryCode = S.CountryCode and O.Citizenship = L.Citizenship

and L.Citizenship = R.Citizenship and R.Citizenship = W.Citizenship and W.Citizenship = S.Citizenship

and O.Year = L.Year and L.Year = R.Year and R.Year = W.Year and W.Year = S.Year

and O.Grade = L.Grade and L.Grade = R.Grade and R.Grade = W.Grade and W.Grade = S.Grade;

73

Solution: Step 4

Because our sample data is small, there are many records with zeroes values in

the Total Students columns. Delete records where Total Students in all of the

five test components are equal to zero.

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

delete from FinalFact2

where Total_Students_Overall = 0

and Total_Students_Listening = 0

and Total_Students_Reading = 0

and Total_Students_Writing = 0

and Total_Students_Speaking = 0;

Non-Type as a

Determinant

Dimension

75

▪ Type Dimension maybe a candidate for a Determinant Dimension:

– PetroTypeDim is a Determinant Dimension in the Petrol Price case

study,

– MedalTypeDim in the Olympic Games case study is a not a

Determinant Dimension

▪ Type Dimension is a good candidate for a Determinant Dimension, but

needs further examination to check if it is really a Determinant Dimension

▪ Other dimensions - non-type dimension can potentially be a Determinant

Dimension

Non-Type as a Determinant Dimension

76

▪ The Class Type Dimension in

this example is not a

Determinant Dimension

because we could get Number

of Students without referring to

any of the class types. ▪ E.g. retrieving number of

students in the Science

Faculty last year is certainly a

useful information, even

without Class Type

information.

Non-Type as a Determinant Dimension – University Enrolment Case Study

Type Dimension

77

Total student of

Science Faculty

Non-Type as a Determinant Dimension – University Enrolment Case Study

Total student of

Science Faculty

last year

vs.

The information is

dependent on the YearDIM

- determinant dimension

78

Converting a

Determinant Dimension

to a Pivoted Fact Table

Non-Type as a Determinant Dimension – University Enrolment Case Study

79

Question 1:

What if the number of records in the Determinant Dimension is huge?

– Number of fact measures increased to 20, 50?

– May not be practical

Solution:

Enforce Determinant Dimension through the User-Interface level ▪ Users are forced to enter/choose a criteria from the Determinant

Dimension.

▪ In filling a web-based form, this enforcement is normally indicated by a

star (*) next to the text entry to indicate that this text entry must be filled

in, before the form can proceed.

Pivoted Fact Table for University Enrolment Case Study

80

Question 2:

What if there are a lot of attributes in the Determinant Dimension? – If we shift the Determinant Dimension to the fact measures, only the key

identifier is kept in the fact measure.

– All other attributes from the Determinant Dimension will be lost from the star

schema. This issue will be discussed in the next section.

Pivoted Fact Table for University Enrolment Case Study (cont.)

Multiple Relationship

between a Dimension

and the Fact

82

▪ Private Taxi Case Study

– The dimension (or a Determinant Dimension) to be shifted to the fact

measure (e.g. Pivoted Fact Table) has many attributes.

– If only the key identifier of the dimension (or the Determinant

Dimension) is used as the fact measure, all other attributes of the

dimension (or the Determinant Dimension) are no longer exists in the

star schema

Multiple Relationship between a Dimension and the Fact

83

▪ Assume that

– The business rules stated that there is no Determinant Dimension. Even without a

Determinant Dimension, we could potentially shift a dimension to become fact measures

– The company is a small private taxi company, has only five cars

▪ In the fact table, instead of having three fact measures, for each fact measure, we could

replicate five times – one for each car.

Private Taxi Case Study

Replicate for each

car ===> 15 fact

measures

Remove this

CarDIM?

84

▪ CarDIM should not be removed – Other details of the car are

included in the data retrieval

▪ Because Car Dimension is not a

Determinant Dimension, we can

simply keep the original star

schema – Include only CarNo in the fact

table

– Use join operation to get more

details about car in query

Private Taxi Case Study (cont.)

select F.MonthYear, D.Make,

D.Model,

sum(F.Total_Kilometers)

from PrivateTaxiFact1 F,

CarDim D

where F.CarNo = D.CarNo

and F.CarNo = 5

and F.MonthYear = ’1908’

group by F.MonthYear, D.Make,

D.Model;

85

Insist that fact measures are broken into five cars

▪ Need to have five Car Dimension’s key

identifier in the fact

Private Taxi Case Study (cont.)

To retrieve Total Kilometers of five cars:

select F.MonthYear, D1.CarMake, D1.Year,

D2.CarMake, D2.Year, D3.CarMake, D3.Year,

D4.CarMake, D4.Year, D5.CarMake, D5.Year,

sum(Total_Kilometers)

from PrivateTaxiFact2 F, CarDim D1, CarDim

D2, CarDim D3, CarDim D4, CarDim D5

where F.CarNo1 = D1.CarNo

and F.CarNo2 = D2.CarNo

and F.CarNo3 = D3.CarNo

and F.CarNo4 = D4.CarNo

and F.CarNo5 = D5CarNo

and MonthYear = ’1908’

group by F.MonthYear, D1.CarMake, D1.Year,

D2.CarMake, D2.Year, D3.CarMake, D3.Year,

D4.CarMake, D4.Year, D5.CarMake, D5.Year;

86

▪ Although conceptually the dimension is already shifted to the fact measure,

we still need to keep the original dimension in the star schema.

▪ This is the only way to preserve all other attributes of that dimension.

▪ This relationship replication is applicable to shifting a Determinant

Dimension to the fact measure.

Multiple Relationship between a Dimension and the Fact

87

Two alternatives to enforce the implementation of a star schema that features a

Determinant Dimension

1. Enforce through the user-interface, whereby the user has to fill in or to

choose the Determinant Attribute during the search.

2. Shift the Determinant Dimension into a Pivoted Fact Table, whereby the key

identifier of the Determinant Dimension is incorporated into the fact

measure.

Determinant Dimension - Conclusion

Oracle/FIT5195-5-MoreComplexProcesses_update (1).pdf

Week 5 – More Complex Processes

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. More Complex Processes

1. Use of count Function

2. Average in the Fact

3. Outer Join

2. One Attribute Dimensions

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

More Complex Processes

in Creating Fact

Recall – Star Schema Components

▪ There are Three main components of the Star Schema:

1. Facts

2. Dimensions

3. Attributes

Recall – Fact

▪ A Fact Table consists of key attributes from each dimension,

and fact measures.

▪ A Fact Table is created by a join operation, that joins several

tables from the operational database.

▪ Fact tables are created either through TempFact or directly

retrieval from the tables in the operational database.

▪ The fact measure itself is an aggregated value.

➢ In the SQL command, the fact measure attribute in the Fact Table is created using an aggregate function, such as count

or sum, and the group by operation.

Use of count Function

▪ The count function is one of the most common aggregate functions used to

create the fact measures.

▪ In SQL, there are three ways of using the count function:

1. count(*) – counts number of records in the query result

2. count(attribute) – counts number of records of that attribute excluding

null values

3. count(distinct attribute) – counts number of unique values in that

mentioned attribute

Case Study #1: Mobile Apps Repositories

Monalisa University is an international university. It has an online mobile app store: Monalisa App Store.

The Monalisa App Store allows students from any university in the world to publish their applications and

receive feedbacks. This app store is considered as a research environment where applications developed

during researches and studies can be tested and used by real users. The applications are free and open

source.

Basically, the operational system has the following features:

▪ Every user details are kept in the database, which includes the universities that the students enroll.

▪ Users can publish their apps or download other apps.

▪ Users can give feedbacks and ratings to other apps.

▪ The download and feedback statistics are stored in the database.

▪ Apps are organized into different categories; and many authors may have more than one app in

different categories.

Case Study #1: Mobile Apps Repositories

▪ A data warehouse is needed

analyze the ratings and feedbacks

by different authors and apps, so

that the staff at Monalisa University

can connect with the talented

authors and send them awards

annually.

▪ The author with the highest

average apps rating will be named

author of the year, and will receive

an award.

Case Study #1: Mobile Apps Repositories

Case Study #1: Mobile Apps Repositories

create table CategoryDim as

select * from Category;

Case Study #1: Mobile Apps Repositories

create table UniversityDim as

select distinct UniversityID,

UniversityName

from University;

Case Study #1: Mobile Apps Repositories

create table LocationDim as

select distinct

Country || City as LocationID,

City,

Country

from University;

Case Study #1: Mobile Apps Repositories

create table TimeDim as

select distinct

to_char(DownloadDate, ’YYYYMM’) as TimeID,

to_char(DownloadDate, ’MM’) as Month,

to_char(DownloadDate, ’YYYY’) as Year

from Download;

Case Study #1: Mobile Apps Repositories

Case Study #1: Mobile Apps Repositories

create table AppsDownloadFact as

select

to_char(D.DownloadDate, ’YYYYMM’) as DownloadMonth,

U.Country || U.City as LocationID,

A.CategoryID,

A.ApplicationID,

U.UniversityID,

count(*) as TotalDownloads

from University U, App_User R, Download D, Application A

where

U.UniversityID = R.UniversityID and

R.UserID = D.DownloaderID and

D.ApplicationID = A.ApplicationID

group by

to_char(D.DownloadDate, ’YYYYMM’),

U.Country || U.City,

A.CategoryID,

A.ApplicationID,

U.UniversityID;

Case Study #1: Mobile Apps Repositories

Case Study #1: Mobile Apps Repositories

create table TimeDim as

select distinct

to_char(CreationDate, ’YYYYMM’) as TimeID,

to_char(CreationDate, ’MM’) as Month,

to_char(CreationDate, ’YYYY’) as Year

from Application;

Case Study #1: Mobile Apps Repositories

create table AppsFact as

select

to_char(A.CreationDate, ’YYYYMM’) as CreationMonth,

U.Country || U.City as LocationID,

A.CategoryID,

A.ApplicationID,

U.UniversityID,

count(distinct ApplicationID) as TotalApps

from University U, App_User R, Download D, Application A

where

U.UniversityID = R.UniversityID and

R.UserID = D.DownloaderID and

D.ApplicationID = A.ApplicationID

group by

to_char(A.CreationDate, ’YYYYMM’),

U.Country || U.City,

A.CategoryID,

A.ApplicationID,

U.UniversityID;

Average in the Fact

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

The operational database

contains:

• 9 records of Semester one

• 7 records of Semester two

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

The operational database

contains:

• 9 records of Semester one

• 7 records of Semester two

• 8 records of Database Unit

(6 Semester one and 2

Semester two)

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact (b) Subject Dimension

(c) Semester Dimension

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact (b) Subject Dimension

(c) Semester Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact (b) Subject Dimension

(c) Semester Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

• Semester two: (64+32)/2 = 48

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact (b) Subject Dimension

(c) Semester Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

• Semester two: (64+32)/2 = 48

These are actually

incorrect!

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact (b) Subject Dimension

(c) Semester Dimension

The SQL command: select avg(Average_Score)

from EnrolmentFact

where UnitCode = ’IT001’;

Query: Calculate Average Score for the Database Unit:

• (73.833 + 48) / 2 = 60.9165

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for the Database Unit.

Calculation using Fact:

• (73.833 + 48) / 2 = 60.9165

Calculation based on the Operational Database:

• (81+41+74+85+87+75+64+32) / 8 =

539 / 8 = 67.375

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for the Java Unit in both Semesters.

Calculation using Fact:

• (56+71.667) / 2 = 63.833

Calculation based on the Operational Database: • (65+47+78+73+64) / 5 = 65.4

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for Semester One.

Calculation using Fact:

• (73.833+56+63) / 3 = 64.287

Calculation based on the Operational Database: • (81+41+74+85+87+75+65+47+63) / 9 = 68.667

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for Semester Two.

Calculation using Fact:

• (48+71.667+52.5) / 3 = 57.389

Calculation based on the Operational Database: • (64+32+78+73+64+53+53) / 7 = 59.4286

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Average in the Fact

Unit Code Semester Total_Score Number_of_Students

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

(a) Fact Version 2

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(b) Subject Dimension

(c) Semester Dimension

Average in the Fact

Unit Code Semester Total_Score Number_of_Students

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

(a) Fact Version 2

The SQL command: select sum(Total_Score)/

sum(Number_of_Students)

as Average_Score

from EnrolmentFact2

where UnitCode = ’IT001’;

Query: Calculate Average Score for

the Database Unit:

• (443+96)/(6+2) = 67.375

Average in the Fact

▪ The problem of Average in the Fact is known as the Average of an Average

problem.

➢ This problem is well known in Mathematics and Statistics.

➢ Average of an average will simply produce an incorrect average result.

▪ Hence, it is not desirable to have an average measure in the fact.

➢ Exceptional case: when the analysis ALWAYS uses all the dimensions (e.g.

Determinant Dimensions).

Min & Max in the Fact

▪ If Average should not be used in the fact, how about Min or Max?

➢ Yes, we can.

➢ Because Max of Max is always a global Max, and Min of Min is always a global

Min.

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Min & Max in the Fact

▪ Query: Find the Maximum Score of Database Unit.

• Max of {87, 64} is 87.

▪ The SQL command: select max(Max_Score) from EnrolmentFact

where UnitCode = ’IT001’;

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Min & Max in the Fact

▪ Query: Find the Minimum Score of Database Unit.

• Min of {41, 32} is 32.

▪ The SQL command: select min(Min_Score) from EnrolmentFact

where UnitCode = ’IT001’;

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

1

2

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Average in the Fact – Conclusion

▪ Average in the Fact is not desirable, although technically it satisfies the two

criteria of the fact (e.g. must be a numerical and aggregate value).

▪ Min and Max in the Fact can still be used, since Min Score and Max Score are

valid fact measures (e.g. they are numerical and aggregated values).

▪ In general, count and sum are more common.

Outer Join

Case Study #2: Employment Agency

The Employment Agency has a file of candidates who are willing to work.

▪ They record the candidates number, name, contact address, contact phone number and

maximum hours the candidate is available per week.

▪ Each candidate may have several qualifications.

▪ The Agency uses special codes to record the candidate’s qualification for a job opening.

▪ As well as recording the code, the Agency also records the experience of the candidate in

each qualification area, expressed as the number of months experience.

Case Study #2: Employment Agency

The Agency also has a list of companies that request temporaries.

▪ Each company is assigned a company number as an identifier.

▪ The company name, type of business and principal contact for employment placements is also recorded.

▪ Each time a company requests a temporary employee, the Agency makes an entry in the (job) openings file. This file contains an opening number, the company requesting an employee, required qualification, starting date, anticipated ending date, and hourly pay.

▪ Each opening requires only one specific or main qualification.

▪ The Agency may be able to fill the opening from the staff on its books, however in some circumstances it cannot fill the request.

Case Study #2: Employment Agency

▪ When a candidate matches the qualification, and is available, he/she is given the job, and an entry is made in the placement record folder.

➢ This folder contains an opening number, candidate number, actual start date, total hours worked to date, and end date if the placement is completed.

➢ The placement record folder is used by the Agency as a source of placement histories for it’s various temporaries.

Case Study #2: Employment Agency

▪ Note:

➢ Not all Openings have

Placements, but each Placement

comes from the Opening.

➢ A Candidate may have several

Placements.

➢ A Company may open several

Openings.

➢ Each Opening has one main

qualification requirement.

Case Study #2: Employment Agency

▪ A data warehouse to analyze

the total openings and total

placements by job durations,

qualification requirements, and

placement starting month.

▪ Several types of job durations:

➢ Short-term (<10 days)

➢ Medium-term (10 to 30 days)

➢ Long-term (>30 days)

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

▪ Several types of job durations:

➢ Short-term (<10 days)

➢ Medium-term (10 to 30 days)

➢ Long-term (>30 days)

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

1. How many short-term job openings?

2. How many jobs requiring certain qualifications?

3. How many people started working in a certain

month?

Case Study #2: Employment Agency

create table QualificationDim as

select * from Qualification;

Case Study #2: Employment Agency

create table MonthDim as

select distinct

to_char(ActualStartDate, ’Month’)

as MonthName

from Placement;

Case Study #2: Employment Agency

create table DurationDim

(DurationID number,

DurationDesc varchar2(20));

insert into DurationDim values (1, ’Short-Term’);

insert into DurationDim values (2, ’Medium-Term’);

insert into DurationDim values (3, ’Long-Term’);

Case Study #2: Employment Agency

create table TempFact as

select O.QCode,

O.StartDate,

O.EndDate,

to_char(P.ActualStartDate, ’Month’)

as MonthName,

O.OpenNo,

P.CandNo

from Opening O, Placement P

where O.OpenNo = P.OpenNo (+);

Case Study #2: Employment Agency

create table TempFact as

select O.QCode,

O.StartDate,

O.EndDate,

to_char(P.ActualStartDate, ’Month’)

as MonthName,

O.OpenNo,

P.CandNo

from Opening O, Placement P

where O.OpenNo = P.OpenNo (+); Left Outer Join

Case Study #2: Employment Agency

alter table TempFact

add (DurationID number);

update TempFact

set DurationID = 1

where EndDate - StartDate < 10;

update TempFact

set DurationID = 2

where EndDate - StartDate >= 10

and EndDate - StartDate <=30;

update TempFact

Set DurationID = 3

where EndDate - StartDate > 30;

Case Study #2: Employment Agency

create table AgencyFact as

select

QCode, DurationID, MonthName,

count(OpenNo) as TotalOpening,

count(CandNo) as TotalPlacement

from TempFact

group by QCode, DurationID, MonthName;

One-Attribute

Dimensions

61

One-attribute dimensions

One-attribute dimensions ● Dimensions with only one attribute, i.e. key attribute

62

Column-based Solution in the Fact

● Medal Type Dimension table seems unnecessary, because Medal Type also

exists in the fact table

● Hence, there is no need to reference to the Medal Type Dimension

63

Column-based Solution in the Fact (cont.)

Column-based solution ● Convert the dimension attribute into

new columns (fact measures) in the

fact table

● Only applicable for dimensions with

low cardinality, because it is not

practical to create a new column in

the fact table for each record of the

dimension

64

Row-based Solution in the Fact

● What happens if the one-attribute dimension has a lot of records?

● The solution: ○ Remove the dimension and keep the attribute of the dimension as a dimension-

less key in the fact table

Row-based solution whereby

each medal type is

represented by records in the

fact table

65

Row-based Solution in the Fact - Sales Case Study

● Assume that there are only four quarters (e.g. Q1, Q2, ..., Q4), regardless

the year

● The Quarter could be shifted to the fact measures

● Result: Pivoted fact table

Q1 to Q4

A Sales Star Schema with Pivoted Fact

66

▪ What if the Quarter also embeds the Year, e.g. Q12020, Q22018, and etc.? – Pivot Fact table cannot be the solution, because the number of possible records

in the Time Dimension can be large

▪ The solution: – Include the Quarter attribute as a Dimension-less Key in the fact, and to remove

the Time Dimension from the star schema

– Result: Row-based solutions - not restricted by the number of records in the

one-attribute dimension

Row-based Solution in the Fact (cont.)

67

▪ Similarity: – The contents of the fact table are identical.

▪ Difference: – The dimension-less key attribute has no references to the Time Dimension,

because the Time Dimension has been removed, whereas the Quarter in the star

schema has a reference to the Time Dimension.

Star Schema with Dimension-less Key vs. One-Attribute Dimension

vs

Star Schema with Dimension-less key

Star Schema with TimeDIM

68

▪ Looking at the high level design, it is

often more desirable to keep the

one-attribute dimension, i.e. the Time

Dimension in this case

– Provides a clearer design,

considering that the fact table

remain the same, with or without

the one-attribute dimensions

Star Schema with Dimension-less Key vs. One-Attribute Dimension (cont.)

69

One-attribute dimensions

70

▪ If there is more than one dimension with single attribute, and these

dimensions are not related to each other, it is often desirable to combine

them into a junk dimension

▪ Case study: – Analyse the number of sales based on

these dimensions

Combine all one-attribute dimensions

Sales Fact Table

71

Combine all one-attribute dimensions (cont.)

Cartesian product of 4

dimension tables

72

73

▪ Assuming that each book can only have one category, then Category is

related to Book

▪ In this case, we can simply move the category into the book dimension

Combine with other normal dimensions - Book Sales Case Study

74

Combine with other normal dimensions - Book Sales Case Study (cont.)

75

Determinant Dimension with one-attribute only - Weather Data Case Study

76

Determinant Dimension with one-attribute only - Weather Data Case Study

▪ WeatherMonth attribute is a Determinant Dimension in this case.

▪ Hence, WeatherMonth Dimension must be kept in the star schema

– To enforce that WeatherMonth Dimension (or WeatherMonth attribute)

must be used in all queries to this star schema

WeatherMonthDI

M

77

Questions:

1. What is the granularity of this

star schema?

2. How can we have a higher

granularity in this star schema

(no category to go for)?

Answer to Question #2: ▪ Combine books into groups

based on the authors

One-attribute dimension with bridge - Book Sales Case Study

78

▪ The ISBN Group List can be

created by using the listagg function

▪ The Book Group List

Dimension is a one-attribute

dimension

▪ It is not possible to remove

this dimension as it links the

fact and the bridge table

One-attribute dimension with bridge - Book Sales Case Study (cont.)

79

1. To move the one attribution dimensions to the fact table

– The one-attribute dimension must have low cardinality

– Else, keep the dimension attribute in the fact as a dimension-less key

– From high level point of view, sometimes it is desirable to keep the

attribute dimension in the design

2. To combine the one-attribute dimension with other dimensions

– Form a junk dimension that combines several one-attribute dimensions

– Move the one-attribute dimension into another normal dimension

– One-attribute dimensions cannot be removed if:

▪ It is a determinant dimension

▪ It is connected to a bridge table

Summary for One-Attribute Dimensions

End of Week 5 Lecture

Oracle/Requiremnts-Oracle.pdf

FIT5195 S1 2020 – Take Home Test (10%)

Due date: Week 6, Friday 1-May-2020, 11:55pm

This Take Home Test consists of two Case Studies. You are required to answer all questions.

Case Study 1: Government Hospital Data Warehouse (50 marks) Currently, the Victorian government has a simple database system implemented to store the

collective data of all government-run hospitals. The database system includes information about their

services, staff, and patients. The simple database system consists of the following tables and

attributes:

Figure 1 : A simplified structure of the Government Hospital’s current E/R Diagram

Due to the nationwide impact of a new pandemic, the Premier of Victoria (i.e. the head of government

in the Australian state of Victoria) requires a data warehouse for analyzing trends and forecasting

purposes. The goal of the analyzing is to identify which frequently used services, popular hospital

locations, etc.

You are required to design a small Data Warehouse to keep track of the statistics. The Premier is

particularly interested in analyzing the total number (population quantity) of patients and total

service charged by services used, time periods, age groups, and hospital locations.

The Premier also wants to be able to drill down based on the cost of services range (low price < $20,

medium price between $20 and $50, and high price >$50).

Tasks:

[1]. Create the operational database based on the above E/R Diagram, and then populate with sufficient numbers of records in each table. It is suggested that each table would have 5-20

records depending on the use of that table in the operational database. You need to create a script

file containing the CREATE TABLE and INSERT INTO commands.

You can follow the list of services from this link:

https://www.monash.edu/health/medical/services.

[2]. Then develop a GOVERNMENT HOSPITAL star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you

have to also include the Weight Factor and List Aggregate.

The result of this task is a star schema diagram. You can use any drawing tool, such as

Lucidchart, to draw the star schema.

[3]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.

[4]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables.

[5]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and

show the results):

a) Show the total number of patients making appointments during Winter. b) Show the total service charged for each service cost type. c) Show the total number of patients by each age group (infant <1, children <18, adult 18+,

senior 65+) in April 2020.

d) Show the total service charged for general medical consultations in each suburb.

Submission checklist for Case Study 1:

A. A pdf file or word document containing:

i. The SQL commands to create tables and insert records to the operational database –

Task 1

ii. The star schema diagram – Task 2

iii. The Two-Column Table Methodology illustration – Task 3

iv. The SQL commands to create the dimension and fact tables, as well as the contents of

these tables – Task 4

v. The SQL commands to answer the queries in Task 5 and the query results

B. The .sql files containing:

i. The SQL commands to create the operational database.

ii. The SQL commands to create the data warehouse.

iii. The SQL commands to answer the queries in Task 5.

(Notes: The marker of this test will simply run this file. So you need to make sure that the

SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,

especially if you cut and paste from a Word file).

C. A folder:

i. Save all the files from [A] and the SQL files from [B] in one folder.

ii. Name the folder CaseStudy1_yourStudentID.

Case Study 2: Accident Records Data Warehouse (50 marks)

Figure 2 : A simplified structure of the Accident Record’s current database system

The above figure displays how the Victorian Roads and Safety (VicR&S) Department stores accident

related data. When there are any road accidents, the details about the accidents such as the location,

date, time and event (e.g. what exactly happened during the accident) are recorded. For each accident,

the information (e.g. name, contact number, employment start date, current employment branch)

about one police officer who is in charge of overseeing the events of the accident (e.g finding out

who/what caused the accident) are stored. In addition, to identify a particular accident, each accident

is given a unique accident number.

On the other hand, one accident may be caused by several vehicles (e.g. a collision between 2

vehicles). All vehicles have owners who may or may not be the same as the vehicle driver during the

accident. Therefore, the VicR&S database stores information about vehicles involved in the accident

(i.e. vehicle number, vehicle model, vehicle make and vehicle manufacturer), the information about

the owners of the vehicles (i.e. owner name, owner address, owner contact number) and the

information about the drivers at the time of the accident (driver name, driver address, driver contact

number, driver licence no, and when the driver started to drive the vehicle). Furthermore, for each

accident, the severity of damage incurred by each vehicle is recorded.

Currently, to reduce the number of road accidents VicR&S department is interested in the following:

- The total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).

- The total number of accidents by each vehicle model. - The number of vehicles involved in every accident event on different locations. - The number of accidents taken care of different police officer branches.

Tasks:

[1]. Develop an ACCIDENT RECORDS star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you have

to also include the Weight Factor and List Aggregate.

The result of this task is a star schema diagram. You can use any drawing tool, such as

Lucidchart, to draw the star schema.

[2]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.

[3]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables. The

operational tables are accessible from the ACCIDENT account. The result of this task is the SQL

commands. You will also need to show the contents of the tables that you have created.

[4]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and

show the results):

a) Show the total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).

b) Show the total number of accidents by each vehicle model. c) Show the number of vehicles involved in every accident event on different locations. d) Show the number of accidents taken care of by different police officer branches.

[5]. You need to come up with additional two more questions and answer these questions using the SQL commands. Also explain the reason for why the management would like to have such

information.

Submission checklist for Case Study 2:

A. A pdf file or word document containing:

i. The star schema diagram – Task 1

ii. The Two-Column Table Methodology illustration – Task 2

iii. The SQL commands to create the dimension and fact tables, as well as the contents of

these tables – Task 3

iv. The SQL commands to answer the queries in Task 4 and the query results

v. The additional two questions with the SQL commands, query result, and your explanation – Task 5

B. The .sql files containing:

i. The SQL commands to create the data warehouse.

ii. The SQL commands to answer the queries in Task 4 and Task 5.

(Notes: The marker of this test will simply run this file. So you need to make sure that the

SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,

especially if you cut and paste from a Word file).

C. A folder:

i. Save all the files from [A] and the SQL files from [B] in one folder.

ii. Name the folder CaseStudy2_yourStudentID.

Submission Method:

1. A zip file: a. Save the CaseStudy1_yourStudentID and CaseStudy2_yourStudentID folders into one

folder named FIT5195_Test_yourname_studentID.

b. ZIP the folder FIT5195_Test_yourname_studentID.zip. This must be a ZIP file and not other types of compressed folder. The zip file should contain the prescribed folder

structure as follows:

● FIT5195_Test_yourname_studentID/ ○ CaseStudy1_yourStudentID/

○ Files for tasks ○ CaseStudy2_yourStudentID/

○ Files for tasks

2. Upload your zip file on Moodle by the due date: Friday 1-May-2020, 11:55pm (No late submission is accepted).

END OF TAKE HOME TEST

Oracle/Week-1c-SQL-revision.pdf

11

SQL - Revision

Assoc. Prof. David Taniar

22

33

SQL - Revision

Outline: A. Introduction to SQL client environment

B. Create tables

C. Insert into

D. Simple query retrieval

E. Updating and deleting records

F. Commit

G. Joining multiple tables

H. Aggregate functions and group by

I. Alter tables

44

A. Introduction to SQL client

• There are several SQL client software:

• SQL Developer

• SQL*Plus (Windows, Mac, Unix)

• Login details:

• Username: S12345678 (12345678 is your student id)

• Password: student

• Host string: the unit code

5

Introduction to SQL Developer

Oracle username which starts

with S and followed by your

student ID. Password is student

Location of

the server

This is the hoststring, which

is the database name

rhino.its.monash.edu

FIT5195

FIT5195

6

Introduction to SQL Developer

Single Statement Execution Script Execution

Output of Execution

Type Command

77

B. Create Tables

• General Syntax:

CREATE TABLE <table_name>

(attribute1 data_type <NOT NULL>,

attribute2 data_type,

PRIMARY KEY (attribute1),

FOREIGN KEY (attribute1)

REFERENCES <references table_name> (key_attribute));

88

Create Tables – Data Types

• Data type denotes a kind of data of an attribute value

• Character data types: VARCHAR2 and CHAR

Make VARCHAR2(20)

Model VARCHAR2(30)

• Number data types: NUMBER

Year NUMBER(4)

StampDuty NUMBER(8,2)

• Date data type: DATE

Salesdate DATE

B

99

Create Tables – Data Types

• Date Data Type

• DATE stores dates from 1/1/4712 BC to 12/31/4712 AD

• Default date format: DD-MON-YYYY

• example: 05-JUN-2011

• Example declaration: salesDate DATE

• DATE data type also stores time values

B

1010

Create Tables – Data Types

• Default time format: HH:MI:SS A.M.

• If no time value is given when a date is inserted, default value is

12:00:00 A.M.

• If no date value is given when a time is inserted, default date is

first day of current month

• Example salesDate field: 07-JUN-2016 12:00:00 A.M.

B

1111

Create Tables – Constraints

• Integrity Constraints

• Primary Key attribute

• NOT NULL constraints

• Specifies that a field cannot be NULL

• Sample Declaration: Field_name data_type NOT NULL

• Foreign Key attribute in a table refers to another record in another table

B

1212

Create Tables – Example 1

• Example: create CAR, CARSALES and CUSTOMER tables

• Commands:

• CAR table

CREATE TABLE Car

(carID NUMBER(5) NOT NULL,

make VARCHAR2(20) NOT NULL,

model VARCHAR2(30),

year NUMBER(4),

colour VARCHAR2(25),

advertisedPrice NUMBER(10),

PRIMARY KEY (carID)

);

B

1313

Create Tables – Example 2

• CUSTOMER table

CREATE TABLE Customer

(customerID NUMBER(5) NOT NULL,

cNAME VARCHAR2(20) NOT NULL,

cPhone VARCHAR2(10),

address VARCHAR2(50),

PRIMARY KEY (customerID)

);

B

1414

Create Tables – Example 3

• CARSALES table

CREATE TABLE Carsales

(carID NUMBER(5) NOT NULL,

customerID NUMBER(5) NOT NULL,

salesDate DATE NOT NULL,

purchasedPrice NUMBER(10) NOT NULL,

stampDuty NUMBER(8,2) NOT NULL,

PRIMARY KEY (carID,customerID),

FOREIGN KEY (carID) REFERENCES Car(carID),

FOREIGN KEY (customerID) REFERENCES Customer(customerID)

);

B

1515

Create Tables – by Copying

B

• You can create a new table by copying from an existing table:

CREATE TABLE <table_name>

AS SELECT *

FROM … ;

• For example:

CREATE TABLE car

AS SELECT *

FROM dtaniar.car;

• Notes:

• It creates and copy the records from the existing table

• However, it does not copy the PK and FK

• In the above example, it copies the table from dtaniar account.

1616

Create Tables – View Tables

• Viewing Information about Tables

• To view all tables in the database, the general syntax is:

SELECT * FROM TAB;

• For example:

SELECT * FROM tab;

TNAME TABTYPE CLUSTERID

------------------------------ ------- ----------

CAR TABLE

CARSALES TABLE

CUSTOMER TABLE

B

1717

Create Tables – Describe Tables

• To view the table structure, the general syntax is:

DESCRIBE <table_name>

• For example:

DESC car;

Name Null Type

--------------- -------- ------------

CARID NOT NULL NUMBER(5)

MAKE NOT NULL VARCHAR2(20)

MODEL VARCHAR2(30)

YEAR NUMBER(4)

COLOUR VARCHAR2(25)

ADVERTISEDPRICE NUMBER(10)

B

1818

Create Tables – Drop Tables

• To drop an unwanted table, the general syntax is:

DROP TABLE <table_name>;

• For example:

DROP TABLE Car;

• If the table that you want to delete (e.g. table Car) is being used as a FK by

another table (e.g. table Carsales), then you cannot delete table Car.

• In this case, you need to delete table Carsales first, before deleting table Car.

B

1919

C. Insert Into

• General Syntax

• To insert values for every single attribute in a record:

INSERT INTO <table_name>

VALUES (attribute1_value, attribute2_value,….);

• Example:

INSERT INTO Car

VALUES (1, 'Holden', 'Cruze', 2015, 'Black', 25780);

• Strings are enclosed in single quotes (‘) and are case-sensitive

(e.g. ‘General Practice’ is different from ‘general practice’)

20

Insert Into

• To insert a value of selected attributes:

INSERT INTO <table_name> (attribute1,attribute2,…)

VALUES (attribute1_value, attribute2_value,….)

• Example:

INSERT INTO Car (carID, make) VALUES (’16’, ’Audi’);

C

2121

Insert Into

• The TO_DATE function:

• TO_DATE (‘date_value’, ’format mask’);

• Example:

INSERT INTO Carsales

VALUES (1,4, TO_DATE('04/Feb/2015', 'DD/MON/YYYY'),25780,824.96);

C

2222

Insert Into

• The common DATE format masks

Format Mask Formatted Data

DD-MON-YYYY 05-FEB-2007

MM/DD/YYYY 02/05/2007

HH:MI AM 02:30 PM

MONTH DAY, YYYY FEB 5, 2007

MM/DD/YYYY HH:MI AM 02/05/2007 02:30 PM

Sample DATE format masks

C

23

Insert Into

• Insert multiple records one-by-one:

INSERT INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800);

INSERT INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200);

INSERT INTO Car VALUES (4, 'Holden', 'Commodore', 2008, 'Grey', 12650);

• Insert multiple records at once:

INSERT ALL

INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800)

INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200)

INTO Car VALUES (’ 4, 'Holden', 'Commodore', 2008, 'Grey', 12650)

SELECT * FROM DUAL;

C

2424

D. Simple Query Retrieval

• Simple Retrieval

• Retrieve all Records

• General Syntax :

SELECT *

FROM <table_name>;

• Example: retrieve everything from the CAR table:

SELECT * FROM car;

• Output:

2525

Simple Query Retrieval

• Retrieve Specific Fields

• General Syntax:

SELECT <attribute1, attribute2…>

FROM <table_name>;

• Example: select only the model from the CAR table

SELECT model FROM car;

D

• Output:

2626

Simple Query Retrieval

• Eliminating Duplicated Records (DISTINCT qualifier)

• General Syntax:

SELECT DISTINCT <attribute1,attribute2,…>

FROM <table_name>;

• Example: eliminating duplicates for the MAKE values:

SELECT DISTINCT make FROM car;

D

2727

Simple Query Retrieval

• Output (with DISTINCT):

D

• Original Output (without DISTINCT):

2828

Simple Query Retrieval

• DISTINCT Multiple Attributes

• Example: eliminating duplicates for the MAKE and YEAR values:

SELECT DISTINCT make, year

FROM car;

D

MAKE YEAR

-------------------- ----------

Volkwagen 2015

Toyota 2013

Mini 2009

Holden 2015

BMW 2016

Honda 2012

Subaru 2014

Holden 2008

Jeep 2006

Mazda 2011

Audi 2016

Lexus 2009

Honda 2010

Nissan 2010

Mercedes Benz 2006

15 rows selected

2929

Simple Query Retrieval

• Conditional Retrieval

• Search Conditions specified for more complex data retrieval

• The WHERE Clause

• Operators:

• equal (=)

• greater than (>)

• less than (<)

• greater than or equal to (>=)

• less than or equal to (<=)

• Not equal (<>)

D

3030

Simple Query Retrieval

• General Syntax:

SELECT <attribute1,attribute2,…>

FROM <ownername.table_name1>

WHERE <search condition>;

• Example:

SELECT year, model, advertisedPrice

FROM car

WHERE make='Holden'

YEAR MODEL ADVERTISEDPRICE

---------- ------------ ---------------

2015 Cruze 25780

2008 Commodore 12650

D

3131

Simple Query Retrieval

• “BETWEEN”

• Example: list the carid, salesDate and purchasedPrice after 2014 and

before 2016

SELECT carid, salesdate, purchasedPrice

FROM carsales

WHERE salesdate

BETWEEN TO_DATE('01-JAN-2014', 'DD-MON-YYYY')

AND TO_DATE('31-DEC-2015','DD-MON-YYYY’);

CARID SALESDATE PURCHASEDPRICE

---------- ----------- --------------

1 04-FEB-2015 25780

8 12-DEC-2015 13999

D

3232

Simple Query Retrieval

• Using a String Comparison

• Example: list the carid, salesDate and purchasedPrice after 2014 and

before 2016

SELECT carid, salesdate, purchasedPrice

FROM carsales

WHERE TO_CHAR(salesdate, ‘YYYYMMDD’) >’20140101’

AND TO_CHAR(salesdate, ‘YYYY’MMDD’) < ‘20151231’;

CARID SALESDATE PURCHASEDPRICE

---------- ----------- --------------

1 04-FEB-2015 25780

8 12-DEC-2015 13999

D

3333

Simple Query Retrieval

• “AND” or “OR”

• AND: both conditions must be true

• Example:

SELECT make, model, year

FROM car

WHERE year>2014 AND make='Holden';

MAKE MODEL YEAR

--------- ---------------- ----------

Holden Cruze 2015

D

3434

Simple Query Retrieval

• OR: either one of the condition is true

• Example:

SELECT make, model, colour, year, advertisedPrice

FROM car

WHERE colour='Red’ OR year>2014;

MAKE MODEL COLOUR YEAR ADVERTISEDPRICE

--------- ---------- ---------- ---------- ----------------

Holden Cruze Black 2015 25780

BMW 520d Grey 2016 98800

Audi A5 Black 2016 68200

Honda Civic Red 2012 13488

Lexus RX350 Red 2009 33500

Volkwagen Golf White 2015 39888

6 rows selected

D

3535

Simple Query Retrieval

• Other Conditions

• LIKE/NOT LIKE

• Example: displaying all MAKE from car table that has their first

character as ‘M’

SELECT make

FROM CAR

WHERE make LIKE 'M%';

MAKE

----------------

Mazda

Mercedes Benz

Mini

D

3636

Simple Query Retrieval

• IN/NOT

• suitable to perform a set member search

• Example:

• IN - displaying all MODEL that its MAKE is either ‘Holden’ or ‘Honda’

SELECT make, model

FROM car

WHERE make IN ('Holden', 'Honda');

MAKE MODEL

-------------------- -------------

Holden Cruze

Holden Commodore

Honda Civic

Honda City

D

3737

Simple Query Retrieval

• IN/NOT

• Example:

• NOT IN - displaying make, model and colour of cars that are not ‘Black’

and ‘White’

SELECT make, model, colour

FROM car

WHERE colour NOT IN ('Black’, 'White');

MAKE MODEL COLOUR

-------------- ------------ ------------

BMW 520d Grey

Holden Commodore Grey

Honda Civic Red

Lexus RX350 Red

Mazda 3Series Blue

Mercedes Benz C200 Blue

Honda City Grey

D

3838

Simple Query Retrieval

• NULL/NOT NULL

• Example:

• NULL Operator:

SELECT carid, make

FROM car

WHERE colour IS NULL;

CARID MAKE

------ ---------

12 Subaru

D

3939

Simple Query Retrieval • NOT NULL Operator:

SELECT carid, make

FROM car

WHERE colour IS NOT NULL;

CARID MAKE

------- --------------------

1 Holden

2 BMW

3 Audi

4 Holden

5 Honda

6 Jeep

7 Lexus

8 Mazda

9 Nissan

10 Volkwagen

11 Mercedes Benz

13 Honda

14 Mini

15 Toyota

14 rows selected

D

4040

Simple Query Retrieval

• MULTIPLE OPERATORs

• Example: list cars make start with ‘M’ and was made before 2016

SELECT year, make, model

FROM car

WHERE make LIKE 'M%'

AND year<2016;

YEAR MAKE MODEL

---- ---------------- -------------

2011 Mazda 3Series

2006 Mercedes Benz C200

2009 Mini Cooper

D

4141

Simple Query Retrieval

• Sorting

• specify to sort the output by using ORDER BY

• General Syntax:

SELECT <attribute1,attribtue2,..>

FROM <table_name>

ORDER BY <attribute_name> [DESC];

D

4242

Simple Query Retrieval • Example: retrieving all make, model, advertisedPrice and colour in a descending

order of the advertisedPrice SELECT make, model, advertisedPrice, colour

FROM car

ORDER BY advertisedPrice DESC;

MAKE MODEL ADVERTISEDPRICE COLOUR

--------------- ------------------ --------------- --------

BMW 520d 98800 Grey

Audi A5 68200 Black

Volkwagen Golf 39888 White

Subaru Outback 33950 NULL

Lexus RX350 33500 Red

Holden Cruze 25780 Black

Toyota Corolla 23888 White

Jeep Grand Cherokee 21999 Black

Mini Cooper 19750 Black

Nissan Dualis 18888 White

Mercedes Benz C200 16995 Blue

Mazda 3Series 13999 Blue

Honda Civic 13488 Grey

Holden Commodore 12650 Grey

Honda City 7490 Grey

15 rows selected

D

43

Simple Query Retrieval

• “AS”: to rename a column

• Example: list all SalesDate, PurchasedPrice, StampDuty and

TotalPrice (TotalPrice = PurchasedPrice+StampDuty )

SELECT salesdate, purchasedPrice, stampduty,

(purchasedPrice+StampDuty) as TotalPrice

FROM carSales;

SALESDATE PURCHASEDPRICE STAMPDUTY TOTALPRICE

----------- -------------- ---------- ----------

04-FEB-2015 25780 824.96 26604.96

13-JUL-2016 12650 506 13156

12-DEC-2015 13999 559.96 14558.96

14-JUN-2016 39888 1276.42 41164.42

18-MAY-2016 98800 5631.6 104431.6

4444

E. Updating and Deleting Records

• In the created tables

• UPDATE command – updating

• DELETE command – deletion

4545

Updating Records

• Update

• General Syntax:

UPDATE <table_name>

SET <attribute_name> = <new_value>

WHERE <expression> <operator> <expression>;

• records can be updated in only one table at a time

• update multiple fields that are within the same table

• WHERE clause - make the command updates specific records only

E

46

Updating Records

• Example:

• Updating colour of carID ‘5’ from ‘Red’ to ‘Grey’

UPDATE car

SET colour='Grey'

WHERE carID =5;

46

E

47

Deleting Records

• Delete

• General Syntax:

DELETE FROM <table_name>

WHERE <search_condition>;

• remove specific records from a database table

• use WHERE clause to specify multiple records to delete multiple records at

one time

• If the search condition is omitted, all records in the table are deleted.

47

E

48

Deleting Records

• Example:

• deleting a single record from the CAR table

DELETE FROM car

WHERE carid= 5;

• deleting multiple records from the CAR table that contain MAKE

starting with ‘M’

DELETE FROM car

WHERE make LIKE ‘M%’;

• deleting all records from the CAR table

DELETE FROM car;

48

E

49

Deleting Records

• Notice:

• not allowed to delete a primary key record that has its corresponded foreign

key somewhere else in another table

• Example: delete a primary key CarId is 5 in the CAR table that has a foreign

key record that CarId is 5 in the CARSALES table

DELETE FROM car WHERE carid=5;

DELETE FROM car

*

ERROR at line 1:

ORA-02292: integrity constraint (SYSTEM.SYS_C005453) violated - child

record found

49

E

5050

F. Commit

• Commit

• When inserted data by issuing the INSERT command

• the changes are only saved in the local database buffer

• are not saved in the database

• until you COMMIT the transaction

• it is important to remember to COMMIT whenever you have finished inserting

values or make changes to the database values

5151

Commit

• General Syntax:

• Sample of inserting a new record with commit

Effect of Inserting New Record with Commit

F

Insert a new record

‘16’ with commit

All original

records are

currently in CAR

table

5252

G. Joining Multiple Tables

• Join

• database query to join multiple database tables together

• the data needed or the conditions specified come from more than one table.

• Syntax:

SELECT <attribute1,attribute2,… >

FROM <table_name1,table_name2,…>

WHERE <table_name1.join_attribute> = <table_name2.join_attribut>

AND <search_condition>

53

Joining Multiple Tables

• joining the CAR, CARSALES and CUSTOMER tables:

SELECT customer.cname, customer.cphone, carsales.salesdate, carsales.purchasedPrice,

carsales.stampDuty, car.make, car.model

FROM customer, carsales, car

WHERE customer.customerID = carsales.customerID

AND car.carID = carsales.carID;

• using prefix when joining tables:

• when more than one table is involved, a prefix for each attribute is

recommended to avoid ambiguity

SELECT ct.cname, ct.cphone, cs.salesdate,

cs.purchasedPrice,cs.stampDuty,

c.make, c.model

FROM customer ct, carsales cs, car c

WHERE ct.customerID = cs.customerID

AND c.carID = cs.carID;;

53

G

• The output of both example:

54

Joining Multiple Tables

• Subquery

• Example: retrieving the make, model and advertisedPrice of the cars that

are not sold

SELECT make, model, advertisedPrice FROM car

WHERE carid NOT IN

(SELECT carid FROM carsales);

MAKE MODEL ADVERTISEDPRICE

----------------- ----------------- ---------------

Subaru Outback 33950

Lexus RX350 33500

Mini Cooper 19750

Nissan Dualis 18888

Mercedes Benz C200 16995

Jeep Grand Cherokee 21999

6 rows selected

5555

H. Aggregate Functions and Group By

• Aggregate Functions

• summarize the input table

• often used include:

• COUNT - count number of records in the input table

• SUM - calculate the sum of a numerical attribute

• MIN and MAX - find the smallest and the largest value of a certain

attribute

56

Aggregate Functions

• Count(*)

• Example: returning the number of records that is available from

the CAR table

SELECT COUNT(*)

FROM car;

COUNT(*)

------------------

15

H

57

Aggregate Functions

• Example: returning the number of values of colour that is available from the CAR table

SELECT COUNT(colour) FROM car;

COUNT(COLOUR)

-------------

14

• Example: returning the number of distinct values of colour that is available from the CAR table

SELECT COUNT(DISTINCT colour) FROM car;

COUNT(DISTINCTCOLOUR)

---------------------

5

COUNT(attribute) & COUNT(DISTINCT attribute)

Cause by a

record with

null value 15 records

include one

NULL value

Output:

58

Aggregate Functions

• MIN and MAX

• Example: calculate the minimum and maximum advertisedPrice

of the cars

SELECT MIN(advertisedPrice) AS "Lowest Price", MAX(advertisedPrice) AS "Highest Price”

FROM car;

Lowest Price Highest Price

------------ --------------

7490 98800

5959

Aggregate Functions and Group By

• Group By

• group an input table into a number of groups based on one or more

nominated attributes

• often used in conjunction with aggregate functions

• CAR table: group by Make

SELECT make, count(*)

FROM car

GROUP BY make;

H

MAKE COUNT(*)

-------------------- ----------

Holden 2

Lexus 1

Subaru 1

BMW 1

Mazda 1

Nissan 1

Audi 1

Volkwagen 1

Toyota 1

Honda 2

Mercedes Benz 1

Jeep 1

Mini 1

60

Aggregate Functions and Group By

• CAR table: grouping by Make with combination of getting only the record

groups that contain the count value greater than 1 is as follows

SELECT make, COUNT(DISTINCT model)

FROM car

GROUP BY make

HAVING COUNT(DISTINCT model)>1;

MAKE COUNT(DISTINCTMODEL)

-------------------- --------------------

Holden 2

Honda 2

H

61

Aggregate Functions and Group By

• multiple tables: selecting the number of sold cars for each colour

SELECT colour, COUNT(DINSTINCT customerId)

FROM car c, carsales cs

WHERE c.carid=cs.carid

GROUP BY colour;

COLOUR COUNT(DISTINCTCUSTOMERID)

----------- -------------------------

White 2

Grey 4

Blue 1

Black 2

Red 1

H

6262

Aggregate Functions

• Count vs. Sum

• COUNT - count number of records in the input table

• SUM - calculate the sum of a numerical attribute

SELECT <attribute1, attribute2>, COUNT(*)

FROM <table_name>

GROUP BY <attribute1, attribute2>;

SELECT <attribute1, attribute2>, SUM (attribute3)

FROM <table_name>

GROUP BY <attribute1, attribute2>;

H

Aggregate Functions

63

• Sum without Group By

• Example: calculate the sum of purchasedPrice

SELECT SUM(purchasedPrice) AS totalsales

FROM carsales;

TOTALSALES

----------

402983

Aggregate Functions

64

• Sum with Group By

• Example: calculate the sum of purchasedPrice for each make in

carsales table

SELECT c.make, SUM(cs.purchasedPrice) AS totalsales

FROM carsales cs, car c

WHERE c.CARID=cs.CARID

GROUP BY c.make;

MAKE TOTALSALES

-------------------- ----------

Holden 38430

BMW 197600

Mazda 13999

Audi 68200

65

Simple Query Retrieval

• SUM with Calculations

• Example: list all Cname, SalesDate and calculate the TotalPrice for

each customer (TotalPrice = PurchasedPrice+StampDuty )

SELECT ct.cname, cs.salesdate, SUM(purchasedPrice+StampDuty) as TotalPrice

FROM carSales cs, customer ct

WHERE cs.customerid=ct.customerid

GROUP BY ct.cname, cs.salesdate;

CNAME SALESDATE TOTALPRICE

-------------------- ----------- ----------

Jone 14-JUN-2016 14027.52

Rex 12-DEC-2015 14558.96

Lily 13-JUL-2016 13156

6666

I. Alter Tables – Add New Fields

• Syntax:

ALTER TABLE <table_name>

ADD (attribute_name data_type_declare constraints_declare);

• attribute name: referring to the new attribute that you want to add into the

existing table

• data type: defines the data type and the size of the new attribute

• constraint: defines the constraints that the new attribute might be enforced by

certain constraints

• Example: add a transmission attribute to the CAR table

ALTER TABLE car

ADD (transmission VARCHAR2(10));

67

Alter Tables – Modify Fields

• Syntax:

ALTER TABLE <table_name>

MODIFY (attribute_name new_data_type);

• attribute name: refers to the attribute that you want to modify

• new data: defines the new data type that you want to use replacing the old one

• Example: change the data type of the transmission to CHAR with a size of

30

ALTER TABLE car

MODIFY (transmission CHAR(30));

I

68

Alter Tables – Drop Columns

• Syntax:

ALTER TABLE <table_name>

DROP COLUMN attribute_name;

• Example: delete the attribute transmission from the CAR table

ALTER TABLE car

DROP COLUMN transmission ;

I

Oracle/Week-4a-StarSchema-StudentEnrollment.pdf

1

More on Student Enrollment Case Study (Tutorial 2)

1. Two dimensions vs. One dimension The star schema of the case study used in Tutorial 2 (Student-Enrollment system) is shown as follows.

Star Schema-1 (one semester-year dimension)

The SQL codes to create dimension and fact tables are as follows:

--Campus dimension

Create table campus_dim as

SELECT distinct Ocampus

FROM Offering;

--Semester_year dimension

Create table sem_year_dim as

SELECT distinct Oyear||Osem as sem_id, Oyear, Osem

FROM Offering;

--Subject Dimension

Create table subject_dim as

SELECT *

FROM subject;

--Grade Dimension

Create table grade_dim as

SELECT distinct Grade

FROM Enrollment;

2

--The Fact

CREATE TABLE student_enrollment_fact as

SELECT

o.Ocampus,

o.Oyear||o.Osem as sem_id,

s.Ucode,

e.Grade,

count(st.sid) as num_of_student,

sum(e.score) as Total_score

FROM subject s, enrollment e, offering o, student st

WHERE e.OID = o.OID

and s.Ucode = o.Ucode

and st.SID = e.SID

GROUP BY

o.Ocampus,

o.Oyear||o.Osem,

s.Ucode,

e.Grade;

The records in the data warehouse are as follows:

SQL> select * from campus_dim;

OCAMPUS

---------

City

Main

DE

SQL> select * from sem_year_dim;

SEM_ID OYEAR OSEM

---------- ---------- ----

20092 2009 2

20091 2009 1

SQL> select * from subject_dim;

UCODE UTITLE UCREDIT

---------- -------------------- ----------

IT001 Database 5

IT002 Java 5

IT003 SAP 10

IT004 Network 5

IT005 ASP.NET 5

SQL> select * from grade_dim;

GRADE

-----

HD

P

D

C

N

3

SQL> select * from student_enrollment_fact;

OCAMPUS SEM_ID UCODE GRADE NUM_OF_STUDENT TOTAL_SCORE

--------- --------- ---------- ----- -------------- -----------

Main 20091 IT001 HD 3 253

Main 20091 IT001 D 2 149

Main 20091 IT002 C 1 65

Main 20091 IT002 N 1 47

Main 20092 IT002 D 2 151

City 20092 IT001 C 1 64

Main 20091 IT001 N 1 41

Main 20092 IT002 C 1 64

DE 20092 IT004 P 2 105

City 20091 IT003 C 1 63

City 20092 IT001 N 1 32

11 rows selected.

Look at particularly the sem_year_dim dimension, both in the SQL code and the contents of this dimension. In the above version of the star schema, the semester and the year information are actually combined into one dimension, called sem_year_dim dimension. Can we have two separate dimensions: one for semester, and the other for year? The answer is “yes, you can”. The new star schema would look like as follows. Because the original sem_year_dim is now split into two dimensions: sem_dim, and year_dim, consequently, in the fact table, we will have two attributes: semester and year, instead of just one attribute sem_id.

4

Star Schema-2 (two separate dimensions for semester and year)

The SQL codes to create these two new dimensions (sem_dim and year_dim) and the fact tables are as follows:

--Semester dimension

Create table sem_dim as

SELECT distinct Osem as sem_id

FROM Offering;

--Year dimension

Create table year_dim as

SELECT distinct Oyear as year_id

FROM Offering;

5

--The Fact

CREATE TABLE student_enrollment_fact_2 as

SELECT

o.Ocampus,

o.Osem as sem_id,

o.Oyear as year_id,

s.Ucode,

e.Grade,

count(st.sid) as num_of_student,

sum(e.score) as Total_score

FROM subject s, enrollment e, offering o, student st

WHERE e.OID = o.OID

and s.Ucode = o.Ucode

and st.SID = e.SID

GROUP BY

o.Ocampus,

o.Osem,

o.Oyear,

s.Ucode,

e.Grade;

The contents of the two dimensions and the new fact are as follows:

SQL> select * from sem_dim;

SEM_ID

----------

1

2

SQL> select * from year_dim;

YEAR_ID

----------

2009

SQL> select * from student_enrollment_fact_2;

OCAMPUS SEM_ID YEAR_ID UCODE GR NUM_OF_STUDENT TOTAL_SCORE

--------- ------ -------- ---------- -- -------------- -----------

DE 2 2009 IT004 P 2 105

Main 1 2009 IT001 N 1 41

Main 1 2009 IT001 D 2 149

Main 2 2009 IT002 C 1 64

Main 1 2009 IT002 N 1 47

Main 2 2009 IT002 D 2 151

City 2 2009 IT001 N 1 32

Main 1 2009 IT002 C 1 65

City 1 2009 IT003 C 1 63

Main 1 2009 IT001 HD 3 253

City 2 2009 IT001 C 1 64

11 rows selected.

Now compare the contents of Fact-1 and Fact-2. They are almost the same, except that Fact-1 has one attribute from the sem_year_dim dimension, whereas

6

Fact-2 has two attributes, called sem_id (from semester_dim) and year_id (from year_dim). Conclusion: From the FACT point of view, it does not really matter whether we have two separate dimensions or just one dimension. They are the same. However, if two information, like semester and year, is often seen as one entity or one piece of information, it would be easier if the two information is present in one dimension.

7

2. Dimension Hierarchy (Hierarchy vs. No Hierarchy) Suppose we have richer information about campus, as follows: Campus Dimension CampusID CampusName Address Suburb City State Postcode Country CL Clayton

Campus Wellington Rd

Clayton Melbourne Victoria 3800 Australia

CA Caulfield Campus

Dandenong Rd

Caulfield East

Melbourne Victoria 3145 Australia

PA Parkville Campus

Royal Parade

Parkville Melbourne Victoria 3052 Australia

SY Sydney Campus

Opera Boulevard

Sydney Sydney New South Wales

2001 Australia

MA Malaysia Campus

Jalan Lagoon

Bandar Sunway

Sydney Selangor 47500 Malaysia

SA South Africa Campus

Peter St Johanne sburg

Johannesb urg

Johanne sburg

1725 South Africa

In the star schema, we have two options: Option 1 is to have one dimension called campusDIM containing all of these attributes.

Option 1 (no hierarchy)

8

The SQL to create the Campus Dimension is shown as follows: --Campus dimension

Create Table campus_dim As

SELECT Distinct Ocampus, <<and any other attributes, eg campus name, city, etc>>

FROM Offering;

Option 2 is to have a hierarchy of three dimensions (e.g. campusDIM connected to the Fact, campusDIM is also the parent of cityDIM – because campus-city forms a hierarchy, and cityDIM becomes the parent of countryDIM).

Option 2 (with hierarchy)

The Campus dimension is still linked to the Fact, whereas the City dimension is linked (through hierarchy) to the Campus, and the Country dimension is linked

9

to the City dimension. Therefore, there is a hierarchy from Campus to City, and to Country.

Note that the information about city is pulled out from the campus, and then create a hierarchy between campus and city. The information about country is also pulled out from the campus and city. The SQL to create the three dimensions in the hierarchy is as follows: --Campus dimension

Create Table campus_dim As

SELECT Distinct Ocampus, <<and any other attributes, eg campus name, etc>>

FROM <<table that stores information about campus, city, country>>;

--City dimension

Create Table city_dim As

SELECT Distinct Ocity, <<and any other attributes about the city>>

FROM <<table that stores information about campus, city, country>>;

--Country dimension

Create Table country_dim As

SELECT Distinct Ocountry, <<and any other attributes about the country>>

FROM <<table that stores information about campus, city, country>>;

The contents of the dimensions in the hierarchy (campus, city, and country dimensions are as follows): Campus Dimension CampusID CampusName Address Suburb Postcode CityID CL Clayton Campus Wellington Rd Clayton 3800 MEL CA Caulfield Campus Dandenong Rd Caulfield East 3145 MEL PA Parkville Campus Royal Parade Parkville 3052 MEL SY Sydney Campus Opera Boulevard Sydney 2001 SYD MA Malaysia Campus Jalan Lagoon Bandar Sunway 47500 KUL SA South Africa Campus Peter St Johannesburg 1725 JNB

City Dimension CityID CityName State CountryID MEL Melbourne Victoria AU SYD Sydney New South Wales AU KUL Kuala Lumpur Selangor MA JNB Johannesburg Johannesburg SA

Country Dimension CountryID CountryName AU Australia MA Malaysia SA South Africa

Note that the creation of the fact table is not affected at all. Either model (option 1 or option 2) will not affect the fact and other dimensions (e.g. grade, subject, semester_year dimensions).

10

Hierarchy should use the higher detail dimension (e.g. campus) as the parent, which is linked to the fact. The lower detail dimension (e.g. city) becomes the child in the hierarchy. The lowest detail dimension (e.g. country) becomes the lowest in the hierarchy. One of the most common mistakes is the wrong ordering of the dimensions in the hierarchy, such as in the following star schema:

An incorrect hierarchy (upside down)

11

Hierarchy vs No Hierarchy Since there are two options in implementing a star schema: with and without hierarchy, which option is more preferred?

1. One table vs. many tables Without hierarchy, there is only one table (e.g. the campus dimension table). With hierarchy, there are three tables (e.g. campus, city, and country dimension tables). The consequence is in the join query processing when producing reports. When you want to produce a report involving the fact and campus, with the no hierarchy option, we only need to use two tables: the campus dimension table, and the fact table. Hence, the join query processing is simple. With the hierarchy option, we need to join the fact with three tables (campus, city, and country). Consequently, the join query processing will be more complex

2. Normalized vs. Unnormalized With the hierarchy option, the tables are normalized, which follows the relational model. In contrast, the non-hierarchy option, the table (e.g. the campus dimension table) is unnormalized, which is basically 1NF with visible replication on information which is prone to anomalies (e.g. insert, update, and delete anomalies). Since no update, insert, or delete operations are performed to the dimension tables (once they have created), hence, anomalies become irrelevant. Therefore, the unnormalized option may be preferred, also due to a simpler join query processing.

3. Drilling down and rolling up People often mistakenly thought that the hierarchy model is good for drilling down. The hierarchy is not from country to city and to campus. The correct hierarchy is from the detail to the more general, such as from campus to city, and to country. Hence, we are not drilling down from campus to country, because drilling down should from something that is general (such as country) to something more detail (such as city and campus). In other words, the hierarchy model is not about drilling down information exploration. How about rolling up? Rolling up is exploring information from the detail to a more general. For example, retrieve number of students located at a campus level, and then rolling up to the country level. With the hierarchy option, we still need to do two queries: one for the campus and the second for the country.

12

Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select CountryName, Sum(Num_of_student) From Fact F, CampusDim C1, CityDim C2, CountryDim C3 Where F.CampusID = C1.CampusID And C1.CityID = C2.CityID And C2.CountryID = C3.CountryID Group By CountryName; Whereas with the non-hierarchy option, we still need to do two queries: Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select Country, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By Country; From the query point of view, both need two queries. From the query processing point of view, the non-hierarchy option uses one join operation only, because it only needs to join the fact and one dimension. From the conceptual point of view, the hierarchy model does not actually offer a better roll up or drill down features. If we insist to have a drill down, that is from country to campus, we still need to do the two queries above, but we will do the country query first, and then the campus query second.

Oracle/Week-4b-Determinant Dimensions.pdf

  1  

Determinant  Dimensions   The  Petrol  Price  Case  Study  

  1.  Snapshot  of  an  operational  database     The  following  is  a  sample  operational  database  that  keeps  the  petrol  prices  (of   different  kind  of  petrol)  of  petrol  stations  within  a  period  of  time    

 

  2  

  2.  Star  Schema     The  requirements  for  the  data  warehouse  are  to  answer  questions  related  to  (a)   average  petrol  price,  (b)  min  petrol  price,  and  (c)  max  petrol  prices.     From  the  sample  data  above,  we  choose  the  following  three  angles  to  view  the   fact  measures:  (i)  day  of  week,  (ii)  suburb,  and  (iii)  petrol  company.       From  the  Day  of  Week  point  of  view,  it  is  sensible  to  answer  the  question   something  like  which  day  is  has  the  lowest  petrol  price.  We  use  a  two-­‐column   table  methodology  to  visualize  a  conceptual  model  of  the  star  schema.  Hence,  the   two-­‐column  table  for  the  day  of  week  is  as  follows:     Day  of  week   Total  Petrol  

Price   Num  of  Petrol   Station  

Min  Petrol   Price  

Max  Petrol   Price  

Monday           Tuesday           …             Notes:    

1. In  the  above  2-­‐column  table,  the  first  column  (say  column  A,  which  is  Day   of  Week)  is  a  category.  The  second  columns  (say,  columns  B1,  B2,  B3,  and   B4)  are  the  fact  measures.  

2. One  of  the  requirements  for  the  fact  measure  is  average  petrol  price.  Since,   the  fact  table  should  not  include  an  average  as  a  fact  measure,  we  include   two  attributes:  total  petrol  price,  and  number  of  petrol  stations,  instead.    

The  visualization  of  the  above  two-­‐column  table  for  Day  of  Week  is  rather   incomplete,  because  there  is  no  data  in  the  fact  measure  columns.  Take  a  look  at   the  Min  Petrol  Price  column  in  the  above  table.  What  kind  of  value  should  be  in   that  column?  Supposed  Monday  has  a  value  in  the  Min  Petrol  Price  column,  what   does  that  value  min?  It  does  not  specify  which  petrol  type,  whether  it  is  Unleaded   91,  or  Premium  95,  or  Premium  98,  for  example.  Therefore,  the  Min  Petrol  Price   column,  although  seems  right,  does  not  make  sense  at  all.  How  can  this  problem   be  solved?     One  solution  is  to  have  one  Min  Petrol  Price  for  each  petrol  type  (e.g.  Unleaded   91,  Premium  95,  Premium  98,  E10,  Diesel,  and  LPG).  Therefore,  we  have  6   attributes  to  represent  Min  Petrol  Prices,  namely:  Min  Unleaded  91  Price,  Min   Premium  95  Price,  Min  Premium  98  Price,  Min  E10  Price,  Min  Diesel  Price,  and   Min  LPG  Price.    

Min  Petrol  Price  Day  of   week  

Total   Petrol   Price  

Num  of   Petrol   Station  

Unleaded   P95   P98   E10   Diesel   LPG   Max   Petrol   Price  

Monday       119.9c   132.9c   139.9c   114.9c   115.9c   55.7c     Tuesday       118.9c   131.9c   136.9c   114.9c   115.9c   56.9c     …                      

  3  

We  need  to  do  the  same  for  the  Total  Price,  and  Max  Petrol  Price;  each  of  which   should  be  divided  into  6  columns.  The  new  two-­‐column  table  for  Day  of  Week  is   then  shown  as  follows:    

Total  Petrol  Price   Min  Petrol  Price   Max  Petrol  Price  Day  of   Week   U   95   98   E   D   LPG  

Num  of   Petrol   Station  

U   95   98   E   D   LPG   U   95   98   E   D   LPG  

Monday                                         Tuesday                                        

                                       

  The  matter  will  become  more  complicated  if  number  of  petrol  stations  that  has   Unleaded  91  is  different  from  that  of  Premium  98,  for  example.  Then  to  calculate   an  average  of  Unleaded  91  on  Monday,  and  an  average  of  Premium  98  on   Monday  cannot  be  simply  a  division  between  Tot91  and  Num  of  Petrol  Stations,   and  between  Tot98  and  Num  of  Petrol  Stations,  because  the  based  line,  which  is   Num  of  Petrol  Stations  for  different  fuel  type  is  different.  Therefore,  we  need  to   divide  the  Num  of  Petrol  Stations  column  into  6  columns  –  one  for  each  petrol   type.  As  a  result,  we  will  end  up  with  24  columns  to  store  the  fact  measures,   instead  of  just  4  columns,  as  initially  thought.     Using  the  same  principle,  we  can  have  a  two-­‐column  table  for  Suburb,  and   another  one  for  Company.    

Total  Petrol  Price   Num  of  Petrol   Station  

Min  Petrol  Price   Max  Petrol  Price  Suburb  

U   95   98   E10   D   L   U   95   98   E10   D   L   U   95   98   E10   D   L   U   95   98   E10   D   L  

Box  Hill                                                   Clayton                                                                                                      

Total  Petrol  Price   Num  of  Petrol   Station  

Min  Petrol  Price   Max  Petrol  Price  Company  

U   95   98   E   D   L   U   95   98   E   D   L   U   95   98   E   D   L   U   95   98   E   D   L  

Caltex                                                   7-­‐Eleven                                                                                                       Based  on  the  above  three  two-­‐column  tables,  we  are  now  confident  to  have  the   following  star  schema:        

  4  

     

  5  

3.  Star  Schema  version-­2     Is  there  another  way  to  reduce  the  number  of  fact  measures?  The  above  star   schema  has  24  fact  measures,  because  there  are  6  different  kinds  of  petrol.  The   answer  is  yes!!!     We  can  have  a  new  dimension  called  Petrol  Type  DIM,  which  stores  the  different   kinds  of  petrol.  With  this  new  dimension,  the  fact  measure  is  reduced  to  4,  which   is  the  original  fact  measures  we  had  in  the  beginning.  Therefore,  the  new  star   schema  becomes  as  follows:    

  With  this  new  star  schema,  as  an  example,  we  can  retrieve  data  that  shows  the   Min  Petrol  Price  of  Premium  98  on  Monday.  Or  another  example  is  to  get  the   Average  Petrol  Price  (which  is  Total  Petrol  Price/Num  of  Petrol  Station)  of   Unleaded  Petrol  in  7-­‐Eleven  petrol  station  located  at  Clayton.     Note  that  when  retrieving  data  from  a  data  warehouse,  there  is  no  obligation  that   we  must  use  all  dimensions.  In  the  Monday  Premium  98  example,  only  two   dimensions  are  used,  which  are  Day  of  Week  DIM,  and  Petrol  Type  DIM.  And  in   the  second  example,  only  three  dimensions  are  used:  Petrol  Type  DIM,  Company   DIM,  and  Suburb  DIM.  So,  to  retrieve  data  from  the  data  warehouse,  we  can  use  

  6  

as  few  as  one  dimension  table;  but  we  can  also  use  all  dimension  tables.  We  are   not  restricted  to  which  dimension  we  have  to  use.     However,  this  imposes  a  new  problem.  For  example,  the  above  star  schema   allows  us  to  retrieve  the  Min  Petrol  Price  of  Monday,  which  uses  only  Day  of   Week  DIM.  This  data  retrieval  does  not  make  any  sense;  which  petrol  type  are   we  referring  to?     Based  on  this  case  study,  it  is  clear  that  the  Petrol  Type  DIM  holds  the  key  to  any   data  retrieval.  We  must  use  the  Petrol  Type  DIM  in  any  retrieval,  such  as  the  Min   Petrol  Price  of  Premium  98  on  Monday,  or  the  Average  Petrol  Price  of  Unleaded   in  7-­‐Eleven  petrol  station  located  in  Clayton  suburb.     The  Petrol  Type  DIM,  in  this  example,  is  called  a  “Determinant  Dimension”  –  a   dimension  which  needs  to  be  used  in  the  data  retrieval,  because  the  fact   measures  are  determined  by  this  dimension.  In  order  to  differentiate  between   Determinant  Dimension  and  all  other  dimensions,  Determinant  Dimension  is   denoted  by  a  star.  The  fact  measures  are  also  denoted  by  a  star.  The  new  star   schema  with  a  Determinant  Dimension  is  as  follows:    

 

Oracle/Week-4c-Determinant -vs-non-determinant.pdf

  1  

Determinant  vs.  Non-­Determinant   Dimensions  

  1.  The  Olympic  Games  Case  Study     The  Olympic  Games  committee  maintains  an  operational  database  that  stores  all   matches,  games,  as  well  as  the  medal  winners  of  the  Olympic  Games  over  the   years  (https://www.olympic.org/).       We  would  like  to  build  a  data  warehouse  to  analyze  the  medal  counts,  by  each   country,  sport,  and  at  which  Olympic  Games.     There   are   two   possible   star   schemas.   Version-­‐1   star   schema   contains   3   dimensions,   whereas   Version-­‐2   star   schema   contains   4   dimensions.   For   simplicity,  a  very  minimal  number  of  attributes  are  included  in  each  dimension.     Version-­‐1  star  schema  has  CountryDIM,  SportDIM,  and  OlympicGamesDIM  as  the   dimensions.  Three  fact  measures  are  included  in  the  fact  table,  namely  number   of  gold  medals,  silver  medals,  and  bronze  medals.    

Version-­1  Star  Schema  

 

  2  

Version-­‐2  star  schema  has  four  dimensions  –  with  an  additional  of  Medal  Type   DIM  (which  is  either  Gold,  Silver,  or  Bronze),  but  there  is  only  one  fact  measure   in  the  Fact  table,  which  is  Num  of  Medals.    

Version-­2  Star  Schema  with  the  Medal  Type  DIM  

  What  is  the  difference  between  these  two  versions  of  star  schema?     Is  Medal  Type  DIM  a  Determinant  Dimension?     In  order  to  answer  these  questions,  we  need  to  visualize  two-­‐column  tables  for   each  dimension  category.       2.  Two-­Column  Tables  for  Version-­1  Star  Schema  (without  the   Medal  Type  DIM)     The  first  two-­‐column  table  is  from  the  Country  point  of  view,  which  is  as  follows:     Country   Num  of  Gold   Num  of  Silver   Num  of  Bronze   USA   733   602   488   China   199   143   133   Australia   167   170   189            

  3  

These  are  the  number  of  Gold,  Silver  and  Bronze  medals  that  these  country  got   on   all   Olympic   Games   (several   Olympic   Games)   recorded   in   the   operational   database.   Note   that   this   two-­‐column   table   methodology   is   to   help   the   data   warehouse   designer   to   visualize   the   view   of   the   fact   measures   from   each   dimension.       The  second  two-­‐column  table  is  from  the  Sport  point  of  view.  Assuming  that  in   the  operational  database,  it  records  20  past  Olympic  Games,  and  at  each  Olympic,   there  is  only  one  gold  for  100m  Butterfly  Men,  for  instance.     Sport   Num  of  

Gold   Num  of   Silver  

Num  of   Bronze  

Swimming  100m  Butterfly  Men   20   20   20   Swimming  400m  Freestyle  Women   20   20   20   Swimming  4x100m  Medley  Relay  Men   20   20   20             The  third  two-­‐column  table  is  from  the  Olympic  Name  point  of  view.       Olympic  Name   Num  of  

Gold   Num  of   Silver  

Num  of   Bronze  

London  2012   302   304   356   Beijing  2008   302   303   353   Athens  2004   301   301   327             All  the  three  two-­‐column  tables  above  seem  to  be  reasonably  correct.  The  first   columns  are  the  categories,  while  the  other  columns  are  the  fact  measures  which   are  numeric  and  aggregate  values.  Because  these  three  two-­‐column  tables  make   sense,  we  are  confident  that  version-­‐1  star  schema  is  correct.       3.  Two-­Column  Tables  for  Version-­2  Star  Schema  (with  the   Medal  Type  DIM)     The  two-­‐column  tables  for  the  first  three  dimensions,  namely  Country,  Sport,  and   Olympic  Names  are  as  follows:     Country   Num  of  Medals   USA   1823   China   475   Australia   526         Sport   Num  of  Medals   Swimming  100m  Butterfly  Men   60   Swimming  400m  Freestyle  Women   60   Swimming  4x100m  Medley  Relay  Men   60        

  4  

Olympic  Name   Num  of  Medals   London  2012   962   Beijing  2008   958   Athens  2004   929         The  question  is  whether  these  two-­‐column  tables  make  sense.  If  we  look  at  the   country,   it  makes  sense  to  see  how  many  medals  Australia  has  received  in  all   Olympic  Games;  the  same  with  Sport,  and  Olympic  Names.  Finding  how  many   medals  (regardless   the  medal   types)   for  each  country,   for  each  sport,  and   for   each  Olympic  seems  to  be  reasonable.     The  fourth  two-­‐column  table  for  version-­‐2  star  schema  is  the  Medal  Type,  which   is  as  follows:     Medal  Type   Num  of  Medals   Gold   4115   Silver   4095   Bronze   4474     This   two-­‐column   table   on   Medal   Type   seems   to   be   reasonable   too.   Hence,   version-­‐2  star  schema  (with  Medal  Type  DIM)  is  correct.     So   in   conclusion,   both   star   schemas   (with   or   without   Medal   Type   DIM)   are   correct.   Now   going   back   to   the   original   question:   Is   Medal   Type   DIM   a   Determinant  Dimension?       4.  Determinant  or  Non-­Determinant  Dimensions     A  “Determinant  Dimension”  is  a  dimension  that  the  fact  measure  relies  on,  and   consequently,   all   data   retrieval   from   the   data   warehouse   must   include   this   dimension.  If  the  data  retrieval  from  the  data  warehouse  does  not  include  this   determinant  dimension,  the  retrieval  result  will  not  make  sense  at  all.     In   the  previous  case  study  on  Petrol  Price,  Petrol  Type  DIM  is  a  Determinant   Dimension  (refer  to  the  star  schema  below).  Note  that  a  Determinant  Dimension   is  denoted  by  a  star.  The  fact  measures  affected  by  the  determinant  dimension   are  also  starred.    

  5  

  The  fact  measures:  Total  Petrol  Price,  Num  of  Petrol  Station,  Min  Petrol  Price,   and  Max  Petrol  Price,  depend  on  the  Petrol  Type,  which  is  indicated  by  the  Petrol   Type  DIM.  That  means  analyzing  the  min  petrol  price  from  the  day  of  week  point   of   view,   must   include   the   petrol   type.   Otherwise,   it   doesn’t   make   sense   to   retrieve   data   to   show   that   on   Monday   the   lowest   (min)   petrol   price   is,   for   example,  109.90cents.  As  it  does  not  indicate  which  petrol  type  it  is,  this  lowest   petrol  price  is  meaningless.  Therefore,  a  better  data  retrieval  is  to  retrieve  the   record  to  show  that  for  example,  on  Monday,  the  min  “Unleaded”  petrol  price  is   109.90cents  (e.g.  Unleaded  is  a  petrol  type  obtained  from  the  Petrol  Type  DIM).     Now  going  back  to  the  Olympic  Games  case  study  (refer  to  Version-­‐2  star  schema   with  Medal  Type  DIM).  Is  Medal  Type  DIM  a  Determinant  Dimension?     The  answer  to  this  question  can  be  answered  by  another  question.  To  retrieve   the  data  from  version-­‐2  star  schema,  must  we  have  the  information  from  Medal   Type  DIM?  The  answer  is  clearly  no,  because  we  can  simply  retrieve  a  record   from  the  fact  to  show  that  Australia  in  London  2012  Olympic  Games  received  10   medals  in  Swimming.  This  covers  three  dimensions,  namely  Country  (Australia),   Olympic  Name  (London  2012),  and  Sport  (Swimming).   In  this  example,  Medal   Type  DIM  is  not  involved,  and  the  information  retrieved  still  makes  sense.    

  6  

So   the   answer   to   the   question   whether   Medal   Type   DIM   is   a   Determinant   Dimension  or  not,  the  answer  is  clearly  No!!     The  next  question  is:  what   is   the  difference  between  the  Olympic  Games  case   study  and  the  Petrol  Price  case  study.  Both  are  very  similar,  but   the  Olympic   Games  Medal  Type  DIM  is  not  a  determinant  dimension,  whereas  the  Petrol  Type   DIM  is  a  determinant  dimension.     The  answer  is  the  aggregate  function  used  in  the  fact  measure.  In  the  Olympic   Games  case  study,  the  fact  measure  function  is  COUNT,  which  is  count  of  medals.   The  breakdown  of  the  medals  is  gold,  silver,  and  bronze;  but  the  main  aggregate   function  of  the  fact  measure  is  number  of  medals,  which  is  a  count.  If  the  fact   measure   is   a   count,   then   the   dimension   (e.g.   Medal   Type   DIM)   is   not   a   determinant  dimension,  because  we  can  still  analyze  the  fact  measure  which  is   the  total  medals  from  other  dimensions,  without  the  medal  type  dimension.     On  the  other  hand,  the  Petrol  Price  case  study  uses  AVG,  MIN,  and  MAX  as  the   aggregation  functions.  Note  that  we  do  not  store  average  as  a  fact  measure,  but   total   price   and   number   of   stations.   These   two   fact   measures   will   be   used   to   calculate  the  average.  Although  the  average  is  not  explicitly  stored  in  the  fact,   implicitly,   the  total  price  and  number  of  stations  represent  the  average.   If   the   aggregate   function   to   calculate   the   fact   measures   is   not   COUNT,   then   a   determinant   dimension   is   needed.   In   this   case,   Petrol   Type   DIM   is   hence   a   determinant  dimension,  because  Min  Petrol  Price,   for  example,  does  not  have   any  meaning  without  petrol  type.       5.  Version-­1  (without  Medal  Type  DIM)  vs.  version-­2  (with   Medal  Type  DIM)     As  both  versions  in  the  Olympic  Games  case  study  are  correct,  let’s  compare  and   contract  these  two  versions.  In  order  to  do  this,  let’s  have  a  look  at  the  records  in   the  respective  fact  tables.     The   fact   table   for   version-­‐1   star   schema   (without   Medal   Type   DIM)   has   6   attributes:  three  from  the  dimensions,  and  the  other  three  for  the  fact  measures.   The  contents  of  the  fact  table  are  as  follows:     Fact  (version-­1  star  schema)   Country   Sport   Olympic  Name   Num  of  

Gold   Num  of   Silver  

Num  of   Bronze  

USA   Swimming   London  2012   16   9   6   China   Swimming   London  2012   5   1   4   Australia   Swimming   London  2012   1   6   3                 The  fact  table   for  version-­‐2  star  schema  (with  Medal  Type  DIM)  consists  of  5   columns:  4  from  the  dimension,  but  only  one  fact  measure.    

  7  

Fact  (version-­2  star  schema)   Country   Sport   Olympic  Name   Medal  Type   Num  of  Medals   USA   Swimming   London  2012   Gold   16   USA   Swimming   London  2012   Silver   9   USA   Swimming   London  2012   Bronze   6   China   Swimming   London  2012   Gold   5   China   Swimming   London  2012   Silver   1   China   Swimming   London  2012   Bronze   4   Australia   Swimming   London  2012   Gold   1   Australia   Swimming   London  2012   Silver   6   Australia   Swimming   London  2012   Bronze   3               From  the  storage  point  of  view,  it  is  clear  that  version-­‐1  is  the  winner.  It  has  only   3   records,   whereas   in   version-­‐2,   the   same   information   is   represented   in   9   records.     From  the  modeling  point  of  view,  some  may  prefer  version-­‐2,  because  the  model   is  concise  and  more  compact.       When  the  number  of  fact  measure  is  reasonably  large  (like  in  the  Petrol  Price   case  study),  the  star  schema  with  a  determinant  dimension  looks  very  slim  and   compact   –   hence,   it   is   easy   to   understand.   But   consequently,   the   storage   requirement  increases  as  well.     In  contrast,  with  many  different  petrol  types,  if  the  star  schema  does  not  use  a   determinant  dimension,   the  number  of  attributes   in   the   fact  will  dramatically   increase,  and  the  schema  looks  more  complex  and  crowded;  but  the  storage  cost   is  lower.    

Oracle/Week-5-AVG-in-Fact.pdf

Page 1/5

Average in the Fact? Should we store an “average” fact measure in the fact table? No!!!!!!! Consider the following example. Suppose we have the following 16 records as our data source in the operational database. Note that there are 9 Semester One and 7 Semester Two records, respectively. Out of the 8 Database unit recodes, 6 of them are Semester One, and 2 of them are Semester Two. An Operational Database Ucode Utitle Semester Sfname Score

IT001 Database 1 Mirriam 81 IT001 Database 1 Allan 41 IT001 Database 1 Ben 74 IT001 Database 1 Kate 85 IT001 Database 1 Larry 87 IT001 Database 1 Leonard 75 IT001 Database 2 Juan 64 IT001 Database 2 Andy 32 IT002 Java 1 Ally 65 IT002 Java 1 Menson 47 IT002 Java 2 Mirriam 78 IT002 Java 2 Ben 73 IT002 Java 2 Larry 64 IT003 SAP 1 Ally 63 IT004 Network 2 Juan 53 IT004 Network 2 Menson 52 The star schema of the above operational database contains one fact and two dimensions. The dimensions are: Subject and Semester (One or Two); and the fact measure is Average Score.

The fact table aggregates these score records based on their dimensions, which are subject and semester. If we store Average Score in the fact table, this is how the fact table will look like:

Page 2/5

Fact Table UCode Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

The dimension tables look like as follows: SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

SemesterDIM Table Semester

1 2 Looking at the Fact Table, the average score for the unit Database in Semester One is 73.833 (average of the first 6 score records); the average score for the unit Database in Semester Two is 48 ((64 + 32)/2). Is this fact table correct? It looks correct. But actually it is incorrect. For example, if we want to query the fact table to find out what is the average score of the Database unit, by looking at the above fact table, the answer would be (73.833+48)/2= 60.9165. The SQL to query the Fact Table is as follows:

Select Avg(Average_Score) From FactTable Where UCode = ‘IT001’;

Is this correct? No. In the Operational Database, there are 8 records for Database unit in Semester One and Two (see the first eight records in the operational database). If we sum all the score of these eight records and divided by eight records, the result will be 539/8=67.375; not 60.9165. Let’s do further comparisons:

Page 3/5

The average score for Java unit in Semester One and Two using the above fact is (56+71.667)/2=63.833. The actual average score for Java unit in Semester One and Two is not 63.833, but 65.4 (see the next 5 records in the above score list, and sum these scores and then divide by 5, 327/5). So again, the above fact table, which stores the average score, will not produce correct results. Ok, now let’s calculate further. The average score for Semester One using the above fact is (73.833+56+63)/3=64.278. In the above score list records, there are nine Semester One records, and the average is in fact 68.667. For Semester Two, using the above fact the average scare for Semester Two is (48+71.667+52.5)/3=57.389; whereas the actual average score for the seven Semester Two records is 59.4286. So, storing average as a fact measurement is not a good idea. How do we solve the above problems? In the fact table, we should store the “total score” and “number of students” in each aggregate group. Hence, the fact table should look like this. Fact Table 2 UCode Semester Total_Score NumberofStudents

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

Note that the dimension tables remain unchanged: SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

SemesterDIM Table Semester

1 2 Using the correct fact table above, it is easy to calculate the average score of Database unit, which is (443+96)/(6+2)=67.375. The SQL to query the Correct Fact Table is as follows:

Page 4/5

Select Sum(Total_Score)/Sum(NumberofStudents) From FactTable2 Where UCode = ‘IT001’;

The correct star schema is then as follows:

Conclusion The problem of AVG in the Fact is known as the “Average of an Average” problem. This problem is well known in Mathematics. Average of an average will simply produce an incorrect average result (almost all the time). Hence, it is not desirable to have an average measure in the fact – unless the analysis ALWAYS uses all the dimensions. How about Min or Max in the Fact? Can we do it? – Yes we can. Because Max of Max is always a global max, and Min of Min is always a global min. For example, using the above sample data, assume we have Max_Score and Min_Score in the Fact, as follows: Fact Table 3 Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

Page 5/5

SemesterDIM Table Semester

1 2 (*Note: the dimension tables are unchanged) Assuming we want to get the Max_Score of IT001, then the max of {87, 64} will produce 87, and 87 is the maximum score of IT001, because 87 is the max in semester 1, which is greater than any max of IT001 (e.g. in semester 2). In other words, “Max of Max” is correct. The SQL to retrieve the maximum score of IT001 is as follows:

Select Max(Max_Score) From FactTable3 Where UCode = ‘IT001’;

The same applies to “Min of Min”. If we want to get the minimum score of IT001, the result will be 32, which is the minimum between 41 and 32.

Select Min(Min_Score) From FactTable3 Where UCode = ‘IT001’;

We certainly don’t want to mix between min and max. For example, retrieving the minimum of Max_Score would be meaningless; the same as retrieving the maximum of Min_Score. As a final conclusion:

• Average in the fact is not desirable, although technically it satisfies the two criteria of the fact (e.g. must be a numerical and aggregate value)

• Min and Max in the fact can still be used, since min_score and max_score are valid fact measures (e.g. they are numerical and aggregated values)

• In general, count and sum are more common. Count is “number of”, and Sum is “total of”.