Oracle Advanced person Needed

HelpStudent13

Oracle.zip

Home >Computer Science homework help >Oracle Advanced person Needed

Oracle/FIT5195-2-Star Schema.pdf

Week 2 – Star Schema

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Notations and Processes

1. Star Schema Notation

2. E/R Diagram Notation

3. Transformation Process (Case Study)

2. Two-Column Table Methodology

Recall – The Big Picture

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

http://flux.qa/1AW6N8

Recall – Data Warehouse

▪ To address the drawback of operational database, and a need for decision-

making support data, data warehouse is needed.

▪ A data warehouse is a multi-dimensional view of databases, with

aggregates and pre-computed summaries.

➢ In many ways, it is basically doing aggregates in advance; that is exactly pre-

computation done at the design level, rather than at the query level.

Recall – Data Warehouse

Star Schema

▪ A Star Schema is a design representation of a multi-dimensional view. It is a

data modeling technique used to map multidimensional decision support

data into a relational database.

▪ The reason for the star schema’s development is that existing relational

modeling techniques: ER and normalization, did not yield a database

structure that served the advanced data analysis requirements well.

Star Schema Components

▪ There are Three main components of the Star Schema:

1. Facts

2. Dimensions

3. Attributes

Star Schema Components

1. Facts

Facts are numeric measurements (values) that represent a specific business aspect

or activity.

For example, sales figures are numeric measurements that represent product and/or

service sales.

2. Dimensions

Dimensions are qualifying characteristics that provide additional perspectives to a

given fact.

For example, sales might be viewed from specific dimension(s), such as sales

location, sales period, sales product, etc.

Star Schema Notation

▪ A Sales Star Schema

➢ Fact:

• Sales

➢ Dimensions:

• Time

• Product

• Branch

▪ Notation-wise, the Fact uses a bolder line, to differentiate between Fact

from Dimensions.

Star Schema Notation

▪ A Sales Star Schema

➢ Fact:

• Sales

➢ Dimensions:

• Time

• Product

• Branch

▪ The lines that represent a relationship

between the fact and dimensions can be

straight lines or bended lines.

Star Schema Notation

▪ Using the star schema notation, the

number of dimensions can be unlimited.

▪ If there is more dimensions, then we just

add more dimensions linked to the Fact.

Star Schema Components

3. Attributes

Each dimension table contains attributes.

For example:

Product dimension: Prod Type,

Description.

Location dimension: Region,

State,

City.

Time dimension: Year,

Month.

Star Schema Notation

▪ Sales Star Schema

(b) Sales Star Schema complete with the Attributes(a) Outline star schema for Sales

E/R Diagram Notation

(a) An Entity in E/R Diagram

(b) Relationships in E/R Diagram

E/R Diagram Notation

Associative

Relationship

E/R Diagram Notation

Non-Associative

Relationship

Transformation Process

Case Study #1

Case Study #1: International College

The admission office handles enrolment, payment, and marketing campaigns to international

students, often through educational agents located overseas. This admission office has an

operational system that maintains all the details of international students enrolled in the College.

Payment details are also handled by this office. Basically, the operational system has the following

features:

▪ Every student details are kept in the database. This includes the courses that the students enroll.

▪ As the College is a multi-campus university, some courses are offered in a different campus. The

admission office handles international students of all campuses.

▪ Some international students coming to the College are handled by an educational agent. This is

particularly common for the first course that a student enrolls in. Subsequent courses are not

normally handled by an agent, because the students themselves deal directly with the College.

▪ International students pay tuition fees several times (normally once every semester) for each

course they are doing.

Case Study #1: International College

The College now requires a data warehouse for analysis

purposes. The analysis is needed for identifying at least the

following questions:

1. How many students come from certain countries?

2. What is the total income for certain postgraduate

courses?

3. How many students are handled by certain agents?

4. How the number of enrolment of courses fluctuates

across the year?

Case Study #1: International College

▪ College Star Schema

➢ Fact:

• Number of Students

• Total Income

➢ Dimensions:

• Country

• Agent

• Course

• Year

Transformation Process

Case Study #2

Case Study #2: Sales

▪ Suppose that we would like to analyze

Total Sales from various point of

views, such as Quarter, Branch, and

Product Category.

Case Study #2: Sales

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Time

• Branch

• Product Category

Two Column Table Methodology

When creating a star schema, you need to

imagine that the data you want to analyse

consists of two columns.

The first column is the category (e.g. A, C,

D, E), and the second column is the

statistical numerical figure (e.g. B).

The second column (e.g. B) has to be

consistent throughout all the two-column

tables.

One Fact Measurement:

Two Column Table Methodology

▪ Case Study 1: Analysis of Accountants

Suppose the CPA organization would like to analyze its members (i.e.

accountants) in a particular city. Assume that the organization has the full

details of its members.

Education Number of Accountants

Diploma 84953

Bachelor 349203

Higher Degree 98943

Others 2322

Two Column Table Methodology

▪ We can also look at the figures from the gender point of view, like:

▪ Another way to analyze number of accountants is form the type of the accountant job

itself; something like:

▪ Note that the figures are fictitious, and the “Types” of Accountants (indicating different

roles of accountants) are also fictitious.

Gender Number of Accountants

Male 434322

Female 89932

Type Number of Accountants

Government 3843

Private Business 45303

Personal 45930

etc

Two Column Table Methodology

▪ You can further identify other example to analyze number of accountants. In

the above three tables, the first angle to look at the number of accountants

is from the educational background, the last one is from the type of the

accountant itself, whether it is a private business accountant, etc.

▪ As you can see, the second column is CONSISTENTLY UNIFORM. In the

above example, it is number of accountants. The first column changes

depending on from which angle that you want to see.

Two Column Table Methodology

▪ Therefore, in this case study, the star schema could look like the following:

Two Column Table Methodology

The second column in the two-column

tables, which is the numerical fact

measurement (e.g. column B) can

actually be multiple columns (call them:

B1, B2, B3), as long as all of these

columns (e.g. B1, B2, B3) relate to all of

the categories (e.g. A, C, D, E).

Multiple Fact Measurements:

Two Column Table Methodology

▪ Case Study 2: Student Enrollment

The University Administrator(s) needs to keep track of the number of

enrollment for particular unit or campus and the students’ performance

each year in order to maintain the University performance. The head of

admin has assigned you the task of developing a small Data Warehouse in

which to keep track the enrollment and performance statistics.

Two Column Table Methodology

▪ For example:

▪ Another example could be something like this:

Subject Number of Students Total Score

Database 8 539

Java 5 327

SAP 1 63

Network 2 105

Semester Number of Students Total Score

One 9 618

Two 7 416

F1 F2

Two Column Table Methodology

▪ In analyzing number of students (apart from the subject and semester as

shown above), you could also see the number of student from another

angle, for example from the campus and grade:

Campus Number of Students Total Score

Main 9 658

City 5 271

DE 2 105

F1 F2

Two Column Table Methodology

▪ For example:

Grade Number of Students Total Score

HD 3 253

D 4 300

C 4 256

P 2 105

N 3 120

F1 F2

Two Column Table Methodology

▪ The first columns of the above examples are the dimensions, whereas

the other columns that contain the statistical/summarized/aggregated

values is the fact.

▪ In the above example, the fact is then STUDENT_ENROLLMENT_FACT,

and the dimensions are SUBJECT, SEMESTER, GRADE and CAMPUS.

Two Column Table Methodology

▪ The star schema for the STUDENT ENROLLMENT is shown as follows:

End of Week 2 Lecture

Oracle/FIT5195-3-Bridge Table.pdf

Week 3 – Bridge Tables

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Bridge Tables

2. Temporary Tables

1. Temporary Dimension Tables

2. Temporary Tables in the Operational Database

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

http://flux.qa/1AW6N8

Bridge Tables

▪ A bridge table is a table that links between two dimensions; and only one

of these two dimensions are linked to the fact.

➢ As a result, the star schema becomes a snowflake schema.

Bridge Tables

▪ Two reasons on why a dimension cannot be connected directly to the Fact: ▪

a) The Fact table has a fact measure, and the dimension has a key identity. In

order to connect a dimension to the Fact, the dimension’s key identity must

contribute directly to the calculation of the fact measure. Unfortunately, this

cannot happen if the operational database does not have this data.

b) The operational database does not have this data if the relationship between

two entities in the operational database that hold the information about

dimension’s key identity and the intended fact measure is a many-many

relationship.

Bridge Tables

Case Study #1

Case Study #1 – A Product Sales Case Study

▪ A company management team would like to

analyze the statistics of its product sales

history. The analysis is needed to identify

popular products, suppliers supplying those

products, the best time to purchase more

stock, etc.

▪ A small data warehouse is to be built to keep

track of the statistics.

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

Case Study #1 – A Product Sales Case Study

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

Case Study #1 – A Product Sales Case Study

▪ The management is particularly interested in

analyzing the total sales (quantity * price) by

product, customer suburbs, sales time

periods (month and year), and supplier.

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Product

• Customer locations/suburbs

• Time period

• Supplier

▪ Possible Two-Column Methodology Tables:

Case Study #1 – A Product Sales Case Study

ProductNo TotalSales

A1 $130,000

B2 $15,900

C3 $2,500,000

… …

TimeID TotalSales

201801 $25,000

201802 $4,700

201803 $3,500

… …

Suburb TotalSales

Caulfield $6,500

Chadstone $12,000

Clayton $1,800

… …

(a) Product point of view (b) Time point of view (c) Suburb point of view

Case Study #1 – A Product Sales Case Study

▪ Sales Star Schema

➢ Fact:

• Total Sales

➢ Dimensions:

• Product

• Customer locations/suburbs

• Time period

• Supplier

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Supplier point of view

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Supplier point of view

Case Study #1 – A Product Sales Case Study

SupplierID TotalSales

S1 $77,000

S2 $5,700

S3 $12,500

… …

Case Study #1 – A Product Sales Case Study

Bridge Table

▪ To create Time Dimension: - create table TimeDim as

select

distinct to_char(SalesDate, ’YYYYMM’) as TimeID,

to_char(SalesDate, ’YYYY’) as Year,

to_char(SalesDate, ’MM’) as Month

from Sales;

▪ To create Customer Location Dimension: - create table CustLocDim as

select distinct Suburb, Postcode

from Customer;

Case Study #1 – A Product Sales Case Study

▪ To create Product Dimension: - create table ProductDim as

select distinct ProductNo, ProductName

from Product;

▪ To create Bridge Table: - create table ProductSupplierBridge as

select *

from StockSupplier;

▪ To create Supplier Dimension: - create table SupplierDim as

select SupplierID, Name as SupplierName

from Supplier;

Case Study #1 – A Product Sales Case Study

▪ To create Fact Table: - create table ProductSalesFact as

Select

to_char(S.SalesDate, ’YYYYMM’) as TimeID,

P.ProductNo,

C.Suburb,

sum(SI.QtySold*P.Price) as TotalSales

from Sales S, Product P, Customer C, SalesItem SI

where S.SalesNo = SI.SalesNo

and SI.ProductNo= P.ProductNo

and C.CustomerID = S.CustomerID

group by

to_char(S.SalesDate, ’YYYYMM’), P.ProductNo, C.Suburb;

Case Study #1 – A Product Sales Case Study

Bridge Tables

Case Study #2

Case Study #2 – A Truck Delivery Case Study

▪ A trucking company is responsible for picking up goods from warehouses of a retail chain

company, and delivering the goods to individual retail stores.

▪ A truck carry goods during a single trip, which is

identified by TripID, and delivers these goods to

multiple stores. Trucks have different capacities

for both the volumes they can hold and the

weights they can carry.

▪ At the moment, a truck makes several trips each

week. An operational database is being used to

keep track the deliveries, including the scheduling

of trucks, which provide timely deliveries to

stores.

Case Study #2 – A Truck Delivery Case Study

▪ A trip may pick up goods from many

warehouses

o i.e. a many-many relationship between

Warehouse and Trip

▪ A trip uses one truck only, and a truck may

have many trips in the history

o i.e. a many-1 relationship between Trip and

Truck

▪ A trip delivers goods (e.g. TVs, fridges, etc)

potentially to several stores

o a many-many relationship between Trip and

Store, which is represented by the Destination

table

▪ Sample data in the operational database:

Case Study #2 – A Truck Delivery Case Study

WarehouseID Location

W1 Warehouse1

W2 Warehouse1

W3 Warehouse1

… …

(d) Truck Table

(b) Trip Table (c) TripFrom Table

TripID Date TotalKm TruckID

Trip1 14-Apr-2018 370 Truck1

Trip2 14-Apr-2018 570 Truck2

Trip3 14-Apr-2018 250 Truck3

Trip4 15-Jul-2018 450 Truck1

… … … …

TripID WarehouseID

Trip1 W1

Trip1 W2

Trip1 W3

Trip2 W1

Trip2 W2

… …

TruckID VolCapacity WeightCategory CostPerKm

Truck1 250 Medium $1.20

Truck2 300 Medium $1.50

Truck3 100 Small $0.80

Truck4 550 Large $2.30

Truck5 650 Large $2.50

… … … …

StoreID StoreName Address

M1 MyStore City Melbourne

M2 MyStore Chaddy Chadstone

M3 MyStore HiPoint High Point

M4 MyStore Donc Doncaster

M5 MyStore North Northland

M6 MyStore South Southland

M7 MyStore East Eastland

M8 MyStore Knox Knox

… …

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(a) Warehouse Table

(e) Store Table (f) Destination Table

Case Study #2 – A Truck Delivery Case Study

▪ The management of this trucking company would like to analyze the deliver cost, based on

trucks, time period, and store.

Case Study #2 – A Truck Delivery Case Study

▪ Sales Star Schema

➢ Fact:

• Total Delivery Cost

(distance * cost per kilometre)

➢ Dimensions:

• Truck

• Time period

• Store

Case Study #2 – A Truck Delivery Case Study

▪ From the Truck point of view, Truck1 has two trips (e.g. Trip1 and Trip4), with the total kilometres of

820km (370km + 450km). The cost for Truck1 is $1.20. Hence, calculating the cost for Truck1 is

straightforward. Other trucks can be calculated this way.

▪ From the Period point of view, 14-Apr-2018 has three trips (e.g. Trip1,Trip2, and Trip3). Trip1 (370km) is

delivered by Truck1 which costs $1.20/km. Trip2 and Trip 3, on the same day, can be calculated the same

way. Hence, on 14-Apr-2018, the total cost can be calculated.

▪ From the Store point of view; The cost is calculated

based on Trip, but a trip delivers goods to many

stores. Therefore, the delivery cost for each store

cannot be calculated. The delivery cost is for the trip –

not for the store.

Case Study #2 – A Truck Delivery Case Study

Solution Model 1 – Using a Bridge Table

Case Study #2 – A Truck Delivery Case Study

Solution Model 2 – add a Weight Factor attribute

A weight factor is only needed if we want to estimate

the contribution that a dimension made to the fact

Case Study #2 – A Truck Delivery Case Study

StoreID StoreName Address

M1 MyStore City Melbourne

M2 MyStore Chaddy Chadstone

M3 MyStore HiPoint High Point

M4 MyStore Donc Doncaster

M5 MyStore North Northland

M6 MyStore South Southland

M7 MyStore East Eastland

M8 MyStore Knox Knox

… …

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(a) Trip Dimension Table

▪ To create Trip Dimension: - create table TripDim2 as

select T.TripID, T.TripDate, T.TotalKm,

1.0/count(*) as WeightFactor

from Trip T, Destination D

where T.TripID = D.TripID

group by T.tripid, T.tripdate, T.totalkm;

Case Study #2 – A Truck Delivery Case Study

Solution Model 3 – a List Aggregate version

Case Study #2 – A Truck Delivery Case Study

TripID StoreID

Trip1 M1

Trip1 M2

Trip1 M4

Trip1 M3

Trip1 M8

Trip2 M4

Trip2 M1

Trip2 M2

… …

(b) Bridge Table

(a) Trip Dimension Table

listagg (Attr1, ‘_’) within group

(order by Attr1) as ColumnName

▪ To create Trip Dimension: - create table TripDim3 as

select T.TripID, T.TripDate, T.TotalKm,

1.0/count(D.StoreID) as WeightFactor,

listagg (D.StoreID, ’_’) within group

(order by D.StoreID) as StoreGroupList

from Trip T, Destination D

where T.TripID = D.TripID

group by T.TripID, T.TripDate, T.TotalKm;

Case Study #2 – A Truck Delivery Case Study

▪ Joining based on the StoreGroupList attribute in the Trip dimension table and the StoreID in the Store dimension table:

- select *

from TripDim3 T, StoreDim3 S

where T.StoreGroupList like ’%’||S.StoreID||’%’;

▪ Without the StoreGroupList attribute in the Trip dimension, we need to join three tables:

- select *

from TripDim3 T, BridgeTable3 B, StoreDim3 S

where T.TripID = B.TripID

and B.StoreID = S.StoreID;

Case Study #2 – A Truck Delivery Case Study

Bridge Tables

Summary

Bridge Tables – Summary

▪ In principal, a Bridge Table is used: ▪

a) When it is impossible to have a dimension connected directed to the Fact table,

because simply there is no relationship between this dimension and the Fact table

(e.g. in the Product Sales case study, it is impossible to have a direct link from

SupplierDim to ProductSalesFact)

b) When an entity (which will become a dimension) has a many-many relationship with

another entity (dimension) in the E/R schema of the operational database (e.g.

Supplier and Stock has a many-many relationship).

c) When temporality aspect (data history) is maintained in the operational database

and the bridge table can be used to accommodate the dimension that has temporal

attributes (e.g. product supply history is maintained in the second snowflake schema

example).

Bridge Tables – Summary

▪ When a Bridge Table is used in the schema, there are two additional options: ▪

a) A Weight Factor is used to estimate the contribution of a dimension in the calculation

of the fact measure. Because this is only an estimate, a weight factor is option.

b) Every snowflake schema (whether it has Weight Factor or not) can be implemented

in two ways: a List Aggregate version, and a non-List Aggregate version.

Temporary Dimension Tables

1. Direct copying from a relational table in the operational database

2. Selecting certain attributes (or records) from a relational table in the

operational database

3. Creating the dimension table from scratch

– Insert records into the newly created dimension table

– Suitable for small dimension table

– Temporary dimension table is needed in circumstances where a new

dimension table cannot be created directly

Ways to Create Dimension Tables

Temporary Dimension Tables - Sales Case Study

An E/R Diagram for the

Sales Case Study

A Star Schema for the Sales Case Study

QuarterID - ‘YYYYQ’ is a unique

combination between year and quarter,

where Q is between 1 and 4

▪ TimeDIM has to be created manually

▪ Questions:

Can we insert the records into TimeDIM

one by one?

▪ If yes, how many records to be

inserted?

▪ If no, why?

Temporary Dimension Tables - Sales Case Study (cont.)

Solution for TimeDIM:

1. Create a temporary Time Dimension table that contains Month and Year

2. Add with a new attribute, called QuarterID.

3. Perform a series of updates to fill in the QuarterID column with the correct

values

4. Create the final TimeDIM table

Temporary Dimension Tables - Sales Case Study (cont.)

Exercise:

Construct the SQL command for Step 4

Temporary fact table is required

because TimeDIM is not directly copied

from the operational database

Temporary Fact Tables - Sales Case Study

Temporary Fact Tables - Sales Case Study (cont.)

2 3

Temporary Tables in Operational Database

Total_Sales = SUM(TotalPrice)

An E/R Diagram for the

Sales Case Study

A Star Schema for the Sales Case Study

Temporary Tables in Operational Database - Case Study

An E/R Diagram for the

Sessional Contract Jobs

Temporary Tables in Operational Database - Case Study (cont.)

A Star Schema for the

Sessional Contract Jobs

Last Degree of the

Employee

▪ One employee may have multiple

degrees.

▪ Most of the work requires a Bachelor

degree, to allow current Master

students to work as a teaching

assistant or as a programmer

▪ Current PhD students who have a

Master degree often work as a

teaching assistant too

▪ BUT, only the last degree is required

in DigreeDIM table.

Temporary Tables in Operational Database - Case Study (cont.)

Temporary Tables - EmployeeTemp Table

Retrieves the last degree of the

employees

Sample Records for Employee and Emp_Degree Table

Employee

Emp_Degree

RANK() OVER Function

Retrieved for

EmployeeTemp

table.

Create Fact Table from EmployeeTemp Table

Temporary Dimension table

Needed when it is not possible to create a dimension table directly; need to

apply a series of additional steps when creating a dimension table through the

creation of an intermediate table

Temporary table in operational database

Tables in the operational database are not readily used to create the Fact table.

Need transformation phase, where an intermediate table, called the Temporary

Operational Database Table is created.

Summary

Oracle/FIT5195-4-Snowflake Schema.pdf

Week 4 – Snowflake Schema

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. Hierarchies

1. Hierarchy vs. Non-Hierarchy

2. Hierarchy versus Multiple Independent Dimensions

3. Linked Dimensions

4. Hierarchy Design Considerations

2. Determinant Dimensions

1. Determinant vs. Non-Determinant Dimensions

2. Determinant Dimension vs. Pivoted Fact Table

3. Non-Type as a Determinant Dimension

4. Multiple Relationship between a Dimension and the Fact

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

http://flux.qa/1AW6N8

Hierarchies

▪ A Hierarchy is formed when a dimension is broken down to two or

more dimensions in a hierarchical manner.

➢ As a result, the star schema becomes a snowflake schema.

(a) A simple Star Schema

(b) A Snowflake with Hierarchy

Hierarchies

(a) A Snowflake with a Bridge Table (b) A Snowflake with Hierarchy

Hierarchies

1. Hierarchy versus Non-Hierarchy Dimensions

2. Hierarchy versus Multiple Independent Dimensions

3. Linked Dimensions

4. Hierarchy Design Considerations

1. Hierarchy vs. Non-Hierarchy

HierarchyNon-Hierarchy

1. Hierarchy vs. Non-Hierarchy

CityID City Name State CountryID

MEL Melbourne Victoria AU

SYD Sydney New South Wales AU

KUL Kuala Lumpur Selangor MA

JNB Johannesburg Johannesburg SA

CampusID CampusName Address Suburb Postcode CityID

CL Clayton Campus Wellington Rd Clayton 3800MEL

CA Caulfield Campus Dandenong Rd Caulfield East 3145MEL

PA Parkville Campus Royal Parade Parkville 3052MEL

SY Sydney Campus Opera Boulevard Sydney 2001SYD

MUM Malaysia Campus Jalan Lagoon Bandar Sunway 47500KUL

MSA South Africa Campus Peter St Johannesburg 1725JNB

(a) Campus Dimension Table

CountryID Country Name

AU Australia

MA Malaysia

SA South Africa

(b) City Dimension Table (c) Country Dimension Table

1. Hierarchy vs. Non-Hierarchy

(a) Correct Hierarchy (b) A Wrong Hierarchy

1. Hierarchy vs. Non-Hierarchy

1. One table vs. many tables

The consequence is in the join query processing when producing reports.

2. Normalized vs. Un-normalized

With the hierarchy option, the tables are normalized, which follows the relational

model. In contrast, the non-hierarchy option, the table (e.g. the Campus

Dimension table) is unnormalized, which is basically 1NF with visible replication

on information, which is prone to anomalies (e.g. insert, update, and delete

anomalies).

3. Drilling down and rolling up

The hierarchy model is not about drilling down information exploration.

1. Hierarchy vs. Non-Hierarchy – Summary

▪ From the query point of view, both versions need two queries.

▪ From the query processing point of view, the non-hierarchy version uses

one join operation only, because it only needs to join the fact and one

dimension.

▪ From the conceptual point of view, the hierarchy model does not actually

offer a better roll up or drill down features.

2. Hierarchy vs. Multiple Independent Dimensions

(a) Separate Dimensions (b) Combined Dimensions

2.1. Separate vs Combined Dimension

(a) Separate Dimensions (b) Combined Dimensions

2.1. Separate vs Combined Dimension

(a) Separate Dimension Model – Campus Dimension Table

(b) Separate Dimension Model –

Country Dimension Table

2.1. Separate vs Combined Dimension

(d) Combined Dimension Model – Campus Dimension Table

(e) Combined Dimension Model

– Fact Table

2.1. Separate vs Combined Dimension (c) Separate Dimension Model – Fact Table (e) Combined Dimension Model – Fact Table

2.1. Separate vs Combined Dimension

(a) Separate Dimension Model – Campus Dimension Table (b) Separate Dimension Model –

Country Dimension Table

(d) Combined Dimension Model – Campus Dimension Table

2.1. Separate vs Combined Dimension

(a) Separate Dimensions (b) Combined Dimensions

2.2. Combined Dimension vs. Hierarchy

(b) Combined Dimensions (c) Hierarchy Dimensions

2.2. Combined Dimension vs. Hierarchy

(a) Hierarchy Model – Campus Dimension Table

(b) Hierarchy Model – Country Dimension Table

2.2. Combined Dimension vs. Hierarchy

▪ Reasons on why the Hierarchy model is not ideal:

➢ An efficient query processing to query the fact and dimensions,

➢ The identifier of the Campus Dimension (e.g. CampusID) already covers the

child dimension (e.g. Country Dimension), and

➢ Campus and Country information is often regarded as one entity; at least in this

case study.

(b) Combined Dimensions (c) Hierarchy Dimensions

3. Linked Dimensions

(a) Linked Dimension Model – Campus Dimension Table

(b) Linked Dimension Model – Country Dimension Table

4. Hierarchy Design Considerations

Hierarchy – Summary

▪ A dimension hierarchy is connecting two or more dimensions in a hierarchical

manner, using a many-1 relationship.

➢ As a result, the dimensions in a hierarchy are normalized, in 3NF, using the

context of Relational Database Design.

▪ Compare and contrasts five different models:

1. Separate Dimension model

2. Combined Dimension model

3. Hierarchy model

4. Linked Dimension model

5. Hierarchy Design Considerations

Determinant Dimensions

Dimensions and Fact Tables - Recap

A Star Schema for the Sales Case Study

Example #1:

Querying Fact Table

Example #2:

Querying Fact Table (cont.)

A star schema may have a special dimension

all data retrieval from this star schema must use this dimension

else, the retrieved data will become meaningless.

Determinant Dimensions

Determinant

_DIM

Determinant_

Att

Special dimension

Key identifier

Introducing a Determinant Dimension – Petrol Station Case Study

Sample operational database of a company about petrol prices from all petrols

stations in Victoria:

The requirements for the data warehouse

are to answer the questions related to

(a) average petrol price,

(b) minimum petrol price, and

Petrol Station - Star Schema Version-1

Chosen Dimensions:

(i) day of week,

(ii) suburb, and

(iii) petrol company.

Two-column Table for Petrol Prices based on Day of Week

Category Fact Measures

Minimum price for which petrol type?

Petrol Station - Star Schema Version-1 (cont.)

▪ Not all petrol stations sell all types of petrol, e.g. Premium 98.

▪ Calculating the average price for Premium 98 on Monday cannot be:

Total_Petrol_Price_P98 / Number_of_Petrol_Station

▪ The number of petrol stations is different for different petrol type!

Petrol Station - Star Schema Version-1 (cont.)

▪ Solution: Split the Number of Petrol Station column into six columns – one

for each petrol type.

▪ Question: How many columns in total?

Petrol Station - Star Schema Version-1 (cont.)

Two-column

Table for Petrol

Prices based

on Suburb Two-column Table for Petrol Prices based on Company

Petrol Station - Star Schema Version-1 (cont.)

Petrol Station - Star Schema Version-2

Min. price for

which petrol

type??

select DayofWeek,

Min_Petrol_Price

from PetrolFact

where DayofWeek = ’Monday’

and PetrolType = ’Unleaded’;

Petrol Station - Star Schema Version-2

select DayofWeek, PetrolType,

Min_Petrol_Price

from PetrolFact

where DayofWeek = ’Monday’;

Petrol Station - Star Schema Version-2

Determinant Dimension uses dotted box in the star schema.

The new star schema with a Determinant Dimension for Petrol Case Study

Determinant vs. Non-

Determinant

Dimensions

The Olympic Games Case Study

▪ The Olympic Games committee maintains an operational database that

stores all matches, games, as well as the medal winners of the Olympic

Games over the years

▪ Let’s build a data warehouse to analyse the medal counts, by each country,

sport, and at which Olympic Games

Determinant vs. Non-Determinant Dimensions

The Olympic Games Case Study - The Star Schemas

Version-1

Version-2

The questions are:

1. What is the difference

between these two

versions of star schema?

2. Is Medal Type Dimension

a Determinant

Dimension?

The Olympic Games Case Study - The Star Schema Version-1

Two-column tables

(pivot tables) for each

dimension category in

star schema version-1

The Olympic Games Case Study - The Star Schema Version-2

▪ Is MedalTypeDIM a determinant dimension?

– To retrieve the data from version-2 star schema, must we have the

information from Medal Type Dimension?

Determinant vs. Non-Determinant Dimensions

Version-2

COUNT = total medals from

other dimensions, without the medal type dimension)

Fact Table for Version-1

The Olympic Games Case Study - The Star Schema Version-1 vs Version-2

Fact Table for Version-2

The Olympic Games Case Study - The Star Schema Version-1 vs Version-2

Differenc

Version 1 Version 2 (with

MedalTypeDIM)

Storage Lower with only three records Nine records

Modelling The schema looks more

complex and crowded; but the

storage cost is lower

More concise and compact with

less number of measures in the

fact, easier to understand

Query

Processin

Less join processing between

dimension tables and the fact

table.

Requires additional join with

MedalTypeDIM

Determinant

Dimensions vs.

Pivoted Fact Table

The PTE Academic Test Case Study ▪ PTE Test consists of four main components: Listening, Reading, Speaking, and

Writing

▪ There are many test venues (all major cities in Australia, and overseas as well), each

country may have a different price setting

▪ A student taking a PTE test will have one score for each of these four components,

as well as one overall score

▪ The score is a numerical score, ranging from 10 to 90

▪ The results will normally be released five days after the actual test date

▪ A student may take a PTE Academic Test several times, in order to improve the

score

▪ A student is identified by his/her RegistrationID, which stays the same student stays

the same

Determinant Dimensions vs. Pivoted Fact Table

ERD for PTE Academic Test

The PTE Academic Test Case Study - Sample Records for 8 Students

The PTE Academic Test Case Study - Star Schema with Determinant Dim.

To implement the star schema in SQL:

create table CountryVenueDim

select distinct C.CountryCode, C.CountryName, T.TestPrice

from Test_Venue T, Country C

where T.CountryCode = C.CountryCode;

create table CitizenshipDim

select distinct C.CountryCode as Citizenship, C.CountryName

from Student S, Country C

where S.Citizenship = C.CountryCode;

create table YearDim

select distinct to_char(TestDate, ’YYYY’) as Year from Test;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

create table GradeDim (

Grade varchar2(3),

Description varchar2(20),

MinScore number,

MaxScore number

);

insert into GradeDim values (’4.5’, ’Functional’, 30, 35);

insert into GradeDim values (’5’, ’Vocational’, 36, 49);

insert into GradeDim values (’6’, ’Competent’, 50, 64);

insert into GradeDim values (’7’, ’Proficient’, 65, 78);

insert into GradeDim values (’8-9’, ’Superior’, 79, 90);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Superior

create table TestComponentDim (

TestComponent varchar2(20)

);

insert into TestComponentDim values

(’Listening’);

insert into TestComponentDim values

(’Reading’);

insert into TestComponentDim values

(’Writing’);

insert into TestComponentDim values

(’Speaking’);

insert into TestComponentDim values

(’Overall’);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Create temporary fact table:

create table TempFact

select TV.CountryCode, S.Citizenship,

to_char(T.TestDate, ’YYYY’) as Year,

TR.ListeningScore, TR.ReadingScore,

TR.WritingScore, TR.SpeakingScore,

TR.OverallScore, TR.RegistrationID

from Test_Venue TV, Test T, Student S,

Test_Result TR

where TV.VenueID = T.VenueID

and T.TestNo = TR.TestNo

and TR.RegistrationID = S.RegistrationID;

alter table TempFact

add (

GradeOverall varchar2(3),

GradeListening varchar2(3),

GradeReading varchar2(3),

GradeWriting varchar2(3),

GradeSpeaking varchar2(3)

);

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

update TempFact

set GradeOverall = (

case

when OverallScore >= 30 and

OverallScore <= 35 then ’4.5’

when OverallScore >= 36 and

OverallScore <= 49 then ’5’

when OverallScore >= 50 and

OverallScore <= 64 then ’6’

when OverallScore >= 65 and

OverallScore <= 78 then ’7’

when OverallScore >= 79 and

OverallScore <= 90 then ’8-9’

end);

Repeat for

GradeListening, GradeReading,

GradeWriting, and GradeSpeaking

Superior

▪ The TempFact table has the correct grades for each student.

▪ The problem now is to breakdown the grades for a student into multiple test

components – Create another temporary fact for each of the test components, eg. a temporary

fact called ’OverallFact’ for the Overall Score, and so on so forth

create table OverallFact

select CountryCode, Citizenship, Year,

GradeOverall As Grade, ’Overall’ as TestComponent,

count(RegistrationID) as Total_Students_Overall

from TempFact

group by CountryCode, Citizenship, Year, GradeOverall, ’Overall’;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

Repeat for

GradeListening, GradeReading, GradeWriting, and GradeSpeaking

▪ There were eleven students taking the test which are now grouped into nine

records in Table OverallFact

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

▪ There are also three other fact tables: ReadingFact, WritingFact, and

SpeakingFact, with identical table structures

▪ These five fact tables can now be ’combined’ to form one final fact table

using a union operator: create table FinalFact as

select CountryCode, Citizenship, Year, Grade, TestComponent,

Total_Students_Overall as Total_Students

from OverallFact

union select * from ListeningFact

union select * from ReadingFact

union select * from WritingFact

union select * from SpeakingFact;

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

▪ The final fact table contains 45

records

▪ Conclusion: – When the input record in the

operational database contains the

scores of all the test components

as one record, it will be

challenging to break down the

records into multiple records in

the final fact table, because there

is a Test Component dimension.

The PTE Academic Test - Star Schema with Determinant Dim. (cont.)

The PTE Academic Test - Star Schema without Determinant Dim.

● Fact measure for all the test components

● This means that the fact table is a Pivoted Fact Table ○ Shifts the Test Component into the fact measure;

○ The Determinant Dimension - TestComponentDim is then removed from the star schema

▪ How many records do we have in fact table with five fact measures?

▪ However, since our test result data is incomplete, we will have problems

with the final fact, because in the non-determinant dimension version, we

need to keep track the zeroes – Even if the data is the test result is a lot, but still does not cover all possible

combinations from the dimension!

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

Answer:

60 records in the final fact table

= 2 countries of test venues * 6 countries of citizenship * 1 year * 5 grades

▪ ‘Superior’ grade is only available for Listening and Reading scores

▪ Using the normal fact table, there will not be entry in the fact table for Superior-Speaking, and for

Superior-Writing;

▪ There will not be zero value in TotalStudents, where the Grade column is ’Superior’ and the Test

Component column is either ’Speaking’ or ’Writing’ → less than 60 rows of records in the fact table

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

Superior

Solution: Step1

Get all possible combinations from all dimensions using Cartesian product

between all dimensions:

create table AllDimensions

select CO.CountryCode, CI.Citizenship, Y.Year, G.Grade

from CountryVenueDim CO,CitizenshipDim CI,

YearDim Y, GradeDim G;

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

Solution: Step 2

Re-create five temporary fact tables, one for each test component, using Outer

Join operation between AllDimensions table and each of the temporary fact

tables created in the previous section (see slide #35).

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

create table OverallFactNew as

select A.CountryCode, A.Citizenship,

A.Year, A.Grade,

nvl(O.Total_Students_Overall, 0)

as Total_Students_Overall

from AllDimensions A, OverallFact O

where A.CountryCode = O.CountryCode(+)

and A.Citizenship = O.Citizenship(+)

and A.Year = O.Year(+)

and A.Grade = O.Grade(+);

create table ListeningFactNew as

select A.CountryCode, A.Citizenship,

A.Year, A.Grade,

nvl(O.Total_Students_Listening, 0)

as Total_Students_Listening

from AllDimensions A, ListeningFact O where

A.CountryCode = O.CountryCode(+)

and A.Citizenship = O.Citizenship(+)

and A.Year = O.Year(+)

and A.Grade = O.Grade(+);

create table ReadingFactNew as select ...;

create table WritingFactNew as select ...;

create table SpeakingFactNew as select ...;

Solution: Step 3

Create the final fact table using the join operation

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

create table FinalFact2 as

select O.CountryCode, O.Citizenship, O.Year, O.Grade, O.Total_Students_Overall, L.Total_Students_Listening,

R.Total_Students_Reading, W.Total_Students_Writing, S.Total_Students_Speaking

from OverallFactNew O, ListeningFactNew L, ReadingFactNew R, WritingFactNew W, SpeakingFactNew S

where O.CountryCode = L.CountryCode

and L.CountryCode = R.CountryCode and R.CountryCode = W.CountryCode

and W.CountryCode = S.CountryCode and O.Citizenship = L.Citizenship

and L.Citizenship = R.Citizenship and R.Citizenship = W.Citizenship and W.Citizenship = S.Citizenship

and O.Year = L.Year and L.Year = R.Year and R.Year = W.Year and W.Year = S.Year

and O.Grade = L.Grade and L.Grade = R.Grade and R.Grade = W.Grade and W.Grade = S.Grade;

Solution: Step 4

Because our sample data is small, there are many records with zeroes values in

the Total Students columns. Delete records where Total Students in all of the

five test components are equal to zero.

The PTE Academic Test - Star Schema without Determinant Dim. (cont.)

delete from FinalFact2

where Total_Students_Overall = 0

and Total_Students_Listening = 0

and Total_Students_Reading = 0

and Total_Students_Writing = 0

and Total_Students_Speaking = 0;

Non-Type as a

Determinant

Dimension

▪ Type Dimension maybe a candidate for a Determinant Dimension:

– PetroTypeDim is a Determinant Dimension in the Petrol Price case

study,

– MedalTypeDim in the Olympic Games case study is a not a

Determinant Dimension

▪ Type Dimension is a good candidate for a Determinant Dimension, but

needs further examination to check if it is really a Determinant Dimension

▪ Other dimensions - non-type dimension can potentially be a Determinant

Dimension

Non-Type as a Determinant Dimension

▪ The Class Type Dimension in

this example is not a

Determinant Dimension

because we could get Number

of Students without referring to

any of the class types. ▪ E.g. retrieving number of

students in the Science

Faculty last year is certainly a

useful information, even

without Class Type

information.

Non-Type as a Determinant Dimension – University Enrolment Case Study

Type Dimension

Total student of

Science Faculty

Non-Type as a Determinant Dimension – University Enrolment Case Study

Total student of

Science Faculty

last year

vs.

The information is

dependent on the YearDIM

- determinant dimension

Converting a

Determinant Dimension

to a Pivoted Fact Table

Non-Type as a Determinant Dimension – University Enrolment Case Study

Question 1:

What if the number of records in the Determinant Dimension is huge?

– Number of fact measures increased to 20, 50?

– May not be practical

Solution:

Enforce Determinant Dimension through the User-Interface level ▪ Users are forced to enter/choose a criteria from the Determinant

Dimension.

▪ In filling a web-based form, this enforcement is normally indicated by a

star (*) next to the text entry to indicate that this text entry must be filled

in, before the form can proceed.

Pivoted Fact Table for University Enrolment Case Study

Question 2:

What if there are a lot of attributes in the Determinant Dimension? – If we shift the Determinant Dimension to the fact measures, only the key

identifier is kept in the fact measure.

– All other attributes from the Determinant Dimension will be lost from the star

schema. This issue will be discussed in the next section.

Pivoted Fact Table for University Enrolment Case Study (cont.)

Multiple Relationship

between a Dimension

and the Fact

▪ Private Taxi Case Study

– The dimension (or a Determinant Dimension) to be shifted to the fact

measure (e.g. Pivoted Fact Table) has many attributes.

– If only the key identifier of the dimension (or the Determinant

Dimension) is used as the fact measure, all other attributes of the

dimension (or the Determinant Dimension) are no longer exists in the

star schema

Multiple Relationship between a Dimension and the Fact

▪ Assume that

– The business rules stated that there is no Determinant Dimension. Even without a

Determinant Dimension, we could potentially shift a dimension to become fact measures

– The company is a small private taxi company, has only five cars

▪ In the fact table, instead of having three fact measures, for each fact measure, we could

replicate five times – one for each car.

Private Taxi Case Study

Replicate for each

car ===> 15 fact

measures

Remove this

CarDIM?

▪ CarDIM should not be removed – Other details of the car are

included in the data retrieval

▪ Because Car Dimension is not a

Determinant Dimension, we can

simply keep the original star

schema – Include only CarNo in the fact

table

– Use join operation to get more

details about car in query

Private Taxi Case Study (cont.)

select F.MonthYear, D.Make,

D.Model,

sum(F.Total_Kilometers)

from PrivateTaxiFact1 F,

CarDim D

where F.CarNo = D.CarNo

and F.CarNo = 5

and F.MonthYear = ’1908’

group by F.MonthYear, D.Make,

D.Model;

Insist that fact measures are broken into five cars

▪ Need to have five Car Dimension’s key

identifier in the fact

Private Taxi Case Study (cont.)

To retrieve Total Kilometers of five cars:

select F.MonthYear, D1.CarMake, D1.Year,

D2.CarMake, D2.Year, D3.CarMake, D3.Year,

D4.CarMake, D4.Year, D5.CarMake, D5.Year,

sum(Total_Kilometers)

from PrivateTaxiFact2 F, CarDim D1, CarDim

D2, CarDim D3, CarDim D4, CarDim D5

where F.CarNo1 = D1.CarNo

and F.CarNo2 = D2.CarNo

and F.CarNo3 = D3.CarNo

and F.CarNo4 = D4.CarNo

and F.CarNo5 = D5CarNo

and MonthYear = ’1908’

group by F.MonthYear, D1.CarMake, D1.Year,

D2.CarMake, D2.Year, D3.CarMake, D3.Year,

D4.CarMake, D4.Year, D5.CarMake, D5.Year;

▪ Although conceptually the dimension is already shifted to the fact measure,

we still need to keep the original dimension in the star schema.

▪ This is the only way to preserve all other attributes of that dimension.

▪ This relationship replication is applicable to shifting a Determinant

Dimension to the fact measure.

Multiple Relationship between a Dimension and the Fact

Two alternatives to enforce the implementation of a star schema that features a

Determinant Dimension

1. Enforce through the user-interface, whereby the user has to fill in or to

choose the Determinant Attribute during the search.

2. Shift the Determinant Dimension into a Pivoted Fact Table, whereby the key

identifier of the Determinant Dimension is incorporated into the fact

measure.

Determinant Dimension - Conclusion

Oracle/FIT5195-5-MoreComplexProcesses_update (1).pdf

Week 5 – More Complex Processes

Semester 1, 2020

FIT5195 – Business Intelligence and Data Warehousing

Developed by: Agnes Haryanto

[email protected]

Soon Lay-Ki

[email protected]

MONASH INFORMATION

TECHNOLOGY

Agenda

1. More Complex Processes

1. Use of count Function

2. Average in the Fact

3. Outer Join

2. One Attribute Dimensions

Using FLUX

1. Visit http://flux.qa/ on your internet enabled device

2. Log in using your Monash account (not required if you are already logged in to Monash)

3. Click on the “+” to join audience

4. Enter the Audience Code:

• Caulfield – 3GANT7

• Fully Flex – 39WRG8

• Malaysia – VTVPLW

5. Select FIT5195 in the Active Presentation menu

6. Answer questions when they pop up

http://flux.qa/1AW6N8

More Complex Processes

in Creating Fact

Recall – Star Schema Components

▪ There are Three main components of the Star Schema:

1. Facts

2. Dimensions

3. Attributes

Recall – Fact

▪ A Fact Table consists of key attributes from each dimension,

and fact measures.

▪ A Fact Table is created by a join operation, that joins several

tables from the operational database.

▪ Fact tables are created either through TempFact or directly

retrieval from the tables in the operational database.

▪ The fact measure itself is an aggregated value.

➢ In the SQL command, the fact measure attribute in the Fact Table is created using an aggregate function, such as count

or sum, and the group by operation.

Use of count Function

▪ The count function is one of the most common aggregate functions used to

create the fact measures.

▪ In SQL, there are three ways of using the count function:

1. count(*) – counts number of records in the query result

2. count(attribute) – counts number of records of that attribute excluding

null values

3. count(distinct attribute) – counts number of unique values in that

mentioned attribute

Case Study #1: Mobile Apps Repositories

Monalisa University is an international university. It has an online mobile app store: Monalisa App Store.

The Monalisa App Store allows students from any university in the world to publish their applications and

receive feedbacks. This app store is considered as a research environment where applications developed

during researches and studies can be tested and used by real users. The applications are free and open

source.

Basically, the operational system has the following features:

▪ Every user details are kept in the database, which includes the universities that the students enroll.

▪ Users can publish their apps or download other apps.

▪ Users can give feedbacks and ratings to other apps.

▪ The download and feedback statistics are stored in the database.

▪ Apps are organized into different categories; and many authors may have more than one app in

different categories.

Case Study #1: Mobile Apps Repositories

▪ A data warehouse is needed

analyze the ratings and feedbacks

by different authors and apps, so

that the staff at Monalisa University

can connect with the talented

authors and send them awards

annually.

▪ The author with the highest

average apps rating will be named

author of the year, and will receive

an award.

Case Study #1: Mobile Apps Repositories

create table CategoryDim as

select * from Category;

Case Study #1: Mobile Apps Repositories

create table UniversityDim as

select distinct UniversityID,

UniversityName

from University;

Case Study #1: Mobile Apps Repositories

create table LocationDim as

select distinct

Country || City as LocationID,

City,

Country

from University;

Case Study #1: Mobile Apps Repositories

create table TimeDim as

select distinct

to_char(DownloadDate, ’YYYYMM’) as TimeID,

to_char(DownloadDate, ’MM’) as Month,

to_char(DownloadDate, ’YYYY’) as Year

from Download;

Case Study #1: Mobile Apps Repositories

create table AppsDownloadFact as

select

to_char(D.DownloadDate, ’YYYYMM’) as DownloadMonth,

U.Country || U.City as LocationID,

A.CategoryID,

A.ApplicationID,

U.UniversityID,

count(*) as TotalDownloads

from University U, App_User R, Download D, Application A

where

U.UniversityID = R.UniversityID and

R.UserID = D.DownloaderID and

D.ApplicationID = A.ApplicationID

group by

to_char(D.DownloadDate, ’YYYYMM’),

U.Country || U.City,

A.CategoryID,

A.ApplicationID,

U.UniversityID;

Case Study #1: Mobile Apps Repositories

create table TimeDim as

select distinct

to_char(CreationDate, ’YYYYMM’) as TimeID,

to_char(CreationDate, ’MM’) as Month,

to_char(CreationDate, ’YYYY’) as Year

from Application;

Case Study #1: Mobile Apps Repositories

create table AppsFact as

select

to_char(A.CreationDate, ’YYYYMM’) as CreationMonth,

U.Country || U.City as LocationID,

A.CategoryID,

A.ApplicationID,

U.UniversityID,

count(distinct ApplicationID) as TotalApps

from University U, App_User R, Download D, Application A

where

U.UniversityID = R.UniversityID and

R.UserID = D.DownloaderID and

D.ApplicationID = A.ApplicationID

group by

to_char(A.CreationDate, ’YYYYMM’),

U.Country || U.City,

A.CategoryID,

A.ApplicationID,

U.UniversityID;

Average in the Fact

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

The operational database

contains:

• 9 records of Semester one

• 7 records of Semester two

Average in the Fact Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

The operational database

contains:

• 9 records of Semester one

• 7 records of Semester two

• 8 records of Database Unit

(6 Semester one and 2

Semester two)

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact (b) Subject Dimension

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact (b) Subject Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact (b) Subject Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

• Semester two: (64+32)/2 = 48

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact (b) Subject Dimension

Average Score for the Database Unit in:

• Semester one: (81+41+74+85+87+75)/6 = 73.833

• Semester two: (64+32)/2 = 48

These are actually

incorrect!

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact (b) Subject Dimension

The SQL command: select avg(Average_Score)

from EnrolmentFact

where UnitCode = ’IT001’;

Query: Calculate Average Score for the Database Unit:

• (73.833 + 48) / 2 = 60.9165

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for the Database Unit.

Calculation using Fact:

• (73.833 + 48) / 2 = 60.9165

Calculation based on the Operational Database:

• (81+41+74+85+87+75+64+32) / 8 =

539 / 8 = 67.375

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for the Java Unit in both Semesters.

Calculation using Fact:

• (56+71.667) / 2 = 63.833

Calculation based on the Operational Database: • (65+47+78+73+64) / 5 = 65.4

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for Semester One.

Calculation using Fact:

• (73.833+56+63) / 3 = 64.287

Calculation based on the Operational Database: • (81+41+74+85+87+75+65+47+63) / 9 = 68.667

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

(a) Fact

(b) Operational Database Query: Calculate Average Score for Semester Two.

Calculation using Fact:

• (48+71.667+52.5) / 3 = 57.389

Calculation based on the Operational Database: • (64+32+78+73+64+53+53) / 7 = 59.4286

Unit Code Unit Title Semester Student First Name Score

IT001 Database 1 Mirriam 81

IT001 Database 1 Allan 41

IT001 Database 1 Ben 74

IT001 Database 1 Kate 85

IT001 Database 1 Larry 87

IT001 Database 1 Leonard 75

IT001 Database 2 Juan 64

IT001 Database 2 Andy 32

IT002 Java 1 Ally 65

IT002 Java 1 Menson 47

IT002 Java 2 Mirriam 78

IT002 Java 2 Ben 73

IT002 Java 2 Larry 64

IT003 SAP 1 Ally 63

IT004 Network 2 Juan 53

IT004 Network 2 Menson 52

Average in the Fact

Unit Code Semester Total_Score Number_of_Students

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

(a) Fact Version 2

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(b) Subject Dimension

Average in the Fact

Unit Code Semester Total_Score Number_of_Students

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

(a) Fact Version 2

The SQL command: select sum(Total_Score)/

sum(Number_of_Students)

as Average_Score

from EnrolmentFact2

where UnitCode = ’IT001’;

Query: Calculate Average Score for

the Database Unit:

• (443+96)/(6+2) = 67.375

Average in the Fact

▪ The problem of Average in the Fact is known as the Average of an Average

problem.

➢ This problem is well known in Mathematics and Statistics.

➢ Average of an average will simply produce an incorrect average result.

▪ Hence, it is not desirable to have an average measure in the fact.

➢ Exceptional case: when the analysis ALWAYS uses all the dimensions (e.g.

Determinant Dimensions).

Min & Max in the Fact

▪ If Average should not be used in the fact, how about Min or Max?

➢ Yes, we can.

➢ Because Max of Max is always a global Max, and Min of Min is always a global

Min.

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Min & Max in the Fact

▪ Query: Find the Maximum Score of Database Unit.

• Max of {87, 64} is 87.

▪ The SQL command: select max(Max_Score) from EnrolmentFact

where UnitCode = ’IT001’;

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Min & Max in the Fact

▪ Query: Find the Minimum Score of Database Unit.

• Min of {41, 32} is 32.

▪ The SQL command: select min(Min_Score) from EnrolmentFact

where UnitCode = ’IT001’;

Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

Unit Code Unit Title

IT001 Database

IT002 Java

IT003 SAP

IT004 Network

Semester

(a) Fact

(b) Subject Dimension (c) Semester Dimension

Average in the Fact – Conclusion

▪ Average in the Fact is not desirable, although technically it satisfies the two

criteria of the fact (e.g. must be a numerical and aggregate value).

▪ Min and Max in the Fact can still be used, since Min Score and Max Score are

valid fact measures (e.g. they are numerical and aggregated values).

▪ In general, count and sum are more common.

Outer Join

Case Study #2: Employment Agency

The Employment Agency has a file of candidates who are willing to work.

▪ They record the candidates number, name, contact address, contact phone number and

maximum hours the candidate is available per week.

▪ Each candidate may have several qualifications.

▪ The Agency uses special codes to record the candidate’s qualification for a job opening.

▪ As well as recording the code, the Agency also records the experience of the candidate in

each qualification area, expressed as the number of months experience.

Case Study #2: Employment Agency

The Agency also has a list of companies that request temporaries.

▪ Each company is assigned a company number as an identifier.

▪ The company name, type of business and principal contact for employment placements is also recorded.

▪ Each time a company requests a temporary employee, the Agency makes an entry in the (job) openings file. This file contains an opening number, the company requesting an employee, required qualification, starting date, anticipated ending date, and hourly pay.

▪ Each opening requires only one specific or main qualification.

▪ The Agency may be able to fill the opening from the staff on its books, however in some circumstances it cannot fill the request.

Case Study #2: Employment Agency

▪ When a candidate matches the qualification, and is available, he/she is given the job, and an entry is made in the placement record folder.

➢ This folder contains an opening number, candidate number, actual start date, total hours worked to date, and end date if the placement is completed.

➢ The placement record folder is used by the Agency as a source of placement histories for it’s various temporaries.

Case Study #2: Employment Agency

▪ Note:

➢ Not all Openings have

Placements, but each Placement

comes from the Opening.

➢ A Candidate may have several

Placements.

➢ A Company may open several

Openings.

➢ Each Opening has one main

qualification requirement.

Case Study #2: Employment Agency

▪ A data warehouse to analyze

the total openings and total

placements by job durations,

qualification requirements, and

placement starting month.

▪ Several types of job durations:

➢ Short-term (<10 days)

➢ Medium-term (10 to 30 days)

➢ Long-term (>30 days)

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

▪ Several types of job durations:

➢ Short-term (<10 days)

➢ Medium-term (10 to 30 days)

➢ Long-term (>30 days)

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

Case Study #2: Employment Agency

▪ Star Schema

➢ Fact:

• Total Opening

• Total Placement

➢ Dimensions:

• Duration

• Qualification

• Month

1. How many short-term job openings?

2. How many jobs requiring certain qualifications?

3. How many people started working in a certain

month?

Case Study #2: Employment Agency

create table QualificationDim as

select * from Qualification;

Case Study #2: Employment Agency

create table MonthDim as

select distinct

to_char(ActualStartDate, ’Month’)

as MonthName

from Placement;

Case Study #2: Employment Agency

create table DurationDim

(DurationID number,

DurationDesc varchar2(20));

insert into DurationDim values (1, ’Short-Term’);

insert into DurationDim values (2, ’Medium-Term’);

insert into DurationDim values (3, ’Long-Term’);

Case Study #2: Employment Agency

create table TempFact as

select O.QCode,

O.StartDate,

O.EndDate,

to_char(P.ActualStartDate, ’Month’)

as MonthName,

O.OpenNo,

P.CandNo

from Opening O, Placement P

where O.OpenNo = P.OpenNo (+);

Case Study #2: Employment Agency

create table TempFact as

select O.QCode,

O.StartDate,

O.EndDate,

to_char(P.ActualStartDate, ’Month’)

as MonthName,

O.OpenNo,

P.CandNo

from Opening O, Placement P

where O.OpenNo = P.OpenNo (+); Left Outer Join

Case Study #2: Employment Agency

alter table TempFact

add (DurationID number);

update TempFact

set DurationID = 1

where EndDate - StartDate < 10;

update TempFact

set DurationID = 2

where EndDate - StartDate >= 10

and EndDate - StartDate <=30;

update TempFact

Set DurationID = 3

where EndDate - StartDate > 30;

Case Study #2: Employment Agency

create table AgencyFact as

select

QCode, DurationID, MonthName,

count(OpenNo) as TotalOpening,

count(CandNo) as TotalPlacement

from TempFact

group by QCode, DurationID, MonthName;

One-Attribute

Dimensions

One-attribute dimensions

One-attribute dimensions ● Dimensions with only one attribute, i.e. key attribute

Column-based Solution in the Fact

● Medal Type Dimension table seems unnecessary, because Medal Type also

exists in the fact table

● Hence, there is no need to reference to the Medal Type Dimension

Column-based Solution in the Fact (cont.)

Column-based solution ● Convert the dimension attribute into

new columns (fact measures) in the

fact table

● Only applicable for dimensions with

low cardinality, because it is not

practical to create a new column in

the fact table for each record of the

dimension

Row-based Solution in the Fact

● What happens if the one-attribute dimension has a lot of records?

● The solution: ○ Remove the dimension and keep the attribute of the dimension as a dimension-

less key in the fact table

Row-based solution whereby

each medal type is

represented by records in the

fact table

Row-based Solution in the Fact - Sales Case Study

● Assume that there are only four quarters (e.g. Q1, Q2, ..., Q4), regardless

the year

● The Quarter could be shifted to the fact measures

● Result: Pivoted fact table

Q1 to Q4

A Sales Star Schema with Pivoted Fact

▪ What if the Quarter also embeds the Year, e.g. Q12020, Q22018, and etc.? – Pivot Fact table cannot be the solution, because the number of possible records

in the Time Dimension can be large

▪ The solution: – Include the Quarter attribute as a Dimension-less Key in the fact, and to remove

the Time Dimension from the star schema

– Result: Row-based solutions - not restricted by the number of records in the

one-attribute dimension

Row-based Solution in the Fact (cont.)

▪ Similarity: – The contents of the fact table are identical.

▪ Difference: – The dimension-less key attribute has no references to the Time Dimension,

because the Time Dimension has been removed, whereas the Quarter in the star

schema has a reference to the Time Dimension.

Star Schema with Dimension-less Key vs. One-Attribute Dimension

Star Schema with Dimension-less key

Star Schema with TimeDIM

▪ Looking at the high level design, it is

often more desirable to keep the

one-attribute dimension, i.e. the Time

Dimension in this case

– Provides a clearer design,

considering that the fact table

remain the same, with or without

the one-attribute dimensions

Star Schema with Dimension-less Key vs. One-Attribute Dimension (cont.)

One-attribute dimensions

▪ If there is more than one dimension with single attribute, and these

dimensions are not related to each other, it is often desirable to combine

them into a junk dimension

▪ Case study: – Analyse the number of sales based on

these dimensions

Combine all one-attribute dimensions

Sales Fact Table

Combine all one-attribute dimensions (cont.)

Cartesian product of 4

dimension tables

▪ Assuming that each book can only have one category, then Category is

related to Book

▪ In this case, we can simply move the category into the book dimension

Combine with other normal dimensions - Book Sales Case Study

Combine with other normal dimensions - Book Sales Case Study (cont.)

Determinant Dimension with one-attribute only - Weather Data Case Study

▪ WeatherMonth attribute is a Determinant Dimension in this case.

▪ Hence, WeatherMonth Dimension must be kept in the star schema

– To enforce that WeatherMonth Dimension (or WeatherMonth attribute)

must be used in all queries to this star schema

WeatherMonthDI

Questions:

1. What is the granularity of this

star schema?

2. How can we have a higher

granularity in this star schema

(no category to go for)?

Answer to Question #2: ▪ Combine books into groups

based on the authors

One-attribute dimension with bridge - Book Sales Case Study

▪ The ISBN Group List can be

created by using the listagg function

▪ The Book Group List

Dimension is a one-attribute

dimension

▪ It is not possible to remove

this dimension as it links the

fact and the bridge table

One-attribute dimension with bridge - Book Sales Case Study (cont.)

1. To move the one attribution dimensions to the fact table

– The one-attribute dimension must have low cardinality

– Else, keep the dimension attribute in the fact as a dimension-less key

– From high level point of view, sometimes it is desirable to keep the

attribute dimension in the design

2. To combine the one-attribute dimension with other dimensions

– Form a junk dimension that combines several one-attribute dimensions

– Move the one-attribute dimension into another normal dimension

– One-attribute dimensions cannot be removed if:

▪ It is a determinant dimension

▪ It is connected to a bridge table

Summary for One-Attribute Dimensions

End of Week 5 Lecture

Oracle/Requiremnts-Oracle.pdf

FIT5195 S1 2020 – Take Home Test (10%)

Due date: Week 6, Friday 1-May-2020, 11:55pm

This Take Home Test consists of two Case Studies. You are required to answer all questions.

Case Study 1: Government Hospital Data Warehouse (50 marks) Currently, the Victorian government has a simple database system implemented to store the

collective data of all government-run hospitals. The database system includes information about their

services, staff, and patients. The simple database system consists of the following tables and

attributes:

Figure 1 : A simplified structure of the Government Hospital’s current E/R Diagram

Due to the nationwide impact of a new pandemic, the Premier of Victoria (i.e. the head of government

in the Australian state of Victoria) requires a data warehouse for analyzing trends and forecasting

purposes. The goal of the analyzing is to identify which frequently used services, popular hospital

locations, etc.

You are required to design a small Data Warehouse to keep track of the statistics. The Premier is

particularly interested in analyzing the total number (population quantity) of patients and total

service charged by services used, time periods, age groups, and hospital locations.

The Premier also wants to be able to drill down based on the cost of services range (low price < $20,

medium price between $20 and $50, and high price >$50).

Tasks:

[1]. Create the operational database based on the above E/R Diagram, and then populate with sufficient numbers of records in each table. It is suggested that each table would have 5-20

records depending on the use of that table in the operational database. You need to create a script

file containing the CREATE TABLE and INSERT INTO commands.

You can follow the list of services from this link:

https://www.monash.edu/health/medical/services.

[2]. Then develop a GOVERNMENT HOSPITAL star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you

have to also include the Weight Factor and List Aggregate.

The result of this task is a star schema diagram. You can use any drawing tool, such as

Lucidchart, to draw the star schema.

[3]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.

[4]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables.

[5]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and

show the results):

a) Show the total number of patients making appointments during Winter. b) Show the total service charged for each service cost type. c) Show the total number of patients by each age group (infant <1, children <18, adult 18+,

senior 65+) in April 2020.

d) Show the total service charged for general medical consultations in each suburb.

Submission checklist for Case Study 1:

A. A pdf file or word document containing:

i. The SQL commands to create tables and insert records to the operational database –

Task 1

ii. The star schema diagram – Task 2

iii. The Two-Column Table Methodology illustration – Task 3

iv. The SQL commands to create the dimension and fact tables, as well as the contents of

these tables – Task 4

v. The SQL commands to answer the queries in Task 5 and the query results

B. The .sql files containing:

i. The SQL commands to create the operational database.

ii. The SQL commands to create the data warehouse.

iii. The SQL commands to answer the queries in Task 5.

(Notes: The marker of this test will simply run this file. So you need to make sure that the

SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,

especially if you cut and paste from a Word file).

C. A folder:

i. Save all the files from [A] and the SQL files from [B] in one folder.

ii. Name the folder CaseStudy1_yourStudentID.

https://www.monash.edu/health/medical/services

Case Study 2: Accident Records Data Warehouse (50 marks)

Figure 2 : A simplified structure of the Accident Record’s current database system

The above figure displays how the Victorian Roads and Safety (VicR&S) Department stores accident

related data. When there are any road accidents, the details about the accidents such as the location,

date, time and event (e.g. what exactly happened during the accident) are recorded. For each accident,

the information (e.g. name, contact number, employment start date, current employment branch)

about one police officer who is in charge of overseeing the events of the accident (e.g finding out

who/what caused the accident) are stored. In addition, to identify a particular accident, each accident

is given a unique accident number.

On the other hand, one accident may be caused by several vehicles (e.g. a collision between 2

vehicles). All vehicles have owners who may or may not be the same as the vehicle driver during the

accident. Therefore, the VicR&S database stores information about vehicles involved in the accident

(i.e. vehicle number, vehicle model, vehicle make and vehicle manufacturer), the information about

the owners of the vehicles (i.e. owner name, owner address, owner contact number) and the

information about the drivers at the time of the accident (driver name, driver address, driver contact

number, driver licence no, and when the driver started to drive the vehicle). Furthermore, for each

accident, the severity of damage incurred by each vehicle is recorded.

Currently, to reduce the number of road accidents VicR&S department is interested in the following:

- The total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).

- The total number of accidents by each vehicle model. - The number of vehicles involved in every accident event on different locations. - The number of accidents taken care of different police officer branches.

Tasks:

[1]. Develop an ACCIDENT RECORDS star schema. Identify the fact table, dimensions and attributes required to support the schema. If the star schema consists of a Bridge Table, you have

to also include the Weight Factor and List Aggregate.

The result of this task is a star schema diagram. You can use any drawing tool, such as

Lucidchart, to draw the star schema.

[2]. Validate your star schema using the Two-Column Table Methodology. You are required to illustrate some two-column tables for this task based on your star schema design.

[3]. Write the SQL commands to create the fact and dimension tables. You need to create a script file containing the appropriate SQL commands to create the fact and dimension tables. The

operational tables are accessible from the ACCIDENT account. The result of this task is the SQL

commands. You will also need to show the contents of the tables that you have created.

[4]. Write the SQL commands to answer the following queries: (you need to make sure that there are records in your fact and dimensions tables. For each of the following queries, write the SQL and

show the results):

a) Show the total number of accidents happening by different locations and by different lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).

b) Show the total number of accidents by each vehicle model. c) Show the number of vehicles involved in every accident event on different locations. d) Show the number of accidents taken care of by different police officer branches.

[5]. You need to come up with additional two more questions and answer these questions using the SQL commands. Also explain the reason for why the management would like to have such

information.

Submission checklist for Case Study 2:

A. A pdf file or word document containing:

i. The star schema diagram – Task 1

ii. The Two-Column Table Methodology illustration – Task 2

iii. The SQL commands to create the dimension and fact tables, as well as the contents of

these tables – Task 3

iv. The SQL commands to answer the queries in Task 4 and the query results

v. The additional two questions with the SQL commands, query result, and your explanation – Task 5

B. The .sql files containing:

i. The SQL commands to create the data warehouse.

ii. The SQL commands to answer the queries in Task 4 and Task 5.

(Notes: The marker of this test will simply run this file. So you need to make sure that the

SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,

especially if you cut and paste from a Word file).

C. A folder:

i. Save all the files from [A] and the SQL files from [B] in one folder.

ii. Name the folder CaseStudy2_yourStudentID.

Submission Method:

1. A zip file: a. Save the CaseStudy1_yourStudentID and CaseStudy2_yourStudentID folders into one

folder named FIT5195_Test_yourname_studentID.

b. ZIP the folder FIT5195_Test_yourname_studentID.zip. This must be a ZIP file and not other types of compressed folder. The zip file should contain the prescribed folder

structure as follows:

● FIT5195_Test_yourname_studentID/ ○ CaseStudy1_yourStudentID/

○ Files for tasks ○ CaseStudy2_yourStudentID/

○ Files for tasks

2. Upload your zip file on Moodle by the due date: Friday 1-May-2020, 11:55pm (No late submission is accepted).

END OF TAKE HOME TEST

Oracle/Week-1c-SQL-revision.pdf

SQL - Revision

Assoc. Prof. David Taniar

SQL - Revision

Outline: A. Introduction to SQL client environment

B. Create tables

C. Insert into

D. Simple query retrieval

E. Updating and deleting records

F. Commit

G. Joining multiple tables

H. Aggregate functions and group by

I. Alter tables

A. Introduction to SQL client

• There are several SQL client software:

• SQL Developer

• SQL*Plus (Windows, Mac, Unix)

• Login details:

• Username: S12345678 (12345678 is your student id)

• Password: student

• Host string: the unit code

Introduction to SQL Developer

Oracle username which starts

with S and followed by your

student ID. Password is student

Location of

the server

This is the hoststring, which

is the database name

rhino.its.monash.edu

FIT5195

Introduction to SQL Developer

Single Statement Execution Script Execution

Output of Execution

Type Command

B. Create Tables

• General Syntax:

CREATE TABLE <table_name>

(attribute1 data_type <NOT NULL>,

attribute2 data_type,

…

PRIMARY KEY (attribute1),

FOREIGN KEY (attribute1)

REFERENCES <references table_name> (key_attribute));

Create Tables – Data Types

• Data type denotes a kind of data of an attribute value

• Character data types: VARCHAR2 and CHAR

Make VARCHAR2(20)

Model VARCHAR2(30)

• Number data types: NUMBER

Year NUMBER(4)

StampDuty NUMBER(8,2)

• Date data type: DATE

Salesdate DATE

Create Tables – Data Types

• Date Data Type

• DATE stores dates from 1/1/4712 BC to 12/31/4712 AD

• Default date format: DD-MON-YYYY

• example: 05-JUN-2011

• Example declaration: salesDate DATE

• DATE data type also stores time values

1010

Create Tables – Data Types

• Default time format: HH:MI:SS A.M.

• If no time value is given when a date is inserted, default value is

12:00:00 A.M.

• If no date value is given when a time is inserted, default date is

first day of current month

• Example salesDate field: 07-JUN-2016 12:00:00 A.M.

1111

Create Tables – Constraints

• Integrity Constraints

• Primary Key attribute

• NOT NULL constraints

• Specifies that a field cannot be NULL

• Sample Declaration: Field_name data_type NOT NULL

• Foreign Key attribute in a table refers to another record in another table

1212

Create Tables – Example 1

• Example: create CAR, CARSALES and CUSTOMER tables

• Commands:

• CAR table

CREATE TABLE Car

(carID NUMBER(5) NOT NULL,

make VARCHAR2(20) NOT NULL,

model VARCHAR2(30),

year NUMBER(4),

colour VARCHAR2(25),

advertisedPrice NUMBER(10),

PRIMARY KEY (carID)

);

1313

Create Tables – Example 2

• CUSTOMER table

CREATE TABLE Customer

(customerID NUMBER(5) NOT NULL,

cNAME VARCHAR2(20) NOT NULL,

cPhone VARCHAR2(10),

address VARCHAR2(50),

PRIMARY KEY (customerID)

);

1414

Create Tables – Example 3

• CARSALES table

CREATE TABLE Carsales

(carID NUMBER(5) NOT NULL,

customerID NUMBER(5) NOT NULL,

salesDate DATE NOT NULL,

purchasedPrice NUMBER(10) NOT NULL,

stampDuty NUMBER(8,2) NOT NULL,

PRIMARY KEY (carID,customerID),

FOREIGN KEY (carID) REFERENCES Car(carID),

FOREIGN KEY (customerID) REFERENCES Customer(customerID)

);

1515

Create Tables – by Copying

• You can create a new table by copying from an existing table:

CREATE TABLE <table_name>

AS SELECT *

FROM … ;

• For example:

CREATE TABLE car

AS SELECT *

FROM dtaniar.car;

• Notes:

• It creates and copy the records from the existing table

• However, it does not copy the PK and FK

• In the above example, it copies the table from dtaniar account.

1616

Create Tables – View Tables

• Viewing Information about Tables

• To view all tables in the database, the general syntax is:

SELECT * FROM TAB;

• For example:

SELECT * FROM tab;

TNAME TABTYPE CLUSTERID

------------------------------ ------- ----------

CAR TABLE

CARSALES TABLE

CUSTOMER TABLE

1717

Create Tables – Describe Tables

• To view the table structure, the general syntax is:

DESCRIBE <table_name>

• For example:

DESC car;

Name Null Type

--------------- -------- ------------

CARID NOT NULL NUMBER(5)

MAKE NOT NULL VARCHAR2(20)

MODEL VARCHAR2(30)

YEAR NUMBER(4)

COLOUR VARCHAR2(25)

ADVERTISEDPRICE NUMBER(10)

1818

Create Tables – Drop Tables

• To drop an unwanted table, the general syntax is:

DROP TABLE <table_name>;

• For example:

DROP TABLE Car;

• If the table that you want to delete (e.g. table Car) is being used as a FK by

another table (e.g. table Carsales), then you cannot delete table Car.

• In this case, you need to delete table Carsales first, before deleting table Car.

1919

C. Insert Into

• General Syntax

• To insert values for every single attribute in a record:

INSERT INTO <table_name>

VALUES (attribute1_value, attribute2_value,….);

• Example:

INSERT INTO Car

VALUES (1, 'Holden', 'Cruze', 2015, 'Black', 25780);

• Strings are enclosed in single quotes (‘) and are case-sensitive

(e.g. ‘General Practice’ is different from ‘general practice’)

Insert Into

• To insert a value of selected attributes:

INSERT INTO <table_name> (attribute1,attribute2,…)

VALUES (attribute1_value, attribute2_value,….)

• Example:

INSERT INTO Car (carID, make) VALUES (’16’, ’Audi’);

2121

Insert Into

• The TO_DATE function:

• TO_DATE (‘date_value’, ’format mask’);

• Example:

INSERT INTO Carsales

VALUES (1,4, TO_DATE('04/Feb/2015', 'DD/MON/YYYY'),25780,824.96);

2222

Insert Into

• The common DATE format masks

Format Mask Formatted Data

DD-MON-YYYY 05-FEB-2007

MM/DD/YYYY 02/05/2007

HH:MI AM 02:30 PM

MONTH DAY, YYYY FEB 5, 2007

MM/DD/YYYY HH:MI AM 02/05/2007 02:30 PM

Sample DATE format masks

Insert Into

• Insert multiple records one-by-one:

INSERT INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800);

INSERT INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200);

INSERT INTO Car VALUES (4, 'Holden', 'Commodore', 2008, 'Grey', 12650);

• Insert multiple records at once:

INSERT ALL

INTO Car VALUES (2, 'BMW', '520d', 2016, 'Grey', 98800)

INTO Car VALUES (3, 'Audi', 'A5', 2016, 'Black', 68200)

INTO Car VALUES (’ 4, 'Holden', 'Commodore', 2008, 'Grey', 12650)

SELECT * FROM DUAL;

2424

D. Simple Query Retrieval

• Simple Retrieval

• Retrieve all Records

• General Syntax :

SELECT *

FROM <table_name>;

• Example: retrieve everything from the CAR table:

SELECT * FROM car;

• Output:

2525

Simple Query Retrieval

• Retrieve Specific Fields

• General Syntax:

SELECT <attribute1, attribute2…>

FROM <table_name>;

• Example: select only the model from the CAR table

SELECT model FROM car;

• Output:

2626

Simple Query Retrieval

• Eliminating Duplicated Records (DISTINCT qualifier)

• General Syntax:

SELECT DISTINCT <attribute1,attribute2,…>

FROM <table_name>;

• Example: eliminating duplicates for the MAKE values:

SELECT DISTINCT make FROM car;

2727

Simple Query Retrieval

• Output (with DISTINCT):

• Original Output (without DISTINCT):

2828

Simple Query Retrieval

• DISTINCT Multiple Attributes

• Example: eliminating duplicates for the MAKE and YEAR values:

SELECT DISTINCT make, year

FROM car;

MAKE YEAR

-------------------- ----------

Volkwagen 2015

Toyota 2013

Mini 2009

Holden 2015

BMW 2016

Honda 2012

Subaru 2014

Holden 2008

Jeep 2006

Mazda 2011

Audi 2016

Lexus 2009

Honda 2010

Nissan 2010

Mercedes Benz 2006

15 rows selected

2929

Simple Query Retrieval

• Conditional Retrieval

• Search Conditions specified for more complex data retrieval

• The WHERE Clause

• Operators:

• equal (=)

• greater than (>)

• less than (<)

• greater than or equal to (>=)

• less than or equal to (<=)

• Not equal (<>)

3030

Simple Query Retrieval

• General Syntax:

SELECT <attribute1,attribute2,…>

FROM <ownername.table_name1>

WHERE <search condition>;

• Example:

SELECT year, model, advertisedPrice

FROM car

WHERE make='Holden'

YEAR MODEL ADVERTISEDPRICE

---------- ------------ ---------------

2015 Cruze 25780

2008 Commodore 12650

3131

Simple Query Retrieval

• “BETWEEN”

• Example: list the carid, salesDate and purchasedPrice after 2014 and

before 2016

SELECT carid, salesdate, purchasedPrice

FROM carsales

WHERE salesdate

BETWEEN TO_DATE('01-JAN-2014', 'DD-MON-YYYY')

AND TO_DATE('31-DEC-2015','DD-MON-YYYY’);

CARID SALESDATE PURCHASEDPRICE

---------- ----------- --------------

1 04-FEB-2015 25780

8 12-DEC-2015 13999

3232

Simple Query Retrieval

• Using a String Comparison

• Example: list the carid, salesDate and purchasedPrice after 2014 and

before 2016

SELECT carid, salesdate, purchasedPrice

FROM carsales

WHERE TO_CHAR(salesdate, ‘YYYYMMDD’) >’20140101’

AND TO_CHAR(salesdate, ‘YYYY’MMDD’) < ‘20151231’;

CARID SALESDATE PURCHASEDPRICE

---------- ----------- --------------

1 04-FEB-2015 25780

8 12-DEC-2015 13999

3333

Simple Query Retrieval

• “AND” or “OR”

• AND: both conditions must be true

• Example:

SELECT make, model, year

FROM car

WHERE year>2014 AND make='Holden';

MAKE MODEL YEAR

--------- ---------------- ----------

Holden Cruze 2015

3434

Simple Query Retrieval

• OR: either one of the condition is true

• Example:

SELECT make, model, colour, year, advertisedPrice

FROM car

WHERE colour='Red’ OR year>2014;

MAKE MODEL COLOUR YEAR ADVERTISEDPRICE

--------- ---------- ---------- ---------- ----------------

Holden Cruze Black 2015 25780

BMW 520d Grey 2016 98800

Audi A5 Black 2016 68200

Honda Civic Red 2012 13488

Lexus RX350 Red 2009 33500

Volkwagen Golf White 2015 39888

6 rows selected

3535

Simple Query Retrieval

• Other Conditions

• LIKE/NOT LIKE

• Example: displaying all MAKE from car table that has their first

character as ‘M’

SELECT make

FROM CAR

WHERE make LIKE 'M%';

MAKE

----------------

Mazda

Mercedes Benz

Mini

3636

Simple Query Retrieval

• IN/NOT

• suitable to perform a set member search

• Example:

• IN - displaying all MODEL that its MAKE is either ‘Holden’ or ‘Honda’

SELECT make, model

FROM car

WHERE make IN ('Holden', 'Honda');

MAKE MODEL

-------------------- -------------

Holden Cruze

Holden Commodore

Honda Civic

Honda City

3737

Simple Query Retrieval

• IN/NOT

• Example:

• NOT IN - displaying make, model and colour of cars that are not ‘Black’

and ‘White’

SELECT make, model, colour

FROM car

WHERE colour NOT IN ('Black’, 'White');

MAKE MODEL COLOUR

-------------- ------------ ------------

BMW 520d Grey

Holden Commodore Grey

Honda Civic Red

Lexus RX350 Red

Mazda 3Series Blue

Mercedes Benz C200 Blue

Honda City Grey

3838

Simple Query Retrieval

• NULL/NOT NULL

• Example:

• NULL Operator:

SELECT carid, make

FROM car

WHERE colour IS NULL;

CARID MAKE

------ ---------

12 Subaru

3939

Simple Query Retrieval • NOT NULL Operator:

SELECT carid, make

FROM car

WHERE colour IS NOT NULL;

CARID MAKE

------- --------------------

1 Holden

2 BMW

3 Audi

4 Holden

5 Honda

6 Jeep

7 Lexus

8 Mazda

9 Nissan

10 Volkwagen

11 Mercedes Benz

13 Honda

14 Mini

15 Toyota

14 rows selected

4040

Simple Query Retrieval

• MULTIPLE OPERATORs

• Example: list cars make start with ‘M’ and was made before 2016

SELECT year, make, model

FROM car

WHERE make LIKE 'M%'

AND year<2016;

YEAR MAKE MODEL

---- ---------------- -------------

2011 Mazda 3Series

2006 Mercedes Benz C200

2009 Mini Cooper

4141

Simple Query Retrieval

• Sorting

• specify to sort the output by using ORDER BY

• General Syntax:

SELECT <attribute1,attribtue2,..>

FROM <table_name>

ORDER BY <attribute_name> [DESC];

4242

Simple Query Retrieval • Example: retrieving all make, model, advertisedPrice and colour in a descending

order of the advertisedPrice SELECT make, model, advertisedPrice, colour

FROM car

ORDER BY advertisedPrice DESC;

MAKE MODEL ADVERTISEDPRICE COLOUR

--------------- ------------------ --------------- --------

BMW 520d 98800 Grey

Audi A5 68200 Black

Volkwagen Golf 39888 White

Subaru Outback 33950 NULL

Lexus RX350 33500 Red

Holden Cruze 25780 Black

Toyota Corolla 23888 White

Jeep Grand Cherokee 21999 Black

Mini Cooper 19750 Black

Nissan Dualis 18888 White

Mercedes Benz C200 16995 Blue

Mazda 3Series 13999 Blue

Honda Civic 13488 Grey

Holden Commodore 12650 Grey

Honda City 7490 Grey

15 rows selected

Simple Query Retrieval

• “AS”: to rename a column

• Example: list all SalesDate, PurchasedPrice, StampDuty and

TotalPrice (TotalPrice = PurchasedPrice+StampDuty )

SELECT salesdate, purchasedPrice, stampduty,

(purchasedPrice+StampDuty) as TotalPrice

FROM carSales;

SALESDATE PURCHASEDPRICE STAMPDUTY TOTALPRICE

----------- -------------- ---------- ----------

04-FEB-2015 25780 824.96 26604.96

13-JUL-2016 12650 506 13156

12-DEC-2015 13999 559.96 14558.96

14-JUN-2016 39888 1276.42 41164.42

18-MAY-2016 98800 5631.6 104431.6

4444

E. Updating and Deleting Records

• In the created tables

• UPDATE command – updating

• DELETE command – deletion

4545

Updating Records

• Update

• General Syntax:

UPDATE <table_name>

SET <attribute_name> = <new_value>

WHERE <expression> <operator> <expression>;

• records can be updated in only one table at a time

• update multiple fields that are within the same table

• WHERE clause - make the command updates specific records only

Updating Records

• Example:

• Updating colour of carID ‘5’ from ‘Red’ to ‘Grey’

UPDATE car

SET colour='Grey'

WHERE carID =5;

Deleting Records

• Delete

• General Syntax:

DELETE FROM <table_name>

WHERE <search_condition>;

• remove specific records from a database table

• use WHERE clause to specify multiple records to delete multiple records at

one time

• If the search condition is omitted, all records in the table are deleted.

Deleting Records

• Example:

• deleting a single record from the CAR table

DELETE FROM car

WHERE carid= 5;

• deleting multiple records from the CAR table that contain MAKE

starting with ‘M’

DELETE FROM car

WHERE make LIKE ‘M%’;

• deleting all records from the CAR table

DELETE FROM car;

Deleting Records

• Notice:

• not allowed to delete a primary key record that has its corresponded foreign

key somewhere else in another table

• Example: delete a primary key CarId is 5 in the CAR table that has a foreign

key record that CarId is 5 in the CARSALES table

DELETE FROM car WHERE carid=5;

DELETE FROM car

ERROR at line 1:

ORA-02292: integrity constraint (SYSTEM.SYS_C005453) violated - child

record found

5050

F. Commit

• Commit

• When inserted data by issuing the INSERT command

• the changes are only saved in the local database buffer

• are not saved in the database

• until you COMMIT the transaction

• it is important to remember to COMMIT whenever you have finished inserting

values or make changes to the database values

5151

Commit

• General Syntax:

• Sample of inserting a new record with commit

Effect of Inserting New Record with Commit

Insert a new record

‘16’ with commit

All original

records are

currently in CAR

table

5252

G. Joining Multiple Tables

• Join

• database query to join multiple database tables together

• the data needed or the conditions specified come from more than one table.

• Syntax:

SELECT <attribute1,attribute2,… >

FROM <table_name1,table_name2,…>

WHERE <table_name1.join_attribute> = <table_name2.join_attribut>

AND <search_condition>

Joining Multiple Tables

• joining the CAR, CARSALES and CUSTOMER tables:

SELECT customer.cname, customer.cphone, carsales.salesdate, carsales.purchasedPrice,

carsales.stampDuty, car.make, car.model

FROM customer, carsales, car

WHERE customer.customerID = carsales.customerID

AND car.carID = carsales.carID;

• using prefix when joining tables:

• when more than one table is involved, a prefix for each attribute is

recommended to avoid ambiguity

SELECT ct.cname, ct.cphone, cs.salesdate,

cs.purchasedPrice,cs.stampDuty,

c.make, c.model

FROM customer ct, carsales cs, car c

WHERE ct.customerID = cs.customerID

AND c.carID = cs.carID;;

• The output of both example:

Joining Multiple Tables

• Subquery

• Example: retrieving the make, model and advertisedPrice of the cars that

are not sold

SELECT make, model, advertisedPrice FROM car

WHERE carid NOT IN

(SELECT carid FROM carsales);

MAKE MODEL ADVERTISEDPRICE

----------------- ----------------- ---------------

Subaru Outback 33950

Lexus RX350 33500

Mini Cooper 19750

Nissan Dualis 18888

Mercedes Benz C200 16995

Jeep Grand Cherokee 21999

6 rows selected

5555

H. Aggregate Functions and Group By

• Aggregate Functions

• summarize the input table

• often used include:

• COUNT - count number of records in the input table

• SUM - calculate the sum of a numerical attribute

• MIN and MAX - find the smallest and the largest value of a certain

attribute

Aggregate Functions

• Count(*)

• Example: returning the number of records that is available from

the CAR table

SELECT COUNT(*)

FROM car;

COUNT(*)

------------------

Aggregate Functions

• Example: returning the number of values of colour that is available from the CAR table

SELECT COUNT(colour) FROM car;

COUNT(COLOUR)

-------------

• Example: returning the number of distinct values of colour that is available from the CAR table

SELECT COUNT(DISTINCT colour) FROM car;

COUNT(DISTINCTCOLOUR)

---------------------

COUNT(attribute) & COUNT(DISTINCT attribute)

Cause by a

record with

null value 15 records

include one

NULL value

Output:

Aggregate Functions

• MIN and MAX

• Example: calculate the minimum and maximum advertisedPrice

of the cars

SELECT MIN(advertisedPrice) AS "Lowest Price", MAX(advertisedPrice) AS "Highest Price”

FROM car;

Lowest Price Highest Price

------------ --------------

7490 98800

5959

Aggregate Functions and Group By

• Group By

• group an input table into a number of groups based on one or more

nominated attributes

• often used in conjunction with aggregate functions

• CAR table: group by Make

SELECT make, count(*)

FROM car

GROUP BY make;

MAKE COUNT(*)

-------------------- ----------

Holden 2

Lexus 1

Subaru 1

BMW 1

Mazda 1

Nissan 1

Audi 1

Volkwagen 1

Toyota 1

Honda 2

Mercedes Benz 1

Jeep 1

Mini 1

Aggregate Functions and Group By

• CAR table: grouping by Make with combination of getting only the record

groups that contain the count value greater than 1 is as follows

SELECT make, COUNT(DISTINCT model)

FROM car

GROUP BY make

HAVING COUNT(DISTINCT model)>1;

MAKE COUNT(DISTINCTMODEL)

-------------------- --------------------

Holden 2

Honda 2

Aggregate Functions and Group By

• multiple tables: selecting the number of sold cars for each colour

SELECT colour, COUNT(DINSTINCT customerId)

FROM car c, carsales cs

WHERE c.carid=cs.carid

GROUP BY colour;

COLOUR COUNT(DISTINCTCUSTOMERID)

----------- -------------------------

White 2

Grey 4

Blue 1

Black 2

Red 1

6262

Aggregate Functions

• Count vs. Sum

• COUNT - count number of records in the input table

• SUM - calculate the sum of a numerical attribute

SELECT <attribute1, attribute2>, COUNT(*)

FROM <table_name>

GROUP BY <attribute1, attribute2>;

SELECT <attribute1, attribute2>, SUM (attribute3)

FROM <table_name>

GROUP BY <attribute1, attribute2>;

Aggregate Functions

• Sum without Group By

• Example: calculate the sum of purchasedPrice

SELECT SUM(purchasedPrice) AS totalsales

FROM carsales;

TOTALSALES

----------

402983

Aggregate Functions

• Sum with Group By

• Example: calculate the sum of purchasedPrice for each make in

carsales table

SELECT c.make, SUM(cs.purchasedPrice) AS totalsales

FROM carsales cs, car c

WHERE c.CARID=cs.CARID

GROUP BY c.make;

MAKE TOTALSALES

-------------------- ----------

Holden 38430

BMW 197600

Mazda 13999

Audi 68200

Simple Query Retrieval

• SUM with Calculations

• Example: list all Cname, SalesDate and calculate the TotalPrice for

each customer (TotalPrice = PurchasedPrice+StampDuty )

SELECT ct.cname, cs.salesdate, SUM(purchasedPrice+StampDuty) as TotalPrice

FROM carSales cs, customer ct

WHERE cs.customerid=ct.customerid

GROUP BY ct.cname, cs.salesdate;

CNAME SALESDATE TOTALPRICE

-------------------- ----------- ----------

Jone 14-JUN-2016 14027.52

Rex 12-DEC-2015 14558.96

Lily 13-JUL-2016 13156

6666

I. Alter Tables – Add New Fields

• Syntax:

ALTER TABLE <table_name>

ADD (attribute_name data_type_declare constraints_declare);

• attribute name: referring to the new attribute that you want to add into the

existing table

• data type: defines the data type and the size of the new attribute

• constraint: defines the constraints that the new attribute might be enforced by

certain constraints

• Example: add a transmission attribute to the CAR table

ALTER TABLE car

ADD (transmission VARCHAR2(10));

Alter Tables – Modify Fields

• Syntax:

ALTER TABLE <table_name>

MODIFY (attribute_name new_data_type);

• attribute name: refers to the attribute that you want to modify

• new data: defines the new data type that you want to use replacing the old one

• Example: change the data type of the transmission to CHAR with a size of

ALTER TABLE car

MODIFY (transmission CHAR(30));

Alter Tables – Drop Columns

• Syntax:

ALTER TABLE <table_name>

DROP COLUMN attribute_name;

• Example: delete the attribute transmission from the CAR table

ALTER TABLE car

DROP COLUMN transmission ;

Oracle/Week-4a-StarSchema-StudentEnrollment.pdf

More on Student Enrollment Case Study (Tutorial 2)

1. Two dimensions vs. One dimension The star schema of the case study used in Tutorial 2 (Student-Enrollment system) is shown as follows.

Star Schema-1 (one semester-year dimension)

The SQL codes to create dimension and fact tables are as follows:

--Campus dimension

Create table campus_dim as

SELECT distinct Ocampus

FROM Offering;

--Semester_year dimension

Create table sem_year_dim as

SELECT distinct Oyear||Osem as sem_id, Oyear, Osem

FROM Offering;

--Subject Dimension

Create table subject_dim as

SELECT *

FROM subject;

--Grade Dimension

Create table grade_dim as

SELECT distinct Grade

FROM Enrollment;

--The Fact

CREATE TABLE student_enrollment_fact as

SELECT

o.Ocampus,

o.Oyear||o.Osem as sem_id,

s.Ucode,

e.Grade,

count(st.sid) as num_of_student,

sum(e.score) as Total_score

FROM subject s, enrollment e, offering o, student st

WHERE e.OID = o.OID

and s.Ucode = o.Ucode

and st.SID = e.SID

GROUP BY

o.Ocampus,

o.Oyear||o.Osem,

s.Ucode,

e.Grade;

The records in the data warehouse are as follows:

SQL> select * from campus_dim;

OCAMPUS

---------

City

Main

SQL> select * from sem_year_dim;

SEM_ID OYEAR OSEM

---------- ---------- ----

20092 2009 2

20091 2009 1

SQL> select * from subject_dim;

UCODE UTITLE UCREDIT

---------- -------------------- ----------

IT001 Database 5

IT002 Java 5

IT003 SAP 10

IT004 Network 5

IT005 ASP.NET 5

SQL> select * from grade_dim;

GRADE

-----

SQL> select * from student_enrollment_fact;

OCAMPUS SEM_ID UCODE GRADE NUM_OF_STUDENT TOTAL_SCORE

--------- --------- ---------- ----- -------------- -----------

Main 20091 IT001 HD 3 253

Main 20091 IT001 D 2 149

Main 20091 IT002 C 1 65

Main 20091 IT002 N 1 47

Main 20092 IT002 D 2 151

City 20092 IT001 C 1 64

Main 20091 IT001 N 1 41

Main 20092 IT002 C 1 64

DE 20092 IT004 P 2 105

City 20091 IT003 C 1 63

City 20092 IT001 N 1 32

11 rows selected.

Look at particularly the sem_year_dim dimension, both in the SQL code and the contents of this dimension. In the above version of the star schema, the semester and the year information are actually combined into one dimension, called sem_year_dim dimension. Can we have two separate dimensions: one for semester, and the other for year? The answer is “yes, you can”. The new star schema would look like as follows. Because the original sem_year_dim is now split into two dimensions: sem_dim, and year_dim, consequently, in the fact table, we will have two attributes: semester and year, instead of just one attribute sem_id.

Star Schema-2 (two separate dimensions for semester and year)

The SQL codes to create these two new dimensions (sem_dim and year_dim) and the fact tables are as follows:

--Semester dimension

Create table sem_dim as

SELECT distinct Osem as sem_id

FROM Offering;

--Year dimension

Create table year_dim as

SELECT distinct Oyear as year_id

FROM Offering;

--The Fact

CREATE TABLE student_enrollment_fact_2 as

SELECT

o.Ocampus,

o.Osem as sem_id,

o.Oyear as year_id,

s.Ucode,

e.Grade,

count(st.sid) as num_of_student,

sum(e.score) as Total_score

FROM subject s, enrollment e, offering o, student st

WHERE e.OID = o.OID

and s.Ucode = o.Ucode

and st.SID = e.SID

GROUP BY

o.Ocampus,

o.Osem,

o.Oyear,

s.Ucode,

e.Grade;

The contents of the two dimensions and the new fact are as follows:

SQL> select * from sem_dim;

SEM_ID

----------

SQL> select * from year_dim;

YEAR_ID

----------

2009

SQL> select * from student_enrollment_fact_2;

OCAMPUS SEM_ID YEAR_ID UCODE GR NUM_OF_STUDENT TOTAL_SCORE

--------- ------ -------- ---------- -- -------------- -----------

DE 2 2009 IT004 P 2 105

Main 1 2009 IT001 N 1 41

Main 1 2009 IT001 D 2 149

Main 2 2009 IT002 C 1 64

Main 1 2009 IT002 N 1 47

Main 2 2009 IT002 D 2 151

City 2 2009 IT001 N 1 32

Main 1 2009 IT002 C 1 65

City 1 2009 IT003 C 1 63

Main 1 2009 IT001 HD 3 253

City 2 2009 IT001 C 1 64

11 rows selected.

Now compare the contents of Fact-1 and Fact-2. They are almost the same, except that Fact-1 has one attribute from the sem_year_dim dimension, whereas

Fact-2 has two attributes, called sem_id (from semester_dim) and year_id (from year_dim). Conclusion: From the FACT point of view, it does not really matter whether we have two separate dimensions or just one dimension. They are the same. However, if two information, like semester and year, is often seen as one entity or one piece of information, it would be easier if the two information is present in one dimension.

2. Dimension Hierarchy (Hierarchy vs. No Hierarchy) Suppose we have richer information about campus, as follows: Campus Dimension CampusID CampusName Address Suburb City State Postcode Country CL Clayton

Campus Wellington Rd

Clayton Melbourne Victoria 3800 Australia

CA Caulfield Campus

Dandenong Rd

Caulfield East

Melbourne Victoria 3145 Australia

PA Parkville Campus

Royal Parade

Parkville Melbourne Victoria 3052 Australia

SY Sydney Campus

Opera Boulevard

Sydney Sydney New South Wales

2001 Australia

MA Malaysia Campus

Jalan Lagoon

Bandar Sunway

Sydney Selangor 47500 Malaysia

SA South Africa Campus

Peter St Johanne sburg

Johannesb urg

Johanne sburg

1725 South Africa

In the star schema, we have two options: Option 1 is to have one dimension called campusDIM containing all of these attributes.

Option 1 (no hierarchy)

The SQL to create the Campus Dimension is shown as follows: --Campus dimension

Create Table campus_dim As

SELECT Distinct Ocampus, <<and any other attributes, eg campus name, city, etc>>

FROM Offering;

Option 2 is to have a hierarchy of three dimensions (e.g. campusDIM connected to the Fact, campusDIM is also the parent of cityDIM – because campus-city forms a hierarchy, and cityDIM becomes the parent of countryDIM).

Option 2 (with hierarchy)

The Campus dimension is still linked to the Fact, whereas the City dimension is linked (through hierarchy) to the Campus, and the Country dimension is linked

to the City dimension. Therefore, there is a hierarchy from Campus to City, and to Country.

Note that the information about city is pulled out from the campus, and then create a hierarchy between campus and city. The information about country is also pulled out from the campus and city. The SQL to create the three dimensions in the hierarchy is as follows: --Campus dimension

Create Table campus_dim As

SELECT Distinct Ocampus, <<and any other attributes, eg campus name, etc>>

FROM <<table that stores information about campus, city, country>>;

--City dimension

Create Table city_dim As

SELECT Distinct Ocity, <<and any other attributes about the city>>

FROM <<table that stores information about campus, city, country>>;

--Country dimension

Create Table country_dim As

SELECT Distinct Ocountry, <<and any other attributes about the country>>

FROM <<table that stores information about campus, city, country>>;

The contents of the dimensions in the hierarchy (campus, city, and country dimensions are as follows): Campus Dimension CampusID CampusName Address Suburb Postcode CityID CL Clayton Campus Wellington Rd Clayton 3800 MEL CA Caulfield Campus Dandenong Rd Caulfield East 3145 MEL PA Parkville Campus Royal Parade Parkville 3052 MEL SY Sydney Campus Opera Boulevard Sydney 2001 SYD MA Malaysia Campus Jalan Lagoon Bandar Sunway 47500 KUL SA South Africa Campus Peter St Johannesburg 1725 JNB

City Dimension CityID CityName State CountryID MEL Melbourne Victoria AU SYD Sydney New South Wales AU KUL Kuala Lumpur Selangor MA JNB Johannesburg Johannesburg SA

Country Dimension CountryID CountryName AU Australia MA Malaysia SA South Africa

Note that the creation of the fact table is not affected at all. Either model (option 1 or option 2) will not affect the fact and other dimensions (e.g. grade, subject, semester_year dimensions).

Hierarchy should use the higher detail dimension (e.g. campus) as the parent, which is linked to the fact. The lower detail dimension (e.g. city) becomes the child in the hierarchy. The lowest detail dimension (e.g. country) becomes the lowest in the hierarchy. One of the most common mistakes is the wrong ordering of the dimensions in the hierarchy, such as in the following star schema:

An incorrect hierarchy (upside down)

Hierarchy vs No Hierarchy Since there are two options in implementing a star schema: with and without hierarchy, which option is more preferred?

1. One table vs. many tables Without hierarchy, there is only one table (e.g. the campus dimension table). With hierarchy, there are three tables (e.g. campus, city, and country dimension tables). The consequence is in the join query processing when producing reports. When you want to produce a report involving the fact and campus, with the no hierarchy option, we only need to use two tables: the campus dimension table, and the fact table. Hence, the join query processing is simple. With the hierarchy option, we need to join the fact with three tables (campus, city, and country). Consequently, the join query processing will be more complex

2. Normalized vs. Unnormalized With the hierarchy option, the tables are normalized, which follows the relational model. In contrast, the non-hierarchy option, the table (e.g. the campus dimension table) is unnormalized, which is basically 1NF with visible replication on information which is prone to anomalies (e.g. insert, update, and delete anomalies). Since no update, insert, or delete operations are performed to the dimension tables (once they have created), hence, anomalies become irrelevant. Therefore, the unnormalized option may be preferred, also due to a simpler join query processing.

3. Drilling down and rolling up People often mistakenly thought that the hierarchy model is good for drilling down. The hierarchy is not from country to city and to campus. The correct hierarchy is from the detail to the more general, such as from campus to city, and to country. Hence, we are not drilling down from campus to country, because drilling down should from something that is general (such as country) to something more detail (such as city and campus). In other words, the hierarchy model is not about drilling down information exploration. How about rolling up? Rolling up is exploring information from the detail to a more general. For example, retrieve number of students located at a campus level, and then rolling up to the country level. With the hierarchy option, we still need to do two queries: one for the campus and the second for the country.

Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select CountryName, Sum(Num_of_student) From Fact F, CampusDim C1, CityDim C2, CountryDim C3 Where F.CampusID = C1.CampusID And C1.CityID = C2.CityID And C2.CountryID = C3.CountryID Group By CountryName; Whereas with the non-hierarchy option, we still need to do two queries: Select CampusName, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By CampusName; Select Country, Sum(Num_of_student) From Fact F, CampusDim C Where F.CampusID = C.CampusID Group By Country; From the query point of view, both need two queries. From the query processing point of view, the non-hierarchy option uses one join operation only, because it only needs to join the fact and one dimension. From the conceptual point of view, the hierarchy model does not actually offer a better roll up or drill down features. If we insist to have a drill down, that is from country to campus, we still need to do the two queries above, but we will do the country query first, and then the campus query second.

Oracle/Week-4b-Determinant Dimensions.pdf

Determinant Dimensions The Petrol Price Case Study

1. Snapshot of an operational database The following is a sample operational database that keeps the petrol prices (of different kind of petrol) of petrol stations within a period of time

2. Star Schema The requirements for the data warehouse are to answer questions related to (a) average petrol price, (b) min petrol price, and (c) max petrol prices. From the sample data above, we choose the following three angles to view the fact measures: (i) day of week, (ii) suburb, and (iii) petrol company. From the Day of Week point of view, it is sensible to answer the question something like which day is has the lowest petrol price. We use a two-‐column table methodology to visualize a conceptual model of the star schema. Hence, the two-‐column table for the day of week is as follows: Day of week Total Petrol

Price Num of Petrol Station

Min Petrol Price

Max Petrol Price

Monday Tuesday … Notes:

1. In the above 2-‐column table, the first column (say column A, which is Day of Week) is a category. The second columns (say, columns B1, B2, B3, and B4) are the fact measures.

2. One of the requirements for the fact measure is average petrol price. Since, the fact table should not include an average as a fact measure, we include two attributes: total petrol price, and number of petrol stations, instead.

The visualization of the above two-‐column table for Day of Week is rather incomplete, because there is no data in the fact measure columns. Take a look at the Min Petrol Price column in the above table. What kind of value should be in that column? Supposed Monday has a value in the Min Petrol Price column, what does that value min? It does not specify which petrol type, whether it is Unleaded 91, or Premium 95, or Premium 98, for example. Therefore, the Min Petrol Price column, although seems right, does not make sense at all. How can this problem be solved? One solution is to have one Min Petrol Price for each petrol type (e.g. Unleaded 91, Premium 95, Premium 98, E10, Diesel, and LPG). Therefore, we have 6 attributes to represent Min Petrol Prices, namely: Min Unleaded 91 Price, Min Premium 95 Price, Min Premium 98 Price, Min E10 Price, Min Diesel Price, and Min LPG Price.

Min Petrol Price Day of week

Total Petrol Price

Num of Petrol Station

Unleaded P95 P98 E10 Diesel LPG Max Petrol Price

Monday 119.9c 132.9c 139.9c 114.9c 115.9c 55.7c Tuesday 118.9c 131.9c 136.9c 114.9c 115.9c 56.9c …

We need to do the same for the Total Price, and Max Petrol Price; each of which should be divided into 6 columns. The new two-‐column table for Day of Week is then shown as follows:

Total Petrol Price Min Petrol Price Max Petrol Price Day of Week U 95 98 E D LPG

Num of Petrol Station

U 95 98 E D LPG U 95 98 E D LPG

Monday Tuesday

The matter will become more complicated if number of petrol stations that has Unleaded 91 is different from that of Premium 98, for example. Then to calculate an average of Unleaded 91 on Monday, and an average of Premium 98 on Monday cannot be simply a division between Tot91 and Num of Petrol Stations, and between Tot98 and Num of Petrol Stations, because the based line, which is Num of Petrol Stations for different fuel type is different. Therefore, we need to divide the Num of Petrol Stations column into 6 columns – one for each petrol type. As a result, we will end up with 24 columns to store the fact measures, instead of just 4 columns, as initially thought. Using the same principle, we can have a two-‐column table for Suburb, and another one for Company.

Total Petrol Price Num of Petrol Station

Min Petrol Price Max Petrol Price Suburb

U 95 98 E10 D L U 95 98 E10 D L U 95 98 E10 D L U 95 98 E10 D L

Box Hill Clayton

Total Petrol Price Num of Petrol Station

Min Petrol Price Max Petrol Price Company

U 95 98 E D L U 95 98 E D L U 95 98 E D L U 95 98 E D L

Caltex 7-‐Eleven Based on the above three two-‐column tables, we are now confident to have the following star schema:

3. Star Schema version-2 Is there another way to reduce the number of fact measures? The above star schema has 24 fact measures, because there are 6 different kinds of petrol. The answer is yes!!! We can have a new dimension called Petrol Type DIM, which stores the different kinds of petrol. With this new dimension, the fact measure is reduced to 4, which is the original fact measures we had in the beginning. Therefore, the new star schema becomes as follows:

With this new star schema, as an example, we can retrieve data that shows the Min Petrol Price of Premium 98 on Monday. Or another example is to get the Average Petrol Price (which is Total Petrol Price/Num of Petrol Station) of Unleaded Petrol in 7-‐Eleven petrol station located at Clayton. Note that when retrieving data from a data warehouse, there is no obligation that we must use all dimensions. In the Monday Premium 98 example, only two dimensions are used, which are Day of Week DIM, and Petrol Type DIM. And in the second example, only three dimensions are used: Petrol Type DIM, Company DIM, and Suburb DIM. So, to retrieve data from the data warehouse, we can use

as few as one dimension table; but we can also use all dimension tables. We are not restricted to which dimension we have to use. However, this imposes a new problem. For example, the above star schema allows us to retrieve the Min Petrol Price of Monday, which uses only Day of Week DIM. This data retrieval does not make any sense; which petrol type are we referring to? Based on this case study, it is clear that the Petrol Type DIM holds the key to any data retrieval. We must use the Petrol Type DIM in any retrieval, such as the Min Petrol Price of Premium 98 on Monday, or the Average Petrol Price of Unleaded in 7-‐Eleven petrol station located in Clayton suburb. The Petrol Type DIM, in this example, is called a “Determinant Dimension” – a dimension which needs to be used in the data retrieval, because the fact measures are determined by this dimension. In order to differentiate between Determinant Dimension and all other dimensions, Determinant Dimension is denoted by a star. The fact measures are also denoted by a star. The new star schema with a Determinant Dimension is as follows:

Oracle/Week-4c-Determinant -vs-non-determinant.pdf

Determinant vs. Non-Determinant Dimensions

1. The Olympic Games Case Study The Olympic Games committee maintains an operational database that stores all matches, games, as well as the medal winners of the Olympic Games over the years (https://www.olympic.org/). We would like to build a data warehouse to analyze the medal counts, by each country, sport, and at which Olympic Games. There are two possible star schemas. Version-‐1 star schema contains 3 dimensions, whereas Version-‐2 star schema contains 4 dimensions. For simplicity, a very minimal number of attributes are included in each dimension. Version-‐1 star schema has CountryDIM, SportDIM, and OlympicGamesDIM as the dimensions. Three fact measures are included in the fact table, namely number of gold medals, silver medals, and bronze medals.

Version-1 Star Schema

Version-‐2 star schema has four dimensions – with an additional of Medal Type DIM (which is either Gold, Silver, or Bronze), but there is only one fact measure in the Fact table, which is Num of Medals.

Version-2 Star Schema with the Medal Type DIM

What is the difference between these two versions of star schema? Is Medal Type DIM a Determinant Dimension? In order to answer these questions, we need to visualize two-‐column tables for each dimension category. 2. Two-Column Tables for Version-1 Star Schema (without the Medal Type DIM) The first two-‐column table is from the Country point of view, which is as follows: Country Num of Gold Num of Silver Num of Bronze USA 733 602 488 China 199 143 133 Australia 167 170 189

These are the number of Gold, Silver and Bronze medals that these country got on all Olympic Games (several Olympic Games) recorded in the operational database. Note that this two-‐column table methodology is to help the data warehouse designer to visualize the view of the fact measures from each dimension. The second two-‐column table is from the Sport point of view. Assuming that in the operational database, it records 20 past Olympic Games, and at each Olympic, there is only one gold for 100m Butterfly Men, for instance. Sport Num of

Gold Num of Silver

Num of Bronze

Swimming 100m Butterfly Men 20 20 20 Swimming 400m Freestyle Women 20 20 20 Swimming 4x100m Medley Relay Men 20 20 20 The third two-‐column table is from the Olympic Name point of view. Olympic Name Num of

Gold Num of Silver

Num of Bronze

London 2012 302 304 356 Beijing 2008 302 303 353 Athens 2004 301 301 327 All the three two-‐column tables above seem to be reasonably correct. The first columns are the categories, while the other columns are the fact measures which are numeric and aggregate values. Because these three two-‐column tables make sense, we are confident that version-‐1 star schema is correct. 3. Two-Column Tables for Version-2 Star Schema (with the Medal Type DIM) The two-‐column tables for the first three dimensions, namely Country, Sport, and Olympic Names are as follows: Country Num of Medals USA 1823 China 475 Australia 526 Sport Num of Medals Swimming 100m Butterfly Men 60 Swimming 400m Freestyle Women 60 Swimming 4x100m Medley Relay Men 60

Olympic Name Num of Medals London 2012 962 Beijing 2008 958 Athens 2004 929 The question is whether these two-‐column tables make sense. If we look at the country, it makes sense to see how many medals Australia has received in all Olympic Games; the same with Sport, and Olympic Names. Finding how many medals (regardless the medal types) for each country, for each sport, and for each Olympic seems to be reasonable. The fourth two-‐column table for version-‐2 star schema is the Medal Type, which is as follows: Medal Type Num of Medals Gold 4115 Silver 4095 Bronze 4474 This two-‐column table on Medal Type seems to be reasonable too. Hence, version-‐2 star schema (with Medal Type DIM) is correct. So in conclusion, both star schemas (with or without Medal Type DIM) are correct. Now going back to the original question: Is Medal Type DIM a Determinant Dimension? 4. Determinant or Non-Determinant Dimensions A “Determinant Dimension” is a dimension that the fact measure relies on, and consequently, all data retrieval from the data warehouse must include this dimension. If the data retrieval from the data warehouse does not include this determinant dimension, the retrieval result will not make sense at all. In the previous case study on Petrol Price, Petrol Type DIM is a Determinant Dimension (refer to the star schema below). Note that a Determinant Dimension is denoted by a star. The fact measures affected by the determinant dimension are also starred.

The fact measures: Total Petrol Price, Num of Petrol Station, Min Petrol Price, and Max Petrol Price, depend on the Petrol Type, which is indicated by the Petrol Type DIM. That means analyzing the min petrol price from the day of week point of view, must include the petrol type. Otherwise, it doesn’t make sense to retrieve data to show that on Monday the lowest (min) petrol price is, for example, 109.90cents. As it does not indicate which petrol type it is, this lowest petrol price is meaningless. Therefore, a better data retrieval is to retrieve the record to show that for example, on Monday, the min “Unleaded” petrol price is 109.90cents (e.g. Unleaded is a petrol type obtained from the Petrol Type DIM). Now going back to the Olympic Games case study (refer to Version-‐2 star schema with Medal Type DIM). Is Medal Type DIM a Determinant Dimension? The answer to this question can be answered by another question. To retrieve the data from version-‐2 star schema, must we have the information from Medal Type DIM? The answer is clearly no, because we can simply retrieve a record from the fact to show that Australia in London 2012 Olympic Games received 10 medals in Swimming. This covers three dimensions, namely Country (Australia), Olympic Name (London 2012), and Sport (Swimming). In this example, Medal Type DIM is not involved, and the information retrieved still makes sense.

So the answer to the question whether Medal Type DIM is a Determinant Dimension or not, the answer is clearly No!! The next question is: what is the difference between the Olympic Games case study and the Petrol Price case study. Both are very similar, but the Olympic Games Medal Type DIM is not a determinant dimension, whereas the Petrol Type DIM is a determinant dimension. The answer is the aggregate function used in the fact measure. In the Olympic Games case study, the fact measure function is COUNT, which is count of medals. The breakdown of the medals is gold, silver, and bronze; but the main aggregate function of the fact measure is number of medals, which is a count. If the fact measure is a count, then the dimension (e.g. Medal Type DIM) is not a determinant dimension, because we can still analyze the fact measure which is the total medals from other dimensions, without the medal type dimension. On the other hand, the Petrol Price case study uses AVG, MIN, and MAX as the aggregation functions. Note that we do not store average as a fact measure, but total price and number of stations. These two fact measures will be used to calculate the average. Although the average is not explicitly stored in the fact, implicitly, the total price and number of stations represent the average. If the aggregate function to calculate the fact measures is not COUNT, then a determinant dimension is needed. In this case, Petrol Type DIM is hence a determinant dimension, because Min Petrol Price, for example, does not have any meaning without petrol type. 5. Version-1 (without Medal Type DIM) vs. version-2 (with Medal Type DIM) As both versions in the Olympic Games case study are correct, let’s compare and contract these two versions. In order to do this, let’s have a look at the records in the respective fact tables. The fact table for version-‐1 star schema (without Medal Type DIM) has 6 attributes: three from the dimensions, and the other three for the fact measures. The contents of the fact table are as follows: Fact (version-1 star schema) Country Sport Olympic Name Num of

Gold Num of Silver

Num of Bronze

USA Swimming London 2012 16 9 6 China Swimming London 2012 5 1 4 Australia Swimming London 2012 1 6 3 The fact table for version-‐2 star schema (with Medal Type DIM) consists of 5 columns: 4 from the dimension, but only one fact measure.

Fact (version-2 star schema) Country Sport Olympic Name Medal Type Num of Medals USA Swimming London 2012 Gold 16 USA Swimming London 2012 Silver 9 USA Swimming London 2012 Bronze 6 China Swimming London 2012 Gold 5 China Swimming London 2012 Silver 1 China Swimming London 2012 Bronze 4 Australia Swimming London 2012 Gold 1 Australia Swimming London 2012 Silver 6 Australia Swimming London 2012 Bronze 3 From the storage point of view, it is clear that version-‐1 is the winner. It has only 3 records, whereas in version-‐2, the same information is represented in 9 records. From the modeling point of view, some may prefer version-‐2, because the model is concise and more compact. When the number of fact measure is reasonably large (like in the Petrol Price case study), the star schema with a determinant dimension looks very slim and compact – hence, it is easy to understand. But consequently, the storage requirement increases as well. In contrast, with many different petrol types, if the star schema does not use a determinant dimension, the number of attributes in the fact will dramatically increase, and the schema looks more complex and crowded; but the storage cost is lower.

Oracle/Week-5-AVG-in-Fact.pdf

Page 1/5

Average in the Fact? Should we store an “average” fact measure in the fact table? No!!!!!!! Consider the following example. Suppose we have the following 16 records as our data source in the operational database. Note that there are 9 Semester One and 7 Semester Two records, respectively. Out of the 8 Database unit recodes, 6 of them are Semester One, and 2 of them are Semester Two. An Operational Database Ucode Utitle Semester Sfname Score

IT001 Database 1 Mirriam 81 IT001 Database 1 Allan 41 IT001 Database 1 Ben 74 IT001 Database 1 Kate 85 IT001 Database 1 Larry 87 IT001 Database 1 Leonard 75 IT001 Database 2 Juan 64 IT001 Database 2 Andy 32 IT002 Java 1 Ally 65 IT002 Java 1 Menson 47 IT002 Java 2 Mirriam 78 IT002 Java 2 Ben 73 IT002 Java 2 Larry 64 IT003 SAP 1 Ally 63 IT004 Network 2 Juan 53 IT004 Network 2 Menson 52 The star schema of the above operational database contains one fact and two dimensions. The dimensions are: Subject and Semester (One or Two); and the fact measure is Average Score.

The fact table aggregates these score records based on their dimensions, which are subject and semester. If we store Average Score in the fact table, this is how the fact table will look like:

Page 2/5

Fact Table UCode Semester Average_Score

IT001 1 73.833

IT001 2 48

IT002 1 56

IT002 2 71.667

IT003 1 63

IT004 2 52.5

The dimension tables look like as follows: SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

SemesterDIM Table Semester

1 2 Looking at the Fact Table, the average score for the unit Database in Semester One is 73.833 (average of the first 6 score records); the average score for the unit Database in Semester Two is 48 ((64 + 32)/2). Is this fact table correct? It looks correct. But actually it is incorrect. For example, if we want to query the fact table to find out what is the average score of the Database unit, by looking at the above fact table, the answer would be (73.833+48)/2= 60.9165. The SQL to query the Fact Table is as follows:

Select Avg(Average_Score) From FactTable Where UCode = ‘IT001’;

Is this correct? No. In the Operational Database, there are 8 records for Database unit in Semester One and Two (see the first eight records in the operational database). If we sum all the score of these eight records and divided by eight records, the result will be 539/8=67.375; not 60.9165. Let’s do further comparisons:

Page 3/5

The average score for Java unit in Semester One and Two using the above fact is (56+71.667)/2=63.833. The actual average score for Java unit in Semester One and Two is not 63.833, but 65.4 (see the next 5 records in the above score list, and sum these scores and then divide by 5, 327/5). So again, the above fact table, which stores the average score, will not produce correct results. Ok, now let’s calculate further. The average score for Semester One using the above fact is (73.833+56+63)/3=64.278. In the above score list records, there are nine Semester One records, and the average is in fact 68.667. For Semester Two, using the above fact the average scare for Semester Two is (48+71.667+52.5)/3=57.389; whereas the actual average score for the seven Semester Two records is 59.4286. So, storing average as a fact measurement is not a good idea. How do we solve the above problems? In the fact table, we should store the “total score” and “number of students” in each aggregate group. Hence, the fact table should look like this. Fact Table 2 UCode Semester Total_Score NumberofStudents

IT001 1 443 6

IT001 2 96 2

IT002 1 112 2

IT002 2 215 3

IT003 1 63 1

IT004 2 105 2

Note that the dimension tables remain unchanged: SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

SemesterDIM Table Semester

1 2 Using the correct fact table above, it is easy to calculate the average score of Database unit, which is (443+96)/(6+2)=67.375. The SQL to query the Correct Fact Table is as follows:

Page 4/5

Select Sum(Total_Score)/Sum(NumberofStudents) From FactTable2 Where UCode = ‘IT001’;

The correct star schema is then as follows:

Conclusion The problem of AVG in the Fact is known as the “Average of an Average” problem. This problem is well known in Mathematics. Average of an average will simply produce an incorrect average result (almost all the time). Hence, it is not desirable to have an average measure in the fact – unless the analysis ALWAYS uses all the dimensions. How about Min or Max in the Fact? Can we do it? – Yes we can. Because Max of Max is always a global max, and Min of Min is always a global min. For example, using the above sample data, assume we have Max_Score and Min_Score in the Fact, as follows: Fact Table 3 Ucode Semester Min_Score Max_Score

IT001 1 41 87

IT001 2 32 64

IT002 1 47 65

IT002 2 64 78

IT003 1 63 63

IT004 2 52 53

SubjectDIM Table UCode UTitle

IT001 Database IT002 Java IT003 SAP

IT004 Network

Page 5/5

SemesterDIM Table Semester

1 2 (*Note: the dimension tables are unchanged) Assuming we want to get the Max_Score of IT001, then the max of {87, 64} will produce 87, and 87 is the maximum score of IT001, because 87 is the max in semester 1, which is greater than any max of IT001 (e.g. in semester 2). In other words, “Max of Max” is correct. The SQL to retrieve the maximum score of IT001 is as follows:

Select Max(Max_Score) From FactTable3 Where UCode = ‘IT001’;

The same applies to “Min of Min”. If we want to get the minimum score of IT001, the result will be 32, which is the minimum between 41 and 32.

Select Min(Min_Score) From FactTable3 Where UCode = ‘IT001’;

We certainly don’t want to mix between min and max. For example, retrieving the minimum of Max_Score would be meaningless; the same as retrieving the maximum of Min_Score. As a final conclusion:

• Average in the fact is not desirable, although technically it satisfies the two criteria of the fact (e.g. must be a numerical and aggregate value)

• Min and Max in the fact can still be used, since min_score and max_score are valid fact measures (e.g. they are numerical and aggregated values)

• In general, count and sum are more common. Count is “number of”, and Sum is “total of”.