Report on Unstructured data management

profileRenaaz
DATA4200_Workshop_02_T1_2021.pdf

DATA4200

Data Acquisition and

Management

Lesson 2

Relational Database Design: Data

Storage, pre-processing, querying

and Maintenance

Copyright Notice

COPYRIGHT COMMONWEALTH OF AUSTRALIA

Copyright Regulations 1969 WARNING

This material has been reproduced and communicated to you by or on behalf of Kaplan Higher

Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in

this communication may be subject to copyright under the Act. Any further reproduction

or communication of this material by you may be the subject of copyright protection under the Act.

Do not remove this notice

2

Lesson Learning Outcomes

1 Examine data storage options

2 Investigate traditional relational database

design

3 Define data pre-processing

4 Assess data pre-processing options

5 Create SQL search commands in

preparation for the hands-on workshops

This Week’s Big Idea

“Data that is loved tends to survive.” – Kurt Bollacker, computer scientist.

This Photo by Unknown Author is licensed under CC BY

Software Choices for SQL

• See week 1 webinar recording by Santos (if you haven’t already) on how to install a version of SQL

• Choices – Macs and low grade PCs, Azure Data Studio

– https://docs.microsoft.com/en-us/sql/azure-data- studio/download-azure-data-studio?view=sql-server-ver15

– Or for other PCs, Microsoft SQL management server

– https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-ver15

– Don’t forget Power BI too

Approaches to Data Storage

• In this workshop we will compare old and new

ways of storing, cleaning and querying data

• Note: There are many data storage architectures

old and new.

https://www.juku.it/year-enterprise-object-storage-quick-product-roundup-part-2/

Storage of Data (p1)

• Data Lake – Repository for large amounts of raw data from multiple sources and in

many formats, some of which may not be useful

• Data hub – Collection of raw data organised for sharing.

– The data is physically moved and re-indexed (harmonised) into a new

system.

• Data Virtualisation

– Data access and management without having to know all of the details.

– Original data is not moved, the user has virtual views of the data stored in

existing databases

https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/

Storage of Data (p2)

• Data warehouse (traditional approach) – A repository of data from various sources, partially re-organised,

structured and used to support decision makers in the organisation

– Takes data from data lake and transforms it (also a type of DB)

• Database (DB) – Relational DB: Organised collection of structured data

– Other DBs: Organised collection of semi-, quasi or unstructured data

– For example, Graph database, in which the data is structured a certain

way to determine relationships

https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/

Comparison Table

Task Data Hub (e.g.

that of MarkLogic)

Data Lake Data Virtualisation

Data ingestion Load raw data

Data migrated to

a DB and stays

there

Load raw data

Data migrated to a

DB then stored in a

HDFS*

See a virtual view of

the data

Data Model Multi-model Hadoop multiple file

model

Underlying system

model (see notes)

Search and

Query

Built in Varies depending on

tools

Queries optimised

and passed to the

underlying systems

Security Granular and

advanced

Can be poor Security controls

required for virtual

and underlying DB

* HDFS = Hadoop Distributed File System

https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/

Comparison Table

Task Data Hub (e.g.

that of

MarkLogic)

Data Lake Data Virtualisation

Performance High

performance

Transactions and

analytics

High performance

analytics

High performance

Analytics, however

depends on underlying

infrastructure

Scalability Pentabyte

scalability *

Pentabyte

scalability

Only performs as well

as the slowest

underlying system

(and is impacted by

any issues in the

originating DBs)

https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/

Activity 1: Quick Quiz

1. An unorganised repository of data is most likely to be

a. A database

b. A library

c. A data lake

d. Data virtualisation

2. No security is required for data virtualisation

a. True

b. False

3. Scalability is similar for a Data Hub and Data Lake?

a. True

b. False

Big Data Frameworks

• Hadoop is a distributed database (filing) system for big data

o Uses the MapReduce algorithm (see notes)

• Spark is an open-source processing engine o Also described as a distributed general-purpose cluster-computing

framework.

This Photo by Unknown Author is licensed under CC BY-ND

Factors Spark Hadoop MapReduce

Speed 100x faster than MapReduce Faster than previous

systems

Code name Scala Java

Data processing Batch, real-time, iterative,

interactive and graph

Batch processing

Ease of use Compact and easier than Hadoop complex

Caching Caches* the data in memory No caching

https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/

Activity 2: Big Data Frameworks

• Apache Sparke is taking over from Hadoop or included in Hadoop

installations

• Watch the video below

• https://www.youtube.com/watch?v=aUCLLJFSJ-g

Q. Using your slides and the information in the video, list the various ways

in which the frameworks differ?

This Photo by Unknown Author is licensed under CC BY-ND

Traditional Data Warehouse Model LO2

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

Newer Cloud-Based Warehouse

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

• For example, Panoply’s “Smart data infrastructure” with

Amazon Redshift and Amazon S3

Case Study 1: Storage via AWS

• Amazon Simple Storage Service (Amazon S3) is a cloud storage service.

• Allows you to store and retrieve any amount of data at any time, from anywhere on the web.

• Uses a simple and intuitive web interface of the Amazon Web Services (AWS) Management Console.

• Example AWS warehouse: the Amazon Redshift data Warehouse (see notes)

https://aws.amazon.com/#

Activity 3: 10-min Research

• Check out the “Products” or “Solutions” menu items on the

AWS website https://aws.amazon.com/#

• For example AWS Data Lakes and Analytics from the

“Solutions” menu

• Spend 10 minutes researching the information and report

back to class with information on AWS

Database Design Terms

• Entity = a person, place, thing or event about which data will be collected or stored

Example: a customer

• Attribute = a characteristic of an entity

Examples: customer ID, first name, date of

birth

Entity relationship models show how entities, (often in different tables), are related

Coronel, C & Morris, S 2019, Database Systems: Design, implementation, & management, 13th

ed, Cengage Learning, Inc., Boston, USA, pp. 36-37.

Traditional Relational

Database Structure

• Tables are the building blocks of relational database design

• Each row in a table represents a single entity (e.g. an

employee)

• Also called a record or tuple

• Each column represents an attribute, e.g. First name, last

name, payroll ID, year to date tax, etc

First_name Last _name Payroll_ID Year_to_date_tax

Julia Wang P123414 $5,693

Bob Jones P231118 $13,004

Harpreet Singh P284923 $8,934

Rosemary Chu P285026 $10,567

Entity Relationship Diagrams

• Entity Relationship Diagrams (ERDs) are used in relational

(SQL) database design

• Are a graphical representation of how entities are related

• Three types of relationships

• One-to-one: One customer – generates one invoice

Symbol

• One-to-many: One office – many employees

• Many-to-many: Many employees – many skills

• Zero or many

• Zero or one

Example 1: Customer Order ERD

CUSTOMER

Mobile_no

First Name

Last Name

DOB

ORDER

Order_no

Ord_date

Tax

Mobile_no

ORDER ITEMS

Item_no

Order_no

Quantity

Customer

makes

orders

Orders have

many items

CUSTOMER

MOBILE PHONE

Mobile_no

A customer

has one

mobile phone

Example 2: Student – Professor ERD

http://www.cs.uregina.ca/Links/class-info/215/erd/

Activity 4: Patient – Doctor ERD

1. What is an ERD and why do we use it?

2. As a group, create connections for this ERD.

3. Add three attributes to each entity, e.g. patient Medicare

number

Appointment Patient Doctor

Prescription

Order

Medication Bill

What is data pre-processing?

• Cleaning (incomplete, noisy, inconsistent, incorrect data)

• Integration (multiple sources and types)

• Transformation (e.g. rescaling, smoothing, aggregation,

generalisation)

• Reduction (irrelevant or redundant dimensions)

http://iasri.res.in/ebook/win_school_aa/notes/Data_Preprocessing.pdf

More on Data Cleaning (Cleansing)

Simple processes include:

• Removing spaces and blank cells

• Checking data types and formats

• Removing duplicates

• Checking inconsistencies

• Investigating outliers and missing values

• Deleting irrelevant data

• Verifying data

https://trumpexcel.com/clean-data-in-excel/

This Photo by Unknown Author is licensed under CC BY-SA

Where does data cleaning sit in the

scheme of things?

This Photo by Unknown Author is licensed under CC BY-SA-NC

Augmented Data Preparation

• Machine learning algorithms can be used to extract,

clean, format or manipulate data.

• Allows data scientists to focus on more specific problems

of interest, example Software: Smarten, Dataform

• Why consider augmented analytics?

• Helps those with less technical skills to gain insight

This Photo by Unknown Author is licensed under CC BY-SA-NC

Extra video if time:

Data Preparation (Pre-Processing)

• See video https://www.youtube.com/watch?v=ar2J0pX0T3M

• Data blending (combining data set)

• Handling errors (missing values)

• Dimension introduction (which vars are useful)

• Outlier detection

Filtering Techniques

• Filters allow you to view specific rows/values in

a table, while hiding others

• Many ways to filter data

• Many software tools have menu-driven filters

– Excel

– Spotfire

– Tableau

• Other software relies on code

Example 1: Excel Filter

Filtering other start stations out

except for Bayside Park

Example 2: Filtering in Tableau

Click on the dimension of your choice, e.g. Start Station Name

Choose the “show filter” command to get a list for filtering (on RHS)

Example 3: Filtering in Spotfire

Activity 5: Explore Filtering in Power

BI

• As a group or individually,

• Load the Citi trip data file JC-201610-citibike-

tripdata.xlsx in Power BI

• Use the filter to focus on:

– Start station “Brunswick street”

– Start time 1/10/2016

– Gender 0

– Formulate a visualisation which would show this

information most effectively

Citi-bike Data in Power BI

Activity 5: Filtering in Power BI

• Each column has a filter on it, use the filters to focus on:

– Untick ‘Select all’ then chose Start station “Brunswick street”

– Similarly, choose end station “Grove St PATH”

– Gender 1

– More specific filtering is accesses next to the Viz menu

– Formulate a visualisation which would show this information most

effectively

Activity 5: Filtering in Power BI

More specific filtering is accesses

next to the Viz menu

– Formulate a visualisation which would

show this information most effectively

Example: Power BI

• On a doughnut chart of trip duration by birth year, I have

filtered later birth years by applying the filter ‘greater than

or equal 1980’

Search Tools for Relational Databases

• Consider our Citi Bike files with many trips from many stations.

• Business Task: As well as pre-processing the data, you wish to search the Citi Bike database and simply use code to extract the trips of your choice.

• Recall: A ‘Query’ language is a computer language for retrieving records in a database. For example, Structured Query Language (SQL).

You can use SQL code to extract the information that you want.

Examples of SQL management systems:

– Microsoft SQL Server

– MySQL

– SQL Developer Suite

– Azure Data Studio

– Aquafold

– DBVisualizer Free

Microsoft SQL Server

• Please use the SQL server which suits you

• The video below shows you have to give a database a

name and import data from a .csv or excel file into the MS

SQL Server- management studio.

• Your teacher will help you import the first 1000 rows of

the Citi Bike data for Oct 2016 into yor SQL database.

• Steps may vary depending on the version of the SQL

software.

https://www.youtube.com/watch?v=E24oT1hnjIE

SQL Server View of Citi Bike Data

Common SQL Commands

• Let’s learn about some of the common SQL

commands with examples now

• We will do a workshop on SQL in week 4

This Photo by Unknown Author is licensed under CC BY-NC

Common SQL Commands

• SELECT specifies the particular attribute you want to find

• FROM specifies the table(s) from which the data will be

selected

e.g. SELECT Trip_Duration, Start_Station_Name, Birth_year, Gender, End_Station_Name

FROM [CitibikeDB].[dbo].[JC-201608-citibike-tripdata]

WHERE Command

• WHERE filters the rows of interest based on given criteria

• SELECT Trip_Duration,Start_Station_Name,End_Station_Name,

Birth_Year,Bike_ID, Gender FROM [CitibikeDB].[dbo].[JC-201608-

citibike-tripdata] WHERE Start_Station_Name = 'Morris Canal'

WHERE With AND Command

• AND allows you to add another condition (limit)

e.g. filtering trips taken by customers between 1970 and 1885

SELECT Trip_Duration,Start_Station_Name,End_Station_Name,

Birth_Year,Bike_ID, Gender FROM [CitibikeDB].[dbo].[JC-201608-citibike-

tripdata] WHERE Birth_Year > 1970 AND Birth_Year <= 1985

ORDER BY Command

• ORDER BY allows you to sequentially sort columns data

SELECT Trip_Duration,Start_Station_Name,End_Station_Name,

Birth_Year,Bike_ID, Gender

FROM [CitibikeDB].[dbo].[JC-201608-citibike-tripdata] WHERE

Start_Station_Name = 'Morris Canal’

ORDER BY Birth_Year, Gender

Activity 4: Guess the Code

1. The council wants to order the bike ID list for Hamilton

park. They ask you as their analyst to investigate.

How could you use WHERE and ORDER BY to achieve the

follow output? Detail your process. (See previous slide)

More SQL Commands for Week 4

• Feel free to explore the S

• HAVING filters the groups formed in the GROUPED BY

statement

• GROUPED BY groups rows based on given attributes

• JOIN … on Var1 = Var2 allows you to joins two tables

where two different variables have the same value.

Explore more with teacher

• Feel free to spend more time doing initial

exploration of SQL commands with your

teacher

This Photo by Unknown Author is licensed under CC BY-SA

Next Week

• Methods of sampling data

This Photo by Unknown Author is licensed under CC BY-SA-NC