Report on Unstructured data management
DATA4200
Data Acquisition and
Management
Lesson 2
Relational Database Design: Data
Storage, pre-processing, querying
and Maintenance
Copyright Notice
COPYRIGHT COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969 WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in
this communication may be subject to copyright under the Act. Any further reproduction
or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice
2
Lesson Learning Outcomes
1 Examine data storage options
2 Investigate traditional relational database
design
3 Define data pre-processing
4 Assess data pre-processing options
5 Create SQL search commands in
preparation for the hands-on workshops
This Week’s Big Idea
“Data that is loved tends to survive.” – Kurt Bollacker, computer scientist.
This Photo by Unknown Author is licensed under CC BY
Software Choices for SQL
• See week 1 webinar recording by Santos (if you haven’t already) on how to install a version of SQL
• Choices – Macs and low grade PCs, Azure Data Studio
– https://docs.microsoft.com/en-us/sql/azure-data- studio/download-azure-data-studio?view=sql-server-ver15
– Or for other PCs, Microsoft SQL management server
– https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-ver15
– Don’t forget Power BI too
Approaches to Data Storage
• In this workshop we will compare old and new
ways of storing, cleaning and querying data
• Note: There are many data storage architectures
old and new.
https://www.juku.it/year-enterprise-object-storage-quick-product-roundup-part-2/
Storage of Data (p1)
• Data Lake – Repository for large amounts of raw data from multiple sources and in
many formats, some of which may not be useful
• Data hub – Collection of raw data organised for sharing.
– The data is physically moved and re-indexed (harmonised) into a new
system.
• Data Virtualisation
– Data access and management without having to know all of the details.
– Original data is not moved, the user has virtual views of the data stored in
existing databases
https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/
Storage of Data (p2)
• Data warehouse (traditional approach) – A repository of data from various sources, partially re-organised,
structured and used to support decision makers in the organisation
– Takes data from data lake and transforms it (also a type of DB)
• Database (DB) – Relational DB: Organised collection of structured data
– Other DBs: Organised collection of semi-, quasi or unstructured data
– For example, Graph database, in which the data is structured a certain
way to determine relationships
https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/
Comparison Table
Task Data Hub (e.g.
that of MarkLogic)
Data Lake Data Virtualisation
Data ingestion Load raw data
Data migrated to
a DB and stays
there
Load raw data
Data migrated to a
DB then stored in a
HDFS*
See a virtual view of
the data
Data Model Multi-model Hadoop multiple file
model
Underlying system
model (see notes)
Search and
Query
Built in Varies depending on
tools
Queries optimised
and passed to the
underlying systems
Security Granular and
advanced
Can be poor Security controls
required for virtual
and underlying DB
* HDFS = Hadoop Distributed File System
https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/
Comparison Table
Task Data Hub (e.g.
that of
MarkLogic)
Data Lake Data Virtualisation
Performance High
performance
Transactions and
analytics
High performance
analytics
High performance
Analytics, however
depends on underlying
infrastructure
Scalability Pentabyte
scalability *
Pentabyte
scalability
Only performs as well
as the slowest
underlying system
(and is impacted by
any issues in the
originating DBs)
https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/
Activity 1: Quick Quiz
1. An unorganised repository of data is most likely to be
a. A database
b. A library
c. A data lake
d. Data virtualisation
2. No security is required for data virtualisation
a. True
b. False
3. Scalability is similar for a Data Hub and Data Lake?
a. True
b. False
Big Data Frameworks
• Hadoop is a distributed database (filing) system for big data
o Uses the MapReduce algorithm (see notes)
• Spark is an open-source processing engine o Also described as a distributed general-purpose cluster-computing
framework.
This Photo by Unknown Author is licensed under CC BY-ND
Factors Spark Hadoop MapReduce
Speed 100x faster than MapReduce Faster than previous
systems
Code name Scala Java
Data processing Batch, real-time, iterative,
interactive and graph
Batch processing
Ease of use Compact and easier than Hadoop complex
Caching Caches* the data in memory No caching
https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/
Activity 2: Big Data Frameworks
• Apache Sparke is taking over from Hadoop or included in Hadoop
installations
• Watch the video below
• https://www.youtube.com/watch?v=aUCLLJFSJ-g
Q. Using your slides and the information in the video, list the various ways
in which the frameworks differ?
This Photo by Unknown Author is licensed under CC BY-ND
Traditional Data Warehouse Model LO2
https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/
Newer Cloud-Based Warehouse
https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/
• For example, Panoply’s “Smart data infrastructure” with
Amazon Redshift and Amazon S3
Case Study 1: Storage via AWS
• Amazon Simple Storage Service (Amazon S3) is a cloud storage service.
• Allows you to store and retrieve any amount of data at any time, from anywhere on the web.
• Uses a simple and intuitive web interface of the Amazon Web Services (AWS) Management Console.
• Example AWS warehouse: the Amazon Redshift data Warehouse (see notes)
https://aws.amazon.com/#
Activity 3: 10-min Research
• Check out the “Products” or “Solutions” menu items on the
AWS website https://aws.amazon.com/#
• For example AWS Data Lakes and Analytics from the
“Solutions” menu
• Spend 10 minutes researching the information and report
back to class with information on AWS
Database Design Terms
• Entity = a person, place, thing or event about which data will be collected or stored
Example: a customer
• Attribute = a characteristic of an entity
Examples: customer ID, first name, date of
birth
Entity relationship models show how entities, (often in different tables), are related
Coronel, C & Morris, S 2019, Database Systems: Design, implementation, & management, 13th
ed, Cengage Learning, Inc., Boston, USA, pp. 36-37.
Traditional Relational
Database Structure
• Tables are the building blocks of relational database design
• Each row in a table represents a single entity (e.g. an
employee)
• Also called a record or tuple
• Each column represents an attribute, e.g. First name, last
name, payroll ID, year to date tax, etc
First_name Last _name Payroll_ID Year_to_date_tax
Julia Wang P123414 $5,693
Bob Jones P231118 $13,004
Harpreet Singh P284923 $8,934
Rosemary Chu P285026 $10,567
Entity Relationship Diagrams
• Entity Relationship Diagrams (ERDs) are used in relational
(SQL) database design
• Are a graphical representation of how entities are related
• Three types of relationships
• One-to-one: One customer – generates one invoice
Symbol
• One-to-many: One office – many employees
• Many-to-many: Many employees – many skills
• Zero or many
• Zero or one
Example 1: Customer Order ERD
CUSTOMER
Mobile_no
First Name
Last Name
DOB
ORDER
Order_no
Ord_date
Tax
Mobile_no
ORDER ITEMS
Item_no
Order_no
Quantity
Customer
makes
orders
Orders have
many items
CUSTOMER
MOBILE PHONE
Mobile_no
A customer
has one
mobile phone
Example 2: Student – Professor ERD
http://www.cs.uregina.ca/Links/class-info/215/erd/
Activity 4: Patient – Doctor ERD
1. What is an ERD and why do we use it?
2. As a group, create connections for this ERD.
3. Add three attributes to each entity, e.g. patient Medicare
number
Appointment Patient Doctor
Prescription
Order
Medication Bill
What is data pre-processing?
• Cleaning (incomplete, noisy, inconsistent, incorrect data)
• Integration (multiple sources and types)
• Transformation (e.g. rescaling, smoothing, aggregation,
generalisation)
• Reduction (irrelevant or redundant dimensions)
http://iasri.res.in/ebook/win_school_aa/notes/Data_Preprocessing.pdf
More on Data Cleaning (Cleansing)
Simple processes include:
• Removing spaces and blank cells
• Checking data types and formats
• Removing duplicates
• Checking inconsistencies
• Investigating outliers and missing values
• Deleting irrelevant data
• Verifying data
https://trumpexcel.com/clean-data-in-excel/
This Photo by Unknown Author is licensed under CC BY-SA
Where does data cleaning sit in the
scheme of things?
This Photo by Unknown Author is licensed under CC BY-SA-NC
Augmented Data Preparation
• Machine learning algorithms can be used to extract,
clean, format or manipulate data.
• Allows data scientists to focus on more specific problems
of interest, example Software: Smarten, Dataform
• Why consider augmented analytics?
• Helps those with less technical skills to gain insight
This Photo by Unknown Author is licensed under CC BY-SA-NC
Extra video if time:
Data Preparation (Pre-Processing)
• See video https://www.youtube.com/watch?v=ar2J0pX0T3M
• Data blending (combining data set)
• Handling errors (missing values)
• Dimension introduction (which vars are useful)
• Outlier detection
Filtering Techniques
• Filters allow you to view specific rows/values in
a table, while hiding others
• Many ways to filter data
• Many software tools have menu-driven filters
– Excel
– Spotfire
– Tableau
• Other software relies on code
Example 1: Excel Filter
Filtering other start stations out
except for Bayside Park
Example 2: Filtering in Tableau
Click on the dimension of your choice, e.g. Start Station Name
Choose the “show filter” command to get a list for filtering (on RHS)
Example 3: Filtering in Spotfire
Activity 5: Explore Filtering in Power
BI
• As a group or individually,
• Load the Citi trip data file JC-201610-citibike-
tripdata.xlsx in Power BI
• Use the filter to focus on:
– Start station “Brunswick street”
– Start time 1/10/2016
– Gender 0
– Formulate a visualisation which would show this
information most effectively
Citi-bike Data in Power BI
Activity 5: Filtering in Power BI
• Each column has a filter on it, use the filters to focus on:
– Untick ‘Select all’ then chose Start station “Brunswick street”
– Similarly, choose end station “Grove St PATH”
– Gender 1
– More specific filtering is accesses next to the Viz menu
– Formulate a visualisation which would show this information most
effectively
Activity 5: Filtering in Power BI
More specific filtering is accesses
next to the Viz menu
– Formulate a visualisation which would
show this information most effectively
Example: Power BI
• On a doughnut chart of trip duration by birth year, I have
filtered later birth years by applying the filter ‘greater than
or equal 1980’
Search Tools for Relational Databases
• Consider our Citi Bike files with many trips from many stations.
• Business Task: As well as pre-processing the data, you wish to search the Citi Bike database and simply use code to extract the trips of your choice.
• Recall: A ‘Query’ language is a computer language for retrieving records in a database. For example, Structured Query Language (SQL).
You can use SQL code to extract the information that you want.
Examples of SQL management systems:
– Microsoft SQL Server
– MySQL
– SQL Developer Suite
– Azure Data Studio
– Aquafold
– DBVisualizer Free
Microsoft SQL Server
• Please use the SQL server which suits you
• The video below shows you have to give a database a
name and import data from a .csv or excel file into the MS
SQL Server- management studio.
• Your teacher will help you import the first 1000 rows of
the Citi Bike data for Oct 2016 into yor SQL database.
• Steps may vary depending on the version of the SQL
software.
https://www.youtube.com/watch?v=E24oT1hnjIE
SQL Server View of Citi Bike Data
Common SQL Commands
• Let’s learn about some of the common SQL
commands with examples now
• We will do a workshop on SQL in week 4
This Photo by Unknown Author is licensed under CC BY-NC
Common SQL Commands
• SELECT specifies the particular attribute you want to find
• FROM specifies the table(s) from which the data will be
selected
e.g. SELECT Trip_Duration, Start_Station_Name, Birth_year, Gender, End_Station_Name
FROM [CitibikeDB].[dbo].[JC-201608-citibike-tripdata]
WHERE Command
• WHERE filters the rows of interest based on given criteria
• SELECT Trip_Duration,Start_Station_Name,End_Station_Name,
Birth_Year,Bike_ID, Gender FROM [CitibikeDB].[dbo].[JC-201608-
citibike-tripdata] WHERE Start_Station_Name = 'Morris Canal'
WHERE With AND Command
• AND allows you to add another condition (limit)
e.g. filtering trips taken by customers between 1970 and 1885
SELECT Trip_Duration,Start_Station_Name,End_Station_Name,
Birth_Year,Bike_ID, Gender FROM [CitibikeDB].[dbo].[JC-201608-citibike-
tripdata] WHERE Birth_Year > 1970 AND Birth_Year <= 1985
ORDER BY Command
• ORDER BY allows you to sequentially sort columns data
SELECT Trip_Duration,Start_Station_Name,End_Station_Name,
Birth_Year,Bike_ID, Gender
FROM [CitibikeDB].[dbo].[JC-201608-citibike-tripdata] WHERE
Start_Station_Name = 'Morris Canal’
ORDER BY Birth_Year, Gender
Activity 4: Guess the Code
1. The council wants to order the bike ID list for Hamilton
park. They ask you as their analyst to investigate.
How could you use WHERE and ORDER BY to achieve the
follow output? Detail your process. (See previous slide)
More SQL Commands for Week 4
• Feel free to explore the S
• HAVING filters the groups formed in the GROUPED BY
statement
• GROUPED BY groups rows based on given attributes
• JOIN … on Var1 = Var2 allows you to joins two tables
where two different variables have the same value.
Explore more with teacher
• Feel free to spend more time doing initial
exploration of SQL commands with your
teacher
This Photo by Unknown Author is licensed under CC BY-SA
Next Week
• Methods of sampling data
This Photo by Unknown Author is licensed under CC BY-SA-NC