Report on Unstructured data management

Renaaz

DATA4200_Workshop_02_T1_2021.pdf

Home >Information Systems homework help >Report on Unstructured data management

DATA4200

Data Acquisition and

Management

Lesson 2

Relational Database Design: Data

Storage, pre-processing, querying

and Maintenance

This material has been reproduced and communicated to you by or on behalf of Kaplan Higher

Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in

this communication may be subject to copyright under the Act. Any further reproduction

or communication of this material by you may be the subject of copyright protection under the Act.

Do not remove this notice

Lesson Learning Outcomes

1 Examine data storage options

2 Investigate traditional relational database

design

3 Define data pre-processing

4 Assess data pre-processing options

5 Create SQL search commands in

preparation for the hands-on workshops

This Week’s Big Idea

“Data that is loved tends to survive.” – Kurt Bollacker, computer scientist.

This Photo by Unknown Author is licensed under CC BY

https://www.linkedin.com/in/kurt-bollacker-997b851/

https://www.flickr.com/photos/bixentro/2157792847

https://creativecommons.org/licenses/by/3.0/

Software Choices for SQL

• See week 1 webinar recording by Santos (if you haven’t already) on how to install a version of SQL

• Choices – Macs and low grade PCs, Azure Data Studio

– https://docs.microsoft.com/en-us/sql/azure-data- studio/download-azure-data-studio?view=sql-server-ver15

– Or for other PCs, Microsoft SQL management server

– https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-ver15

– Don’t forget Power BI too

https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver15

Approaches to Data Storage

• In this workshop we will compare old and new

ways of storing, cleaning and querying data

• Note: There are many data storage architectures

old and new.

https://www.juku.it/year-enterprise-object-storage-quick-product-roundup-part-2/

Storage of Data (p1)

• Data Lake – Repository for large amounts of raw data from multiple sources and in

many formats, some of which may not be useful

• Data hub – Collection of raw data organised for sharing.

– The data is physically moved and re-indexed (harmonised) into a new

system.

• Data Virtualisation

– Data access and management without having to know all of the details.

– Original data is not moved, the user has virtual views of the data stored in

existing databases

https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/

Storage of Data (p2)

• Data warehouse (traditional approach) – A repository of data from various sources, partially re-organised,

structured and used to support decision makers in the organisation

– Takes data from data lake and transforms it (also a type of DB)

• Database (DB) – Relational DB: Organised collection of structured data

– Other DBs: Organised collection of semi-, quasi or unstructured data

– For example, Graph database, in which the data is structured a certain

way to determine relationships

https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/

Comparison Table

Task Data Hub (e.g.

that of MarkLogic)

Data Lake Data Virtualisation

Data ingestion Load raw data

Data migrated to

a DB and stays

there

Load raw data

Data migrated to a

DB then stored in a

HDFS*

See a virtual view of

the data

Data Model Multi-model Hadoop multiple file

model

Underlying system

model (see notes)

Search and

Query

Built in Varies depending on

tools

Queries optimised

and passed to the

underlying systems

Security Granular and

advanced

Can be poor Security controls

required for virtual

and underlying DB

* HDFS = Hadoop Distributed File System

https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/

Comparison Table

Task Data Hub (e.g.

that of

MarkLogic)

Data Lake Data Virtualisation

Performance High

performance

Transactions and

analytics

High performance

analytics

High performance

Analytics, however

depends on underlying

infrastructure

Scalability Pentabyte

scalability *

Pentabyte

scalability

Only performs as well

as the slowest

underlying system

(and is impacted by

any issues in the

originating DBs)

https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/

Activity 1: Quick Quiz

1. An unorganised repository of data is most likely to be

a. A database

b. A library

c. A data lake

d. Data virtualisation

2. No security is required for data virtualisation

a. True

b. False

3. Scalability is similar for a Data Hub and Data Lake?

a. True

b. False

Big Data Frameworks

• Hadoop is a distributed database (filing) system for big data

o Uses the MapReduce algorithm (see notes)

• Spark is an open-source processing engine o Also described as a distributed general-purpose cluster-computing

framework.

This Photo by Unknown Author is licensed under CC BY-ND

Factors Spark Hadoop MapReduce

Speed 100x faster than MapReduce Faster than previous

systems

Code name Scala Java

Data processing Batch, real-time, iterative,

interactive and graph

Batch processing

Ease of use Compact and easier than Hadoop complex

Caching Caches* the data in memory No caching

https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/

https://blog.toright.com/posts/4369/apache-spark-%E6%95%99%E5%AD%B8-%E7%94%A8-machine-learning-%E8%BE%A8%E8%AD%98%E9%B3%B6%E5%B0%BE%E8%8A%B1.html

https://creativecommons.org/licenses/by-nd/3.0/

Activity 2: Big Data Frameworks

• Apache Sparke is taking over from Hadoop or included in Hadoop

installations

• Watch the video below

• https://www.youtube.com/watch?v=aUCLLJFSJ-g

Q. Using your slides and the information in the video, list the various ways

in which the frameworks differ?

This Photo by Unknown Author is licensed under CC BY-ND

https://www.youtube.com/watch?v=aUCLLJFSJ-g

https://blog.toright.com/posts/4369/apache-spark-%E6%95%99%E5%AD%B8-%E7%94%A8-machine-learning-%E8%BE%A8%E8%AD%98%E9%B3%B6%E5%B0%BE%E8%8A%B1.html

https://creativecommons.org/licenses/by-nd/3.0/

Traditional Data Warehouse Model LO2

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

Newer Cloud-Based Warehouse

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

• For example, Panoply’s “Smart data infrastructure” with

Amazon Redshift and Amazon S3

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

Case Study 1: Storage via AWS

• Amazon Simple Storage Service (Amazon S3) is a cloud storage service.

• Allows you to store and retrieve any amount of data at any time, from anywhere on the web.

• Uses a simple and intuitive web interface of the Amazon Web Services (AWS) Management Console.

• Example AWS warehouse: the Amazon Redshift data Warehouse (see notes)

https://aws.amazon.com/#

Activity 3: 10-min Research

• Check out the “Products” or “Solutions” menu items on the

AWS website https://aws.amazon.com/#

• For example AWS Data Lakes and Analytics from the

“Solutions” menu

• Spend 10 minutes researching the information and report

back to class with information on AWS

https://aws.amazon.com/

Database Design Terms

• Entity = a person, place, thing or event about which data will be collected or stored

Example: a customer

• Attribute = a characteristic of an entity

Examples: customer ID, first name, date of

birth

Entity relationship models show how entities, (often in different tables), are related

Coronel, C & Morris, S 2019, Database Systems: Design, implementation, & management, 13th

ed, Cengage Learning, Inc., Boston, USA, pp. 36-37.

Traditional Relational

Database Structure

• Tables are the building blocks of relational database design

• Each row in a table represents a single entity (e.g. an

employee)

• Also called a record or tuple

• Each column represents an attribute, e.g. First name, last

name, payroll ID, year to date tax, etc

First_name Last _name Payroll_ID Year_to_date_tax

Julia Wang P123414 $5,693

Bob Jones P231118 $13,004

Harpreet Singh P284923 $8,934

Rosemary Chu P285026 $10,567

Entity Relationship Diagrams

• Entity Relationship Diagrams (ERDs) are used in relational

(SQL) database design

• Are a graphical representation of how entities are related

• Three types of relationships

• One-to-one: One customer – generates one invoice

Symbol

• One-to-many: One office – many employees

• Many-to-many: Many employees – many skills

• Zero or many

• Zero or one

Example 1: Customer Order ERD

CUSTOMER

Mobile_no

First Name

Last Name

DOB

ORDER

Order_no

Ord_date

Tax

Mobile_no

ORDER ITEMS

Item_no

Order_no

Quantity

Customer

makes

orders

Orders have

many items

CUSTOMER

MOBILE PHONE

Mobile_no

A customer

has one

mobile phone

Example 2: Student – Professor ERD

http://www.cs.uregina.ca/Links/class-info/215/erd/

Activity 4: Patient – Doctor ERD

1. What is an ERD and why do we use it?

2. As a group, create connections for this ERD.

3. Add three attributes to each entity, e.g. patient Medicare

number

Appointment Patient Doctor

Prescription

Order

Medication Bill

What is data pre-processing?

• Cleaning (incomplete, noisy, inconsistent, incorrect data)

• Integration (multiple sources and types)

• Transformation (e.g. rescaling, smoothing, aggregation,

generalisation)

• Reduction (irrelevant or redundant dimensions)

http://iasri.res.in/ebook/win_school_aa/notes/Data_Preprocessing.pdf