report

profileSnug
Workshop_08_v2.pdf

BUS105

Business Information

Systems

Lesson week 8

Semi-structured and unstructured data

management

Lesson Learning Outcomes

1 Define semi-structured and unstructured data

2 Distinguish between the various NoSQL and

NewSQL databases

3 Learn about various software packages for

the management of semi-structured and

unstructured data

4 Evaluate case studies

5 Final discussion with your teacher of

individual report

Dark analytics: Analyzing

unstructured data

Did you know that 95% of data in the world is unstructured?

Watch the video on Dark Analytics

https://www.youtube.com/watch?v=X4f-GCGraXI

What sorts of data is really difficult to analyse?

Glossary 1 LO1

Recall that

• Semi-structured data has some structure

- e.g. CSV files with comma separated data. XML &

JavaScript Object Notation, JSON, documents used to

exchange data to/from a web server. **** some analysts do consider .csv files as structured data

• Unstructured data has no predefined data model not

organised, may have multiple types of data

- e.g. data from thermostats, sensors, home electronic

devices, cars, images and sounds & pdf files.

EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, John Wiley &

Sons, Indianapolis, US.

https://www.google.com/search?q=parsing+definition&ie=&oe=

Glossary 3 LO1

• Quasi-structured data textual data which has various

formats and takes effort to handle and analyse

– e.g. web clickstream data

• Unstructured data has no predefined data model not

organised, may have multiple types of data

- e.g. data from thermostats, sensors, home electronic

devices, cars, images and sounds & pdf files.

EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing,

Visualizing and Presenting Data, John Wiley & Sons, Indianapolis, US.

https://commons.wikimedia.org/wiki/Ne

odythemis_hildebrandti

Why we need non-relational

databases?

• Big data has driven the need for

• NoSQL databases – For unstructured data

• NewSQL databases – Bridging the gap between relational and NoSQL database

design

• Note: Querying language/method depends on the database used

This Photo by Unknown Author is licensed

under CC BY-NC

Recall: NoSQL Databases

NoSQL (Not only SQL), i.e.Non-relational databases

Are used to manage unstructured & semi• -structured data

Sometimes called • “Cloud” databases

• Usually open source

Work on a distributed (parallel) data approach•

General categories of non• -relational databases (DBs):

– Key-value DBs, e.g. shopping cart, sensor data

– Document DBs, e.g. tweets, customer data, blog posts

– column-oriented DBs, e.g. time series, banking

– Graph DBs, e.g. networks, social connections

Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation, &

Management, 13th Edn.,Cengage, Boston, USA.

Activity 1:Match database type and

application

Key-value

DBs

Document

DBs

Column-

oriented

DBs

Graph

DBs

Shopping cart

Tweets

Networks

Time series

Sensor data

Banking

Blog posts

Social connections

Example of Key-Value Database

For example, student names and ages. The name is

used as the key.

Software

Windows Azure•

Riak•

Redis•

Dynamo•

https://www.c-sharpcorner.com/UploadFile/f0b2ed/introduction-of-nosql-

database/

Example: Document Database

• For example, student names, ages & salaries

• Each document has a unique key for searching

• Documents appear as JavaScript Object Notation (JSON) files (semi-structured)

Software

• MongoDB

• RavenDb

• CouchDB

• OrientDB

h tt p s :/ /w

w w

.c -s

h a rp

c o rn

e r.

c o m

/U p lo

a d F

il e /f

0 b 2 e d /i n tr

o d u c ti o n

-o f-

n o s q l- d a ta

b a s e /

Example: JSON code JSON format code examples that could be used to exchange data to or

from a web server:

{“name”: “John”, “age”:30, “Car”: “Ford” }

{“StreetNum”: 5, “streetName”:”King William”, “Lanes”: 4}

KEY VALUE colon (:) curly brace

1. JSON objects are surrounded by curly braces {},

2. They are written in key & value pairs.

3. Keys must be strings, and values must be a valid JSON data type

(string, i.e. text), number, object, array, boolean or null).

4. Keys and values are separated by a colon.

5. Each key/value pair is separated by a comma.

Javascript: JSON and Ajax, 1998 -2014 O’Reilly Media, Inc. available at

archive.oreilly.com/oreillyschool/courses/javascript2/Javascript%20JSON%20and%20Ajax%20v2.pdf

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Activity 2: JSON code

• Why are these incorrectly coded?

1. (“name”: “John”, “age”:30, “Car”: “Ford” )

2. {name: “John”, age:30, Car: “Ford” }

3. {“name”: “age”:30, “Car”: “Ford” }

4. {“name”: “John”, “age”:30, [Car]: [Ford] }

5. {“name”: “John” “age”:30 “Car”: “Ford” }

More about MongoDB

• A document database

• Documents do not have to conform to the

same structure (schema-less)

• Documents with similar types are stored in

collections, related collections are stored in a

DB

• The documents appear as JSON files to users

Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation,

& Management, 13th Edn.,Cengage, Boston, USA.

Example: Column-Oriented Database

• Same example in a row store (relational) and column

(non-relational). Software, Cassandra and HBase

Relational Table

Column-centric storage

Block 1 | 125670,145679,234466,785940,785840

Block 2 | Ma,Jimmy,Peter,Sundar,Jiping

Block 3 | 130, 128 144, 132, 110

Block 4 | 85,78,88,82,70

Activity 3: Column-Oriented Database

• Convert the subset of data from the week 7 excel file

(shown below) into column-centric format

Relational Table

Column-

centric

storage

Block 1 |

Block 2 |

Block 3 |

Block 4 |

Case study: Fraud detection using a

Graph Database

• Neo4j video on Fraud detection

• Watch the video and learn about graph database design

https://www.youtube.com/watch?v=ujimD6MP87I

Aggregate awareness

• Aggregate awareness means that the data is grouped (or “aggregated”) around a central topic

• For example, data collected in connection with an individual blog post, including

– Title, content, date posted

– Username, screen name

– Comments made on the post, etc

• Key value, document and column DBs are all aggregate aware

Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation,

& Management, 13th Edn.,Cengage, Boston, USA.

NewSQL Databases

• Cloud-based to handle large amounts of data

• E.g. ClustrixDB, NuoDB

• Use SQL for queries

• Use massively parallel query processing (MPP) , i.e. data across multiple servers which process the data locally

• Key-value and column-oriented data stores

Case Study: Hit Labs

ClustrixDB customer success story

• Application: Hit Labs created the Bubble Group Messenger

App (for group messaging and group chat)

• It is free on iOS and Android devices

• Originally built on Amazon's Aurora

• Problem: Hit Labs wanted a database to support their rapid

user growth

• Solution: Hit Labs decided to use ClusterixDB

https://www.clustrix.com/resources/customer-success-story-hitlabs/

Activity 4: Review Quiz Q1. Each document in a ___________ has a unique key for searching and storing data, similar to a key-value system

a. relational database

b. document database

c. graph database

d. No database

Q2. MPP stands for

a. Mainly parallel processing

b. Massively parallel providers

c. Massively parallel query processing

d. Managing parallel processors

Hadoop

• Open source Java-based framework for storage and processing of all Big data

• Hadoop distributed filing system (HDFS) assumes

– A high volume of data

– Files written once, closed, and read many times

– Batch processing of entire files

– Data distributed across many computers (nodes)

Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation, & Management, 13th

Edn.,Cengage, Boston, USA.

Case study: Tesco uses Hadoop

• Watch the video and see how Tesco

supermarkets use Hadoop to revolutionise

grocery shopping

• https://www.youtube.com/watch?v=tvVx3sR

fydg

Benefits and Challenges of

unstructured data

• Benefits

– Deeper understanding of data (e.g. customers, products)

– Boosts company’s revenue

• Challenges

– Large volume of data

– Storage and security

– Need a clear management strategy

https://data-insider.com/2016/10/the-challenge-of-managing-unstructured-data/

This Photo by Unknown Author is licensed under CC BY-SA

Individual Report

• Please ask your teacher if you have last

minute questions regarding the individual

report

This Photo by Unknown Author is licensed under CC BY-NC-ND

Exit Activity 5: Mind Map

Help your teacher construct a mind map of today’s content on

unstructured data management.

Unstructured data