report
BUS105
Business Information
Systems
Lesson week 8
Semi-structured and unstructured data
management
Lesson Learning Outcomes
1 Define semi-structured and unstructured data
2 Distinguish between the various NoSQL and
NewSQL databases
3 Learn about various software packages for
the management of semi-structured and
unstructured data
4 Evaluate case studies
5 Final discussion with your teacher of
individual report
Dark analytics: Analyzing
unstructured data
Did you know that 95% of data in the world is unstructured?
Watch the video on Dark Analytics
https://www.youtube.com/watch?v=X4f-GCGraXI
What sorts of data is really difficult to analyse?
Glossary 1 LO1
Recall that
• Semi-structured data has some structure
- e.g. CSV files with comma separated data. XML &
JavaScript Object Notation, JSON, documents used to
exchange data to/from a web server. **** some analysts do consider .csv files as structured data
• Unstructured data has no predefined data model not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, John Wiley &
Sons, Indianapolis, US.
https://www.google.com/search?q=parsing+definition&ie=&oe=
Glossary 3 LO1
• Quasi-structured data textual data which has various
formats and takes effort to handle and analyse
– e.g. web clickstream data
• Unstructured data has no predefined data model not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data, John Wiley & Sons, Indianapolis, US.
https://commons.wikimedia.org/wiki/Ne
odythemis_hildebrandti
Why we need non-relational
databases?
• Big data has driven the need for
• NoSQL databases – For unstructured data
• NewSQL databases – Bridging the gap between relational and NoSQL database
design
• Note: Querying language/method depends on the database used
This Photo by Unknown Author is licensed
under CC BY-NC
Recall: NoSQL Databases
NoSQL (Not only SQL), i.e.Non-relational databases
Are used to manage unstructured & semi• -structured data
Sometimes called • “Cloud” databases
• Usually open source
Work on a distributed (parallel) data approach•
General categories of non• -relational databases (DBs):
– Key-value DBs, e.g. shopping cart, sensor data
– Document DBs, e.g. tweets, customer data, blog posts
– column-oriented DBs, e.g. time series, banking
– Graph DBs, e.g. networks, social connections
Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation, &
Management, 13th Edn.,Cengage, Boston, USA.
Activity 1:Match database type and
application
Key-value
DBs
Document
DBs
Column-
oriented
DBs
Graph
DBs
Shopping cart
Tweets
Networks
Time series
Sensor data
Banking
Blog posts
Social connections
Example of Key-Value Database
For example, student names and ages. The name is
used as the key.
Software
Windows Azure•
Riak•
Redis•
Dynamo•
https://www.c-sharpcorner.com/UploadFile/f0b2ed/introduction-of-nosql-
database/
Example: Document Database
• For example, student names, ages & salaries
• Each document has a unique key for searching
• Documents appear as JavaScript Object Notation (JSON) files (semi-structured)
Software
• MongoDB
• RavenDb
• CouchDB
• OrientDB
h tt p s :/ /w
w w
.c -s
h a rp
c o rn
e r.
c o m
/U p lo
a d F
il e /f
0 b 2 e d /i n tr
o d u c ti o n
-o f-
n o s q l- d a ta
b a s e /
Example: JSON code JSON format code examples that could be used to exchange data to or
from a web server:
{“name”: “John”, “age”:30, “Car”: “Ford” }
{“StreetNum”: 5, “streetName”:”King William”, “Lanes”: 4}
KEY VALUE colon (:) curly brace
1. JSON objects are surrounded by curly braces {},
2. They are written in key & value pairs.
3. Keys must be strings, and values must be a valid JSON data type
(string, i.e. text), number, object, array, boolean or null).
4. Keys and values are separated by a colon.
5. Each key/value pair is separated by a comma.
Javascript: JSON and Ajax, 1998 -2014 O’Reilly Media, Inc. available at
archive.oreilly.com/oreillyschool/courses/javascript2/Javascript%20JSON%20and%20Ajax%20v2.pdf
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Activity 2: JSON code
• Why are these incorrectly coded?
1. (“name”: “John”, “age”:30, “Car”: “Ford” )
2. {name: “John”, age:30, Car: “Ford” }
3. {“name”: “age”:30, “Car”: “Ford” }
4. {“name”: “John”, “age”:30, [Car]: [Ford] }
5. {“name”: “John” “age”:30 “Car”: “Ford” }
More about MongoDB
• A document database
• Documents do not have to conform to the
same structure (schema-less)
• Documents with similar types are stored in
collections, related collections are stored in a
DB
• The documents appear as JSON files to users
Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation,
& Management, 13th Edn.,Cengage, Boston, USA.
Example: Column-Oriented Database
• Same example in a row store (relational) and column
(non-relational). Software, Cassandra and HBase
Relational Table
Column-centric storage
Block 1 | 125670,145679,234466,785940,785840
Block 2 | Ma,Jimmy,Peter,Sundar,Jiping
Block 3 | 130, 128 144, 132, 110
Block 4 | 85,78,88,82,70
Activity 3: Column-Oriented Database
• Convert the subset of data from the week 7 excel file
(shown below) into column-centric format
Relational Table
Column-
centric
storage
Block 1 |
Block 2 |
Block 3 |
Block 4 |
Case study: Fraud detection using a
Graph Database
• Neo4j video on Fraud detection
• Watch the video and learn about graph database design
https://www.youtube.com/watch?v=ujimD6MP87I
Aggregate awareness
• Aggregate awareness means that the data is grouped (or “aggregated”) around a central topic
• For example, data collected in connection with an individual blog post, including
– Title, content, date posted
– Username, screen name
– Comments made on the post, etc
• Key value, document and column DBs are all aggregate aware
Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation,
& Management, 13th Edn.,Cengage, Boston, USA.
NewSQL Databases
• Cloud-based to handle large amounts of data
• E.g. ClustrixDB, NuoDB
• Use SQL for queries
• Use massively parallel query processing (MPP) , i.e. data across multiple servers which process the data locally
• Key-value and column-oriented data stores
Case Study: Hit Labs
ClustrixDB customer success story
• Application: Hit Labs created the Bubble Group Messenger
App (for group messaging and group chat)
• It is free on iOS and Android devices
• Originally built on Amazon's Aurora
• Problem: Hit Labs wanted a database to support their rapid
user growth
• Solution: Hit Labs decided to use ClusterixDB
https://www.clustrix.com/resources/customer-success-story-hitlabs/
Activity 4: Review Quiz Q1. Each document in a ___________ has a unique key for searching and storing data, similar to a key-value system
a. relational database
b. document database
c. graph database
d. No database
Q2. MPP stands for
a. Mainly parallel processing
b. Massively parallel providers
c. Massively parallel query processing
d. Managing parallel processors
Hadoop
• Open source Java-based framework for storage and processing of all Big data
• Hadoop distributed filing system (HDFS) assumes
– A high volume of data
– Files written once, closed, and read many times
– Batch processing of entire files
– Data distributed across many computers (nodes)
Coronel, C, and Morris, S 2019, Database Systems: Design, Implementation, & Management, 13th
Edn.,Cengage, Boston, USA.
Case study: Tesco uses Hadoop
• Watch the video and see how Tesco
supermarkets use Hadoop to revolutionise
grocery shopping
• https://www.youtube.com/watch?v=tvVx3sR
fydg
Benefits and Challenges of
unstructured data
• Benefits
– Deeper understanding of data (e.g. customers, products)
– Boosts company’s revenue
• Challenges
– Large volume of data
– Storage and security
– Need a clear management strategy
https://data-insider.com/2016/10/the-challenge-of-managing-unstructured-data/
This Photo by Unknown Author is licensed under CC BY-SA
Individual Report
• Please ask your teacher if you have last
minute questions regarding the individual
report
This Photo by Unknown Author is licensed under CC BY-NC-ND
Exit Activity 5: Mind Map
Help your teacher construct a mind map of today’s content on
unstructured data management.
Unstructured data