report
BUS105
Business Information
Systems
Workshop Week 3 Small and big Data Collection, Storage
and Management in Relation to
Information Systems
Copyright Notice
COPYRIGHT COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969 WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in
this communication may be subject to copyright under the Act. Any further reproduction
or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice
2
Lesson Learning Outcomes
1 Review different types of data
2 Contrast small and big data collection
3 Learn about data storage and management
4 Examine business case studies in relation to
the type of data requirements for particular
information systems
Splunk: Slicing Data for
Domino’s Pizza
• Watch the video on how Splunk is helping to improve
Domino’s business functions
https://www.youtube.com/watch?v=LXMjN6kVmUY
Q: What was the big event
that occurred in the US that
required many pizza orders?
• Raw data (primary data)
– Numbers, words, symbols collected from a source
– Not cleaned or processed
– may have errors or outliers
• Metadata
– Data that provides information about other data
– “Metadata explains the origin, purpose, time, geographic
location, creator, access, and terms of use of the data.” https://data.library.arizona.edu/data-management-tips/data-documentation-and-metadata
Glossary 1 LO1
• Metadata from a pdf file
Metadata Example
Glossary 2 LO1
• Structured data is formatted for use, has a well-defined data structure, generally stored in rows and columns - e.g. age (in years), first name (text), address (text),
income ($), etc. We will learn more about this in the relational database section of the slides.
• Semi-structured data has some structure - e.g. CSV files with comma separated data. XML and
JavaScript Object Notation, JSON, documents used to exchange data to/from a web server
• Parse means to analyse (a string or text) into logical syntactic
components.
EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, John Wiley &
Sons, Indianapolis, US.
https://www.google.com/search?q=parsing+definition&ie=&oe=
Glossary 3 LO1
• Quasi-structured data textual data which has various
formats and takes effort to handle and analyse
– e.g. web clickstream data
• Unstructured data has no predefined data model, not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data, John Wiley & Sons, Indianapolis, US.
https://commons.wikimedia.org/wiki/Neod
ythemis_hildebrandti
Numerical vs Categorical Data LO1
Data
Numerical (quantitative)
Discrete: takes numerical values from counting
Continuous: takes numerical values from measurements
Categorical (qualitative)
Nominal : an identifier or label and has no numerical meaning
Ordinal: categories that can be ranked (ordered) arbitrarily
Examples of Numerical and
Categorical Data
Data
Numerical (quantitative)
Discrete: number of chairs in this room
Continuous: height
Categorical (qualitative)
Nominal: colours, i.e. blue, green, yellow.....
Ordinal: risk, e.g.
1. High risk,
2. Medium risk
3. Low risk
Activity 1: Numerical and
Categorical Data • Form groups and find more examples of the data types
Data
Numerical (quantitative)
Discrete:
Continuous:
Categorical (qualitative)
Nominal:
Ordinal:
• Suppose that you have been employed by bicycle hire
company Citibike to analyse bike trips made by customers
in 2018. Some of the questions you may have are:
• Where do the customers ride most often?
• How far do the customers ride?
• How old, on average, are the customers?
https://www.citibikenyc.com/
Q: What sort of data would
you collect and how much?
Who Wants to Ride Around
New York City?
Who Wants to Ride Around
New York City?
This is structured data.
Q: How do you think this customer data is collected?
• We obtained a data set of 12,677 trips taken in January
2018.
• Variables include • Trip Duration (seconds)
• Start Time and Date
• Stop Time and Date
• Start Station Name
• End Station Name
• Station ID
• Station Lat/Long
• Bike ID
• User Type (Customer = 24-hour pass or 3-day pass user;
Subscriber = Annual Member)
• Gender (Zero=unknown; 1=male; 2=female)
• Year of Birth https://data.world/citibikenyc/citibike-tripdata-january-2018
Q. What type of variables
are these?
Who Wants to Ride Around
New York City?
Activity 2: Contrast Small and
Big Data LO2
• Watch the video and list four of the ten ways in
which small and big data differ
• Report back to class
https://www.youtube.com/watch?v=nh-FrpMqlIs
Small Data Summary LO2
1. Goal: often for a very specific purpose
2. Location: usually stored in one place
3. Structure: more likely to be structured data
4. Data preparation: often handled by a single person
5. Longevity: may only be kept for 7 years
6. Measurements: usually measurements taken by a smaller group
or one person/machine and are consistent
7. Reproducibility: easier to reproduce
8. Stakes (cost): less expensive
9. Introspection: easier to interpret and data points clearer
10. Analysis: often easier to organise and analyse
Video on content from Jules Berman’s book called Principles of Big Data: Preparing, Sharing, and Analyzing
Complex Information https://www.youtube.com/watch?v=nh-FrpMqlIs
Big Data Summary LO2
1. Goals: one may not know how they are going to use all of their big data
2. Location: in multiple places (servers)
3. Structure: all types (structured, semi, quasi and unstructured)
4. Data preparation: by several persons
5. Longevity: may be kept for much longer and possibly used across
different projects, or linked to other data later
6. Measurements: by different persons/machines with different protocols
7. Reproducibility: more difficult to recover data if something goes wrong.
8. Stakes (cost): can be expensive
9. Introspection: you may not be able to identify data type or use
10. Analysis: more complex, e.g. requires extraction, transformation, etc.
How Business Collects Customer
Big Data Internally collected as:
• Sales data (transaction history, customer interaction)
• Customer feedback (e.g. Facebook)
Externally collected by:
• Directly asking
• Indirect tracking (emails, apps and third-party trackers,
• Websites, cookies and web beacons
• Adding other data sources to their own by
– purchasing third party data (e.g. from data
companies Acxiom and Oracle)
https://www.itchronicles.com/big-data/how-do-big-companies-collect-customer-data/
Activity 3: Quick Quiz LO2
1. Big data is usually collected for one specific purpose.
a. True
b. False
2. Small data is usually stored in one place (on one computer or server).
a. True
b. False
3. The Kaplan Information systems course code BUS105 is a:
a. Continuous numerical variable
b. Ordinal variable
c. Nominal variable
d. Discrete numerical variable
Storage of Data LO3
• Data Lake – Repository for large amounts of raw data from multiple sources and in
many formats, some of which may not be useful
• Data warehouse – A repository of data from various sources, partially re-organised, and
used to support decision makers in the organisation
– Takes data from data lake and transforms it
• Data mart
– A low-cost, scaled-down version of a data warehouse designed for the
end-user needs in a strategic business unit (SBU) or a department
• Database
– Organised collection of structured data (relational) or specific
Semi-, quasi and unstructured data (non-relational)
Big Data Storage and
Management Options Top 10 Big Data Storage Companies
https://selecthub.com/big-data-storage-software/
We will learn more about
semi and unstructured data
management in week 8.
Relational Database Management Systems
• Database management system (DBMS)
– A set of tools to add, delete, access, modify, and analyse stored data
Relational databases • Data represented as two-dimensional tables with columns and rows
Example: Microsoft Excel
Software for storage and finding data: MySQL, Microsoft Access, Google
Spanner, MemSQL
http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/
Non-Relational Database Management
Systems Non-relational databases • For big data and real-time web data
• Usually open source and work on a distributed (parallel)
data approach
General categories of non-relational databases:
Key-value stores for shopping cart, sensor data
Document stores for tweets, customer data, blog posts
Wide-column stores for time series, banking
Graph stores for networks, social connections
http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-possible-to-visualise-a-simple-overview-of-my-database
Non-relational databases
NoSQL databases:
• Store data in a non-tabular for,
e.g. MongoDB (JSON), Neo4j, HBASE
XML databases:
• Have an XML format,
e.g. Oracle Berkeley DB XML, eXist-db, BaseX
http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-possible-to-visualise-a-simple-overview-of-my-database
Non-Relational Database Management
Systems Cont.
Query Languages
• Query languages request information from databases.
• Querying language and method used depends on the
database used.
• The oldest query language is structured query language
(SQL) for relational databases.
– SQL does complicated searches using simple key
words, e.g.
• SELECT (specifies a desired attribute)
• FROM (specifies the table to be used)
• WHERE (specifies conditions to apply in the query)
Other types: UnQL for noSQL databases
• Xquery, XQL for XML databases
Activity 4: Review Quiz
Q1: SQL stands for:
a. Sequence query language
b. Structured query language
c. Semi query language
d. Social query language
Q2: Would you use a data mart across a large organisation or just in a
department?
Q3: MongoDB is a
a. Relational database
b. Table
c. XML database
d. NoSQL database using JSON
Data Governance
• Data governance:
– The policies and processes for managing data and information
across an entire organisation for a specified time.
• Master data management
– How and where data is managed and maintained for the entire
organisation
• Roles and responsibilities
– Staff in charge of making policies and managing data
Example (see next slide)
• Cancer Institute NSW data governance policy
Master data
management Roles and
responsibilities
http://databaseanswers.org/downloads/Data_Governance_by_Example.pdf
Data
governance
Case Study: Cancer Institute NSW
Data Governance • Extract from page 6 of the policy document
https://www.cancer.nsw.gov.au/getmedia/b6a63978-f588-493c-af45-ee4716a4066b/CINSW-data-governance-policy.PDF
Data Management Summary LO3
Data management is how you:
– Organise, structure, and maintain the data
– Store, back up, and preserve data
– Prepare material for analysis, or to share with others
This Photo by Unknown Author is licensed under CC BY
Management is part of
governance (hence the
overlap)
Activity 5: Data Governance
• Form groups, watch the video on data governance and answer the questions below.
https://www.youtube.com/watch?v=t4IOS5csv40
Q1: Definite data governance. Why do we need it?
Q2: What keywords came up in the video in relation to
data governance?
Q3: What are the three key components of data
governance? Can you explain them in your own words?
Data Documentation
• Data documentation is important for transparency.
• Methods include data dictionaries, schema, metadata
A data dictionary is a reference (document) of the variables in a database.
– Defines the format necessary to enter the data into the database, i.e. ranges, codes, decimal places
– Creates standard definitions for all attributes
– Provides organisational data resource inventory for effective data management
Creating a Data Dictionary
Watch the video on creating a data dictionary.
https://www.youtube.com/watch?v=AeVJy-ow2b0
Do you understand these basic elements now?
Field name
Field size
Data type
Data format
Description
Example (optional)
See activity on next slide
Activity 6: Create a Simple Data
Dictionary for the Citibike Data
• Form a group
• Download the file ‘JC-201801-citibike-tripdata.xlsx’
• As a group, construct a simple data dictionary for at least
four variables in the Citibike data
• Report back to class
Case Study: H&R Block Partner
With Xero LO3
• The video shows how H&R Block has adopted
Xero to customise service, given customer tax
data
• Click on link: Xero
• Xero partners dominate nominations for the
Australian Accounting Awards 2019
This Photo by Unknown Author is licensed under CC BY-SA
Case Study: Yamaha Partner 2nd
Watch and AWS Cloud Services “Established in 1960 as Yamaha International Corporation, Yamaha
Corporation of America (YCA) offers a full line of musical instruments
and audio/visual products to the U.S. market.”
Business Problem:
• Yamaha’s data management based at a single data centre.
• All production, test, and development systems running in a co-location
arrangement at another data centre.
• Yamaha had an expensive 30-month replacement cycle for its leased
hardware.
Solution:
• Yamaha migrated data & some management to the AWS Cloud
• Company 2nd Watch was hired to assist.
• The migration to AWS was timely.
• 2nd Watch provide ongoing management, optimisation and planning
services.
https://aws.amazon.com/partners/apn-journal/all/yamaha-2nd-watch/