report

profileSnug
Workshop_03.pdf

BUS105

Business Information

Systems

Workshop Week 3 Small and big Data Collection, Storage

and Management in Relation to

Information Systems

Copyright Notice

COPYRIGHT COMMONWEALTH OF AUSTRALIA

Copyright Regulations 1969 WARNING

This material has been reproduced and communicated to you by or on behalf of Kaplan Higher

Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in

this communication may be subject to copyright under the Act. Any further reproduction

or communication of this material by you may be the subject of copyright protection under the Act.

Do not remove this notice

2

Lesson Learning Outcomes

1 Review different types of data

2 Contrast small and big data collection

3 Learn about data storage and management

4 Examine business case studies in relation to

the type of data requirements for particular

information systems

Splunk: Slicing Data for

Domino’s Pizza

• Watch the video on how Splunk is helping to improve

Domino’s business functions

https://www.youtube.com/watch?v=LXMjN6kVmUY

Q: What was the big event

that occurred in the US that

required many pizza orders?

• Raw data (primary data)

– Numbers, words, symbols collected from a source

– Not cleaned or processed

– may have errors or outliers

• Metadata

– Data that provides information about other data

– “Metadata explains the origin, purpose, time, geographic

location, creator, access, and terms of use of the data.” https://data.library.arizona.edu/data-management-tips/data-documentation-and-metadata

Glossary 1 LO1

• Metadata from a pdf file

Metadata Example

Glossary 2 LO1

• Structured data is formatted for use, has a well-defined data structure, generally stored in rows and columns - e.g. age (in years), first name (text), address (text),

income ($), etc. We will learn more about this in the relational database section of the slides.

• Semi-structured data has some structure - e.g. CSV files with comma separated data. XML and

JavaScript Object Notation, JSON, documents used to exchange data to/from a web server

• Parse means to analyse (a string or text) into logical syntactic

components.

EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, John Wiley &

Sons, Indianapolis, US.

https://www.google.com/search?q=parsing+definition&ie=&oe=

Glossary 3 LO1

• Quasi-structured data textual data which has various

formats and takes effort to handle and analyse

– e.g. web clickstream data

• Unstructured data has no predefined data model, not

organised, may have multiple types of data

- e.g. data from thermostats, sensors, home electronic

devices, cars, images and sounds & pdf files.

EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing,

Visualizing and Presenting Data, John Wiley & Sons, Indianapolis, US.

https://commons.wikimedia.org/wiki/Neod

ythemis_hildebrandti

Numerical vs Categorical Data LO1

Data

Numerical (quantitative)

Discrete: takes numerical values from counting

Continuous: takes numerical values from measurements

Categorical (qualitative)

Nominal : an identifier or label and has no numerical meaning

Ordinal: categories that can be ranked (ordered) arbitrarily

Examples of Numerical and

Categorical Data

Data

Numerical (quantitative)

Discrete: number of chairs in this room

Continuous: height

Categorical (qualitative)

Nominal: colours, i.e. blue, green, yellow.....

Ordinal: risk, e.g.

1. High risk,

2. Medium risk

3. Low risk

Activity 1: Numerical and

Categorical Data • Form groups and find more examples of the data types

Data

Numerical (quantitative)

Discrete:

Continuous:

Categorical (qualitative)

Nominal:

Ordinal:

• Suppose that you have been employed by bicycle hire

company Citibike to analyse bike trips made by customers

in 2018. Some of the questions you may have are:

• Where do the customers ride most often?

• How far do the customers ride?

• How old, on average, are the customers?

https://www.citibikenyc.com/

Q: What sort of data would

you collect and how much?

Who Wants to Ride Around

New York City?

Who Wants to Ride Around

New York City?

This is structured data.

Q: How do you think this customer data is collected?

• We obtained a data set of 12,677 trips taken in January

2018.

• Variables include • Trip Duration (seconds)

• Start Time and Date

• Stop Time and Date

• Start Station Name

• End Station Name

• Station ID

• Station Lat/Long

• Bike ID

• User Type (Customer = 24-hour pass or 3-day pass user;

Subscriber = Annual Member)

• Gender (Zero=unknown; 1=male; 2=female)

• Year of Birth https://data.world/citibikenyc/citibike-tripdata-january-2018

Q. What type of variables

are these?

Who Wants to Ride Around

New York City?

Activity 2: Contrast Small and

Big Data LO2

• Watch the video and list four of the ten ways in

which small and big data differ

• Report back to class

https://www.youtube.com/watch?v=nh-FrpMqlIs

Small Data Summary LO2

1. Goal: often for a very specific purpose

2. Location: usually stored in one place

3. Structure: more likely to be structured data

4. Data preparation: often handled by a single person

5. Longevity: may only be kept for 7 years

6. Measurements: usually measurements taken by a smaller group

or one person/machine and are consistent

7. Reproducibility: easier to reproduce

8. Stakes (cost): less expensive

9. Introspection: easier to interpret and data points clearer

10. Analysis: often easier to organise and analyse

Video on content from Jules Berman’s book called Principles of Big Data: Preparing, Sharing, and Analyzing

Complex Information https://www.youtube.com/watch?v=nh-FrpMqlIs

Big Data Summary LO2

1. Goals: one may not know how they are going to use all of their big data

2. Location: in multiple places (servers)

3. Structure: all types (structured, semi, quasi and unstructured)

4. Data preparation: by several persons

5. Longevity: may be kept for much longer and possibly used across

different projects, or linked to other data later

6. Measurements: by different persons/machines with different protocols

7. Reproducibility: more difficult to recover data if something goes wrong.

8. Stakes (cost): can be expensive

9. Introspection: you may not be able to identify data type or use

10. Analysis: more complex, e.g. requires extraction, transformation, etc.

How Business Collects Customer

Big Data Internally collected as:

• Sales data (transaction history, customer interaction)

• Customer feedback (e.g. Facebook)

Externally collected by:

• Directly asking

• Indirect tracking (emails, apps and third-party trackers,

• Websites, cookies and web beacons

• Adding other data sources to their own by

– purchasing third party data (e.g. from data

companies Acxiom and Oracle)

https://www.itchronicles.com/big-data/how-do-big-companies-collect-customer-data/

Activity 3: Quick Quiz LO2

1. Big data is usually collected for one specific purpose.

a. True

b. False

2. Small data is usually stored in one place (on one computer or server).

a. True

b. False

3. The Kaplan Information systems course code BUS105 is a:

a. Continuous numerical variable

b. Ordinal variable

c. Nominal variable

d. Discrete numerical variable

Storage of Data LO3

• Data Lake – Repository for large amounts of raw data from multiple sources and in

many formats, some of which may not be useful

• Data warehouse – A repository of data from various sources, partially re-organised, and

used to support decision makers in the organisation

– Takes data from data lake and transforms it

• Data mart

– A low-cost, scaled-down version of a data warehouse designed for the

end-user needs in a strategic business unit (SBU) or a department

• Database

– Organised collection of structured data (relational) or specific

Semi-, quasi and unstructured data (non-relational)

Big Data Storage and

Management Options Top 10 Big Data Storage Companies

https://selecthub.com/big-data-storage-software/

We will learn more about

semi and unstructured data

management in week 8.

Relational Database Management Systems

• Database management system (DBMS)

– A set of tools to add, delete, access, modify, and analyse stored data

Relational databases • Data represented as two-dimensional tables with columns and rows

Example: Microsoft Excel

Software for storage and finding data: MySQL, Microsoft Access, Google

Spanner, MemSQL

http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/

Non-Relational Database Management

Systems Non-relational databases • For big data and real-time web data

• Usually open source and work on a distributed (parallel)

data approach

General categories of non-relational databases:

Key-value stores for shopping cart, sensor data

Document stores for tweets, customer data, blog posts

Wide-column stores for time series, banking

Graph stores for networks, social connections

http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/

https://stackoverflow.com/questions/35281066/neo4j-is-it-possible-to-visualise-a-simple-overview-of-my-database

Non-relational databases

NoSQL databases:

• Store data in a non-tabular for,

e.g. MongoDB (JSON), Neo4j, HBASE

XML databases:

• Have an XML format,

e.g. Oracle Berkeley DB XML, eXist-db, BaseX

http://bigdata-madesimple.com/relational-vs-non-relational-databases-part-1/

https://stackoverflow.com/questions/35281066/neo4j-is-it-possible-to-visualise-a-simple-overview-of-my-database

Non-Relational Database Management

Systems Cont.

Query Languages

• Query languages request information from databases.

• Querying language and method used depends on the

database used.

• The oldest query language is structured query language

(SQL) for relational databases.

– SQL does complicated searches using simple key

words, e.g.

• SELECT (specifies a desired attribute)

• FROM (specifies the table to be used)

• WHERE (specifies conditions to apply in the query)

Other types: UnQL for noSQL databases

• Xquery, XQL for XML databases

Activity 4: Review Quiz

Q1: SQL stands for:

a. Sequence query language

b. Structured query language

c. Semi query language

d. Social query language

Q2: Would you use a data mart across a large organisation or just in a

department?

Q3: MongoDB is a

a. Relational database

b. Table

c. XML database

d. NoSQL database using JSON

Data Governance

• Data governance:

– The policies and processes for managing data and information

across an entire organisation for a specified time.

• Master data management

– How and where data is managed and maintained for the entire

organisation

• Roles and responsibilities

– Staff in charge of making policies and managing data

Example (see next slide)

• Cancer Institute NSW data governance policy

Master data

management Roles and

responsibilities

http://databaseanswers.org/downloads/Data_Governance_by_Example.pdf

Data

governance

Case Study: Cancer Institute NSW

Data Governance • Extract from page 6 of the policy document

https://www.cancer.nsw.gov.au/getmedia/b6a63978-f588-493c-af45-ee4716a4066b/CINSW-data-governance-policy.PDF

Data Management Summary LO3

Data management is how you:

– Organise, structure, and maintain the data

– Store, back up, and preserve data

– Prepare material for analysis, or to share with others

This Photo by Unknown Author is licensed under CC BY

Management is part of

governance (hence the

overlap)

Activity 5: Data Governance

• Form groups, watch the video on data governance and answer the questions below.

https://www.youtube.com/watch?v=t4IOS5csv40

Q1: Definite data governance. Why do we need it?

Q2: What keywords came up in the video in relation to

data governance?

Q3: What are the three key components of data

governance? Can you explain them in your own words?

Data Documentation

• Data documentation is important for transparency.

• Methods include data dictionaries, schema, metadata

A data dictionary is a reference (document) of the variables in a database.

– Defines the format necessary to enter the data into the database, i.e. ranges, codes, decimal places

– Creates standard definitions for all attributes

– Provides organisational data resource inventory for effective data management

Creating a Data Dictionary

Watch the video on creating a data dictionary.

https://www.youtube.com/watch?v=AeVJy-ow2b0

Do you understand these basic elements now?

Field name

Field size

Data type

Data format

Description

Example (optional)

See activity on next slide

Activity 6: Create a Simple Data

Dictionary for the Citibike Data

• Form a group

• Download the file ‘JC-201801-citibike-tripdata.xlsx’

• As a group, construct a simple data dictionary for at least

four variables in the Citibike data

• Report back to class

Case Study: H&R Block Partner

With Xero LO3

• The video shows how H&R Block has adopted

Xero to customise service, given customer tax

data

• Click on link: Xero

• Xero partners dominate nominations for the

Australian Accounting Awards 2019

This Photo by Unknown Author is licensed under CC BY-SA

Case Study: Yamaha Partner 2nd

Watch and AWS Cloud Services “Established in 1960 as Yamaha International Corporation, Yamaha

Corporation of America (YCA) offers a full line of musical instruments

and audio/visual products to the U.S. market.”

Business Problem:

• Yamaha’s data management based at a single data centre.

• All production, test, and development systems running in a co-location

arrangement at another data centre.

• Yamaha had an expensive 30-month replacement cycle for its leased

hardware.

Solution:

• Yamaha migrated data & some management to the AWS Cloud

• Company 2nd Watch was hired to assist.

• The migration to AWS was timely.

• 2nd Watch provide ongoing management, optimisation and planning

services.

https://aws.amazon.com/partners/apn-journal/all/yamaha-2nd-watch/