Report on Unstructured data management

profileRenaaz
DATA4200_Workshop_01_T1_2021.pdf

Data 4200

Data Acquisition and

Management

Lesson 1

Data Sources, Ethical and Best

Practice in Data Acquisition

Copyright Notice

COPYRIGHT COMMONWEALTH OF AUSTRALIA

Copyright Regulations 1969 WARNING

This material has been reproduced and communicated to you by or on behalf of Kaplan Higher

Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in

this communication may be subject to copyright under the Act. Any further reproduction

or communication of this material by you may be the subject of copyright protection under the Act.

Do not remove this notice

2

Roadmap

• In this subject we want you to understand how data collection, storage, sampling and analysis is evolving

• This course will provide you with hands on experience in SQL and Power BI

• There will also be introductory level Neo4j and python

This Photo by Unknown Author is licensed under CC BY-NC

Lesson Learning Outcomes

1 Investigate data types, data sources and

acquisition

2 Explore open data sources

3 Perform some simple web extraction

exercises

4 Analyse best practice, ethics and challenges

in relation to data acquisition

5 Review the evolution of databases and

example database models

This Topic’s Big Idea

“Web users ultimately want to get at data

quickly and easily. They don’t care as much

about attractive sites and pretty design.” Tim Berners-Lee

Activity 1: Many Ways to Acquire and

Store Data • Watch the video at:

https://www.coursera.org/lecture/big-data-introduction/step-1-acquiring-

data-4SI3T

• Answer the questions below.

Q1: It is said that leaving out data

can lead to errors in your results,

however is it realistic to think that

all of the data we need will be

available if we search hard

enough?

Q2: Name a software package that

can be used to acquire text files.

Ways of Distinguishing Between

Data Types

Many ways of categorising the data you acquire

• Primary, secondary

• Quantitative, qualitative

• Raw data, metadata

• Structured, semi- structured, unstructured

• Internal versus external

https://lcolumbus.files.wordpress.com/2012/03

/image-for-data-center-forecast.jpg

Recall Data Types… • Structured data is formatted for use, has a well-defined data

structure, generally stored in rows and columns

- e.g. Age (in years), first name (text), address (text), income ($),

etc.

• Semi-structured data has some structure

- e.g. CSV files with comma separated data; XML and JavaScript

Object Notation, JSON, documents used to exchange data

to/from a web server.

• Unstructured data has no predefined data model, not organised,

may have multiple types of data

- e.g. Data from thermostats, sensors, home electronic devices,

cars, images and sounds and pdf files

Tomcy, J & Pankaj, M 2017, Data Lake for Enterprises, Packt publishing Ltd, Birmingham, UK.

Internal vs External Data Sources

• Sometimes called “data resources”

Internal:

• More traditionally: source data from information systems

in each functional area

• More recently: access data from a data lake or hub

(in the cloud) with all company data

External:

• Closed data for purchase (or used under strict agreement)

• Open data (free) https://fermi.gsfc.nasa.gov/ssc/

Activity 2: Open Big Data Sources

Below is a list of a few ‘open big data sources’

• The World Factbook

• Amazon Web services

• Open Government Data

• Open Data Network

• Google Public Data explorer

• Worldback.org

• UNData

• World Census Open Data

1. In groups, go to the link below on your device/phone and open up one of the

open data websites.

2. Which data source is most accessible, understandable and user friendly for a

business analyst?

3. Report back to the class

https://www.investintech.com/resources/blog/archives/6304-open-data-resources.html

https://www.flickr.com/photos/notb

rucelee/8016172703

How big Business Collects

Customer Data Internally collected as:

• Sales data (transaction history, customer interaction)

• Customer feedback (e.g. Facebook)

Externally collected by:

• Directly asking

• Indirect tracking (emails, apps and third-party trackers)

• Websites, scraping, cookies and web beacons

• Adding other data sources to their own by purchasing

third party data from data companies, e.g. Acxiom and Oracle

https://www.itchronicles.com/big-data/how-do-big-companies-collect-customer-data/

Example: Google Analytics

• Google Analytics and similar tools allow you to analyse

consumer behaviour once collected

https://analytics.google.com/analytics/web/

Web data extraction tools and apps

Why do businesses use Web extraction tools?

• To aggregating new or market data

• For insurance

• For price comparisons

• To do real-time or text analytics

• ML training models

• To generate sales lead

Data extraction tools

• Web crawling tools automatically group webpages/topics using an

text analytics algorithm

• Web scraping tools automate the extraction of data from websites

Example software, mozenda.com, pattern.web, scrapy.org,

octoparse.com, python, Content Grabber, Nintex RPA, Carrot2

https://www.predictiveanalyticstoday.com/top-web-scraping-software/

https://towardsdatascience.com/https-medium-com-hiren787-patel-web-scraping-applications-a6f370d316f4

vimeo.com

Scraping versus Crawling

This Photo by Unknown Author is licensed under CC BY-SA

Activity 3 and 4

• We will now perform web extraction in

excel and Power BI for comparison

This Photo by Unknown Author is licensed under CC BY

Activity 3: Webscraping Financial Data

Into Excel

In this case we will scrape data from the Bloomberg website.

• As a group, or individually, open a new Microsoft Excel

spreadsheet.

• Open a web browser and go to the Bloomberg website:

https://www.bloomberg.com/markets/currencies

• Copy the url above.

• Return to Excel and click on the ‘Data’ menu.

• In the ‘Get Data’ options, choose ‘From other sources’

then choose ‘From Web’.

• Paste the Bloomberg url in here

• Click OK and another pop window will prompt you for the view

Activity 3 contd

• Choose Table 0 and

click ‘load’

• It will take a little

while and then you

should see your data

Activity 3 contd

• You should see the Bloomberg data

You can refresh

the data by

clicking here

Activity 3 contd

Activity 4: Webscraping Basketball data

Using Power BI

• Open a new file in Power BI

• To connect to the Data: Use

the “Get data” menu

• Select “other” and then the

“Web” option

• Click connect

https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/

Activity 4 contd

• Copy in the Url below and click on Ok

• https://www.basketball-reference.com/teams/CHI/1995.html

https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/

Activity 4 contd

• Scroll down and select the “Per Game table”

• Click on Transform

Data

• Let do a small

amount of

transforming and

visualising the data

https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/

Activity 4 contd

• Rename col 2 “Player” by double clicking on the title row

• Select the Add Column (top menu) at the top and then select Custom

Column. Name your column Year and make the value 1995 and click on OK

https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/

Activity 4 contd

https://spencerbaucke.com/2020/04/29/w

eb-scraping-in-power-bi/

• Rename the Per Game table to reflect the year of games, by clicking on the

name and typing, e.g. Games 1995

• Go to the New Query menu, new Source and search for the 2000 data

https://www.basketball-reference.com/teams/CHI/2000.html

• And repeat the steps as before for 1995 and you can add in some more

years if you have time

• Two make a union of the datasets, go to the Home menu and on the top

ribbon click on Append Queries as New

• Select the table names you want to join

and click OK

Click on close and apply

https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/

Visualise the results

• You should see the Appended table and can view the variables in

the Fields menu

• Click on Ribbon Chart

• Drag Year to the Axis variable box

• Player in the legend box

• Points per Game ( Σ PTS/G) to the Values box

If you have time

you can

experiment with

the formatting

Activity 5: Ethics, Privacy and Data

Collection Watch the video at:

https://www.youtube.com/watch?v=naaDBNSx610

Q1: What ethical issues have been brought up here?

Q2: Should you have to pay to keep your privacy?

Recall: the Definition of Ethics

https://en.wiktionary.org/wiki/ethics

https://theodi.org/article/data-ethics-canvas/

Ethics originates from the Greek word ēthikós (ἠθικός), which

means "relating to one's character".

Activity 6: Case Study Snowden Revelations about National Security Agency (NSA) surveillance

In 2013 it was revealed by Edward Snowden, a Central Intelligence Agency

employee, that the NSA (using an order from the Foreign Intelligence Court) had

demanded that the US telecommunications company Verizon hand over metadata

from millions of American’s phone calls to the Federal Bureau of Investigation and

the NSA.

Snowden also revealed other global surveillance programs, leaked classified

documents, and suggested that the so-called “PRISM” program gave the NSA

direct access to servers of some large technology companies, e.g. Facebook,

Yahoo and Google.

After Snowden was charged for leaking Government information he fled to Russia

and is still living there.

In groups, discuss and answer these questions:

Q1: What are possible consequences for the public, given these leaks?

Q2: Is this surveillance unethical or should Intelligence agencies be able to access

all of our data for safety reasons? Lyon, D 2014, ‘Surveillance, Snowden, and Big Data: Capacities, consequences, critique’, Big Data and

Society, Jul-Dec, pp.1-13 accessed on 26 March 2019

Where is database design heading?

• In coming weeks we will be performing

data queries in traditional (SQL, SQL in

Power BI) and non-traditional databases

(Neo4j)

• To understand why, let’s look at a brief

history of databases and compare a

simple example of old and new

Database History Chart

a-brief-history-of-databases.pdf (peterjamesthomas.com)

Database Evolution

• This evolution of database models influences how

– data is stored and manipulated by software

– you perform analyses and the output obtained (trends vs

relationships vs basic stats)

• For Example,

– relational database data is usually imported to an analytics

package or loaded in SQL for queries to look at trends or make

predictions

– The same data would have to be reformatted in Neo4j or other

graph database to look at relations

– The way the information is linked provides different insight

(see next page)

Comparison of relational and graph

database set-up

Relational Database

Graph Database

https://neo4j.com/developer/graph-db-vs-rdbms/

Case Study: Why does Instagram

use a graph database?

• Business problem: Understanding customers and how they relate to each other to provide a customised service

• Instagram makes money by selling targeted adverts, just as other social media websites

• They build profiles and determine user relationships

• Instagram collects – your location (even by looking at the background in your photographs)

– your mobile carrier

– phone number

– IP address

– the types of products that interest you

• Hashtags on Instagram are compared with likes on Facebook and messages on Messenger

https://vpnoverview.com/privacy/social-media/what-does-instagram-know-about-me/

AI-Based Data Collection

• Examples of daily AI based tools in daily life: Amazon

Echo, Alexa, Google Home, and Apple’s Siri, Woebot

and other chatbots

• Used on the web, within apps, and on messaging

platforms

• For example, chatbots use natural language processing

and machine learning (ML) to enable them to converse

with humans (and other chatbots)

• Challenges:

– Determining the user’s emotional state

– understanding unusual accents

https://www.businessinsider.com/what-is-chatbot-talking-ai-robot-chat-simulators-2017-10/?r=AU&IR=T

Activity 7: Bias Data Collection Using Artificial

Intelligence

Veale, M & Binns, R 2017, ‘Fairer machine learning in the real world: Mitigating discrimination without

collecting sensitive data’, Jul-Dec, pp. 1-17.

• Veale and Binns (2017) argue that decisions based on ML algorithms can be “unfair” and reproduce

biases in the historical data used to train them.

• Methods used to handle the discrimination are

- Discrimination-aware data mining (DADM) and fairness,

- Accountability and transparency machine learning (FATML)

• Not all organisations hold information such as gender, ethnicity, disability, etc., which are required at

times in order to avoid discrimination.

In small groups, discuss the merits of Veale and Binns’ three suggestions

1. “Trusted third parties hold the necessary data for discrimination discovery.”

2. “Collaborative online platforms would allow diverse organisations to record, share and access

contextual and experiential knowledge to promote fairness in machine learning systems.”

3. “Unsupervised learning and pedagogically interpretable algorithms might allow fairness hypotheses to

be built for further selective testing and exploration.”

Case Study: Bottos AI Data Sharing

• Business problem: Non-transparency and large

companies having more access to AI data than smaller

companies

• Solution: Bottos created a decentralised AI data sharing

network (see how it works at https://bottos.org/?lang=en )

– The network is based on blockchain infrastructure

• Benefit: Access for more companies, data collected is

transparent and secure

https://hackernoon.com/transforming-the-ai-industry-with-next-level-data-collection-72f61b6a1b44

Best Practices for Collecting Data

Responsibly • Be transparent

✓ Explain why you require the data

✓ Offer users the opportunity to “opt-in” or “opt-out”, e.g. customer

check-boxes

✓ Make your data acquisition and sharing policy easy to understand

• Offer some benefit/reward for the user data

• Securely collect/transport data

• Understand government regulations (e.g. Australian privacy laws,

GDPR)

• Ensure data integrity and accuracy and keep data up-to-date

https://rtslabs.com/8-best-practices-for-collecting-data-responsibly/

Exit Activity: Mind Map

Help your teacher construct a mind map of today’s content on

data sources, ethical and best practice in data acquisition.

Data