Report on Unstructured data management
Data 4200
Data Acquisition and
Management
Lesson 1
Data Sources, Ethical and Best
Practice in Data Acquisition
Copyright Notice
COPYRIGHT COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969 WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in
this communication may be subject to copyright under the Act. Any further reproduction
or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice
2
Roadmap
• In this subject we want you to understand how data collection, storage, sampling and analysis is evolving
• This course will provide you with hands on experience in SQL and Power BI
• There will also be introductory level Neo4j and python
This Photo by Unknown Author is licensed under CC BY-NC
Lesson Learning Outcomes
1 Investigate data types, data sources and
acquisition
2 Explore open data sources
3 Perform some simple web extraction
exercises
4 Analyse best practice, ethics and challenges
in relation to data acquisition
5 Review the evolution of databases and
example database models
This Topic’s Big Idea
“Web users ultimately want to get at data
quickly and easily. They don’t care as much
about attractive sites and pretty design.” Tim Berners-Lee
Activity 1: Many Ways to Acquire and
Store Data • Watch the video at:
https://www.coursera.org/lecture/big-data-introduction/step-1-acquiring-
data-4SI3T
• Answer the questions below.
Q1: It is said that leaving out data
can lead to errors in your results,
however is it realistic to think that
all of the data we need will be
available if we search hard
enough?
Q2: Name a software package that
can be used to acquire text files.
Ways of Distinguishing Between
Data Types
Many ways of categorising the data you acquire
• Primary, secondary
• Quantitative, qualitative
• Raw data, metadata
• Structured, semi- structured, unstructured
• Internal versus external
https://lcolumbus.files.wordpress.com/2012/03
/image-for-data-center-forecast.jpg
Recall Data Types… • Structured data is formatted for use, has a well-defined data
structure, generally stored in rows and columns
- e.g. Age (in years), first name (text), address (text), income ($),
etc.
• Semi-structured data has some structure
- e.g. CSV files with comma separated data; XML and JavaScript
Object Notation, JSON, documents used to exchange data
to/from a web server.
• Unstructured data has no predefined data model, not organised,
may have multiple types of data
- e.g. Data from thermostats, sensors, home electronic devices,
cars, images and sounds and pdf files
Tomcy, J & Pankaj, M 2017, Data Lake for Enterprises, Packt publishing Ltd, Birmingham, UK.
Internal vs External Data Sources
• Sometimes called “data resources”
Internal:
• More traditionally: source data from information systems
in each functional area
• More recently: access data from a data lake or hub
(in the cloud) with all company data
External:
• Closed data for purchase (or used under strict agreement)
• Open data (free) https://fermi.gsfc.nasa.gov/ssc/
Activity 2: Open Big Data Sources
Below is a list of a few ‘open big data sources’
• The World Factbook
• Amazon Web services
• Open Government Data
• Open Data Network
• Google Public Data explorer
• Worldback.org
• UNData
• World Census Open Data
1. In groups, go to the link below on your device/phone and open up one of the
open data websites.
2. Which data source is most accessible, understandable and user friendly for a
business analyst?
3. Report back to the class
https://www.investintech.com/resources/blog/archives/6304-open-data-resources.html
https://www.flickr.com/photos/notb
rucelee/8016172703
How big Business Collects
Customer Data Internally collected as:
• Sales data (transaction history, customer interaction)
• Customer feedback (e.g. Facebook)
Externally collected by:
• Directly asking
• Indirect tracking (emails, apps and third-party trackers)
• Websites, scraping, cookies and web beacons
• Adding other data sources to their own by purchasing
third party data from data companies, e.g. Acxiom and Oracle
https://www.itchronicles.com/big-data/how-do-big-companies-collect-customer-data/
Example: Google Analytics
• Google Analytics and similar tools allow you to analyse
consumer behaviour once collected
https://analytics.google.com/analytics/web/
Web data extraction tools and apps
Why do businesses use Web extraction tools?
• To aggregating new or market data
• For insurance
• For price comparisons
• To do real-time or text analytics
• ML training models
• To generate sales lead
Data extraction tools
• Web crawling tools automatically group webpages/topics using an
text analytics algorithm
• Web scraping tools automate the extraction of data from websites
Example software, mozenda.com, pattern.web, scrapy.org,
octoparse.com, python, Content Grabber, Nintex RPA, Carrot2
https://www.predictiveanalyticstoday.com/top-web-scraping-software/
https://towardsdatascience.com/https-medium-com-hiren787-patel-web-scraping-applications-a6f370d316f4
vimeo.com
Scraping versus Crawling
This Photo by Unknown Author is licensed under CC BY-SA
Activity 3 and 4
• We will now perform web extraction in
excel and Power BI for comparison
This Photo by Unknown Author is licensed under CC BY
Activity 3: Webscraping Financial Data
Into Excel
In this case we will scrape data from the Bloomberg website.
• As a group, or individually, open a new Microsoft Excel
spreadsheet.
• Open a web browser and go to the Bloomberg website:
https://www.bloomberg.com/markets/currencies
• Copy the url above.
• Return to Excel and click on the ‘Data’ menu.
• In the ‘Get Data’ options, choose ‘From other sources’
then choose ‘From Web’.
• Paste the Bloomberg url in here
• Click OK and another pop window will prompt you for the view
Activity 3 contd
• Choose Table 0 and
click ‘load’
• It will take a little
while and then you
should see your data
Activity 3 contd
• You should see the Bloomberg data
You can refresh
the data by
clicking here
Activity 3 contd
Activity 4: Webscraping Basketball data
Using Power BI
• Open a new file in Power BI
• To connect to the Data: Use
the “Get data” menu
• Select “other” and then the
“Web” option
• Click connect
•
https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/
Activity 4 contd
• Copy in the Url below and click on Ok
• https://www.basketball-reference.com/teams/CHI/1995.html
https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/
Activity 4 contd
• Scroll down and select the “Per Game table”
• Click on Transform
Data
• Let do a small
amount of
transforming and
visualising the data
https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/
Activity 4 contd
• Rename col 2 “Player” by double clicking on the title row
• Select the Add Column (top menu) at the top and then select Custom
Column. Name your column Year and make the value 1995 and click on OK
https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/
Activity 4 contd
https://spencerbaucke.com/2020/04/29/w
eb-scraping-in-power-bi/
• Rename the Per Game table to reflect the year of games, by clicking on the
name and typing, e.g. Games 1995
• Go to the New Query menu, new Source and search for the 2000 data
https://www.basketball-reference.com/teams/CHI/2000.html
• And repeat the steps as before for 1995 and you can add in some more
years if you have time
• Two make a union of the datasets, go to the Home menu and on the top
ribbon click on Append Queries as New
• Select the table names you want to join
and click OK
Click on close and apply
https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/
Visualise the results
• You should see the Appended table and can view the variables in
the Fields menu
• Click on Ribbon Chart
• Drag Year to the Axis variable box
• Player in the legend box
• Points per Game ( Σ PTS/G) to the Values box
If you have time
you can
experiment with
the formatting
Activity 5: Ethics, Privacy and Data
Collection Watch the video at:
https://www.youtube.com/watch?v=naaDBNSx610
Q1: What ethical issues have been brought up here?
Q2: Should you have to pay to keep your privacy?
Recall: the Definition of Ethics
https://en.wiktionary.org/wiki/ethics
https://theodi.org/article/data-ethics-canvas/
Ethics originates from the Greek word ēthikós (ἠθικός), which
means "relating to one's character".
Activity 6: Case Study Snowden Revelations about National Security Agency (NSA) surveillance
In 2013 it was revealed by Edward Snowden, a Central Intelligence Agency
employee, that the NSA (using an order from the Foreign Intelligence Court) had
demanded that the US telecommunications company Verizon hand over metadata
from millions of American’s phone calls to the Federal Bureau of Investigation and
the NSA.
Snowden also revealed other global surveillance programs, leaked classified
documents, and suggested that the so-called “PRISM” program gave the NSA
direct access to servers of some large technology companies, e.g. Facebook,
Yahoo and Google.
After Snowden was charged for leaking Government information he fled to Russia
and is still living there.
In groups, discuss and answer these questions:
Q1: What are possible consequences for the public, given these leaks?
Q2: Is this surveillance unethical or should Intelligence agencies be able to access
all of our data for safety reasons? Lyon, D 2014, ‘Surveillance, Snowden, and Big Data: Capacities, consequences, critique’, Big Data and
Society, Jul-Dec, pp.1-13 accessed on 26 March 2019
Where is database design heading?
• In coming weeks we will be performing
data queries in traditional (SQL, SQL in
Power BI) and non-traditional databases
(Neo4j)
• To understand why, let’s look at a brief
history of databases and compare a
simple example of old and new
Database History Chart
a-brief-history-of-databases.pdf (peterjamesthomas.com)
Database Evolution
• This evolution of database models influences how
– data is stored and manipulated by software
– you perform analyses and the output obtained (trends vs
relationships vs basic stats)
• For Example,
– relational database data is usually imported to an analytics
package or loaded in SQL for queries to look at trends or make
predictions
– The same data would have to be reformatted in Neo4j or other
graph database to look at relations
– The way the information is linked provides different insight
(see next page)
Comparison of relational and graph
database set-up
Relational Database
Graph Database
https://neo4j.com/developer/graph-db-vs-rdbms/
Case Study: Why does Instagram
use a graph database?
• Business problem: Understanding customers and how they relate to each other to provide a customised service
• Instagram makes money by selling targeted adverts, just as other social media websites
• They build profiles and determine user relationships
• Instagram collects – your location (even by looking at the background in your photographs)
– your mobile carrier
– phone number
– IP address
– the types of products that interest you
• Hashtags on Instagram are compared with likes on Facebook and messages on Messenger
https://vpnoverview.com/privacy/social-media/what-does-instagram-know-about-me/
AI-Based Data Collection
• Examples of daily AI based tools in daily life: Amazon
Echo, Alexa, Google Home, and Apple’s Siri, Woebot
and other chatbots
• Used on the web, within apps, and on messaging
platforms
• For example, chatbots use natural language processing
and machine learning (ML) to enable them to converse
with humans (and other chatbots)
• Challenges:
– Determining the user’s emotional state
– understanding unusual accents
https://www.businessinsider.com/what-is-chatbot-talking-ai-robot-chat-simulators-2017-10/?r=AU&IR=T
Activity 7: Bias Data Collection Using Artificial
Intelligence
Veale, M & Binns, R 2017, ‘Fairer machine learning in the real world: Mitigating discrimination without
collecting sensitive data’, Jul-Dec, pp. 1-17.
• Veale and Binns (2017) argue that decisions based on ML algorithms can be “unfair” and reproduce
biases in the historical data used to train them.
• Methods used to handle the discrimination are
- Discrimination-aware data mining (DADM) and fairness,
- Accountability and transparency machine learning (FATML)
• Not all organisations hold information such as gender, ethnicity, disability, etc., which are required at
times in order to avoid discrimination.
In small groups, discuss the merits of Veale and Binns’ three suggestions
1. “Trusted third parties hold the necessary data for discrimination discovery.”
2. “Collaborative online platforms would allow diverse organisations to record, share and access
contextual and experiential knowledge to promote fairness in machine learning systems.”
3. “Unsupervised learning and pedagogically interpretable algorithms might allow fairness hypotheses to
be built for further selective testing and exploration.”
Case Study: Bottos AI Data Sharing
• Business problem: Non-transparency and large
companies having more access to AI data than smaller
companies
• Solution: Bottos created a decentralised AI data sharing
network (see how it works at https://bottos.org/?lang=en )
– The network is based on blockchain infrastructure
• Benefit: Access for more companies, data collected is
transparent and secure
https://hackernoon.com/transforming-the-ai-industry-with-next-level-data-collection-72f61b6a1b44
Best Practices for Collecting Data
Responsibly • Be transparent
✓ Explain why you require the data
✓ Offer users the opportunity to “opt-in” or “opt-out”, e.g. customer
check-boxes
✓ Make your data acquisition and sharing policy easy to understand
• Offer some benefit/reward for the user data
• Securely collect/transport data
• Understand government regulations (e.g. Australian privacy laws,
GDPR)
• Ensure data integrity and accuracy and keep data up-to-date
https://rtslabs.com/8-best-practices-for-collecting-data-responsibly/
Exit Activity: Mind Map
Help your teacher construct a mind map of today’s content on
data sources, ethical and best practice in data acquisition.
Data