week 8-data science

profilerav
Chapter_1.pdf

DATA SCIENCE AND BIG DATA ANALYTICS

CHAPTER 1: INTRODUTION TO BIG

DATA ANALYTICS

1.1 BIG DATA OVERVIEW

• Industries that gather and exploit data

• Credit card companies monitor purchase

• Good at identifying fraudulent purchases

• Mobile phone companies analyze calling patterns – e.g., even on rival

networks

• Look for customers might switch providers

• For social networks data is primary product

• Intrinsic value increases as data grows

ATTRIBUTES DEFINING BIG DATA CHARACTERISTICS

• Huge volume of data

• Not just thousands/millions, but billions of items

• Complexity of data types and structures

• Varity of sources, formats, structures

• Speed of new data creation and grow

• High velocity, rapid ingestion, fast analysis

SOURCES OF BIG DATA DELUGE

• Mobile sensors – GPS, accelerometer, etc.

• Social media – 700 Facebook updates/sec in2012

• Video surveillance – street cameras, stores, etc.

• Video rendering – processing video for display

• Smart grids – gather and act on information

• Geophysical exploration – oil, gas, etc.

• Medical imaging – reveals internal body structures

• Gene sequencing – more prevalent, less expensive,

healthcare would like to predict personal illnesses

SOURCES OF BIG DATA DELUGE

EXAMPLE: GENOTYPING FROM 23ANDME.COM

1.1.1 DATA STRUCTURES: CHARACTERISTICS OF BIG DATA

DATA STRUCTURES: CHARACTERISTICS OF BIG DATA

• Structured – defined data type, format, structure

• Transactional data, OLAP cubes, RDBMS, CVS files, spreadsheets

• Semi-structured

• Text data with discernable patterns – e.g., XML data

• Quasi-structured

• Text data with erratic data formats – e.g., clickstream data

• Unstructured

• Data with no inherent structure – text docs, PDF’s, images, video

EXAMPLE OF STRUCTURED DATA

EXAMPLE OF SEMI-STRUCTURED DATA

EXAMPLE OF QUASI-STRUCTURED DATA

VISITING 3 WEBSITES ADDS 3 URLS TO USER’S LOG FILES

EXAMPLE OF UNSTRUCTURED DATA VIDEO ABOUT ANTARCTICA

EXPEDITION

1.1.2 TYPES OF DATA REPOSITORIES FROM AN ANALYST PERSPECTIVE

1.2 STATE OF THE PRACTICE IN ANALYTICS

• Business Intelligence (BI) versus Data Science

• Current Analytical Architecture

• Drivers of Big Data

• Emerging Big Data Ecosystem and a New Approach to

Analytics

BUSINESS DRIVERS FOR ADVANCED ANALYTICS

1.2.1 BUSINESS INTELLIGENCE (BI) VERSUS DATA SCIENCE

1.2.2 CURRENT ANALYTICAL ARCHITECTURE

TYPICAL ANALYTIC ARCHITECTURE

CURRENT ANALYTICAL ARCHITECTURE

• Data sources must be well understood

• EDW – Enterprise Data Warehouse

• From the EDW data is read by applications

• Data scientists get data for downstream analytics processing

1.2.3 DRIVERS OF BIG DATA DATA EVOLUTION & RISE OF BIG

DATA SOURCES

1.2.4 EMERGING BIG DATA ECOSYSTEM AND A NEW

APPROACH TO ANALYTICS

• Four main groups of players

• Data devices

• Games, smartphones, computers, etc.

• Data collectors

• Phone and TV companies, Internet, Gov’t, etc.

• Data aggregators – make sense of data

• Websites, credit bureaus, media archives, etc.

• Data users and buyers

• Banks, law enforcement, marketers, employers, etc.

EMERGING BIG DATA ECOSYSTEM AND A NEW

APPROACH TO ANALYTICS

1.3 KEY ROLES FOR THE NEW BIG DATA ECOSYSTEM

1. Deep analytical talent

• Advanced training in quantitative disciplines – e.g., math, statistics,

machine learning

2. Data savvy professionals

• Savvy but less technical than group 1

3. Technology and data enablers

• Support people – e.g., DB admins, programmers, etc.

THREE KEY ROLES OF THE NEW BIG DATA ECOSYSTEM

THREE RECURRING DATA SCIENTIST ACTIVITIES

1. Reframe business challenges as analytics

challenges

2. Design, implement, and deploy statistical

models and data mining techniques on Big

Data

3. Develop insights that lead to actionable

recommendations

PROFILE OF DATA SCIENTIST FIVE MAIN SETS OF SKILLS

PROFILE OF DATA SCIENTIST FIVE MAIN SETS OF SKILLS

• Quantitative skill – e.g., math, statistics

• Technical aptitude – e.g., software engineering, programming

• Skeptical mindset and critical thinking – ability to examine work

critically

• Curious and creative – passionate about data and finding creative

solutions

• Communicative and collaborative – can articulate ideas, can work

with others

1.4 EXAMPLES OF BIG DATA ANALYTICS

• Retailer Target

• Uses life events: marriage, divorce, pregnancy

• Apache Hadoop

• Open source Big Data infrastructure innovation

• MapReduce paradigm, ideal for many projects

• Social Media Company LinkedIn

• Social network for working professionals

• Can graph a user’s professional network

• 250 million users in 2014

DATA VISUALIZATION OF USER’S SOCIAL NETWORK USING INMAPS

SUMMARY

• Big Data comes from myriad sources

• Social media, sensors, IoT, video surveillance, and sources only

recently considered

• Companies are finding creative and novel ways to use

Big Data

• Exploiting Big Data opportunities requires

• New data architectures

• New machine learning algorithms, ways of working

• People with new skill sets

• Always Review Chapter Exercises

FOCUS OF COURSE

• Focus on quantitative disciplines – e.g., math, statistics, machine learning

• Provide overview of Big Data analytics

• In-depth study of a several key algorithms