week 8-data science

profilerav
Chapter_10.pptx

Data Science and Big Data Analytics

Chapter 10: Adv. Analytics – Tech & Tools: MapReduce and Hadoop

1

Chapter Contents

10.1 Analytics for Unstructured Data

10.1.1 Use Cases

10.1.2 MapReduce

10.1.3 Apache Hadoop

10.2 The Hadoop Ecosystem

10.2.1 Pig

10.2.2 Hive

10.2.3 HBase

10.2.4 Mahout

10.3 NoSQL

Summary

2

10.1 Analytics for Unstructured Data

This chapter presents some key technologies and tools related to the Apache Hadoop software library

Hadoop stores data in in a distributed system

Hadoop implements a simple programming paradigm known as MapReduce

Tools of the Hadoop ecosystem

Pig

Hive

HBase

Mahout

3

10.1 Analytics for Unstructured Data 10.1.1 Use Cases

IBM Watson – Jeopardy playing machine

To educate Watson, Hadoop was utilized to process data sources

Encyclopedias, dictionaries, news wire feeds, literature, Wikipedia, etc.

LinkedIn – network of over 250 million users in 200 countries

Hadoop is used to process daily transaction logs, examine users’ activities, feed extracted data back to production systems, restructure the data, develop and test analytic models

Yahoo! – large Hadoop deployment

Search index creation and maintenance, Webpage content optimization, spam filters, etc.

4

10.1 Analytics for Unstructured Data 10.1.2 MapReduce

The MapReduce paradigm breaks a large task into smaller tasks, runs the tasks in parallel, and consolidates the outputs of the individual tasks into the final output

Map

Applies an operation to a piece of data

Provides some intermediate output

Reduce

Consolidates the intermediate outputs from the map steps

Provides the final output

Each step uses key/value pairs, denoted <key, value> as input and output

5

10.1 Analytics for Unstructured Data 10.1.2 MapReduce

MapReduce word count example

6

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

MapReduce is a simple paradigm to understand but not easy to implement

Executing a MapReduce job requires

Jobs scheduled based on system’s workload

Input data spread across cluster of machines

Map step spread across distributed system

Intermediate outputs collected and provided to proper machines for reduce step

Final output made available to another user, another application, or another MapReduce job

Next few slides present overview of Hadoop environment

7

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Hadoop Distributed File System (HDFS)

File system that distributes data across a cluster to take advantage of the parallel processing of MapReduce

HDFS uses three Java daemons (background processors)

NameNode – determines and tracks where various blocks of data are stored

DataNode – manages the data stored on each machine

Secondary NameNode – performs some of the NameNode tasks to reduce the load on NameNode

8

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Structuring a MapReduce Job in Hadoop

A typical MapReduce Java program has three classes

Driver – provides details such as input file locations, names of mapper and reducer classes, location of reduce class output, etc.

Mapper – provides logic to process each data block

Reducer – reduces the data provided by the mapper

9

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Structuring a MapReduce Job in Hadoop

A file stored in HDFS

10

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Additional Considerations in Structuring a MapReduce Job

Shuffle and Sort

11

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Using a combiner

12

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Using a custom partitioner

13

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Developing and Executing a Hadoop MapReduce Program

Common practice is to use an IDE tool such as Eclipse

The MapReduce program consists of three Java files

Driver code, map code, and reduce code

Java code is compiled and stored in a JAR file and executed against the specified HDFS input files

14

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop

Yet Another Resource Negotiator (YARN)

Hadoop continues to undergo development

An important one was to separate the MapReduce functionality from the management of running jobs and the distributed environment

This rewrite is sometimes called

Yet Another Resource Negotiator (YARN)

15

10.2 The Hadoop Ecosystem

Tools have been developed to make Hadoop easier to use and provide additional functionality and features

Pig – provides a high-level data-flow programming language

Hive – provides SQL-like access

HBase – provides real-time reads and writes

Mahout – provides analytical tools

16

10.2 The Hadoop Ecosystem 10.2.1 Pig

Pig consists of

A data flow language called Pig Latin

An environment to execute the Pig code

Example of Pig commands

17

10.2 The Hadoop Ecosystem 10.2.1 Pig – Built-in Pig Functions

18

10.2 The Hadoop Ecosystem 10.2.2 Hive

The Hive language, Hive Query Language (HiveQL), resembles SQL rather than a scripting language

Example Hive code

19

10.2 The Hadoop Ecosystem 10.2.3 HBase

Unlike Pig and Hive, intended for batch applications, HBase can provide real-time read and write access to huge datasets

Example - Choosing a shipping address at checkout

20

10.2 The Hadoop Ecosystem 10.2.4 Mahout

Mahout permits the application of analytical techniques within the Hadoop environment

Mahout provides Java code that implements the usual classification, clustering, and recommenders/collaborative filtering algorithms

Users can download Apache Hadoop directly from www.apache.org or use commercial packages

For example, the Pivital company provides Pivital HD Enterprise

21

10.2 The Hadoop Ecosystem 10.2.4 Mahout

Components of Pivotal HD Enterprise

22

10.3 NoSQL

NoSQL = Not only Structured Query Language

Four major categories of NoSQL tools

Key/value stores – contains data (the value) accessed by the key

Document stores – good when the value of the key/value pair is a file

Column family stores – good for sparse datasets

Graph stores – good for items and relationships between them

Social networks like Facebook and LinkedIn

23

10.3 NoSQL

Examples of NoSQL Data Stores

24