week 8-data science
Data Science and Big Data Analytics
Chapter 10: Adv. Analytics – Tech & Tools: MapReduce and Hadoop
1
Chapter Contents
10.1 Analytics for Unstructured Data
10.1.1 Use Cases
10.1.2 MapReduce
10.1.3 Apache Hadoop
10.2 The Hadoop Ecosystem
10.2.1 Pig
10.2.2 Hive
10.2.3 HBase
10.2.4 Mahout
10.3 NoSQL
Summary
2
10.1 Analytics for Unstructured Data
This chapter presents some key technologies and tools related to the Apache Hadoop software library
Hadoop stores data in in a distributed system
Hadoop implements a simple programming paradigm known as MapReduce
Tools of the Hadoop ecosystem
Pig
Hive
HBase
Mahout
3
10.1 Analytics for Unstructured Data 10.1.1 Use Cases
IBM Watson – Jeopardy playing machine
To educate Watson, Hadoop was utilized to process data sources
Encyclopedias, dictionaries, news wire feeds, literature, Wikipedia, etc.
LinkedIn – network of over 250 million users in 200 countries
Hadoop is used to process daily transaction logs, examine users’ activities, feed extracted data back to production systems, restructure the data, develop and test analytic models
Yahoo! – large Hadoop deployment
Search index creation and maintenance, Webpage content optimization, spam filters, etc.
4
10.1 Analytics for Unstructured Data 10.1.2 MapReduce
The MapReduce paradigm breaks a large task into smaller tasks, runs the tasks in parallel, and consolidates the outputs of the individual tasks into the final output
Map
Applies an operation to a piece of data
Provides some intermediate output
Reduce
Consolidates the intermediate outputs from the map steps
Provides the final output
Each step uses key/value pairs, denoted <key, value> as input and output
5
10.1 Analytics for Unstructured Data 10.1.2 MapReduce
MapReduce word count example
6
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
MapReduce is a simple paradigm to understand but not easy to implement
Executing a MapReduce job requires
Jobs scheduled based on system’s workload
Input data spread across cluster of machines
Map step spread across distributed system
Intermediate outputs collected and provided to proper machines for reduce step
Final output made available to another user, another application, or another MapReduce job
Next few slides present overview of Hadoop environment
7
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Hadoop Distributed File System (HDFS)
File system that distributes data across a cluster to take advantage of the parallel processing of MapReduce
HDFS uses three Java daemons (background processors)
NameNode – determines and tracks where various blocks of data are stored
DataNode – manages the data stored on each machine
Secondary NameNode – performs some of the NameNode tasks to reduce the load on NameNode
8
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Structuring a MapReduce Job in Hadoop
A typical MapReduce Java program has three classes
Driver – provides details such as input file locations, names of mapper and reducer classes, location of reduce class output, etc.
Mapper – provides logic to process each data block
Reducer – reduces the data provided by the mapper
9
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Structuring a MapReduce Job in Hadoop
A file stored in HDFS
10
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Additional Considerations in Structuring a MapReduce Job
Shuffle and Sort
11
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Using a combiner
12
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Using a custom partitioner
13
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Developing and Executing a Hadoop MapReduce Program
Common practice is to use an IDE tool such as Eclipse
The MapReduce program consists of three Java files
Driver code, map code, and reduce code
Java code is compiled and stored in a JAR file and executed against the specified HDFS input files
14
10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop
Yet Another Resource Negotiator (YARN)
Hadoop continues to undergo development
An important one was to separate the MapReduce functionality from the management of running jobs and the distributed environment
This rewrite is sometimes called
Yet Another Resource Negotiator (YARN)
15
10.2 The Hadoop Ecosystem
Tools have been developed to make Hadoop easier to use and provide additional functionality and features
Pig – provides a high-level data-flow programming language
Hive – provides SQL-like access
HBase – provides real-time reads and writes
Mahout – provides analytical tools
16
10.2 The Hadoop Ecosystem 10.2.1 Pig
Pig consists of
A data flow language called Pig Latin
An environment to execute the Pig code
Example of Pig commands
17
10.2 The Hadoop Ecosystem 10.2.1 Pig – Built-in Pig Functions
18
10.2 The Hadoop Ecosystem 10.2.2 Hive
The Hive language, Hive Query Language (HiveQL), resembles SQL rather than a scripting language
Example Hive code
19
10.2 The Hadoop Ecosystem 10.2.3 HBase
Unlike Pig and Hive, intended for batch applications, HBase can provide real-time read and write access to huge datasets
Example - Choosing a shipping address at checkout
20
10.2 The Hadoop Ecosystem 10.2.4 Mahout
Mahout permits the application of analytical techniques within the Hadoop environment
Mahout provides Java code that implements the usual classification, clustering, and recommenders/collaborative filtering algorithms
Users can download Apache Hadoop directly from www.apache.org or use commercial packages
For example, the Pivital company provides Pivital HD Enterprise
21
10.2 The Hadoop Ecosystem 10.2.4 Mahout
Components of Pivotal HD Enterprise
22
10.3 NoSQL
NoSQL = Not only Structured Query Language
Four major categories of NoSQL tools
Key/value stores – contains data (the value) accessed by the key
Document stores – good when the value of the key/value pair is a file
Column family stores – good for sparse datasets
Graph stores – good for items and relationships between them
Social networks like Facebook and LinkedIn
23
10.3 NoSQL
Examples of NoSQL Data Stores
24