Lab#-01 Assignment

profileManojMartha
HadoopEchosystems.pdf

Big Data Labs Objectives

How the Hadoop Ecosystem fits in with the data processing lifecycle

How data is distribute, stored and processed in Hadoop cluster

How to use sqoop and flume to ingest data

How to process distributed data with Spark

Best practices for data storage

How to model structured data as tables in impala and Hive

The big Data HUB from 50k feet

Source: Cloudera.com

The Hadoop Ecosystems

Cluster Installed Technologies

qHDFS

qMapReduce qImpala

qSpark qHbase

qSolr

qZooKeeper

Client Technologies

qHive (which uses MapReduce)

qHive on Spark qPig (which uses MapReduce)

qSpork (which uses Spark) qSearch (which uses Solr and ZooKeeper)

qOozie

• 100% open source, enterprise- ready distribution of Hadoop and related projects

• The most complete, tested, and widely-deployed distribution of Hadoop

• Integrates all the key Hadoop echo system projects

Distribution System Including Apache Hadoop

Common Hadoop Use Cases

qExtract/Transform/Load(ETL)

qTest mining

qIndex building

qGraph creation and analysis

qPattern recognition

qCollaborative filtering

qPrediction models

qSentiment analysis

qRisk assessment

What do these workload have in common? Nature of the data… a) Volume b) Velocity c) Variety

Core Hadoop

Distributed processing with Spark solves the issues with three key components:

q Tasks Distributions: The Spark framework handles figuring out how to divide tasks up into steps that can be executed in parallel and handles “plumbing” like coordina0ng tasks, copying data across the network and so on.

q Cluster Computing: Spark programs run across a cluster using a cluster resource management framework ( YARN) allowing an application to run on many nodes, sharing resources, and managing the applica0on lifecycle.

q Storage Data is distributed when it is stored. Replicated for efficiency and fault tolerance “Bring the computa0on to the data”

Processing

• Spark • MapReduce

Storage

• HDFS

Resource Management

• YARN

A Hadoop Cluster

Big Data Processing