Lab#-01 Assignment
Big Data Labs Objectives
How the Hadoop Ecosystem fits in with the data processing lifecycle
How data is distribute, stored and processed in Hadoop cluster
How to use sqoop and flume to ingest data
How to process distributed data with Spark
Best practices for data storage
How to model structured data as tables in impala and Hive
The big Data HUB from 50k feet
Source: Cloudera.com
The Hadoop Ecosystems
Cluster Installed Technologies
qHDFS
qMapReduce qImpala
qSpark qHbase
qSolr
qZooKeeper
Client Technologies
qHive (which uses MapReduce)
qHive on Spark qPig (which uses MapReduce)
qSpork (which uses Spark) qSearch (which uses Solr and ZooKeeper)
qOozie
• 100% open source, enterprise- ready distribution of Hadoop and related projects
• The most complete, tested, and widely-deployed distribution of Hadoop
• Integrates all the key Hadoop echo system projects
Distribution System Including Apache Hadoop
Common Hadoop Use Cases
qExtract/Transform/Load(ETL)
qTest mining
qIndex building
qGraph creation and analysis
qPattern recognition
qCollaborative filtering
qPrediction models
qSentiment analysis
qRisk assessment
What do these workload have in common? Nature of the data… a) Volume b) Velocity c) Variety
Core Hadoop
Distributed processing with Spark solves the issues with three key components:
q Tasks Distributions: The Spark framework handles figuring out how to divide tasks up into steps that can be executed in parallel and handles “plumbing” like coordina0ng tasks, copying data across the network and so on.
q Cluster Computing: Spark programs run across a cluster using a cluster resource management framework ( YARN) allowing an application to run on many nodes, sharing resources, and managing the applica0on lifecycle.
q Storage Data is distributed when it is stored. Replicated for efficiency and fault tolerance “Bring the computa0on to the data”
Processing
• Spark • MapReduce
Storage
• HDFS
Resource Management
• YARN
A Hadoop Cluster
Big Data Processing