week6
LAB# 8– Apache Spark 2
Narender Reddy Kudumula
University of Cumberlands
Data Science & Big Data Analysis (ITS-836)
Prof. Dr. Gasan Elkhodari
11/09/2019
The Data
Files and Data Used in This Homework:
Exercise Directory: $DEV1/exercises/spark-etl
Data files (local):
1. $DEV1DATA/activations/*
2. Review the data in $DEV1DATA/activations.
3. Copy this data to /loudacre in HDFS
4. Create a new RDD ( eg. test-01) for any file ( any single file) under:
HDFS : /loudacre/activations/
5) Display the contents of the RDD by using “*.collect()” function.
6) Create additional RDD ( eg. test-02) for any other file
7) Display the contents of the RDD by using “*.collect()” function.
8) Use ‘*.union’ function to merge and union both RDDs ( test-01 and test-02)
9) Examine and validate the new union
9) Use filter function “*.filter” to extract and display all records that has the test “account-number”.
10) Display the results by using the “*.collect()” function.
The results should be similar to the below screen. 3