week6

Lab8-ApacheSpark-ITS836.docx

The Data

Files and Data Used in This Homework:

Exercise Directory: $DEV1/exercises/spark-etl

Data files (local):

1. $DEV1DATA/activations/*

2. Review the data in $DEV1DATA/activations.

3. Copy this data to /loudacre in HDFS

4. Create a new RDD ( eg. test-01) for any file ( any single file) under:

HDFS : /loudacre/activations/

5) Display the contents of the RDD by using “*.collect()” function.

6) Create additional RDD ( eg. test-02) for any other file

7) Display the contents of the RDD by using “*.collect()” function.

8) Use ‘*.union’ function to merge and union both RDDs ( test-01 and test-02)

9) Examine and validate the new union

9) Use filter function “*.filter” to extract and display all records that has the test “account-number”.

10) Display the results by using the “*.collect()” function.

The results should be similar to the below screen. 3