Lab10-ApacheSparkAggregation-ITS836.docx

Lab# 10 – FALL 2019

Lab#10-Apache Spark

Narender Reddy Kudumula

University of Cumberlands

Data Science & Big Data Analysis (ITS-836)

Prof. Dr. Gasan Elkhodari

11/24/2019

1) Using map-reduce, count the number of requests from each user.

a) Use map to create a Pair RDD with the user ID as the key, and the integer 1 as the value. (The user ID is the third field in each line.) Your data will look something like this: (userid, 1) (userid,1) (userid,1)

b) Use reduce to sum the values for each user ID. Your RDD data will be similar to: (userid, 5) (userid,7) (userid,2)

2) Use countByKey to determine how many users visited the site for each frequency. That is, how many users visited once, twice, three times and so on.

a) Use map to reverse the key and value, like this: (5,userid) (7,userid) (2,userid)

b) Use the countByKey action to return a Map of frequency:user-count pairs.

3) Create an RDD where the user id is the key, and the value is the list of all the IP addresses that user has connected from. (IP address is the first field in each request line.)

Hint: Map to (userid,ipaddress) and then use groupByKey.

(userid, [20.1.34.55, 74.125.139.981]))

(userid, [245.33.1.1, 245.33.1.1, 66.79.233.99])

(userid, [65.50.196.141, 142.456.23.1, 671.143.222.1]))

4) Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains the user account information and the number of website hits for that user.

a) Create an RDD based on the accounts data consisting of key/value-array pairs: (userid,[values...])

b) Join the Pair RDD with the set of user-id/hit-count pairs calculated in the first step.

c) Display the user ID, hit count, and first name (3rd value) and last name (4th value) for the first 5 elements, e.g.:

Lab#10 – solution

# Step 1 - Create an RDD based on a subset of weblogs (those ending in digit 6)

logs=sc.textFile("/loudacre/weblogs/*6")

# map each request (line) to a pair (userid, 1), then sum the values

userreqs = logs \

.map(lambda line: line.split()) \

.map(lambda words: (words[2],1)) \

.reduceByKey(lambda count1,count2: count1 + count2)

# Step 2 - Show the records for the 10 users with the highest count

# Step 3 - Group IPs by user ID

userips = logs \

.map(lambda line: line.split()) \

.map(lambda words: (words[2],words[0])) \

.groupByKey()

# print out the first 10 user ids, and their IP list

for (userid,ips) in userips.take(10):

print userid, ":"

for ip in ips: print "\t",ip

#Step 4a - Map account data to (userid,[values....])

accounts = sc.textFile("/loudacre/accounts") \

.map(lambda s: s.split(',')) \

.map(lambda account: (account[0],account))

# Step 4b - Join account data with userreqs then merge hit count into valuelist

accounthits = accounts.join(userreqs)

# Step 4c - Display userid, hit count, first name, last name for the first 5 elements

for (userid,(values,count)) in accounthits.take(5) :

print userid, count, values[3],values[4]

1