Lab home work
Lab#10
In this Exercise you will continue exploring the Loudacre web server log files, as well as the Loudacre user account data, using key-value Pair RDDs.
Explore WeB Log Files
Continue working with the web log files, as in the previous exercise. In this exercise you will be reducing and joining large datasets, which can take a lot of time. You may wish to perform the exercises below using a smaller dataset, consisting of only a few of the web log files, rather than all of them. Remember that you can specify a wildcard textFile("/loudacre/weblogs/*6") would include only filenames ending with the digit 6.
1) Using map-reduce, count the number of requests from each user.
a) Use map to create a Pair RDD with the user ID as the key, and the integer 1 as the value. (The user ID is the third field in each line.) Your data will look something like this:
(userid, 1)
(userid,1)
(userid,1)
b) Use reduce to sum the values for each user ID. Your RDD data will be similar to:
(userid, 5)
(userid,7)
(userid,2)
2) Use countByKey to determine how many users visited the site for each frequency. That is, how many users visited once, twice, three times and so on.
a) Use map to reverse the key and value, like this:
(5,userid)
(7,userid)
(2,userid)
b) Use the countByKey action to return a Map of frequency:user-count pairs.
3) Create an RDD where the user id is the key, and the value is the list of all the IP addresses that user has connected from. (IP address is the first field in each request line.)
Hint: Map to (userid,ipaddress) and then use groupByKey.
(userid, [20.1.34.55, 74.125.139.981]))
(userid, [245.33.1.1, 245.33.1.1, 66.79.233.99])
(userid, [65.50.196.141, 142.456.23.1, 671.143.222.1]))
4) Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains the user account information and the number of website hits for that user.
a) Create an RDD based on the accounts data consisting of key/value-array pairs: (userid,[values...])
b) Join the Pair RDD with the set of user-id/hit-count pairs calculated in the first step.
c) Display the user ID, hit count, and first name (3rd value) and last name (4th value) for the first 5 elements, e.g.:
============================ Lab#10 – solution # Step 1 - Create an RDD based on a subset of weblogs (those ending in digit 6) logs=sc.textFile("/loudacre/weblogs/*6") # map each request (line) to a pair (userid, 1), then sum the values userreqs = logs \ .map(lambda line: line.split()) \ .map(lambda words: (words[2],1)) \ .reduceByKey(lambda count1,count2: count1 + count2) # Step 2 - Show the records for the 10 users with the highest count # Step 3 - Group IPs by user ID userips = logs \ .map(lambda line: line.split()) \ .map(lambda words: (words[2],words[0])) \ .groupByKey() # print out the first 10 user ids, and their IP list for (userid,ips) in userips.take(10): print userid, ":" for ip in ips: print "\t",ip # Step 4a - Map account data to (userid,[values....]) accounts = sc.textFile("/loudacre/accounts") \ .map(lambda s: s.split(',')) \ .map(lambda account: (account[0],account)) # Step 4b - Join account data with userreqs then merge hit count into valuelist accounthits = accounts.join(userreqs) # Step 4c - Display userid, hit count, first name, last name for the first 5 elements for (userid,(values,count)) in accounthits.take(5) : print userid, count, values[3],values[4]
- 1) Using map-reduce, count the number of requests from each user.
- 2) Use countByKey to determine how many users visited the site for each frequency. That is, how many users visited once, twice, three times and so on.
- 3) Create an RDD where the user id is the key, and the value is the list of all the IP addresses that user has connected from. (IP address is the first field in each request line.)
- 4) Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains the user account information and the number of website hits for that user.