Big Data

Siper
MapRe.pdf

MapReduce How many neighbors you have? In this project, you are asked to work on the MapReduce framework. From the lecture, that you need to refer to, MapReduce is one of important techniques to solve Big Data problems. It has two main phases; namely the Map phase and Reduce phase. In each one of these, you have sub-phases. Briefly, on a cluster of nodes/cores, during the Map phase, the cluster nodes running the map program should emit key-value pairs from the input file. These key-value pairs will be consumed by the cluster nodes running the reduce program. The reduce component usually summarizes the data received from the map phase to produce the final output after combining the map produced key-value pairs coming from several nodes. Objectives: 1- Understanding the conceptual basis of the Map-Reduce frame work 2- Solving a problem by applying the Map-Reduce framework 3- Having the hands-on experience of developing and running Map-Reduce programs 4- Gaining the skills for not only developing the code, but also: a. creating executable files,

b. moving data back and forth between the local system and the Hadoop file system,

c. configuring a Map-Reduce job,

d. submitting the job for execution, and

e. Obtaining the results. You will be given a network file below. The file has the network you will work with. When you download the file from the link given below, you need to open it. You can extract the compressed file using WinRAR for instance. To open the network file, notepad or notepad++ will not be a good choice because the file is huge for a simple text editor. However, you can open it using a free editor such as PilotLite. The file needs a very simple cleaning. You need to remove the first 4 lines in order to keep the network only. Main Task: (20 points) In this project, you will be solving the problem of counting the neighboring nodes of each node in the network. If two nodes have a link/an edge, they are considered neighbors. You can think of the network as Friendship network (nodes are friends and links represent friendship relations), Co-Authorship network (nodes are authors and links represent common work) …etc.

Bonus Task: (15 points) In this part, you are only asked to report the top 30 nodes with the largest number of neighbors in the network. If several nodes have the same number of neighbors, then you can break these ties randomly. If you decided to submit the bonus part, you have to submit separate files that are called Bonus.py For the bonus part, the code has to produce accurate/correct results. No partial credit will be given. For more information please see below: 1. General Information: Data Description a. The network has i. Nodes: 3072441 ii. Edges: 117185083 b. The uncompressed file size: 1.64 GB

c. The data in the file has the following format: d. e. The numbers represent node ids

f. Each line shows a link/edge in the network

g. In a line, every two nodes are delimited by the tab character(\t)

h. For instance, in the snapshot given in point d., the neighbors of node 1 are: 11, 12, and 13. The neighbors count is 3 (this is the number you need to report for each node) i. For example, the output of the previous snapshot network given in d. would look like: 1 3 2 5 5 1 6 1 1 1

1

1 2

1

1 3

1

1 5

1

6 0

1

6 2

1

Hints: 1- You have to study the MapReduce lectures (both theory and practical lectures) that you find on BB to know what steps are needed.

2- I highly encourage you to check the code you have on BB and try to understand it very well. Solving this project is similar but with some tweaks and modifications. 3- As a starting point, you can think of only using a small part of the network during your development.

a. For example you may use only the first 100 links/edges or even less. This will make testing your code easier.

b. This is called working with toy dataset. This way, producing results will be faster and solving problems will be quicker. 4- I recommend you think thoroughly of what your map and reduce parts will do before implementation begins. Notes: 1- The project is to be done in groups of 2 or less. Forming groups, if you want to have a group, is the responsibility of students.

2- You should be developing this project under the Cloudera virtual machine. You should have installed it at the beginning of this semester. a. You should not need to install any special packages or libraries except the default compilers and libraries. 3- You can develop the code using Java, Python, or C++. a. For running the code in a language not covered in the lectures, please note that you will be responsible to figure out how it should run AND include the information as a comment in your code. 4- For the main part, you should submit: a. The code file.

b. The executable file; it will be used to run your code Only one zipped file should be submitted per group.

Your code file has to start with a block of comment.

This comment block has: Students names, and ids 5- For the bonus part, you should submit: a. The code file.

b. The executable file; it will be used to run your code • Only one zipped file should be submitted per group.

• Your code file has to start with a block of comment.

• This comment block has: Students names, ids, and sections 6- The network data file can be found in: a. http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz 7- You have to make sure that your code runs error-free, especially compilation errors. We will not debug or fix any errors. Very low score is expected in this case.

8- Example that you may use on How to run the code: (Similarly for the bonus part with needed modifications) a. Java: • $ hadoop jar MR.jar MR /user/cloudera/input_mr /user/cloudera/output_mr

• The input_mr will have the input network data to work on

• The output will be stored in the output_mr folder b. Python: • $hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar -mapper /home/cloudera/Desktop/PyMapRed/mapper.py -reducer /home/cloudera/Desktop/PyMapRed/reducer.py -input /user/cloudera/input_mr -output /user/cloudera/output_mr

• The input_mr will have the input network data to work on

• The output will be stored in the output_mr folder 9- You can use any resources to understand more about the technique. a. You should not copy solution code from any sources. • You have to develop the code yourself. 10- Copying and cheating will have serious consequences. So, avoid that. Good Luck!