can somebody do that by dec 12?
CAS CS 112 A1. Assignment #6 Due 11:59 PM on Wednesday, December 12, 2012.
In this assignment you have to investigate the structure of a real world graph using some of the techniques that we discussed in class. The graph that you will use comes from scientific publications. It is based on a web site (DBLP) that contains information about articles published in a large number of computer science journals and conferences:
http://www.informatik.uni-trier.de/~ley/db/.
The graph that you will use is the co-authorship graph: each vertex in the graph represents an author and there is a link between two authors if they have published at least one paper together. This is an undirected graph. Your task is to use simple techniques to analyze this graph. You will try to find the “importance” of an author based on a number of different measures including number of co-authors, neighborhood characteristics, and its “centrality” in this graph:
http://en.wikipedia.org/wiki/Eigenvector centrality#eigenvector centrality.
Building the Graph
(20 Points)
You will be provided with a file (graph.txt) which contains the graph data using the format from the text described on p. 525. Each vertex in the graph is represented as an integer number between 0 and 226412, some of which refer to authors of single-authored papers (no incident edges). There are roughly 700 thousand edges in the graph. The edges are undirected. Therefore, if an edge (a, b) is specified, so is (b,a). Since the graph is very sparse, use an adjacency list representation to store it. Your first task is to read in the graph.txt file line by line and add the corresponding edge to the graph representation. The first line in the file specifies the number of vertices and the second the number of edges. After that, you have the edges for each vertex one by one. (Hint: We will provide a smaller graph to test you code. So, you can start with the small graph first and then go to the huge graph.)
Ranking Nodes by Degree
(10 Points) The simplest way to determine the importance of a vertex is by its degree. For example, in a social interaction network the person who knows the most people may be important (but of course that is not always the case). Using our graph representation, it is very easy to determine the degree of each vertex (just get the size of its neighbor set). Produce a list of vertices sorted by decreasing degree in a file called degreeRanking.txt, where each line contains the name of the vertex and its degree. You should notice that most vertices have very low degree, but there is still a considerable number of vertices with high degree. These high degree vertices are typically called hubs.
Ranking Nodes by Clustering Coefficient
(30 Points) Another way to determine the importance of a vertex is by considering how dense is the immediate neighborhood of this vertex. One way to measure this is by considering what fraction of its neighbors are connected. This metric is known as the clustering coefficient. Formally, for undirected graph G = (V ; E), define the neighbor set of vi ∈ V , denoted by Nvi , to be the set of vertices connected to vi:
1
Nvi = {vj|(vi, vj ) ∈ E}
Then the clustering coefficient of vi, denoted by CC(vi), is the fraction of vertex pairs in Nvi that share an edge:
CC(vi) = |{(vj ,vk)}|
( |Nvi
|
2 )
: vj , vk ∈ Nvi , (vj , vk) ∈ E.
Compute the clustering coefficient of the top-100 vertices in the degreeRanking.txt file (that is the 100 vertices with the highest degrees), and again produce a sorted list (in the same format as before, sorted by decreasing clustering coefficient) in a file called clusteringRanking.txt.
Ranking Nodes by Closeness Centrality
(40 Points) Another measure that signifies the “importance” of a vertex in a graph is the closeness centrality of this vertex. The idea behind this measure is that if a vertex is close to the center of the graph, it should be easy for this vertex to reach any other vertex in the graph. Given a graph G = (V, E) that is connected (that is, there is at least one path between every pair of vertices), the closeness centrality CLC of a vertex v is defined as follows:
CLC(v) =
∑ u∈V ′
d(v,u)
|V |−1
where V ′ = V − {v} and d(v, u) is the shortest distance between vertices v and u in the graph (the size of the shortest path). Thus, the CLC(v) is the average distance of vertex v to every other vertex in the graph.
To calculate the CLC(v) of a vertex v you need to run a breadth-first search from this vertex and keep a counter of the number of vertices with distance 1, 2, 3, and so on. After all vertices of the graph have been visited, you can use these counters to estimate CLC(v).
Compute the closeness centrality of the the top-100 vertices in the degreeRanking.txt file and produce a sorted list in a file called closenessRanking.txt.
Submission
Submit only the code that you used in this assignment. Put the code in a file GraphTools.java that contains the code to create the graph and generate the output in the files specified in the assignment. You do not have to submit the three files.
2