You are expected to submit both report and code. For your code, please include clear readme
files. When not specifically mentioned, use the default data provided in our provided program
“/****************Please Fill Missing Lines Here*****************/” is
used where input from you is needed.
1. Clustering Evaluation.
ID Conference Name Ground Truth Label Algorithm output Label
1 IJCAI 3 2
2 AAAI 3 2
3 ICDE 1 3
4 VLDB 1 3
5 SIGMOD 1 3
6 SIGIR 4 4
7 ICML 3 2
8 NIPS 3 2
9 CIKM 4 3
10 KDD 2 1
11 WWW 4 4
12 PAKDD 2 1
13 PODS 1 3
14 ICDM 2 1
15 ECML 3 2
16 PKDD 2 1
17 EDBT 1 2
18 SDM 2 1
19 ECIR 4 4
20 WSDM 4 4
Suppose we want to cluster 20 above conferences into four areas, with ground truth label and algorithm
output label shown in third and fourth column. Please evaluate the quality of the clustering algorithm
according to purity, precision, recall, F-measure, and normalized mutual information, respectively.
2. Understanding and comparing different clustering algorithms.
(1) Fill in the missing lines in K-means, DBSCAN, and EM algorithm for GMM, and apply them on three
datasets (data1.txt, data2.txt, and data3.txt), respectively.
(2) Plot the clustering results for the three datasets using scatter plot, with different colors representing
different clusters. Evaluate the above algorithms using (1) purity and (2) normalized mutual information
for each dataset. You need to state the setting of the parameters used in these algorithms. (For
DBSCAN, you may treat each noise as a cluster in the evaluation.)
(3) Can you give the reasoning why some algorithm works better than others for each of these datasets?
3. Topic Model: PLSA.
(1) Fill in the missing lines in PLSA.java.
(2) Run PLSA on the provided dataset, where each document is a computer science venue and the text
for each venue is the collection of titles of papers published in that venue, write down the top-10 words
for each of the 4 topics according to word distribution of each topic (beta).
(3) According to topic distribution of each venue (theta), assign each venue to the topic with the highest
probability as the clustering result, and calculate NMI based on the ground truth given in Question 1.