Data Mining



 You are expected to submit both report and code. For your code, please include clear readme

files. When not specifically mentioned, use the default data provided in our provided program


 “/****************Please Fill Missing Lines Here*****************/” is

used where input from you is needed.

1. Clustering Evaluation.

ID Conference Name Ground Truth Label Algorithm output Label

1 IJCAI 3 2

2 AAAI 3 2

3 ICDE 1 3

4 VLDB 1 3

5 SIGMOD 1 3

6 SIGIR 4 4

7 ICML 3 2

8 NIPS 3 2

9 CIKM 4 3

10 KDD 2 1

11 WWW 4 4

12 PAKDD 2 1

13 PODS 1 3

14 ICDM 2 1

15 ECML 3 2

16 PKDD 2 1

17 EDBT 1 2

18 SDM 2 1

19 ECIR 4 4

20 WSDM 4 4

Suppose we want to cluster 20 above conferences into four areas, with ground truth label and algorithm

output label shown in third and fourth column. Please evaluate the quality of the clustering algorithm

according to purity, precision, recall, F-measure, and normalized mutual information, respectively.

2. Understanding and comparing different clustering algorithms.

(1) Fill in the missing lines in K-means, DBSCAN, and EM algorithm for GMM, and apply them on three

datasets (data1.txt, data2.txt, and data3.txt), respectively.

(2) Plot the clustering results for the three datasets using scatter plot, with different colors representing

different clusters. Evaluate the above algorithms using (1) purity and (2) normalized mutual information

for each dataset. You need to state the setting of the parameters used in these algorithms. (For

DBSCAN, you may treat each noise as a cluster in the evaluation.)

(3) Can you give the reasoning why some algorithm works better than others for each of these datasets?

3. Topic Model: PLSA.

(1) Fill in the missing lines in

(2) Run PLSA on the provided dataset, where each document is a computer science venue and the text

for each venue is the collection of titles of papers published in that venue, write down the top-10 words

for each of the 4 topics according to word distribution of each topic (beta).

(3) According to topic distribution of each venue (theta), assign each venue to the topic with the highest

probability as the clustering result, and calculate NMI based on the ground truth given in Question 1.