Data Mining Assignment

profiledor140
homework2.pdf

1. Clustering Evaluation.

ID Conference Name Ground Truth Label Algorithm output Label

1 IJCAI 3 2

2 AAAI 3 2

3 ICDE 1 3

4 VLDB 1 3

5 SIGMOD 1 3

6 SIGIR 4 4

7 ICML 3 2

8 NIPS 3 2

9 CIKM 4 3

10 KDD 2 1

11 WWW 4 4

12 PAKDD 2 1

13 PODS 1 3

14 ICDM 2 1

15 ECML 3 2

16 PKDD 2 1

17 EDBT 1 2

18 SDM 2 1

19 ECIR 4 4

20 WSDM 4 1

Suppose we want to cluster 20 above conferences into four areas, with ground truth label and algorithm

output label shown in third and fourth column. Please evaluate the quality of the clustering algorithm

according to purity, precision, recall, F-measure, and normalized mutual information, respectively.

2. Understanding and comparing different clustering algorithms.

(1) Implement KNN, DBSCAN, and EM algorithm for Gaussian mixture model for three datasets

(data1.txt, data2.txt, and data3.txt), respectively. [upload your code]

(2) Plot the clustering results using scatter plot, with different clusters of different colors. Calculate and

compare (1) purity and (2) normalized mutual information for each algorithm for each dataset. You need

to state the setting of the parameters used in these algorithms, if any is required.

(3) Can you give the reasoning why some algorithm works better (if any) than others for each of these

datasets?

3. Clustering the real-world data.

By looking at our AP_train.txt dataset, we can see that a lot of features can be captured for conferences.

Please provide a solution of clustering 20 conferences mentioned in Question 1 into 4 areas by using

AP_train.txt dataset. Please (1) write down the major steps, including feature selection, clustering

criterion, algorithm to use, etc.; (2) output the purity and NMI measure given the labels in Question 1;

(3) analyze whether your solution is a reasonable one and why? [Note: you need to first link these

conference acronym names and the proceeding names used in AP_train.txt. For example, “#c

Proceedings of the Fifth ACM SIGKDD international conference on Knowledge discovery and data

mining” is the 1999 year proceeding of “KDD.”]