Computer Science Assignment work
K-means Cluster Analysis
The objective of this algorithm is to partition a data set S consisting of n-tuples of real numbers into k clusters C1, …, Ck in an efficient way. For each cluster Cj, one element cj is chosen from that cluster called a centroid.
Definition 1: The basic k-means clustering algorithm is defined as follows:
· Step 1: Choose the number of clusters k
· Step 2: Make an initial selection of k centroids
· Step 3: Assign each data element to its nearest centroid (in this way k clusters are formed one for each centroid, where each cluster consists of all the data elements assigned to that centroid)
· Step 4: For each cluster make a new selection of its centroid
· Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)
There are various choices available for each step in the process.
An alternative version of the algorithm is as follows:
· Step 1: Choose the number of clusters k
· Step 2: Make an initial assignment of the data elements to the k clusters
· Step 3: For each cluster select its centroid
· Step 4: Based on centroids make a new assignment of data elements to the k clusters
· Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)
Distance
There are a number of ways to define the distance between two n-tuples in the data set S, but we will focus on the Euclidean measure, namely, if x = ( x1, …, xn) and y = (y1, …, y n) then the distance between x and y is defined by
Since minimizing the distance is equivalent to minimizing the square of the distance, we will instead look at dist2( x, y) = ( dist( x, y))2. If there are k clusters C1, …, Ck with corresponding centroids c1, …, ck, then for each data element x in S, step 3 of the k-means algorithm consists of finding the value j which minimizes dist2( x, cj); i.e.
If we don’t require that the centroids belong to the data set S, then we typically define the new centroid cj for cluster Cj in step 4 to be the mean of all the elements in that cluster, i.e.
where mj is the number of data elements in Cj.
If we think of the distance squared between any data element in S and its nearest centroid as an error value (between a data element and its prototype) then across the system we are trying to minimize
Initial Choice
There is no guarantee that we will find centroids c1, …, ck that minimize SSE and a lot depends on our initial choices for the centroids in step 2.
Property 1: The best choice for the centroids c1, …, ck are the n-tuples which are the means of the C1, …, Ck,. By best choice, we mean the choice that minimizes SSE.
Click here for a proof of this property (using calculus).
Example
Example 1: Apply the second version of the k-means clustering algorithm to the data in range B3:C13 of Figure 1 with k = 2.
Figure 1 – K-means cluster analysis (part 1)
The data consists of 10 data elements which can be viewed as two-dimensional points (see Figure 3 for a graphical representation). Since there are two clusters, we start by assigning the first element to cluster 1, the second to cluster 2, the third to cluster 1, etc. (step 2), as shown in range E3:E13.
We now set the centroids of each cluster to be the mean of all the elements in that cluster. The centroid of the first cluster is (2.6, 1.4) where the X value (in cell H4) is calculated by the formula =AVERAGEIF(E4:E13,1,B4:B13) and the Y value (in cell H5) is calculated by the worksheet formula =AVERAGEIF(E4:E13,1,C4:C13). The centroid for the second cluster (3.2, 3.0) is calculated in a similar way.
We next calculate the squared distance of each of the ten data elements to each centroid. E.g. the squared distance of the first data element to the first centroid is 7.72 (cell L4) as calculated by =(B4-H4)^2+(C4-H5)^2 or equivalently =SUMXMY2($B4:$C4,H$4:H$5). Since the squared distance to the second cluster is 12.24 (cell M4) is higher we see that the first data element is closer to cluster 1 and so we keep that point in cluster 1 (cell O4). Here cell K4 contains the formula =MIN(L4:M4) and cell O4 contains the formula =IF(L4<=M4,1,2).
Convergence
We proceed in this way to determine a new assignment of clusters to each of the 10 data elements as described in range O4:O13. The value of SSE for this assignment is 36.92 (cell H7). Since the original cluster assignment (range E4:E13) is different from the new cluster assignment, the algorithm has not yet converged and so we continue. We simply copy the latest cluster assignment into the range E16:E25 and repeat the same steps.
After four steps we get convergence, as shown in Figure 2 (range E40:E49 contains the same values as O40:O49). The final assignment of data elements to clusters is shown in range E40:E49. We also see that SSE = 22.083 and note that each step in the algorithm has reduced the value of SSE.
Figure 2 – K-means cluster analysis (part 2)
Graph of Assignments
Figure 3 graphically shows the assignment of data elements to the two clusters. This chart is created by highlighting the range B40:C49 and selecting Insert > Charts|Scatter.
Figure 3 – Cluster Assignment
You can add the labels (1 and 2) to the points on the chart shown in Figure 3 as follows. First, right-click on any of the points in the chart. Next, click on the Y Value option in the dialog box that appears as shown in Figure 4.
Figure 4 – Adding labels containing cluster assignment
References
PennState (2015) K-Mean procedure. STAT 505: Applied Multivariate Statistical Analysis https://online.stat.psu.edu/stat505/lesson/14/14.8
Wilks, D. (2011) Cluster analysis http://www.yorku.ca/ptryfos/f1500.pdf
Wikipedia (2015) K-means clustering https://en.wikipedia.org/wiki/K-means_clustering