Computer Science Assignment work

profileRavi12
Week4-K-meansClusterAnalysisStepsusingExcel.docx

K-means Cluster Analysis

The objective of this algorithm is to partition a data set  S consisting of n-tuples of real numbers into  k  clusters  C1, …,  Ck in an efficient way. For each cluster  Cj, one element  cj is chosen from that cluster called a  centroid.

Definition 1: The basic  k-means clustering algorithm is defined as follows:

· Step 1: Choose the number of clusters  k

· Step 2: Make an initial selection of  k centroids

· Step 3: Assign each data element to its nearest centroid (in this way  k clusters are formed one for each centroid, where each cluster consists of all the data elements assigned to that centroid)

· Step 4: For each cluster make a new selection of its centroid

· Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)

There are various choices available for each step in the process.

An alternative version of the algorithm is as follows:

· Step 1: Choose the number of clusters  k

· Step 2: Make an initial assignment of the data elements to the  k clusters

· Step 3: For each cluster select its centroid

· Step 4: Based on centroids make a new assignment of data elements to the  k clusters

· Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)

Distance

There are a number of ways to define the distance between two n-tuples in the data set  S, but we will focus on the Euclidean measure, namely, if  x = ( x1, …,  xn) and y = (y1, …, y n) then the  distance between  x and y is defined by

image9248

Since minimizing the distance is equivalent to minimizing the square of the distance, we will instead look at  dist2( x, y) = ( dist( x, y))2. If there are  k clusters  C1, …,  Ck with corresponding centroids  c1, …,  ck, then for each data element  x in  S, step 3 of the k-means algorithm consists of finding the value  j which minimizes  dist2( x, cj); i.e.

image9249

If we don’t require that the centroids belong to the data set  S, then we typically define the new centroid  cj for cluster  Cj in step 4 to be the mean of all the elements in that cluster, i.e.

image9250

where  mj is the number of data elements in  Cj.

If we think of the distance squared between any data element in  S and its nearest centroid as an error value (between a data element and its prototype) then across the system we are trying to minimize

image9251

Initial Choice

There is no guarantee that we will find centroids  c1, …,  ck that minimize  SSE and a lot depends on our initial choices for the centroids in step 2.

Property 1: The best choice for the centroids  c1, …,  ck are the  n-tuples which are the means of the  C1, …,  Ck,. By best choice, we mean the choice that minimizes  SSE.

Click here for a proof of this property (using calculus).

Example

Example 1: Apply the second version of the k-means clustering algorithm to the data in range B3:C13 of Figure 1 with  k = 2.

Cluster analysis k-means

Figure 1 – K-means cluster analysis (part 1)

The data consists of 10 data elements which can be viewed as two-dimensional points (see Figure 3 for a graphical representation). Since there are two clusters, we start by assigning the first element to cluster 1, the second to cluster 2, the third to cluster 1, etc. (step 2), as shown in range E3:E13.

We now set the centroids of each cluster to be the mean of all the elements in that cluster. The centroid of the first cluster is (2.6, 1.4) where the X value (in cell H4) is calculated by the formula =AVERAGEIF(E4:E13,1,B4:B13) and the Y value (in cell H5) is calculated by the worksheet formula =AVERAGEIF(E4:E13,1,C4:C13). The centroid for the second cluster (3.2, 3.0) is calculated in a similar way.

We next calculate the squared distance of each of the ten data elements to each centroid. E.g. the squared distance of the first data element to the first centroid is 7.72 (cell L4) as calculated by =(B4-H4)^2+(C4-H5)^2 or equivalently =SUMXMY2($B4:$C4,H$4:H$5). Since the squared distance to the second cluster is 12.24 (cell M4) is higher we see that the first data element is closer to cluster 1 and so we keep that point in cluster 1 (cell O4). Here cell K4 contains the formula =MIN(L4:M4) and cell O4 contains the formula =IF(L4<=M4,1,2).

Convergence

We proceed in this way to determine a new assignment of clusters to each of the 10 data elements as described in range O4:O13. The value of  SSE for this assignment is 36.92 (cell H7). Since the original cluster assignment (range E4:E13) is different from the new cluster assignment, the algorithm has not yet converged and so we continue. We simply copy the latest cluster assignment into the range E16:E25 and repeat the same steps.

After four steps we get convergence, as shown in Figure 2 (range E40:E49 contains the same values as O40:O49). The final assignment of data elements to clusters is shown in range E40:E49. We also see that  SSE = 22.083 and note that each step in the algorithm has reduced the value of  SSE.

K-means algorithm Excel

Figure 2 – K-means cluster analysis (part 2)

Graph of Assignments

Figure 3 graphically shows the assignment of data elements to the two clusters. This chart is created by highlighting the range B40:C49 and selecting  Insert > Charts|Scatter.

Cluster analysis chart

Figure 3 – Cluster Assignment

You can add the labels (1 and 2) to the points on the chart shown in Figure 3 as follows. First, right-click on any of the points in the chart. Next, click on the Y Value option in the dialog box that appears as shown in Figure 4.

Format data labels

Figure 4 – Adding labels containing cluster assignment

References

PennState (2015)  K-Mean procedure. STAT 505: Applied Multivariate Statistical Analysis https://online.stat.psu.edu/stat505/lesson/14/14.8

Wilks, D. (2011)  Cluster analysis http://www.yorku.ca/ptryfos/f1500.pdf

Wikipedia (2015)  K-means clustering https://en.wikipedia.org/wiki/K-means_clustering

image3.png

image4.png

image5.png

image6.png

image7.png

image8.png

image1.png

image2.png