Mid Term
Data Science and Big Data Analytics
Chap 4: Advanced Analytical Theory and Methods: Clustering
1
4.1 Overview of Clustering
Clustering is the use of unsupervised techniques for grouping similar objects
Supervised methods use labeled objects
Unsupervised methods use unlabeled objects
Clustering looks for hidden structure in the data, similarities based on attributes
Often used for exploratory analysis
No predictions are made
2
4.2 K-means Algorithm
Given a collection of objects each with n measurable attributes and a chosen value k of the number of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the centers of the k groups.
The algorithm is iterative with the centers adjusted to the mean of each cluster’s n-dimensional vector of attributes
3
4.2.1 Use Cases
Clustering is often used as a lead-in to classification, where labels are applied to the identified clusters
Some applications
Image processing
With security images, successive frames are examined for change
Medical
Patients can be grouped to identify naturally occurring clusters
Customer segmentation
Marketing and sales groups identify customers having similar behaviors and spending patterns
4
4.2.2 Overview of the Method Four Steps
Choose the value of k and the initial guesses for the centroids
Compute the distance from each data point to each centroid, and assign each point to the closest centroid
Compute the centroid of each newly defined cluster from step 2
Repeat steps 2 and 3 until the algorithm converges (no changes occur)
5
4.2.2 Overview of the Method Example – Step 1
Set k = 3 and initial clusters centers
6
4.2.2 Overview of the Method Example – Step 2
Points are assigned to the closest centroid
7
4.2.2 Overview of the Method Example – Step 3
Compute centroids of the new clusters
8
4.2.2 Overview of the Method Example – Step 4
Repeat steps 2 and 3 until convergence
Convergence occurs when the centroids do not change or when the centroids oscillate back and forth
This can occur when one or more points have equal distances from the centroid centers
Videos
9
4.2.3 Determining Number of Clusters
Reasonable guess
Predefined requirement
Use heuristic – e.g., Within Sum of Squares (WSS)
WSS metric is the sum of the squares of the distances between each data point and the closest centroid
The process of identifying the appropriate value of k is referred to as finding the “elbow” of the WSS curve
10
4.2.3 Determining Number of Clusters Example of WSS vs #Clusters curve
The elbow of the curve appears to occur at k = 3.
11
4.2.3 Determining Number of Clusters High School Student Cluster Analysis
12
4.2.4 Diagnostics
When the number of clusters is small, plotting the data helps refine the choice of k
The following questions should be considered
Are the clusters well separated from each other?
Do any of the clusters have only a few points
Do any of the centroids appear to be too close to each other?
13
4.2.4 Diagnostics Example of distinct clusters
14
4.2.4 Diagnostics Example of less obvious clusters
15
4.2.4 Diagnostics Six clusters from points of previous figure
16
4.2.5 Reasons to Choose and Cautions
Decisions the practitioner must make
What object attributes should be included in the analysis?
What unit of measure should be used for each attribute?
Do the attributes need to be rescaled?
What other considerations might apply?
17
4.2.5 Reasons to Choose and Cautions Object Attributes
Important to understand what attributes will be known at the time a new object is assigned to a cluster
E.g., customer satisfaction may be available for modeling but not available for potential customers
Best to reduce number of attributes when possible
Too many attributes minimize the impact of key variables
Identify highly correlated attributes for reduction
Combine several attributes into one: e.g., debt/asset ratio
18
4.2.5 Reasons to Choose and Cautions Object attributes: scatterplot matrix for seven attributes
19
4.2.5 Reasons to Choose and Cautions Units of Measure
K-means algorithm will identify different clusters depending on the units of measure
k = 2
20
4.2.5 Reasons to Choose and Cautions Units of Measure
Age dominates
k = 2
21
4.2.5 Reasons to Choose and Cautions Rescaling
Rescaling can reduce domination effect
E.g., divide each variable by the appropriate standard deviation
Rescaled
attributes
k = 2
22
4.2.5 Reasons to Choose and Cautions Additional Considerations
K-means sensitive to starting seeds
Important to rerun with several seeds – R has the nstart option
Could explore distance metrics other than Euclidean
E.g., Manhattan, Mahalanobis, etc.
K-means is easily applied to numeric data and does not work well with nominal attributes
E.g., color
23
4.2.5 Additional Algorithms
K-modes clustering
kmod()
Partitioning around Medoids (PAM)
pam()
Hierarchical agglomerative clustering
hclust()
24
Summary
Clustering analysis groups similar objects based on the objects’ attributes
To use k-means properly, it is important to
Properly scale the attribute values to avoid domination
Assure the concept of distance between the assigned values of an attribute is meaningful
Carefully choose the number of clusters, k
Once the clusters are identified, it is often useful to label them in a descriptive way
25