big data
Chapter 4: Clustering
I
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are minimized
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Applications of Cluster Analysis
Understanding
Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations
Summarization
Reduce the size of large data sets
Clustering precipitation in Australia
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
What is not Cluster Analysis?
Simple segmentation
Dividing students into different registration groups alphabetically, by last name
Results of a query
Groupings are a result of an external specification
Clustering is a grouping of objects based on the data
Supervised classification
Have class label information
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters
Two Clusters
Six Clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Partitional Clustering
A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Partitional Clustering
Original Points
A Partitional Clustering
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Hierarchical Clustering
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
Traditional Dendrogram
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to multiple clusters.
Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy
In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial versus complete
In some cases, we only want to cluster some of the data
Heterogeneous versus homogeneous
Clusters of widely different sizes, shapes, and densities
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Described by an Objective Function
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Well-Separated
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.
3 well-separated clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a cluster is closer to the “center” of a cluster, than to the center of any other cluster
The center of a cluster is a centroid.
The average of all the points in the cluster is a medoid
4 center-based clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
A point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
8 contiguous clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Density-Based
Density-based
A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
6 density-based clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a particular concept.
.
2 Overlapping Circles
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Types of Clusters: Objective Function
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard)
Can have global or local objectives.
Hierarchical clustering algorithms typically have local objectives
Partitional algorithms typically have global objectives
A variation of the global objective function approach is to fit the data to a parameterized model.
Parameters for the model are determined from the data.
Mixture models assume that the data is a ‘mixture' of a number of statistical distributions.
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Characteristics of the Input Data Are Important
Type of proximity or density measure
Central to clustering
Depends on data and application
Data characteristics that affect proximity and/or density are
Dimensionality
Sparseness
Attribute type
Special relationships in the data
For example, autocorrelation
Distribution of the data
Noise and Outliers
Often interfere with the operation of the clustering algorithm
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
K-means Clustering
Partitional clustering approach
Number of clusters, K, must be specified
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
The basic algorithm is very simple
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Example of K-means Clustering
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Example of K-means Clustering
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two sets of clusters, we prefer the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Two different K-means Clusterings
Sub-optimal Clustering
Optimal Clustering
Original Points
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of K-means
K-means has problems when clusters are of differing
Sizes
Densities
Non-globular Shapes
K-means has problems when the data contains outliers/noise.
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of K-means: Differing Sizes
Original Points
K-means (3 Clusters)
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of K-means: Differing Density
Original Points
K-means (3 Clusters)
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of K-means: Non-globular Shapes
Original Points
K-means (2 Clusters)
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Overcoming K-means Limitations
Original Points K-means Clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Overcoming K-means Limitations
Original Points K-means Clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Importance of Choosing Initial Centroids
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Importance of Choosing Initial Centroids
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Importance of Choosing Initial Centroids …
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Importance of Choosing Initial Centroids …
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Problems with Selecting Initial Points
If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 = 0.00036
Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t
Consider an example of five pairs of clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Solutions to Initial Centroids Problem
Multiple runs
Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial centroids
Select most widely separated
Generate a larger number of clusters and then perform a hierarchical clustering
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of merges or splits
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level
They may correspond to meaningful taxonomies
Example in biological sciences
02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Discovered Clusters Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Technology1-DOWN
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
|
|
Discovered Clusters |
Industry Group |
|
1 |
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN |
Technology1-DOWN |
|
2 |
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN |
Technology2-DOWN |
|
3 |
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN |
Financial-DOWN |
|
4 |
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP |
Oil-UP |
p4
p1
p3
p
2
p4
p1
p3
p
2
p4
p1
p2
p3
p4
p1
p2
p3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
å
å
=
Î
=
K
i
C
x
i
i
x
m
dist
SSE
1
2
)
,
(
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
0
5
10
15
20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0
5
10
15
20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0
5
10
15
20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0
5
10
15
20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
1
2
3
4
5
6
1
2
3
4
5
1
3
2
5
4
6
0
0.05
0.1
0.15
0.2