Introduction to data mining
Measures of Similaritv and Dissimilaritv 65
the comparison between people will be dominated by differences in income. In particular, if the similarity or dissimilarity of two people is calculated using the similarity or dissimilarity measures defined later in this chapter, then in many cases, such as that of Euclidean distance, the income values will dominate the calculation.
The mean and standard deviation are strongly affected by outliers, so the above transformation is often modified. First, the mean is replaced by the median, i.e., the middle value. Second, the standard deviation is replaced by the absolute standard deviation. Specifically, if r is a variable, then the absolute standard deviation of r is given by oa : Dlrl*n - ltl, where ri is lhe 'ith value of the variable, rn is the number of objects, and. p, is either the mean or median. Other approaches for computing estimates of the location (center) and spread of a set of values in the presence of outliers are described in Sections 3.2.3 and 3.2.4, respectively. These measures can also be used to define a standardi zation transformation.
2.4 Measures of Similarity and Dissimilarity
Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering, nearest neighbor classification, and anomaly detection. In many cases) the initial data set is not needed once these similarities or dissimilarities have been computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity) space and then performing the analysis.
We begin with a discussion of the basics: high-level definitions of similarity and dissimilarity, and a discussion of how they are related. For convenience, the term proximity is used to refer to either similarity or dissimilarity. Since the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects, we first describe how to measure the proximity between objects having only one simple attribute, and then consider proximity measures for objects with multiple attributes. This in- cludes measures such as correlation and Euclidean distance, which are useful for dense data such as time series or two-dimensional points, as well as the Jaccard and cosine similarity measures, which are useful for sparse data like documents. Next, we consider several important issues concerning proximity measures. The section concludes with a brief discussion of how to select the right proximity measure.
2.4
66 Chapter 2 Data
2.4.L Basics
Definitions
Informally, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike. Consequently, similarities are hi,gher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity).
The dissimilarity between two objects is a numerical measure of the de- gree to which the two objects are different. Dissimilarities are lower for more similar pairs of objects. FYequently, the term distance is used as a synonym for dissimilarity, although, as we shall see, distance is often used to refer to a special class of dissimilarities. Dissimilarities sometimes fall in the interval
[0,1], but it is also common for them to range from 0 to oo.
TYansformations
Thansformations are often applied to convert a similarity to a dissimilarity, or vice versa, or to transform a proximity measure to fall within a particular
range, such as [0,1]. For instance, we may have similarities that range from 1 to 10, but the particular algorithm or software package that we want to use may be designed to only work with dissimilarities, or it may only work with similarities in the interval 10,1]. We discuss these issues here because we will employ such transformations later in our discussion of proximity. In addi- tion, these issues are relatively independent of the details of specific proximity
measures. Frequently, proximity measures, especially similarities, are defined or trans-
formed to have values in the interval [0,1]. Informally, the motivation for this is to use a scale in which a proximity value indicates the fraction of similarity (or dissimilarity) between two objects. Such a transformation is often rela- tively straightforward. For example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), we can make them fall within the range 10, 1] by using the transformation s' : (s - 1)/9, where s and s/ are the original and new similarity values, respectively. In the more general case, the transformation of similarities to the interval [0,1] is given by the expression
"' : (" - mi,n-s) I (mar -s - mi,n-s) , where mar -s and m'in-s are the
maximum and minimum similarity values, respectively. Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by using the formula d' : (d - rni,n-d)l(mar-d - mi,n-d).
There can be various complications in mapping proximity measures to the interval 10,1], however. If, for example, the proximity measure originally takes
Measures of Similaritv and Dissimilaritv 67
values in the interval [0,-], then a non-linear transformation is needed and values will not have the same relationship to one another on the new scale. Consider the transformation d,' : dl(I* d) for a dissimilarity measure that ranges from 0 to oo. The dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be transformed into the new dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999, respectively. Larger values on the original dissimilarity scale are compressed into the range of values near 1, but whether or not this is desirable depends on the application. Another complication is that the meaning of the proximity measure may be changed. For example, correlation, which is discussed later, is a measure of similarity that takes values in the interval [-1,1]. Mapping these values to the interval [0,1] by taking the absolute value loses information about the sign, which can be important in some applications. See Exercise 22 on page 94.
Tlansforming similarities to dissimilarities and vice versa is also relatively straightforward, although we again face the issues of preserving meaning and changing a linear scale into a non-linear scale. If the similarity (or dissimilar- ity) falls in the interval [0,1], then the dissimilarity can be defined as d : 1- s (s : 1 - d). Another simple approach is to define similarity as the nega- tive of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0, 1, 10, and 100 can be transformed into the similarities 0, -1, -10, and -100,
respectively. The similarities resulting from the negation transformation are not re-
stricted to the range [0,1], but if that is desired, then transformations such as
" : ;{1, s : e-d, or s : 1- *mZ can be used. For the transformation
s : 7fi, the dissimilarities 0, 1, 10, 100 are transformed into 1, 0.5, 0.09, 0.01, respectively. For s : e-d, they become 1.00, 0.37, 0.00, 0.00, respectively, while for s: 1- *ffin they become 1.00, 0.99, 0.00, 0.00, respectively. In this discussion, we have focused on converting dissimilarities to similarities. Conversion in the opposite direction is considered in Exercise 23 on page 94.
In general, any monotonic decreasing function can be used to convert dis- similarities to similarities, or vice versa. Of course, other factors also must be considered when transforming similarities to dissimilarities, or vice versa, or when transforming the values of a proximity measure to a new scale. We have mentioned issues related to preserving meaning, distortion of scale, and requirements of data analysis tools, but this list is certainly not exhaustive.
2.4.2 Similarity and Dissimilarity between Simple Attributes
The proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes, and thus, we first discuss
2.4
68 Chapter 2 Data
proximity between objects having a single attribute. Consider objects de-
scribed by one nominal attribute. What would it mean for two such objects
to be similar? Since nominal attributes only convey information about the
distinctness of objects, all we can say is that two objects either have the same value or they do not. Hence, in this case similarity is traditionally defined as 1
if attribute values match, and as 0 otherwise. A dissimilarity would be defined
in the opposite way: 0 if the attribute values match, and 1 if they do not. For objects with a single ordinal attribute, the situation is more compli-
cated because information about order should be taken into account. Consider an attribute that measures the quality of a product, €.8., a candy bar, on the
scale {poor, fai,r, OK, good, wonderful).It would seem reasonable that a prod-
uct, P1, which is rated wonderful, would be closer to a produclP2, which is
rated good, than it would be to a product P3, which is rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped
to successive integers, beginning at 0 or 1, e.g., {poor:O, fair:|, OK:2, good:3, wonderful:4). Then, d(Pl,P2) - 3 - 2 : 7 or, if we want the dis-
similarity to fall between 0 and 1, d(P1, P2) : TZ :0.25. A similarity for ordinal attributes can then be defined as s : 7 - d.
This definition of similarity (dissimilarity) for an ordinal attribute should make the reader a bit uneasy since this assumes equal intervals, and this is not
so. Otherwise, we would have an interval or ratio attribute. Is the difference between the values fair and good really the same as that between the values
OK and wonderful? Probably not, but in practice, our options are limited, and in the absence of more information, this is the standard approach for defining proximity between ordinal attributes.
For interval or ratio attributes, the natural measure of dissimilarity be- tween two objects is the absolute difference of their values. For example, we might compare our current weight and our weight a year ago by saying "I am
ten pounds heavier." In cases such as these, the dissimilarities typically range from 0 to oo, rather than from 0 to 1. The similarity of interval or ratio at- tributes is typically expressed by transforming a similarity into a dissimilarity, as previously described.
Table 2.7 summarizes this discussion. In this table, r and g are two objects that have one attribute of the indicated type. AIso, d(r,a) and s(r,gr) are the dissimilarity and similarity between r and g, respectively. Other approaches are possible; these are the most common ones.
The following two sections consider more complicated measures of prox-
imity between objects that involve multiple attributes: (1) dissimilarities be- tween data objects and (2) similarities between data objects. This division
2.4 Measures of Similaritv and Dissimilaritv 69
allows us to more naturally display the underlying motivations for employing various proximity measures. We emphasize, however, that similarities can be transformed into dissimilarities and vice versa using the approaches described earlier.
2.4.3 Dissimilarities between Data Objects
In this section, we discuss various kinds of dissimilarities. We begin with a discussion of distances, which are dissimilarities with certain properties, and then provide examples of more general kinds of dissimilarities.
Distances
We first present some examples, and then offer a more formal description of distances in terms of the properties common to all distances. The Euclidean distance, d, between two points, x and y, in one-, two-, three-, or higher- dimensional space, is given by the following familiar formula:
d(*, y) : n
\ - r \ t ) \ tn
- ak ) ' , K = l
(2 .1 )
where n is the number of dimensions and rp and.Ak are) respectively,the kth attributes (components) of r and g. We illustrate this formula with Figure 2.15 and Tables 2.8 and 2.9, which show a set of points, the e and gr coordinates of these points, and the distance matrix containing the pairwise distances of these points.
Table 2.7 . Similarity and dissimilarity for simple attributes
Attribute T'ype
Dissimilarity Similaritv
Nominal ) _11, - 0 1 f r : A I i f r l y s :
I \ f r : y 0 i f n l y
Ordinal d: l r -a l l ( " - t ) (values mapped to integers 0 to n-1 where n is the number of values)
s : I - d
Interval or Ratio d : l r - a l s : - d , s : ; i , s : e - o ,' L t a ^ - 1 d -min-d-
70 Chapter 2 Data
The Euclidean distance measure given the Minkowski distance metric shown in
in Equation 2.1 is generalized Equation 2.2,
by
d(x, y) : ( , , \
where r is a parameter. The following are the three most common examples of Minkowski distances.
. r :1. City block (Manhattan, taxicab, L1 norm) distance. A common example is the Hamming distance, which is the number of bits that are different between two objects that have only binary attributes, i.e., between two binary vectors.
o r :2. Euclidean distance (L2 norm).
. r : oo. Supremum (L*o, or L- norm) distance. This is the maximum difference between any attribute of the objects. More formally, the L- distance is defined by Equation 2.3
(2 .3)
(Lor-orr)''"
J* (U'wr-rrt')"'d(*, Y) :
The r parameter should not be confused with the number of dimensions (at- tributes) n. The Euclidean, Manhattan, and supremum distances are defined for all values of n: I,2,3,..., and specify different ways of combining the differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1 and Loo distances using data from Table 2.8. Notice that all these distance matrices are symmetric; i.e., the ijth entry is the same as the jith entry. In Table 2.9, for instance, the fourth row of the first column and the fourth column of the first row both contain the value 5.1.
Distances, such as the Euclidean distance, have some well-known proper- ties. If d(*, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) > 0 for all x and y,
(b) d(x, Y) : 0 onlY if x : Y.
Measures of Similaritv and Dissimilaritv 7L
Figure 2.15. Four two-dimensional points.
Tabfe 2.8. r and y coordinates of four points. Table 2.9. Euclidean distance matrix for Table 2.g. point z coordinate y coordinate
p1 0 2 p2 2 0 p3 .) 1 p4 1
Table 2.10. L1 distance matrix for Table 2.8. Table 2,11. L- distance matrix for Table 2.8.
2. Symmetry d(*,Y) : d(Y,x) for al l x and Y.
3. T[iangle Inequality d(x,z) < d(*, y) + d(y, z) for all points x, y,, and z.
Measures that satisfy all three properties are known as metrics. Some people only use the term distance for dissimilarity measures that satisfy these properties, but that practice is often violated. The three properties described here are useful, as well as mathematically pleasing. AIso, if the triangle in- equality holds, then this property can be used to increase the efficiency of tech- niques (including clustering) that depend on distances possessing this property. (See Exercise 25.) Nonetheless, many dissimilarities do not satisfy one or more of the metric properties. We give two examples of such measures.
2.4
1 2 3 4 5 6 X
p l p2 p3 p4 pl 0.0 2 .8 3 .2 5 . 1 p2 2.8 0.0 t .4 3 .2 p3 3.2 L .4 0.0 2.0 p4 c . r 3.2 2 .0 0 .0
L1 p1 p2 p3 p4 p l 0.0 4 .0 4.0 6.0 p2 4.0 0.0 2.0 4.0 p3 4.0 2 .0 0.0 2 .0 p4 6.0 4.0 2 .0 0.0
L- p l p2 p3 p4 p1 0.0 2 .0 3.0 5.0 p2 2.0 0 .0 1 .0 3.0 p3 3.0 1 .0 0.0 2.0 p4 5.0 3 .0 2.0 0 .0
72 Chapter 2 Data
Example 2.L4 (Non-metric Dissimilarities: Set Differences). This ex- ample is based on the notion of the difference of two sets, as defined in set theory. Given two sets -4 and B, A - B is the set of elements of A that are not in B . For example , i f A : { I ,2 ,3 ,4 } and B : {2 ,3 ,4 } , then A- B : {1 } and B - A - 0, the empty set. We can define the distance d between two sets A and B as d(A,B): size(A- B), where s'ize ts a function returning the number of elements in a set. This distance measure, which is an integer value greater than or equal to 0, does not satisfy the second part of the pos- itivity property the symmetry property, or the triangle inequality. However, these properties can be made to hold if the dissimilarity measure is modified as fol lows: d(A,B): s ize(A- B) + si ,ze(B - A). See Exercise 21 on page 94. r
Example 2.15 (Non-metric Dissimilarities: Time). This example gives
a more everyday example of a dissimilarity measure that is not a metric, but that is still useful. Define a measure of the distance between times of the day as follows:
d,(t1,t2) : {7^ *t,az _ tr) l i l l i I:} To illustrate, d(lPM, 2PM) : t hour, while d(2PM, 1PM) : 23 hours.
Such a definition would make sense, for example, when answering the question:
"ff an event occurs at lPM every day, and it is now 2PM, how Iong do I have to wait for that event to occur again?"
2.4.4 Similarities between Data Objects
For similarities, the triangle inequality (or the analogous property) typically does not hold, but symmetry and positivity typically do. To be explicit, if s(x, y) is the similarity between points x and y, then the typical properties of similarities are the following:
1 . s (x ,y ) : 1 on ly i f x : y . (0 < s S 1)
2. s(x,y) : s(y, x) for all x and y. (Symmetry)
There is no general analog of the triangle inequality for similarity mea- sures. It is sometimes possible, however, to show that a similarity measure can easily be converted to a metric distance. The cosine and Jaccard similarity measures, which are discussed shortly, are two examples. Also, for specific sim- ilarity measures, it is possible to derive mathematical bounds on the similarity between two objects that are similar in spirit to the triangle inequality.
(2.4)
Measures of Similarity and 73
Example 2.L6 (A Non-symmetric Similarity Measure). Consider an experiment in which people are asked to classify a small set of characters as they flash on a screen. The confusion matrix for this experiment records how often each character is classified as itself, and how often each is classified as another character. For instance, suppose that "0" appeared 200 times and was classified as a "0" 160 times, but as an "o" 40 times. Likewise, suppose that 'o' appeared 200 times and was classified as an "o" 170 times, but as "0" only 30 times. If we take these counts as a measure of the similarity between two characters, then we have a similarity measure, but one that is not symmetric. In such situations, the similarity measure is often made symmetric by setting s'(x, y) : s'(y, x) : (s(x, y)+ s(y,
"D 12, where s/ indicates the new similarity
measure,
2.4 .5
This section provides specific examples of some similarity and dissimilarity measures.
Similarity Measures for Binary Data
Similarity measures between objects that contain only binary attributes are called similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar. There are many rationales for why one coefificient is better than another in specific instances.
Let x and y be two objects that consist of n binary attributes. The com- parison of two such objects, i.e., two binary vectors, Ieads to the following four quantities (frequencies) :
.foo : the number of attributes where x is 0 and y is 0
.for : the number of attributes where x is 0 and y is 1
,fio : the number of attributes where x is 1 and y is 0
"fir : the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the simple matching coefficient (SMC), which is defined as
number of matching attribute values ft * fss
2.4
Examples of Proximity Measures
S M C : number of attributes for -l fn * ,frr * ,foo'
(2.5)
74 Chapter 2 Data
This measure counts both presences and absences equally. Consequently, the SMC could be used to find students who had answered questions similarly on a test that consisted only of true/false questions.
Jaccard Coefficient Suppose that x and y are data objects that represent two rows (two transactions) of a transaction matrix (see Section 2.1.2). If each asymmetric binary attribute corresponds to an item in a store, then a 1 indi- cates that the item was purchased, while a 0 indicates that the product was not purchased. Sincb the number of products not purchased by any customer far outnumbers the number of products that were purchased, a similarity measure such as SMC would say that all transactions are very similar. As a result, the Jaccard coefficient is frequently used to handle objects consisting of asymmet- ric binary attributes. The Jaccard coefficient, which is often symbolized by ,./,'is given by the following equation:
J : number of matching presences (2.6)
number of attributes not involved in 00 matches fot * fn -t ft
Example 2.17 (The SMC and Jaccard Similarity Coefficients). To illustrate the difference between these two similarity measures, we calculate SMC and -I for the following two binary vectors.
x : (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) y : ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 )
for :2 the number of attributes where x was 0 and y was 1
frc: I the number of attributes where x was 1 and y was 0
foo :7 the number of attributes where x was 0 and y was 0
/rr : 0 the number of attributes where x was 1 and y was 1
S M C : - ; - , = - - ; t ' - u . l
J : -T - ,+ - - - - ; - : - -4 - :g J O I T J I O T J I I Z T I T U
Cosine Similarity
Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. It is more complicated than this, of course, since certain common words are ig-
J 7 I
Measures of Similaritv and Dissimilaritv 75
nored and various processing techniques are used to account for different forms of the same word, differing document lengths, and different word frequencies.
Even though documents have thousands or tens of thousands of attributes (terms), each document is sparse since it has relatively few non-zero attributes. (The normalizations used for documents do not create a non-zero entry where there was azero entry; i.e., they preserve sparsity.) Thus, as with transaction data, similarity should not depend on the number of shared 0 values since any two documents are likely to "not contain" many of the same words, and therefore, if 0-0 matches are counted, most documents will be highly similar to most other documents. Therefore, a similarity measure for documents needs to ignores 0-0 matches like the Jaccard measure, but also must be able to handle non-binary vectors. The cosine similarity, defined next, is one of the most common measure of document similarity. If x and y are two document vectors, then
cos(x,y) : ffi,
2.4
( , 7 \
where . indicates the vector dot product, x .y : D[:trplp, and llxll is the
length of vector x, ll*ll : 1f D|:rr2r: 1/x4.
Example 2.18 (Cosine Similarity of Two Document Vectors). This example calculates the cosine similarity for the following two data objects, which might represent document vectors:
* : ( 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0 , 0 ) y : ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 2 )
x . y : 3 i . 1 * 2 x 0 * 0 * 0 * 5 x 0 * 0 x 0 f 0 * 0 * 0 * 0 * 2 * l * 0 x 0 * 0 x 2 : 5
l l x l l : roxo *oxo *2x2*oxo *oxo :6 .48 l l y l l : :2 .24 cos(x, y) : 0.31
I
As indicated by Figure 2.16, cosine similarity really is a measure of the (cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0o, and x and y are the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y is 90o, and they do not share any terms (words).
76 Chapter 2 Data
Figule 2.16. Geometric illustration of the cosine measure.
Equation 2.7 can be written as Equation 2.8.
cos(x ,y ) : + . , ,Y , , : x ' . y ' , l l x l l l l v l l
(2.8)
where x' : x/llxll and y/ : V lllyll. Dividing x and y by their lengths normal- izes them to have a length of 1. This means that cosine similarity does not take the magnitude ofthe two data objects into account when computing similarity. (Euclidean distance might be a better choice when magnitude is important.) For vectors with a length of 1, the cosine measure can be calculated by taking a simple dot product. Consequently, when many cosine similarities between objects are being computed, normalizing the objects to have unit length can reduce the time required.
Extended Jaccard Coefficient (Tanimoto Coefficient)
The extended Jaccard coefficient can be used for document data and that re- duces to the Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also known as the Tanimoto coefficient. (However, there is another coefficient that is also known as the Tanimoto coefficient.) This co- efficient, which we shall represent as E J , is defined by the following equation:
EJ(x , y ) : x . y (2.e)
l l " l l '+ l lv l l ' - * .v
Correlation
The correlation between two data objects that have binary or continuous vari- ables is a measure of the linear relationship between the attributes of the objects. (The calculation of correlation between attributes, which is more common, can be defined similarly.) More precisely, Pearson's correlation
Measures of Similaritv and Dissimilaritv 77
coefficient between two data objects, x and y, is defined by the following equation:
corr(x, y) : covariance(x, y) StA
(2 .10) standard-deviation(x) xstandard-deviation(y) tr ta'
where we are using the following standard statistical notation and definitions:
2.4
1 n
covariance(x,y) : s,s: :+ I("u -z)(yx -9)n - r - . E : L
(2 .11)
standard-deviation(x) : su :
standard-deviation(y) : su:
1 f l
i : I ) - r * i s t h e m e a n o f x n 4
k :1
1 f l
g : 1 f y * i s t h e m e a n o f y h : 1
Example 2.19 (Perfect Correlation). Correlation is always in the range -1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive (negative) linear relationship; that is, 16 - aAx -f b, where a, and b are con- stants. The following two sets of values for x and y indicate cases where the correlation is -1 and *1, respectively. In the first case, the means of x and y were chosen to be 0, for simplicity.
x : ( -3 , 6 , 0 , 3 , -6 )
y : ( 1 , - 2 , 0 , - 7 , 2 )
x : ( 3 , 6 , 0 , 3 , 6 ) y : ( 1 , 2 , 0 , L , 2 )
I
fil@n-n)'
; \ l@*-vt '
78 Chapter 2 Data
-1.00 -.0.90 4.70 --0.60
0.300.200.10
1.000.80
-o.10
Figure 2.17. Scatter plots illustrating correlations from -1 to 1.
Example 2.20 (Non-linear Relationships). If the correlation is 0, then there is no linear relationship between the attributes of the two data objects. However, non-linear relationships may still exist. In the following example, n*: A7, but their correlation is 0.
* : ( - 3 , - 2 , - 1 , 0 , I , 2 , 3 ) Y : (9 , 4 ,1 ,0 ,1 ,4 ,9 )
I
Example 2.21 (Visualizing Correlation). It is also easy to judge the cor- relation between two data objects x and y by plotting pairs of corresponding attribute values. Figure 2.17 shows a number of these plots when x and y have 30 attributes and the values of these attributes are randomly generated (with a normal distribution) so that the correlation of x and y ranges from -1
to 1. Each circle in a plot represents one of the 30 attributes; its r coordinate is the value of one of the attributes for x, while its 3r coordinate is the value of the same attribute for y. I
If we transform x and y by subtracting off their means and then normaliz- ing them so that their lengths are 1, then their correlation can be calculated by
Measures of Similaritv and Dissimilaritv 79
taking the dot product. Notice that this is not the same as the standardization used in other contexts, where we make the transformations, r'* : (rp - ,) lt" and y'r : (A* - T) I sa.
Bregman Divergence* This section provides a brief description of Breg- man divergences, which are a family of proximity functions that share some common properties. As a result, it is possible to construct general data min- ing algorithms, such as clustering algorithms, that work with any Bregman divergence. A concrete example is the K-means clustering algorithm (Section
8.2). Note that this section requires knowledge of vector calculus. Bregman divergences are loss or distortion functions. To understand the
idea of a loss function, consider the following. Let x and y be two points, where y is regarded as the original point and x is some distortion or approximation of it. For example, x may be a point that was generated, for example, by adding random noise to y. The goal is to measure the resulting distortion or Ioss that results if y is approximated by x. Of course, the more similar x and y are, the smaller the loss or distortion. Thus, Bregman divergences can be used as dissimilarity functions.
More formally, we have the following definition.
Definition 2.6 (Bregman Divergence). Given a strictly convex function
@ (with a few modest restrictions that are generally satisfied), the Bregman divergence (loss function) D(x, y) generated by that function is given by the following equation:
D(*,y) : d(x) - Q0) - (Vd(v), (* - y)) (2.r2)
where Vd(V) is the gradient of / evaluated at y, x - y, is the vector difference between x and y, and (Vd(V), (" - V)) is the inner product between Vd(*) and (x-y). For points in Euclidean space, the inner product is just the dot product.
D(*,y) can be wri t ten as D(x,V) : d(x) - L(*) , where.L(x) : d$)+ (Vd(V), (* - V)) and represents the equation of a plane that is tangent to the function Q at y. Using calculus terminology, L(x) is the linearization of. Q around the point y and the Bregman divergence is just the difference between a function and a linear approximation to that function. Different Bregman divergences are obtained by using different choices for S.
Example 2.22. We provide a concrete example using squared Euclidean dis- tance, but restrict ourselves to one dimension to simplify the mathematics. Let
2.4
80 Chapter 2 Data
r and y be real numbers and /(t) be the real valued function, d(t) : t2. In that case, the gradient reduces to the derivative and the dot product reduces to multiplication. Specifically, Equation 2.L2 becomes Equation 2.13.
D(*, i l : 12 - A2 - 2A@ - A) : @ - i l ' (2.13)
The graph for this example, with 3r : 1, is shown in Figure 2.18. The Bregman divergence is shown for two values of r: r :2 and r :3. r
Figure 2.18. lllustration of Bregman divergence.
2.4.6 fssues in Proximity Calculation
This section discusses several important issues related to proximity measures: (1) how to handle the case in which attributes have different scales and/or are correlated, (2) how to calculate proximity between objects that are composed of different types of attributes, e.g., quantitative and qualitative, (3) and how to handle proximity calculation when attributes have different weights; i.e., when not all attributes contribute equally to the proximity of objects.
- 1 0 1 2 3 4 x
2.4 Measures of Similarity and Dissimilarity 81
Standardization and Correlation for Distance Measures
An important issue with distance measures is how to handle the situation when attributes do not have the same range of values. (This situation is often described by saying that "the variables have different scales.") Earlier, Euclidean distance was used to measure the distance between people based on two attributes: age and income. Unless these two attributes are standardized, the distance between two people will be dominated by income.
A related issue is how to compute distance when there is correlation be- tween some of the attributes, perhaps in addition to differences in the ranges of values. A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have different ranges of values (dif- ferent variances), and the distribution of the data is approximately Gaussian (normal). Specifically, the Mahalanobis distance between two objects (vectors) x and y is defined as
mahalanobis(x,y) : (x - y)>-1(x -y)r, (2.14)
where E-1 is the inverse of the covariance matrix of the data. Note that the covariance matrix E is the matrix whose ijth entry is the covariance of the ith and jth attributes as defined by Equation 2.II.
Example 2.23. In Figure 2.19, there are 1000 points, whose r and g at- tributes have a correlation of 0.6. The distance between the two large points
at the opposite ends of the long axis of the ellipse is 14.7 in terms of Euclidean distance, but only 6 with respect to Mahalanobis distance. In practice, com- puting the Mahalanobis distance is expensive, but can be worthwhile for data whose attributes are correlated. If the attributes are relatively uncorrelated, but have different ranges, then standardizing the variables is sufficient.
I
Combining Similarities for fleterogeneous Attributes
The previous definitions of similarity were based on approaches that assumed all the attributes were of the same type. A general approach is needed when the attributes are of different types. One straightforward approach is to compute the similarity between each attribute separately using Table 2.7, and then combine these similarities using a method that results in a similarity between 0 and 1. Typically, the overall similarity is defined as the average of all the individual attribute similarities.
82 Chapter 2 Data
Figure 2.19. Set of two-dimensional points. The Mahalanobis distance between the two points repre- sented by large dots is 6;their Euclidean distance is 14.7.
Unfortunately, this approach does not work well if some of the attributes are asymmetric attributes. For example, if all the attributes are asymmetric binary attributes, then the similarity measure suggested previously reduces to the simple matching coefficient, a measure that is not appropriate for asym- metric binary attributes. The easiest way to fix this problem is to omit asym- metric attributes from the similarity calculation when their values are 0 for both of the objects whose similarity is being computed. A similar approach also works well for handling missing values.
In summary, Algorithm 2.7 is effective for computing an overall similar- ity between two objects, x and y, with different types of attributes. This procedure can be easily modified to work with dissimilarities.
Using Weights
In much of the previous discussion, all attributes were treated equally when computing proximity. This is not desirable when some attributes are more im- portant to the definition of proximity than others. To address these situations,
2.4 Measures of Similaritv and Dissimilaritv 83
Algorithm 2.1 Similarities of heterogeneous objects.
1-: 2 :
For the kth attribute, compute a similarity, s,r(x,y), in the range [0, 1]. Define an indicator variable, d'7., for the kth attribute as follows:
3: Compute the overall similarity between the two objects using the following for- mula:
similarity(x,") :$ffi (2.15)
the formulas for proximity can each attribute.
If the weights u.r6 sum to L,
be modified by weighting the contribution of
bhen (2.15) becomes
(2 .16)
The definition of the Minkowski distance can also be modified as follows:
d(x, y) : (2.r7)
2.4.7 Selecting the Right Proximity Measure
The following are a few general observations that may be helpful. First, the type of proximity measure should fit the type of data. For many types of dense, continuous data, metric distance measures such as Euclidean distance are of- ten used. Proximity between continuous attributes is most often expressed in terms of differences, and distance measures provide a well-defined way of combining these differences into an overall proximity measure. Although at- tributes can have different scales and be of differing importance, these issues can often be dealt with as described earlier.
For sparse data, which ofben consists of asymmetric attributes, we typi- cally employ similarity measures that ignore 0-0 matches. Conceptually, this reflects the fact that, for a pair of complex objects, similarity depends on the number of characteristics they both share, rather than the number of charac- teristics they both lack. More specifically, for sparse, asymmetric data, most
similarity(x, y) : D?:tl|lns !(*' Y) DT:'6x
/ n \ 1 / '
( D,'ol"r - akl'I \ t : r /
84 Chapter 2 Data
objects have only a few of the characteristics described by the attributes, and thus, are highly similar in terms of the characteristics they do not have. The cosine, Jaccard, and extended Jaccard measures are appropriate for such data.
There are other characteristics of data vectors that may need to be consid- ered. Suppose, for example, that we are interested in comparing time series. If the magnitude of the time series is important (for example, each time series represent total sales of the same organization for a different year), then we could use Euclidean distance. If the time series represent different quantities (for example, blood pressure and oxygen consumption), then we usually want to determine if the time series have the same shape, not the same magnitude. Correlation, which uses a built-in normalization that accounts for differences in magnitude and level, would be more appropriate.
In some cases, transformation or normalization of the data is important for obtaining a proper similarity measure since such transformations are not always present in proximity measures. For instance, time series may have trends or periodic patterns that significantly impact similarity. Also, a proper computation of similarity may require that time lags be taken into account. Finally, two time series may only be similar over specific periods of time. For example, there is a strong relationship between temperature and the use of natural gas, but only during the heating season.
Practical consideration can also be important. Sometimes, a one or more proximity measures are already in use in a particular field, and thus, others will have answered the question of which proximity measures should be used. Other times, the software package or clustering algorithm being used may drastically limit the choices. If efficiency is a concern, then we may want to choose a proximity measure that has a property, such as the triangle inequality, that can be used to reduce the number of proximity calculations. (See Exercise 25.)
However, if common practice or practical restrictions do not dictate a choice, then the proper choice of a proximity measure can be a time-consuming task that requires careful consideration of both domain knowledge and the purpose for which the measure is being used. A number of different similarity measures may need to be evaluated to see which ones produce results that make the most sense.
2.5 Bibliographic Notes
It is essential to understand the nature of the data that is being analyzed, and at a fundamental level, this is the subject of measurement theory. In