anomaly detection


Homework 9

Answer the following questions: (10 point each)

1. Consider the following definition of an anomaly: An anomaly is an object that is unusually influential in the creation of a data model.

a. Compare this definition to that of the standard model-based definition of an anomaly.

b. For what sizes of data sets (small, medium, or large) is this definition appropriate?

2. In one approach to anomaly detection, objects are represented as points in a multidimensional space, and the points are grouped into successive shells, where each shell represents a layer around a grouping of points, such as a convex hull. An object is an anomaly if it lies in one of the outer shells.

a. To which of the definitions of an anomaly in Section 9.2 is this definition most closely related?

b. Name two problems with this definition of an anomaly.

3. Consider the (relative distance) K-means scheme for outlier detection described in Section 9.5 and the accompanying figure, Figure 9.10.

a. The points at the bottom of the compact cluster shown in Figure 9.10 have a somewhat higher outlier score than those points at the top of the compact cluster. Why?

b. Suppose that we choose the number of clusters to be much larger, e.g., 10. Would the proposed technique still be effective in finding the most extreme outlier at the top of the figure? Why or why not?

c. The use of relative distance adjusts for differences in density. Give an example of where such an approach might lead to the wrong conclusion.

4. Compare the following two measures of the extent to which an object belongs to a cluster: (1) distance of an object from the centroid of its closest cluster and (2) the silhouette coefficient described in Section 7.5.2.

5. Consider a set of points that are uniformly distributed on the interval [0,1]. Is the statistical notion of an outlier as an infrequently observed value meaningful for this data?