DataMining (Anomaly Detection)


Chapter 9A Problems

1. (a)Consider the following definition of an anomaly: An anomaly is an object that is unusually influential in the creation of a data model. (a) Compare this definition to that of the standard model-based definition of an anomaly.

(b) For what sizes of data sets (small, medium, or large) is this definition appropriate?

2. In one approach to anomaly detection, objects are represented as points in a multidimensional space, and the points are grouped into successive shells, where each shell represents a layer around a grouping of points, such as a convex hull. An object is an anomaly if it lies in one of the outer shells.

(a) To which of the definitions of an anomaly in the first part of the video is this definition most closely related?

(b) Name two problems with this definition of an anomaly.



3. Discuss techniques for combining multiple anomaly detection techniques to improve the identification of anomalous objects. Consider both supervised and unsupervised cases.

Chapter 9B Problems

1. Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. We explore the limitations of such approaches.

(a) For a set of 1,000,000 values, how likely are we to have outliers according to the test that says a value is an outlier if it is more than three standard deviations from the average? (Assume a normal distribution.)


(b) Does the approach that states an outlier is an object of unusually low probability need to be adjusted when dealing with large data sets? If so, how?


2. Consider the (relative distance) K-means scheme for outlier detection described in Section 10.5 and the accompanying figure

(a) The points at the bottom of the compact cluster shown in the figure have a somewhat higher outlier score than those points at the top of the compact cluster. Why?


(b) Suppose that we choose the number of clusters to be much larger, e.g., 10. Would the proposed technique still be effective in finding the most extreme outlier at the top of the figure? Why or why not?


(c) The use of relative distance adjusts for differences in density. Give an example of where such an approach might lead to the wrong conclusion.