Data Mining Homework

profileDavah Davah
Homework9_11.docx

Homework 9_1

Answer the following questions: (10 point each)

1. Consider the following definition of an anomaly: An anomaly is an object that is unusually influential in the creation of a data model.

a. Compare this definition to that of the standard model-based definition of an anomaly.

b. For what sizes of data sets (small, medium, or large) is this definition appropriate?

2. In one approach to anomaly detection, objects are represented as points in a multidimensional space, and the points are grouped into successive shells, where each shell represents a layer around a grouping of points, such as a convex hull. An object is an anomaly if it lies in one of the outer shells.

a. To which of the definitions of an anomaly in Section 9.2 is this definition most closely related?

b. Name two problems with this definition of an anomaly.

3. Consider the (relative distance) K-means scheme for outlier detection described in Section 9.5 and the accompanying figure, Figure 9.10.

a. The points at the bottom of the compact cluster shown in Figure 9.10 have a somewhat higher outlier score than those points at the top of the compact cluster. Why?

b. Suppose that we choose the number of clusters to be much larger, e.g., 10. Would the proposed technique still be effective in finding the most extreme outlier at the top of the figure? Why or why not?

c. The use of relative distance adjusts for differences in density. Give an example of where such an approach might lead to the wrong conclusion.

1