chap9_anomaly_detection.pptx

Anomaly/Outlier Detection

What are anomalies/outliers?

The set of data points that are considerably different than the remainder of the data

Natural implication is that anomalies are relatively rare

One in a thousand occurs often if you have lots of data

Context is important, e.g., freezing temps in July

Can be important or a nuisance

10 foot tall 2 year old

Unusually high blood pressure

9/29/2019

Introduction to Data Mining, 2nd Edition

1

Importance of Anomaly Detection

Ozone Depletion History

In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?

The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!

Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

9/29/2019

Introduction to Data Mining, 2nd Edition

2

Causes of Anomalies

Data from different classes

Measuring the weights of oranges, but a few grapefruit are mixed in

Natural variation

Unusually tall people

Data errors

200 pound 2 year old

9/29/2019

Introduction to Data Mining, 2nd Edition

3

Distinction Between Noise and Anomalies

Noise is erroneous, perhaps random, values or contaminating objects

Weight recorded incorrectly

Grapefruit mixed in with the oranges

Noise doesn’t necessarily produce unusual values or objects

Noise is not interesting

Anomalies may be interesting if they are not a result of noise

Noise and anomalies are related but distinct concepts

9/29/2019

Introduction to Data Mining, 2nd Edition

4

General Issues: Number of Attributes

Many anomalies are defined in terms of a single attribute

Height

Shape

Color

Can be hard to find an anomaly using all attributes

Noisy or irrelevant attributes

Object is only anomalous with respect to some attributes

However, an object may not be anomalous in any one attribute

9/29/2019

Introduction to Data Mining, 2nd Edition

5

General Issues: Anomaly Scoring

Many anomaly detection techniques provide only a binary categorization

An object is an anomaly or it isn’t

This is especially true of classification-based approaches

Other approaches assign a score to all points

This score measures the degree to which an object is an anomaly

This allows objects to be ranked

In the end, you often need a binary decision

Should this credit card transaction be flagged?

Still useful to have a score

How many anomalies are there?

9/29/2019

Introduction to Data Mining, 2nd Edition

6

Other Issues for Anomaly Detection

Find all anomalies at once or one at a time

Swamping

Masking

Evaluation

How do you measure performance?

Supervised vs. unsupervised situations

Efficiency

Context

Professional basketball team

9/29/2019

Introduction to Data Mining, 2nd Edition

7

Variants of Anomaly Detection Problems

Given a data set D, find all data points x  D with anomaly scores greater than some threshold t

Given a data set D, find all data points x  D having the top-n largest anomaly scores

Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D

9/29/2019

Introduction to Data Mining, 2nd Edition

8

Model-Based Anomaly Detection

Build a model for the data and see

Unsupervised

Anomalies are those points that don’t fit well

Anomalies are those points that distort the model

Examples:

Statistical distribution

Clusters

Regression

Geometric

Graph

Supervised

Anomalies are regarded as a rare class

Need to have training data

9/29/2019

Introduction to Data Mining, 2nd Edition

9

Additional Anomaly Detection Techniques

Proximity-based

Anomalies are points far away from other points

Can detect this graphically in some cases

Density-based

Low density points are outliers

Pattern matching

Create profiles or templates of atypical but important events or objects

Algorithms to detect these patterns are usually simple and efficient

9/29/2019

Introduction to Data Mining, 2nd Edition

10

Visual Approaches

Boxplots or scatter plots

Limitations

Not automatic

Subjective

9/29/2019

Introduction to Data Mining, 2nd Edition

11

Statistical Approaches

Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.

Usually assume a parametric model describing the distribution of the data (e.g., normal distribution)

Apply a statistical test that depends on

Data distribution

Parameters of distribution (e.g., mean, variance)

Number of expected outliers (confidence limit)

Issues

Identifying the distribution of a data set

Heavy tailed distribution

Number of attributes

Is the data a mixture of distributions?

9/29/2019

Introduction to Data Mining, 2nd Edition

12

Normal Distributions

One-dimensional Gaussian

Two-dimensional Gaussian

9/29/2019

Introduction to Data Mining, 2nd Edition

13

Grubbs’ Test

Detect outliers in univariate data

Assume data comes from normal distribution

Detects one outlier at a time, remove the outlier, and repeat

H0: There is no outlier in data

HA: There is at least one outlier

Grubbs’ test statistic:

Reject H0 if:

9/29/2019

Introduction to Data Mining, 2nd Edition

14

Statistical-based – Likelihood Approach

Assume the data set D contains samples from a mixture of two probability distributions:

M (majority distribution)

A (anomalous distribution)

General Approach:

Initially, assume all the data points belong to M

Let Lt(D) be the log likelihood of D at time t

For each point xt that belongs to M, move it to A

Let Lt+1 (D) be the new log likelihood.

Compute the difference,  = Lt(D) – Lt+1 (D)

If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A

9/29/2019

Introduction to Data Mining, 2nd Edition

15

Statistical-based – Likelihood Approach

Data distribution, D = (1 – ) M +  A

M is a probability distribution estimated from data

Can be based on any modeling method (naïve Bayes, maximum entropy, etc)

A is initially assumed to be uniform distribution

Likelihood at time t:

9/29/2019

Introduction to Data Mining, 2nd Edition

16

Strengths/Weaknesses of Statistical Approaches

Firm mathematical foundation

Can be very efficient

Good results if distribution is known

In many cases, data distribution may not be known

For high dimensional data, it may be difficult to estimate the true distribution

Anomalies can distort the parameters of the distribution

9/29/2019

Introduction to Data Mining, 2nd Edition

17

Distance-Based Approaches

Several different techniques

An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998)

Some statistical definitions are special cases of this

The outlier score of an object is the distance to its kth nearest neighbor

9/29/2019

Introduction to Data Mining, 2nd Edition

18

One Nearest Neighbor - One Outlier

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

19

One Nearest Neighbor - Two Outliers

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

20

Five Nearest Neighbors - Small Cluster

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

21

Five Nearest Neighbors - Differing Density

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

22

Strengths/Weaknesses of Distance-Based Approaches

Simple

Expensive – O(n2)

Sensitive to parameters

Sensitive to variations in density

Distance becomes less meaningful in high-dimensional space

9/29/2019

Introduction to Data Mining, 2nd Edition

23

Density-Based Approaches

Density-based Outlier: The outlier score of an object is the inverse of the density around the object.

Can be defined in terms of the k nearest neighbors

One definition: Inverse of distance to kth neighbor

Another definition: Inverse of the average distance to k neighbors

DBSCAN definition

If there are regions of different density, this approach can have problems

9/29/2019

Introduction to Data Mining, 2nd Edition

24

Relative Density

Consider the density of a point relative to that of its k nearest neighbors

9/29/2019

Introduction to Data Mining, 2nd Edition

25

Relative Density Outlier Scores

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

26

Density-based: LOF approach

For each point, compute the density of its local neighborhood

Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors

Outliers are points with largest LOF value

p2

p1

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

9/29/2019

Introduction to Data Mining, 2nd Edition

27

Strengths/Weaknesses of Density-Based Approaches

Simple

Expensive – O(n2)

Sensitive to parameters

Density becomes less meaningful in high-dimensional space

9/29/2019

Introduction to Data Mining, 2nd Edition

28

Clustering-Based Approaches

Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster

For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center

For density-based clusters, an object is an outlier if its density is too low

For graph-based clusters, an object is an outlier if it is not well connected

Other issues include the impact of outliers on the clusters and the number of clusters

9/29/2019

Introduction to Data Mining, 2nd Edition

29

Distance of Points from Closest Centroids

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

30

Relative Distance of Points from Closest Centroid

Outlier Score

9/29/2019

Introduction to Data Mining, 2nd Edition

31

Strengths/Weaknesses of Distance-Based Approaches

Simple

Many clustering techniques can be used

Can be difficult to decide on a clustering technique

Can be difficult to decide on number of clusters

Outliers can distort the clusters

9/29/2019

Introduction to Data Mining, 2nd Edition

32

x

y

-4

-3

-2

-1

0

1

2

3

4

5

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

probability

density

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

s

X

X

G

-

=

max

2

2

)

2

,

/

(

)

2

,

/

(

2

)

1

(

-

-

+

-

-

>

N

N

N

N

t

N

t

N

N

G

a

a

å

å

Õ

Õ

Õ

Î

Î

Î

Î

=

+

+

+

-

=

÷

÷

ø

ö

ç

ç

è

æ

÷

÷

ø

ö

ç

ç

è

æ

-

=

=

t

i

t

t

i

t

t

i

t

t

t

i

t

t

A

x

i

A

t

M

x

i

M

t

t

A

x

i

A

A

M

x

i

M

M

N

i

i

D

t

x

P

A

x

P

M

D

LL

x

P

x

P

x

P

D

L

)

(

log

log

)

(

log

)

1

log(

)

(

)

(

)

(

)

1

(

)

(

)

(

|

|

|

|

1

l

l

l

l

D

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

D

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

D

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

D

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1

2

3

4

5

6

6.85

1.33

1.40

A

C

D

0.5

1

1.5

2

2.5

3

3.5

4

4.5

D

C

A

1.2

0.17

4.6

0.5

1

1.5

2

2.5

3

3.5

4