main discussion

profileEswari_ke
chap9_anomaly_detection1.pdf

Anomaly Detection

Lecture Notes for Chapter 9

Introduction to Data Mining, 2nd Edition by

Tan, Steinbach, Karpatne, Kumar

11/20/2020 Introduction to Data Mining, 2nd Edition 1

Anomaly/Outlier Detection

 What are anomalies/outliers? – The set of data points that are

considerably different than the remainder of the data

 Natural implication is that anomalies are relatively rare – One in a thousand occurs often if you have lots of data – Context is important, e.g., freezing temps in July

 Can be important or a nuisance – 10 foot tall 2 year old – Unusually high blood pressure

11/20/2020 Introduction to Data Mining, 2nd Edition 2

Importance of Anomaly Detection

Ozone Depletion History  In 1985 three researchers (Farman,

Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

 Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?

 The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources:

http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

11/20/2020 Introduction to Data Mining, 2nd Edition 3

Causes of Anomalies

 Data from different classes – Measuring the weights of oranges, but a few grapefruit

are mixed in

 Natural variation – Unusually tall people

 Data errors – 200 pound 2 year old

11/20/2020 Introduction to Data Mining, 2nd Edition 4

Distinction Between Noise and Anomalies

 Noise is erroneous, perhaps random, values or contaminating objects – Weight recorded incorrectly

– Grapefruit mixed in with the oranges

 Noise doesn’t necessarily produce unusual values or objects

 Noise is not interesting  Anomalies may be interesting if they are not a result of

noise  Noise and anomalies are related but distinct concepts

11/20/2020 Introduction to Data Mining, 2nd Edition 5

General Issues: Number of Attributes

 Many anomalies are defined in terms of a single attribute – Height – Shape – Color

 Can be hard to find an anomaly using all attributes – Noisy or irrelevant attributes – Object is only anomalous with respect to some attributes

 However, an object may not be anomalous in any one attribute

11/20/2020 Introduction to Data Mining, 2nd Edition 6

General Issues: Anomaly Scoring

 Many anomaly detection techniques provide only a binary categorization – An object is an anomaly or it isn’t – This is especially true of classification-based approaches

 Other approaches assign a score to all points – This score measures the degree to which an object is an anomaly – This allows objects to be ranked

 In the end, you often need a binary decision – Should this credit card transaction be flagged? – Still useful to have a score

 How many anomalies are there? 11/20/2020 Introduction to Data Mining, 2nd Edition 7

Other Issues for Anomaly Detection

 Find all anomalies at once or one at a time – Swamping – Masking

 Evaluation – How do you measure performance? – Supervised vs. unsupervised situations

 Efficiency

 Context – Professional basketball team

11/20/2020 Introduction to Data Mining, 2nd Edition 8

Variants of Anomaly Detection Problems

 Given a data set D, find all data points x ∈ D with anomaly scores greater than some threshold t

 Given a data set D, find all data points x ∈ D having the top-n largest anomaly scores

 Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D

11/20/2020 Introduction to Data Mining, 2nd Edition 9

Model-Based Anomaly Detection

 Build a model for the data and see – Unsupervised

 Anomalies are those points that don’t fit well  Anomalies are those points that distort the model  Examples:

– Statistical distribution – Clusters – Regression – Geometric – Graph

– Supervised  Anomalies are regarded as a rare class  Need to have training data

11/20/2020 Introduction to Data Mining, 2nd Edition 10

Additional Anomaly Detection Techniques

 Proximity-based – Anomalies are points far away from other points – Can detect this graphically in some cases

 Density-based – Low density points are outliers

 Pattern matching – Create profiles or templates of atypical but important

events or objects – Algorithms to detect these patterns are usually simple

and efficient

11/20/2020 Introduction to Data Mining, 2nd Edition 11

Visual Approaches

 Boxplots or scatter plots

 Limitations – Not automatic – Subjective

11/20/2020 Introduction to Data Mining, 2nd Edition 12

Statistical Approaches

Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.

 Usually assume a parametric model describing the distribution of the data (e.g., normal distribution)

 Apply a statistical test that depends on – Data distribution – Parameters of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit)

 Issues – Identifying the distribution of a data set

 Heavy tailed distribution – Number of attributes – Is the data a mixture of distributions?

11/20/2020 Introduction to Data Mining, 2nd Edition 13

Normal Distributions

One-dimensional Gaussian

Two-dimensional Gaussian

x

y

-4 -3 -2 -1 0 1 2 3 4 5

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

probability density

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

11/20/2020 Introduction to Data Mining, 2nd Edition 14

Grubbs’ Test

 Detect outliers in univariate data  Assume data comes from normal distribution  Detects one outlier at a time, remove the outlier,

and repeat – H0: There is no outlier in data – HA: There is at least one outlier

 Grubbs’ test statistic:

 Reject H0 if: s

XX G

− =

max

2

2

)2,/(

)2,/(

2 )1(

+− −

> NN

NN

tN t

N N

G α

α

11/20/2020 Introduction to Data Mining, 2nd Edition 15

Statistical-based – Likelihood Approach

 Assume the data set D contains samples from a mixture of two probability distributions: – M (majority distribution) – A (anomalous distribution)

 General Approach: – Initially, assume all the data points belong to M – Let Lt(D) be the log likelihood of D at time t – For each point xt that belongs to M, move it to A

 Let Lt+1 (D) be the new log likelihood.  Compute the difference, ∆ = Lt(D) – Lt+1 (D)  If ∆ > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A

11/20/2020 Introduction to Data Mining, 2nd Edition 16

Statistical-based – Likelihood Approach

 Data distribution, D = (1 – λ) M + λ A  M is a probability distribution estimated from data

– Can be based on any modeling method (naïve Bayes, maximum entropy, etc)

 A is initially assumed to be uniform distribution  Likelihood at time t:

∑∑

∏∏∏

∈∈

∈∈=

+++−=

  

   

   

   

 −==

ti

t

ti

t

ti

t

t

ti

t

t

Ax iAt

Mx iMtt

Ax iA

A

Mx iM

M N

i iDt

xPAxPMDLL

xPxPxPDL

)(loglog)(log)1log()(

)()()1()()( |||| 1

λλ

λλ

11/20/2020 Introduction to Data Mining, 2nd Edition 17

Strengths/Weaknesses of Statistical Approaches

 Firm mathematical foundation

 Can be very efficient

 Good results if distribution is known

 In many cases, data distribution may not be known

 For high dimensional data, it may be difficult to estimate the true distribution

 Anomalies can distort the parameters of the distribution

11/20/2020 Introduction to Data Mining, 2nd Edition 18

Distance-Based Approaches

 Several different techniques

 An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998) – Some statistical definitions are special cases of this

 The outlier score of an object is the distance to its kth nearest neighbor

11/20/2020 Introduction to Data Mining, 2nd Edition 19

One Nearest Neighbor - One Outlier

D

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Outlier Score 11/20/2020 Introduction to Data Mining, 2nd Edition 20

One Nearest Neighbor - Two Outliers

D

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Outlier Score 11/20/2020 Introduction to Data Mining, 2nd Edition 21

Five Nearest Neighbors - Small Cluster

D

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Outlier Score 11/20/2020 Introduction to Data Mining, 2nd Edition 22

Five Nearest Neighbors - Differing Density

D

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Outlier Score 11/20/2020 Introduction to Data Mining, 2nd Edition 23

Strengths/Weaknesses of Distance-Based Approaches

 Simple

 Expensive – O(n2)

 Sensitive to parameters

 Sensitive to variations in density

 Distance becomes less meaningful in high- dimensional space

11/20/2020 Introduction to Data Mining, 2nd Edition 24

Density-Based Approaches

 Density-based Outlier: The outlier score of an object is the inverse of the density around the object. – Can be defined in terms of the k nearest neighbors – One definition: Inverse of distance to kth neighbor – Another definition: Inverse of the average distance to k

neighbors – DBSCAN definition

 If there are regions of different density, this approach can have problems

11/20/2020 Introduction to Data Mining, 2nd Edition 25

Relative Density

 Consider the density of a point relative to that of its k nearest neighbors

11/20/2020 Introduction to Data Mining, 2nd Edition 26

Relative Density Outlier Scores

Outlier Score 1

2

3

4

5

6

6.85

1.33

1.40

A

C

D

11/20/2020 Introduction to Data Mining, 2nd Edition 27

Density-based: LOF approach

 For each point, compute the density of its local neighborhood  Compute local outlier factor (LOF) of a sample p as the average of

the ratios of the density of sample p and the density of its nearest neighbors

 Outliers are points with largest LOF value

p2 × p1

×

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

11/20/2020 Introduction to Data Mining, 2nd Edition 28

Strengths/Weaknesses of Density-Based Approaches

 Simple

 Expensive – O(n2)

 Sensitive to parameters

 Density becomes less meaningful in high- dimensional space

11/20/2020 Introduction to Data Mining, 2nd Edition 29

Clustering-Based Approaches

 Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster – For prototype-based clusters, an

object is an outlier if it is not close enough to a cluster center

– For density-based clusters, an object is an outlier if its density is too low

– For graph-based clusters, an object is an outlier if it is not well connected

 Other issues include the impact of outliers on the clusters and the number of clusters

11/20/2020 Introduction to Data Mining, 2nd Edition 30

Distance of Points from Closest Centroids

Outlier Score

0.5

1

1.5

2

2.5

3

3.5

4

4.5

D

C

A

1.2

0.17

4.6

11/20/2020 Introduction to Data Mining, 2nd Edition 31

Relative Distance of Points from Closest Centroid

Outlier Score

0.5

1

1.5

2

2.5

3

3.5

4

11/20/2020 Introduction to Data Mining, 2nd Edition 32

Strengths/Weaknesses of Distance-Based Approaches

 Simple

 Many clustering techniques can be used

 Can be difficult to decide on a clustering technique

 Can be difficult to decide on number of clusters

 Outliers can distort the clusters

11/20/2020 Introduction to Data Mining, 2nd Edition 33

  • Anomaly Detection
  • Anomaly/Outlier Detection
  • Importance of Anomaly Detection
  • Causes of Anomalies
  • Distinction Between Noise and Anomalies
  • General Issues: Number of Attributes
  • General Issues: Anomaly Scoring
  • Other Issues for Anomaly Detection
  • Variants of Anomaly Detection Problems
  • Model-Based Anomaly Detection
  • Additional Anomaly Detection Techniques
  • Visual Approaches
  • Statistical Approaches
  • Normal Distributions
  • Grubbs’ Test
  • Statistical-based – Likelihood Approach
  • Statistical-based – Likelihood Approach
  • Strengths/Weaknesses of Statistical Approaches
  • Distance-Based Approaches
  • One Nearest Neighbor - One Outlier
  • One Nearest Neighbor - Two Outliers
  • Five Nearest Neighbors - Small Cluster
  • Five Nearest Neighbors - Differing Density
  • Strengths/Weaknesses of Distance-Based Approaches
  • Density-Based Approaches
  • Relative Density
  • Relative Density Outlier Scores
  • Density-based: LOF approach
  • Strengths/Weaknesses of Density-Based Approaches
  • Clustering-Based Approaches
  • Distance of Points from Closest Centroids
  • Relative Distance of Points from Closest Centroid
  • Strengths/Weaknesses of Distance-Based Approaches