2 Discussions - 1 Project

profileSinners0043
Chap_10_DataMining.ppt

Data Mining
Anomaly Detection

Lecture Notes for Chapter 10

Introduction to Data Mining

by

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *

Anomaly/Outlier Detection

  • What are anomalies/outliers?
  • The set of data points that are considerably different than the remainder of the data
  • Variants of Anomaly/Outlier Detection Problems
  • Given a database D, find all the data points x  D with anomaly scores greater than some threshold t
  • Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x)
  • Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
  • Applications:
  • Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

Importance of Anomaly Detection

Ozone Depletion History

  • In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

  • Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?

  • The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!

Anomaly Detection

  • Challenges
  • How many outliers are there in the data?
  • Method is unsupervised
  • Validation can be quite challenging (just like for clustering)
  • Finding needle in a haystack
  • Working assumption:
  • There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

Anomaly Detection Schemes

  • General Steps
  • Build a profile of the “normal” behavior
  • Profile can be patterns or summary statistics for the overall population
  • Use the “normal” profile to detect anomalies
  • Anomalies are observations whose characteristics
    differ significantly from the normal profile
  • Types of anomaly detection
    schemes
  • Graphical & Statistical-based
  • Distance-based
  • Model-based

Statistical Approaches

  • Assume a parametric model describing the distribution of the data (e.g., normal distribution)

  • Apply a statistical test that depends on
  • Data distribution
  • Parameter of distribution (e.g., mean, variance)
  • Number of expected outliers (confidence limit)

Grubbs’ Test

  • Detect outliers in univariate data
  • Assume data comes from normal distribution
  • Detects one outlier at a time, remove the outlier, and repeat
  • H0: There is no outlier in data
  • HA: There is at least one outlier
  • Grubbs’ test statistic:
  • Reject H0 if:

Statistical-based – Likelihood Approach

  • Assume the data set D contains samples from a mixture of two probability distributions:
  • M (majority distribution)
  • A (anomalous distribution)
  • General Approach:
  • Initially, assume all the data points belong to M
  • Let Lt(D) be the log likelihood of D at time t
  • For each point xt that belongs to M, move it to A
  • Let Lt+1 (D) be the new log likelihood.
  • Compute the difference,  = Lt(D) – Lt+1 (D)
  • If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A

Limitations of Statistical Approaches

  • Most of the tests are for a single attribute

  • In many cases, data distribution may not be known

  • For high dimensional data, it may be difficult to estimate the true distribution

Distance-based Approaches

  • Data is represented as a vector of features

  • Three major approaches
  • Nearest-neighbor based
  • Density based
  • Clustering based

Nearest-Neighbor Based Approach

  • Approach:
  • Compute the distance between every pair of data points

  • There are various ways to define outliers:
  • Data points for which there are fewer than p neighboring points within a distance D

  • The top n data points whose distance to the kth nearest neighbor is greatest

  • The top n data points whose average distance to the k nearest neighbors is greatest

Outliers in Lower Dimensional Projection

  • In high-dimensional space, data is sparse and notion of proximity becomes meaningless
  • Every point is an almost equally good outlier from the perspective of proximity-based definitions
  • Lower-dimensional projection methods
  • A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density

Clustering-Based

  • Basic idea:
  • Cluster the data into groups of different density
  • Choose points in small cluster as candidate outliers
  • Compute the distance between candidate points and non-candidate clusters.
  • If candidate points are far from all other non-candidate points, they are outliers

Base Rate Fallacy in Intrusion Detection

  • I: intrusive behavior,
    I: non-intrusive behavior
    A: alarm
    A: no alarm

  • Detection rate (true positive rate): P(A|I)
  • False alarm rate: P(A|I)

  • Goal is to maximize both
  • Bayesian detection rate, P(I|A)
  • P(I|A)

s

X

X

G

-

=

max

2

2

)

2

,

/

(

)

2

,

/

(

2

)

1

(

-

-

+

-

-

>

N

N

N

N

t

N

t

N

N

G

a

a