Data mining
Data Mining
Anomaly Detection
Lecture Notes for Chapter 10
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *
Anomaly/Outlier Detection
- What are anomalies/outliers?
- The set of data points that are considerably different than the remainder of the data
- Variants of Anomaly/Outlier Detection Problems
- Given a database D, find all the data points x D with anomaly scores greater than some threshold t
- Given a database D, find all the data points x D having the top-n largest anomaly scores f(x)
- Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
- Applications:
- Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection
Importance of Anomaly Detection
Ozone Depletion History
- In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
- Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
- The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
Anomaly Detection
- Challenges
- How many outliers are there in the data?
- Method is unsupervised
- Validation can be quite challenging (just like for clustering)
- Finding needle in a haystack
- Working assumption:
- There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Anomaly Detection Schemes
- General Steps
- Build a profile of the “normal” behavior
- Profile can be patterns or summary statistics for the overall population
- Use the “normal” profile to detect anomalies
- Anomalies are observations whose characteristics
differ significantly from the normal profile - Types of anomaly detection
schemes - Graphical & Statistical-based
- Distance-based
- Model-based
Graphical Approaches
- Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
- Limitations
- Time consuming
- Subjective
Convex Hull Method
- Extreme points are assumed to be outliers
- Use convex hull method to detect extreme values
- What if the outlier occurs in the middle of the data?
Statistical Approaches
- Assume a parametric model describing the distribution of the data (e.g., normal distribution)
- Apply a statistical test that depends on
- Data distribution
- Parameter of distribution (e.g., mean, variance)
- Number of expected outliers (confidence limit)
Grubbs’ Test
- Detect outliers in univariate data
- Assume data comes from normal distribution
- Detects one outlier at a time, remove the outlier, and repeat
- H0: There is no outlier in data
- HA: There is at least one outlier
- Grubbs’ test statistic:
- Reject H0 if:
Statistical-based – Likelihood Approach
- Assume the data set D contains samples from a mixture of two probability distributions:
- M (majority distribution)
- A (anomalous distribution)
- General Approach:
- Initially, assume all the data points belong to M
- Let Lt(D) be the log likelihood of D at time t
- For each point xt that belongs to M, move it to A
- Let Lt+1 (D) be the new log likelihood.
- Compute the difference, = Lt(D) – Lt+1 (D)
- If > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A
Statistical-based – Likelihood Approach
- Data distribution, D = (1 – ) M + A
- M is a probability distribution estimated from data
- Can be based on any modeling method (naïve Bayes, maximum entropy, etc)
- A is initially assumed to be uniform distribution
- Likelihood at time t:
Limitations of Statistical Approaches
- Most of the tests are for a single attribute
- In many cases, data distribution may not be known
- For high dimensional data, it may be difficult to estimate the true distribution
Distance-based Approaches
- Data is represented as a vector of features
- Three major approaches
- Nearest-neighbor based
- Density based
- Clustering based
Nearest-Neighbor Based Approach
- Approach:
- Compute the distance between every pair of data points
- There are various ways to define outliers:
- Data points for which there are fewer than p neighboring points within a distance D
- The top n data points whose distance to the kth nearest neighbor is greatest
- The top n data points whose average distance to the k nearest neighbors is greatest
Outliers in Lower Dimensional Projection
- In high-dimensional space, data is sparse and notion of proximity becomes meaningless
- Every point is an almost equally good outlier from the perspective of proximity-based definitions
- Lower-dimensional projection methods
- A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density
Outliers in Lower Dimensional Projection
- Divide each attribute into equal-depth intervals
- Each interval contains a fraction f = 1/ of the records
- Consider a k-dimensional cube created by picking grid ranges from k different dimensions
- If attributes are independent, we expect region to contain a fraction fk of the records
- If there are N points, we can measure sparsity of a cube D as:
- Negative sparsity indicates cube contains smaller number of points than expected
Example
- N=100, = 5, f = 1/5 = 0.2, N f2 = 4
Density-based: LOF approach
- For each point, compute the density of its local neighborhood
- Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors
- Outliers are points with largest LOF value
In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers
p2
p1
Clustering-Based
- Basic idea:
- Cluster the data into groups of different density
- Choose points in small cluster as candidate outliers
- Compute the distance between candidate points and non-candidate clusters.
- If candidate points are far from all other non-candidate points, they are outliers
Base Rate Fallacy
- Bayes theorem:
- More generally:
Base Rate Fallacy (Axelsson, 1999)
Base Rate Fallacy
- Even though the test is 99% certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people
Base Rate Fallacy in Intrusion Detection
- I: intrusive behavior,
I: non-intrusive behavior
A: alarm
A: no alarm
- Detection rate (true positive rate): P(A|I)
- False alarm rate: P(A|I)
- Goal is to maximize both
- Bayesian detection rate, P(I|A)
- P(I|A)
Detection Rate vs False Alarm Rate
- Suppose:
- Then:
- False alarm rate becomes more dominant if P(I) is very low
Detection Rate vs False Alarm Rate
- Axelsson: We need a very low false alarm rate to achieve a reasonable Bayesian detection rate
s
X
X
G
-
=
max
2
2
)
2
,
/
(
)
2
,
/
(
2
)
1
(
-
-
+
-
-
>
N
N
N
N
t
N
t
N
N
G
a
a
å
å
Õ
Õ
Õ
Î
Î
Î
Î
=
+
+
+
-
=
÷
÷
ø
ö
ç
ç
è
æ
÷
÷
ø
ö
ç
ç
è
æ
-
=
=
t
i
t
t
i
t
t
i
t
t
t
i
t
t
A
x
i
A
t
M
x
i
M
t
t
A
x
i
A
A
M
x
i
M
M
N
i
i
D
t
x
P
A
x
P
M
D
LL
x
P
x
P
x
P
D
L
)
(
log
log
)
(
log
)
1
log(
)
(
)
(
)
(
)
1
(
)
(
)
(
|
|
|
|
1
l
l
l
l