Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningOutliers
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
December 2, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Concepts
What is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .
When do they appear in data mining tasks?
Given a data matrix XXX , find all the cases xxx i ∈ XXX withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix XXX , containing mostly normal (butunlabeled) data points, and a test case xxxnew, compute ananomaly/outlier score of xxxnew with respect to XXX .
Applications
Credit card fraud detection;Network intrusion detection;Misspecification of a model.
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
What is an outlier?
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Issues
How many outliers are there in the data?
Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.
Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
General steps
Build a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.
Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.
General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.
Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Statistical approach
Assume a parametric model describing the distribution ofthe data (e.g., normal distribution)
Apply a statistical test that depends on:
Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Suppose we have a sample of n numbersZZZ = {Z1, . . . ,Zn}, i.e. a n × 1 data matrix.
Assuming data is from normal distribution, Grubbs’ testsuses distribution of
max1≤i≤n Zi − Z̄ZZ
SD(ZZZ )
to search for outlying large values.
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Lower tail variant:
min1≤i≤n Zi − Z̄ZZ
SD(ZZZ )
Two-sided variant:
max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Having chosen a test-statistic, we must determine athreshold that sets our “threshold” rule
Often this is set via a hypothesis test to control Type Ierror.
For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that
P0
(max1≤i≤n |Zi − Z̄ZZ |
SD(ZZZ )≥ cα
)≈ α
Above, P0 denotes the distribution of ZZZ under theassumption there are no outliers.
If ZZZ are IID N(µ, σ2) it is generally possible to compute adecent approximation of this probability using Bonferonni.
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Two sided critical level has the form
cα =n − 1√
n
√√√√ t2α/(2n),n−2
n − 2 + t2α/(2n),n−2
whereP(Tk ≥ tγ,k) = γ
is the upper tail quantile of Tk .
In R, you can use the functions pnorm, qnorm, pt, qt
for these quantities.
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model based: linear regression with outliers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Model based techniques
! First build a model
! Points which don’t fit the model well are identified as outliers
! For the example at the right, a least squares regression model would be appropriate
! Residuals can be fed in to Grubbs’ test.
Figure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Multivariate data
If the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statistic
max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )
Answer: use Mahalanobis distance
max1≤i≤n
(Zi − Z̄ZZ )T Σ̂−1(Zi − Z̄ZZ )
Above, each individual statistic has what looks like aHotelling’s T 2 distribution.
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach
Assume data is a mixture
F = (1− λ)M + λA.
Above, M is the distribution of “most of the data.”
The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.
This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.
Any points assigned to A are “outliers.”
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach
Do we estimate λ or fix it?
The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood
L(θM , θA; l) =
(1− λ)#lM∏i∈lM
fM(xi , θM)
×
λ#lA∏i∈lA
fA(xi ; θA)
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach: Algorithm
Algorithm tries to maximize this by forming iterativeestimates (Mt ,At) of “normal” and “outlying” datapoints.
1 At each stage, tries to place individual points of Mt to At .2 Find (θ̂M , θ̂A) based on partition new partition (if
necessary).3 If increase in likelihood is large enough, call these new set
(Mt+1,At+1).4 Repeat until no further changes.
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Nearest neighbour approach
Many ways to define outliers.
Example: data points for which there are fewer than kneighboring points within a distance ε.
Example: the n points whose distance to k-th nearestneighbour is largest.
The n points whose average distance to the first k nearestneighobours is largest.
Each of these methods all depend on choice of someparameters: k, n, ε. Difficult to choose these in asystematic way.
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Density approach
For each point, xxx i compute a density estimate fxxx i ,k usingits k nearest neighbours.
Density estimate used is
fxxx i ,k =
(∑yyy∈N(xxx i ,k)
d(xxx i ,yyy)
#N(xxx i , k)
)−1Define
LOF (xxx i ) =fxxx i ,k
(∑
y∈N(xxx i ,k)fy ,k)/#N(xxx i , k)
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Density-based: LOF approach
! For each point, compute the density of its local neighborhood ! Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the density of its nearest neighbors
! Outliers are points with largest LOF value
p2 ! p1
!
In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers
Figure : Nearest neighbour vs. density based
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Detection rate
Set P(O) to be the proportion of outliers or anomalies.
Set P(D|O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.
Set P(D|Oc) to the probability of declaring an outlier if itis truly not an outlier.
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Bayesian detection rate
Bayesian detection rate is
P(O|D) =P(D|O)P(O)
P(D|O)P(O) + P(D|Oc)P(Oc).
The false alarm rate or false discovery rate is
P(Oc |D) =P(D|Oc)P(Oc)
P(D|Oc)P(Oc) + P(D|O)P(O).
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
21 / 1