Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu › ~jtaylo › courses ›...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningOutliers

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1


c©JonathanTaylor

Outliers

Concepts

What is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .

When do they appear in data mining tasks?

Given a data matrix XXX , find all the cases xxx i ∈ XXX withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix XXX , containing mostly normal (butunlabeled) data points, and a test case xxxnew, compute ananomaly/outlier score of xxxnew with respect to XXX .

Applications

Credit card fraud detection;Network intrusion detection;Misspecification of a model.

2 / 1


c©JonathanTaylor

What is an outlier?

3 / 1


c©JonathanTaylor

Outliers

Issues

How many outliers are there in the data?

Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.

Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.

4 / 1


c©JonathanTaylor

Outliers

General steps

Build a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.

Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.

General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.

Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .

5 / 1


c©JonathanTaylor

Outliers

Statistical approach

Assume a parametric model describing the distribution ofthe data (e.g., normal distribution)

Apply a statistical test that depends on:

Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)

6 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Suppose we have a sample of n numbersZZZ = {Z1, . . . ,Zn}, i.e. a n × 1 data matrix.

Assuming data is from normal distribution, Grubbs’ testsuses distribution of

max1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

to search for outlying large values.

7 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Lower tail variant:

min1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

Two-sided variant:

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

8 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Having chosen a test-statistic, we must determine athreshold that sets our “threshold” rule

Often this is set via a hypothesis test to control Type Ierror.

For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that

P0

(max1≤i≤n |Zi − Z̄ZZ |

SD(ZZZ )≥ cα

)≈ α

Above, P0 denotes the distribution of ZZZ under theassumption there are no outliers.

If ZZZ are IID N(µ, σ2) it is generally possible to compute adecent approximation of this probability using Bonferonni.

9 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Two sided critical level has the form

cα =n − 1√

n

√√√√ t2α/(2n),n−2

n − 2 + t2α/(2n),n−2

whereP(Tk ≥ tγ,k) = γ

is the upper tail quantile of Tk .

In R, you can use the functions pnorm, qnorm, pt, qt

for these quantities.

10 / 1


c©JonathanTaylor

Model based: linear regression with outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Model based techniques

!  First build a model

!  Points which don’t fit the model well are identified as outliers

!  For the example at the right, a least squares regression model would be appropriate

!  Residuals can be fed in to Grubbs’ test.

Figure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)

11 / 1


c©JonathanTaylor

Outliers

Multivariate data

If the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statistic

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

Answer: use Mahalanobis distance

max1≤i≤n

(Zi − Z̄ZZ )T Σ̂−1(Zi − Z̄ZZ )

Above, each individual statistic has what looks like aHotelling’s T 2 distribution.

12 / 1


c©JonathanTaylor

Outliers

Likelihood approach

Assume data is a mixture

F = (1− λ)M + λA.

Above, M is the distribution of “most of the data.”

The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.

This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.

Any points assigned to A are “outliers.”

13 / 1


c©JonathanTaylor

Outliers

Likelihood approach

Do we estimate λ or fix it?

The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood

L(θM , θA; l) =

(1− λ)#lM∏i∈lM

fM(xi , θM)

×

λ#lA∏i∈lA

fA(xi ; θA)

14 / 1


c©JonathanTaylor

Outliers

Likelihood approach: Algorithm

Algorithm tries to maximize this by forming iterativeestimates (Mt ,At) of “normal” and “outlying” datapoints.

1 At each stage, tries to place individual points of Mt to At .2 Find (θ̂M , θ̂A) based on partition new partition (if

necessary).3 If increase in likelihood is large enough, call these new set

(Mt+1,At+1).4 Repeat until no further changes.

15 / 1


c©JonathanTaylor

Outliers

Nearest neighbour approach

Many ways to define outliers.

Example: data points for which there are fewer than kneighboring points within a distance ε.

Example: the n points whose distance to k-th nearestneighbour is largest.

The n points whose average distance to the first k nearestneighobours is largest.

Each of these methods all depend on choice of someparameters: k, n, ε. Difficult to choose these in asystematic way.

16 / 1


c©JonathanTaylor

Outliers

Density approach

For each point, xxx i compute a density estimate fxxx i ,k usingits k nearest neighbours.

Density estimate used is

fxxx i ,k =

(∑yyy∈N(xxx i ,k)

d(xxx i ,yyy)

#N(xxx i , k)

)−1Define

LOF (xxx i ) =fxxx i ,k

(∑

y∈N(xxx i ,k)fy ,k)/#N(xxx i , k)

17 / 1


c©JonathanTaylor

Outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Density-based: LOF approach

!  For each point, compute the density of its local neighborhood !  Compute local outlier factor (LOF) of a sample p as the

average of the ratios of the density of sample p and the density of its nearest neighbors

!  Outliers are points with largest LOF value

p2 ! p1

!

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

Figure : Nearest neighbour vs. density based

18 / 1


c©JonathanTaylor

Outliers

Detection rate

Set P(O) to be the proportion of outliers or anomalies.

Set P(D|O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.

Set P(D|Oc) to the probability of declaring an outlier if itis truly not an outlier.

19 / 1


c©JonathanTaylor

Outliers

Bayesian detection rate

Bayesian detection rate is

P(O|D) =P(D|O)P(O)

P(D|O)P(O) + P(D|Oc)P(Oc).

The false alarm rate or false discovery rate is

P(Oc |D) =P(D|Oc)P(Oc)

P(D|Oc)P(Oc) + P(D|O)P(O).

20 / 1


c©JonathanTaylor

21 / 1

Date post:	29-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu › ~jtaylo › courses ›...

Documents