+ All Categories
Home > Documents > Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist)...

Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist)...

Date post: 23-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
49
Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics, Week 3 Lecturer: Beate Sick [email protected] 1 Remark: Much of the material have been developed together with Oliver Dürr.
Transcript
Page 1: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate Statistics, Week 3

Lecturer: Beate Sick

[email protected]

1

Remark: Much of the material have been developed together with Oliver Dürr.

Page 2: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Topics of today

2

• Similarity and Distances

• Numeric data

• Categorical data

• Mixed data types

• Multi Dimensional Scaling

• 2D plots of high dimensional data starting from pair-wise distances

• Outlier detection

• Univariate outlier detection by visual checks and additional tests

• Multivariate outlier detection

• Parametric: squared Mahalanobis distances and Chi-Square test

• Non-parametric: Robust PCA for multi-variate outlier detection

Page 3: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The quality or state of being similar; likeness; resemblance; as, a similarity of

features.

Similarity is

hard to define,

but…

“We know it

when we see it”.

The real

meaning of

similarity is a

philosophical

question. We

will take a more

pragmatic

approach.

Webster's

Dictionary

3

What is Similarity?

Page 4: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Definition: Let O1 and O2 be two objects. The distance (dissimilarity)

between O1 and O2 is a real number denoted by d(O1,O2).

0.23 3 342.7

Peter Piotr

4

Defining Distance Measures (Recap)

Page 5: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

(Dis-)similarity / Distance

Pairs of Objects:

Similarity (large ⇒ similar), vague definition

Dissimilarity (small ⇒ similar), Rules 1-3

Distance, Metric (small ⇒ similar), Rule 4 in addition

Examples of metrics (more follow with the examples)

● Euclidian and other Lp-Metrics

● Jaccard-Distance ( 1 - Jaccard Index)

● Graph Distance (shortest-path)

Rules

5

Page 6: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Example of a Metric

Task 1

• Draw 3 Objects on a piece of paper and measure their distances (e.g. by a ruler).

• Is this a proper distance? Are Axioms 1-4 fulfilled?

Task 2

• The 3 entities A,B,C have the dissimilarity:

d(A,B) = 1

d(B,C) = 1

d(A,C) = 3

• Is this dissimilarity a distance?

• Can you try to draw them on a piece of paper?

6

Page 7: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Problematic: Wordmaps

Try to do a wordmap with:

Bank

Finance

Sitting

Triangular Inequality:

Not just a mathematical gimmick!

Triangle inequality would imply:

d(„sitting“, „finance“) ≤ d(„sitting“, „bank“) + d(„bank“, „finance“)

7

Page 8: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

We live in an Euclidean Space

If we are presented objects in the two dimensional plane, we intuitively assume

Euclidean distances between the objects.

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

8

Page 9: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Distance between observations oi, oj

p features describing each observation

Euclidean Distance for 2 observations

oi, oj, described by p numeric features:

Minkowski Distance as

generalization: 0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

obs x1 x2

o1=p1 0 2

o2=p2 2 0

o3=p3 3 1

o4=p4 5 1 2

2

1

o , (o )p

i j ik jk

k

d o o

1

1

o , | o |p r

r i j ik jk

k

rd o o

2D example

(2 features per observation)

x1

x2

2d

9

2 2

2 2 3o , (2 3) (0 1) 2d o

Euclidean Distance and its Generalization

Page 10: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

L1: Manhattan Distances

A

B

One block is one unit.

• How many blocks do you have to walk from A to B?

• What is the L1 distance from A to B

• r=1

• What is the Euclidean distance?

1

1

o , | o |p r

r i j ik jk

k

rd o o

Image from Wikipedia 10

Page 11: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Minkowski Distances

r = 1: City block (Manhattan, taxicab, L1 norm) distance.

r = 2: Euclidean distance (L2 norm)

r = ∞: “Supremum” or maximum (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors

1

1

o , op

i j ik jk

k

d o o

1...po , max oi j k ik jkd o o

2

2

1

o , (o )p

i j ik jk

k

d o o

11

Page 12: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

12 1

21 2

1 2

0 . .

0 . .

. . .

. . .

. . 0

n

n

n n

d d

d d

d d

D

o ,i j ijd o d

All diagonal elements are 0!

As discussed on the last couple of slides, there are different

possibilities to determine the pairwise distance between two

observations oi and oj.

We can collect all these pairwise distances dij in a distance matrix:

o , 0k k kkd o d

Symmetry:

o , o ,ij i j j i jid d o d o d

12

Distance matrix

Page 13: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

How to calculate dissimilarities

with categorical variables?

13

Page 14: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Common situation is that objects, o1 and o2, have only binary attributes, like for

example gender (f/M), driving license (yes/no), Nobel price holder (yes/no).

We distinguish between symmetric and asymmetric binary variables.

In symmetric binary variables, both levels have roughly comparable frequencies.Example: gender

In asymmetric binary variables, both levels have very different frequencies.Example: Nobel price holder

Similarity measures for binary data

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

14

Page 15: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The objects o1 and o2 have only binary attributes.

Matching CoefficientSimilarity measures for “symmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

15

Simple Matching Coefficients for symmetric binary variables (could be only a subset of the p binary variables) is defined as:

SMC = # matches / # attributes

Corresponding to the proportion of matching attributes over all attributes

SMC = (M11

+ M00) / (M

01+ M

10+ M

11+ M

00)

M01 = number of attributes where o1i is 0 and o2i is 1

M10 = number of attributes where o1i is 1 and o2i is 0

M00 = number of attributes where o1i is 0 and o2i is 0

M11 = number of attributes where o1i is 1 and o2i is 1

Page 16: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The objects o1 and o2 have only binary attributes.

The Jaccard Coefficient for the asymmetric binary variable (could be only a subset of the p binary variables) is defined as :

J = # both-1-matches / # of not-both-zero attributes values

Corresponding to the proportion of matching attributes over thoseattributes which are 1 in at least one of both observations.

J = = (M11) / (M

01+ M

10+ M

11)

M01 = number of attributes where o1i is 0 and o2i is 1

M10 = number of attributes where o1i is 1 and o2i is 0

M11 = number of attributes where o1i is 1 and o2i is 1

Matching Coefficient:Similarity measures for “asymmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

16

Page 17: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

M01 = 2 (the number of attributes where p is 0 and q is 1)

M10 = 1 (the number of attributes where p is 1 and q is 0)

M00 = 7 (the number of attributes where p is 0 and q is 0)

M11 = 0 (the number of attributes where p is 1 and q is 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00)

J = (M11) / (M01 + M10 + M11)

Example: How similar are two given binary vectors?

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

17

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

= 0 / (2 + 1 + 0) = 0

Page 18: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

More than 2 levels (Nominal Data)

Simple miss-matching coefficient (ranges between 0 and 1)

mm: number of variables where object i and j do not match

p: number of features

Character strings can be understood as nominal data.

If the strings are of equal length, this is also called Hamming-Distance (sometimes

without dividing by p).

What is the Hamming-Distance between:

HOUSE

MOUSE

18

#missmatches mm

#feature pijd

Proportion of features where

observations differ

Page 19: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Idea: Use distance measure dij between 0 and 1 for each

corresponding pair of variable or feature in two observations.

- If kth variable is binary, nominal,

use discussed methods, e.g.:

- If kth variable is numeric:

xik: value for object i in variable k

Rk: range of variable k for all objects

- If kth variable is ordinal, use normalized ranks.

Then same like with numeric variables

Aggregate distance measures over all

variables/features/dimensions:

Gower’s dissimilarity for mixed data types

( ) 11 00

01 10 11 00

1k

ij

M Md

M M M M

( )| |ik jkk

ij

k

x xd

R

( )

1

1 pk

ij ij

k

d dp

19

Page 20: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

> str(flower)

'data.frame': 18 obs. of 8 variables:

$ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...

$ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...

$ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...

$ V4: Factor w/ 5 levels "1","2","3",..: 4 2 3 4 5 4 4 2 3 5 ...

$ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...

$ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<..: 15 3 1 16 2 12 ...

$ V7: num 25 150 150 125 20 50 40 100 25 100 ...

$ V8: num 15 50 50 50 15 40 20 15 15 60 ...

> library(cluster)

> dist=daisy(flower, type=list(asymm=c(1, 3), symm=2, ordratio=7))

> str(dist)

Classes 'dissimilarity', 'dist' atomic [1:153] 0.901 0.618 ...

..- attr(*, "Size")= int 18

..- attr(*, "Metric")= chr "mixed"

..- attr(*, "Types")= chr [1:8] "A" "S" "A" "N" ...

Dissimilarity for mixed data types with R-function “daisy” calculating Gower’s dissimilarity

20

Page 21: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

library(cluster)

dist = daisy(flower)

mdist = as.matrix( dist)

library(pheatmap)

pheatmap(mdist)

Visualize the distance matrix: Heatmaps are great!

21

Page 22: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

How to visualize

multivariate observations of

mixed data types in 2D?

22

Page 23: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Goal of multidimensional scaling

MDS gets as input distances between

observations or data points and results

in a visualization of points in 2D

The bars between points represent the

given distances between points.

As input of MDS we only know the

distances and look for a low-dim point

configuration where points have the

same or similar distances.

23

Page 24: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Example for metric MDS

Distance matrix:

:

Problem: Given Euclidean distances among points,

recover the position of the points!

Example: Road distance between 21 European cities

(almost Euclidean, but not quite)

24

Page 25: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

eurodist data:

:

res.cmd=cmdscale(eurodist)

plot(res.cmd,pch="")

text(res.cmd,

labels=rownames(res.cmd))

-2000 -1000 0 1000 2000

-10

00

01

00

0

res.cmd[,1]

res.c

md

[,2

]

Athens

Barcelona

BrusselsCalaisCherbourg

Cologne

Copenhagen

Geneva

Gibraltar

HamburgHook of Holland

LisbonLyons

Madrid MarseillesMilan

Munich

Paris

Rome

Stockholm

Vienna

Configuration can be

- shifted

- rotated

- reflected

without changing distances

MDS in R:

25

Example for metric MDS

Page 26: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

After flipping vertical axes:

26

Example for metric MDS

Page 27: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Equivalence of PCA and MDS with Euclidean distance

PCA-representation in data-matrix

=

MDS-representation in Euclidean distance-matrix

MDS on Euclidean distance results in equivalent low-dimensional

representation (up to rotation, flipping, shifts) as PCA on data-

matrix (however, the data-matrix must first be derived from the

distance-matrix).

27

Page 28: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

library(cluster)

dist = daisy(flower)

mdist = as.matrix( dist)

library(pheatmap)

pheatmap(mdist)

library(MASS)

mds = isoMDS(mdist, k=2)

d.mds = as.data.frame(mds$points)

names(d.mds) = c("c1", "c2")

library(ggplot2)

ggplot(data=d.mds, aes(x=c1, y=c2)) +

geom_point() +

geom_text(label=row.names(mdist),

hjust=1.2)

Distance matrix and 2D plot of multivariate mixed data

28

Page 29: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection

29

Page 30: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

How much is an observation differing from average?

30

Page 31: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

31

z-transformationx

x

XX Z

2

(Z) 0

( ) 1Z

E Z

sd Z

Lets standardize and look at the z-score

The standardized Variable Z

has mean zero and variance

one.

We start from a variable X with

𝐸 𝑋 = ത𝑋 = 𝜇𝑥 and Var(X)=x2

and apply the z-transformation:

Often the z-transformation is applied to different univariate features to make

them “comparable”. A distance of -2 from the population mean always means

that this observation is two standard deviations smaller than the average.

In case of a normal-distributed X, we know that 𝑍~𝑁(0,1).

z-score

Page 32: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The z-score has the unit “standard deviation”

1sd 2sd 3sd 4sd 5sd 6sd 7sd 8sd 9sd 10sd0

How much is my IQ

above/below average?

Remark: Mean and SD can also be determined for a non-normal distributed variables

– but intuition is lost and we might prefer to work with quantiles.32

Page 33: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Remarks:

• All marginal distributions of a multi-variate normal

distribution are uni-variate normal distributions.

• All conditional distributions of a multi-variate normal

distribution are uni-variate normal distributions.

• Each iso-density-line is an ellipse or it’s higher-

dimensional generalization.

The multivariate normal distribution

310 5~ ,

0 3 25

XN

Y

Density:

~ ,NX μ Σ

11exp

2

2

t

kf

x μ Σ x μ

x

Σ

33

Page 34: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Mahalanobis distance is the multivariate z-score

MD=1

MD=1

MD=1

MD=2

1MDt x x μ Σ x μ

The Mahalanobis distance MD(x)

measures the distance of x to the

mean of the multivariate normal

distribution in units of standard deviations.

In case of a multi-variate normal distributed x we know that MD(x)~N(0,1).

34

MD=2

Page 35: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

35

We need expectations or a model to identify an outlier!

35

Outlier

Page 36: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection using a boxplot representation

All points beyond the whiskers are called “extreme” values.

Is there any model?

36

Page 37: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The model behind the “extreme” value definition in boxplots

99% of N(0,1) data are within the whiskers.

When visualizing non-normal distributed data, this model is not valid.

boxplot from 100k data points simulated from a N(0,1)

37

Page 38: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection in the uni-variate case via Grubbs test

library(outliers)

x=c(45,56,54,34,32,45,67,45,67,65,154)

# Grubbs test for one outlier

#

# data: x

# G = 2.80490, U = 0.13459, p-value = 0.0001816

# alternative hypothesis: highest value 154 is an outlier

Grubbs developed this test statistic in 1950 (assuming normal distribution as in t-

test for small n) to investigate whether “some time during the experiment

something might have happened to have cause an extraneous variation on the

high side or on the low side”, and is also nowadays routinely used in regression

model checking procedures (i.e. to find outliers in Cook’s d values or standardized

residuals).

potential outlier

38

Page 39: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection in the multi-variate case via 𝝌𝟐test

The Mahalanobis distance MD(x)

measures the distance of x to the

mean of the multivariate normal

distribution in units of SD.

Outlier detection via Mahalanobis distance can be performed for data for

which the multivariate normal assumption is reasonable by checking

whether the MD2 of an p-dim observation is “sticking out” of c2 distribution

with df=p.

1

p×1 p×1 p×p

22

p×1

MD ~ ( , )

MD ~

t

df p

N

c

x x μ Σ x μ 0 1

x

39

Page 40: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection via c2 quantiles

• Compute for each observation p-dim x the (robust version of) the squared

Mahalanobis distances from the assumed N-distribution center: MD(x)2

• Generate a Quantile-Quantile plot to identify observations that have an

expected c2 distribution with df=p ( MD(x)2 > 97.5% quantile of cp2)

• Use in addition “adjusted quantiles” that are estimated by simulations

from the expected chi-square distribution without outliers.

• Use (robust) PCA to visualize the data in a 2D score plot.

40

Page 41: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Extreme quantiles of c2 distribution indicate outliers

41

PC1

PC1 PC1

PC2

PC2

library(mvoutlier)

aq.plot(dat) # to get the shown adjusted quantile plot

chisq.plot(dat) # to get an interactive qq plot to select outliers

Page 42: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Adjusted quantile via simulation

ECDF leaves “plausible” range

Defines adaptive

cutoff

42Slide credit: Markus Kalisch

Page 43: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Outlier detection via robust PCA

imagine 784 dimensions ;-)

Assumption: The manifold hypothesis holds.

43

Page 44: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

nxp nxp nxp

nxp nxp nxp

(PCA representation)

(full reconstruction)t

Y X A

X Y A

PCA Rotation can be achieved by multiplying

X with an orthogonal rotation matrix A

Dimension reduction via PCA

PCA minimizes reconstruction error

over all available m data points:

( ) ( ) 2 2

1 1

ˆˆ|| || || ||m m

i i

i i

i i

x x X X

Partly reconstruct X with only k<p PCs:

nxp nxk nx(p-k) nxpˆ , t X Y 0 A

How good is the data representation?

The reconstruction error is given by the

squared orthogonal distance between

the data point and its projection on the

plane.

44

Page 45: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

PCA is not robust against outliers

The first two PCs point in the directions of maximal variance.

Since the variance is not robust against outliers the

result of PCA is also not robust against outliers.

We can use a robust version of the PCA which is resistant to outliers.

PC1 with

classical PCA

PC1 with

robust PCA

45

Page 46: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

The reconstruction of the red point has a reconstruction-error that corresponds to

the squared distance between the red and green points – PCA minimizes the sum

of squared distances.

Points with extreme reconstruction errors are identified as outliers.

1.PC

46

PCA can be used for outlier detection

Page 47: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

We should use robust PCA to identify outliers via reconstruction errors

In robust PCA the directions of the PCs are not heavily influenced

by the positions of some outliers -> outliers have larger distances

to the hyperplane which is spanned by the first couple of PCs and

capture large parts of the variance of non-outlying points.

outliers

47

Page 48: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

There are two major R-implementations of PCA - prcomp() and princomp()

- prcomp is numerically more stable, and therefore preferred

(see chapter 2.7 in the StDM script)

- princomp has a few more options and is therefore sometimes used

PCA in R

For robust PCA and outlier detection we can use the package rrcov

- PcaHubert performs robust PCA

48

Page 49: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ... · res.cmd=cmdscale(eurodist) plot(res.cmd,pch="") text(res.cmd, labels=rownames(res.cmd))-2000 -1000 0 1000 2000 0

Summary

49

• We can use different measures to quantify the dissimilarities between two

observations described by the same features, e.g.

• Euclidian and other Lp-Metrics for quantitative data

• Matching coefficient for (symmetric) categorical data

• Jaccard coefficient for (asymmetric) categorical data

• Gower dissimilarities for mixed data types (see R-package daisy)

• A distance matrix holds the pair-wise distances between several

observation units and can be visualized by a heatmap.

• Multdimensional scaling (MDS) yields a 2D plot for high-dimensional data

• MDS starts with a distance matrix

• The pair-wise distances are preserved as good as possible in the 2D plot

• PCA yields the same PC1-PC2-2D plot as MDS on the Euclidean distances

• Outlier detection in high dimensional data can be tackled by

• Quantile plots of 𝜒2 distributed squared Mahalanobis distances from assumed

N-distribution center

• Robust PCA and the distances to the PC1-PC2-2D hyperplane


Recommended