+ All Categories
Home > Documents > Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...

Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...

Date post: 22-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
45
A robust clustering approach to fraud detection Luis Angel Garc´ ıa-Escudero Dpto. de Estad´ ıstica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matr´ an (U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D. Perrotta, F. Torti (JRC-Ispra) Conference on Benford’s Law for fraud detection. Stresa, July 2019 1
Transcript
Page 1: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

A robust clustering approach to fraud detection

Luis Angel Garcıa-Escudero

Dpto. de Estadıstica e I.O. and IMUVA - Universidad de Valladolid

joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matran(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.Perrotta, F. Torti (JRC-Ispra)

Conference on Benford’s Law for fraud detection. Stresa, July 2019 1

Page 2: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

1. CLUSTERING AND ROBUSTNESS

• Clustering is the task of grouping a set of objects in such a way thatobjects in the same cluster are more similar to each other than to thosein other clusters:

Page 3: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Sample mean:

� m = 1n

∑ni=1 xi minimizes

∑ni=1 ‖xi −m‖2

� m may be seen as the “center” of a data-cloud:

0 1 2 3 4

01

23

4

x1

y2 X

• k clusters ⇒ k “data-clouds” ⇒ k-means

Page 4: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• k-means: Search for

� k centers m1, ...,mk

� a partition {R1, ..., Rk} of {1, 2, ..., n}

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

• Cluster j:

Rj = {i : ‖xi −mj‖ ≤ ‖xi −ml‖ for every l = 1, ..., k}

(...assignment to the closest center...)

Page 5: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Robustness: Many statistical procedures are strongly affected by evenfew outlying observations:

� The mean is not robust:

x =1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78

7= 1.745

x =1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78

7= 27.485

� k-means inherits that lack of robustness from the mean

Page 6: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Lack of robustness of k-means:

−20 −10 0 10 20 30

−20

−10

010

2030

(a) 3−means

x1

x2

2 groups artific. joined

0 50 100 150

−40

−20

020

40

(b) 2−means

x1

x2

Cluster 1

Cluster 2

Page 7: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Outliers can be seen as “clusters by themselves”

• So, why not increasing the number of clusters...?

� But:

· Due to (physical, economical,...) reasons we could have an initialidea of k without being aware of the existence of outliers

· “Radial/background” noise requires large k’s

• Moreover, the detection of outliers may be the goal itself!!!

Page 8: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Outliers in trade data can be associated to “frauds”:

� Heterogeneous sources of data (clustering) + Few outliers (frauds??)

0 50 100 150 200

010

020

030

040

050

060

0

Trade data

Quantity (tons)

Valu

e (1

000

euro

s)

Page 9: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

2.- TRIMMED k-MEANS

• Trimming is the oldest and most widely used way to achieve robustness.

• Trimmed mean: The proportion α/2 smallest and α/2 largestobservations are discarded before computing the mean:

−2 0 2 4 6

−1.0

−0.5

0.0

0.5

1.0

x

Trimmed Trimmed

Page 10: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• But,... how to trim in clustering?

� Why not trimming outlying “bridge” points?

0 5 10 15

−1.0

−0.5

0.00.5

1.0

x

Non−trimmed ’bridge’ points

� Why a symmetric trimming?

0 10 20 30

−1.0

−0.5

0.0

0.5

1.0

x

Symmetric trimming?

� How to trim in multivariate clustering problems?

Page 11: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Idea: Data itself tell us which are the most outlying observations!!

� Data-driven, adaptive, impartial,... trimming!

• Trimmed k-means: we search for

� k centers m1, ...,mk and

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

[A fraction α of data is not taken into account Trimmed]

Page 12: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Black circles: trimmed points (k = 3 and α = 0.05):

−20 −10 0 10 20 30

−20

020

40

(a)

x1

x2

−5 0 5

−50

510

(b)

x1

x2

Page 13: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previouseruption length” and n = 271

Classificationk = 3, α = 0.03

Eruption length

Pre

viou

s er

uptio

n le

ngth

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

� k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-followed-by-short” eruptions trimmed, 3 bridge points...

Page 14: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

3.- ROBUST MODEL-BASED CLUSTERING

• k-means and trimmed k-means prefer spherical clusters:

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

(a) 2−means (spherical groups)

−6 −4 −2 0 2 4 6

−20

−10

010

20

(b) 2−means (elliptical groups)

• Elliptically contoured clusters?

Page 15: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Multivariate normal distributions with densities φ(·;µ,Σ):

� µ =

(22

)and Σ =

(1 00 1

)[spherical] in (a)

� µ =

(22

)and Σ =

(2 11 1

)[non-spherical] in (b)

−2 0 2 4 6

−20

24

6

(a)

x1

y2

−2 0 2 4 6

−20

24

6

(b)

x1

y2

φ(x;µ,Σ) = (2π)−p/2|Σ|−1/2 exp(− (x− µ)′Σ−1(x− µ)/2

)

Page 16: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Trimmed likelihoods: Search for

� k centers m1, ...,mk,

� k scatter matrices S1, ..., Sk, and,

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

maximizing

k∑j=1

∑xi∈Rj

log φ(xi;mj, Sj) (obs. in R0 not taken into account)

Garcıa-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...

Page 17: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Constraints on the Sj scatter matrices needed:

� Unbounded target likelihood functions� Avoid detecting (non-interesting) “spurious” clusters

• Control relative axes’ lengths (eigenvalues constraints):

c = 1 Large c value

Page 18: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• The FSDA Matlab toolbox:

• The R package tclust at CRAN repository:

Page 19: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• The R package tclust :

> library(tclust)

• tkmeans(data, k, alpha)

� k = “number of groups”

� alpha = “trimming proportion”

• tclust(data, k, alpha, restr.fact, ...)

� restr.fact = “Strength of the constraints”

Page 20: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• tclust(X,k=3,alpha=0.03,restr.fact=50)

Large restr.factk = 3, α = 0.03

x1

x2

0 5 10 15 20

05

1015

20

Page 21: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Old Faithful Geyser data again:

Classificationk = 4, α = 0

Eruption length

Prev

ious e

rupti

on le

ngth

1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

Classificationk = 3, α = 0.03

Eruption lengthPr

eviou

s eru

ption

leng

th1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

• Why k = 3 and α = 0.03 was a sensible solution?

Page 22: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Applying ctlcurves to the Old Faithful Geyser data:

0.0 0.2 0.4 0.6 0.8

−800

−600

−400

−200

0CTL−Curves

α

Objec

tive F

uncti

on Va

lue

Restriction Factor = 50

5

5

5

5555555

555555555

55555555555

4

4

44444444444444

444

44444444444

3

3

3333333333333333

3333333333

33

2

2

222222222

2

22222222

2222222222

1

111111111

11

1

1111111111

1111111

0.00 0.02 0.04 0.06 0.08 0.10

−800

−700

−600

−500

−400

CTL−Curves

α

Objec

tive F

uncti

on Va

lue

Restriction Factor = 50

55

5 5 55

5 5 5 55 5 5 5 5

5 5 5 55

44 4

4 44 4 4

4 44 4

4 4 44

4 4 44

33

3

33

3 3 33 3

3 33 3 3

3 3 3 3 3

22

2

22

2 2 22 2

2 22 2 2

2 22 2

2

11 1

1 11 1 1

1 11 1

1 1 11 1

1 11

Page 23: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

4.- ROBUST CLUSTERING AROUND LINEARSUBSPACES

• Robust linear grouping: Higher p dimensions, but assuming that ourdata “live” in k low-dimensional (affine) subspaces...

� We search for

· k linear subspaces h1, ..., hk in Rp

· a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi − Prhj(xi)‖2.

� Prh(·) denotes the “orthogonal” projection onto the linear subspace h

Page 24: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Example: Three linear structures in presence of noise:

1 1.5 2 2.5 3 3.5 4 4.5 5−6

−4

−2

0

2

4

6

X

Y

1 1.5 2 2.5 3 3.5 4 4.5 5−6

−4

−2

0

2

4

6

X

Y(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)

Trimmed “mixtures of regressions” can also be applied...

Page 25: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:

� PCA provides a q-dimensional (q << p) representation of data by

minBq,Aq,m

n∑i=1

||xi − xi||2 for

xi = Prh(xi) = xi(Bq,Aq,m) = m + Bqai

· Aq =

−a1−· · ·−ai−· · ·−an−

is the scores matrix (n× q)

· Bq =

−b1−· · ·−bj−· · ·−bp−

is a matrix (p × q) whose columns generate a q-dimensional approximating subspace h

Page 26: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Principal Components Analysis is highly non-robust!!!

• Least Trimmed Squares PCA (Maronna 2005): Minimize

n∑i=1

wi‖xi − xi‖2 =

n∑i=1

wi‖xi − xi(Bq,Aq,m)‖2,

with {wi}ni=1 being “0-1 weights” such that

n∑i=1

wi = [n(1− α)]

� Weights: wi =

{1 If xi is not trimmed

0 If xi is trimmed.

Page 27: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Cases → xi = (xi1, ..., xip)′ ∈ Rp and Cells → xij ∈ R

� i denotes a country (or a trader; company;...) for i = 1, ..., n

� xij is the “quantity-value ratio” for country i in the j-th month (orthe j-th year; the j-th product;...) for j = 1, ..., p

• Casewise trimming: Trim xi cases with (at least one) outlying xij

n = 100× p = 4 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

Page 28: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• But when the dimension p increases... we do not expect many xicompletely free of outlying xij cells:

n = 100× p = 80 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• Cellwise trimming:

� Only trimming outlying cells... (⇒ “Particular” frauds identified...??)

Page 29: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• PCA approximation xi = m + Bqai = (xi1, ..., xip)T re-written as

xij = mj + aTi bj.

• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize

n∑i=1

wij(xij −mj − aTi bj)2

� wij = 0 if cell xij is trimmed and wij = 1 if not with

n∑i=1

wij = [n(1− α)], for j = 1, ..., p.

Page 30: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Different patterns/structures in data ⇒ G subspace approximations:

xgi(Bgqg,A

gqg,m

g)

= mg + Bgqga

gi or xgij = mg

j + (agi )Tbgj ,

for g = 1, ..., G

• Minimize

minwgij,B

gq ,A

gq,mg

n∑i=1

p∑j=1

G∑g=1

wgij(xij − xgij)

2.

� wgij = 1 if cell xij is assigned to cluster g and non-trimmed and 0otherwise

� Appropriate constraints on the wgij

q1, ..., qG are intrinsic dimensions...

Page 31: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%“scattered” outliers:

0.0 0.2 0.4 0.6 0.8 1.0

−20

−10

010

2030

40

0.0 0.2 0.4 0.6 0.8 1.0

−20

−10

010

2030

40

Page 32: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• k = 2, q = 2 and α = 0.05:

“-” are the trimmed cells

Page 33: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Cluster means and trimmed cells (◦):

Page 34: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Example 2: n = 400 in dimension p = 100 with 2 groups and fewcurves with 20% consecutive cells corrupted:

0.0 0.2 0.4 0.6 0.8 1.0

−40

−20

020

40

0.0 0.2 0.4 0.6 0.8 1.0

−40

−20

020

40

Page 35: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Results:

Page 36: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Real data example: Average daily temperatures in 83 Spanishmeteorologic stations between 2007-2009 (n = 83 and p = 1096).

Page 37: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Artificial outliers:

� Two periods of 50 consecutive days in Oviedo replaced by 0oC.

� 150 consecutive days in Huelva temperature replaced by 0oC.

Page 38: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Cluster means:

� “Meseta” (Central plateau-Castile): Mediterranean:

� Cantabrian Coast: Canary Islands:

Page 39: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Clustered stations:

Page 40: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Clusters found and trimmed cells:

Page 41: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

−10 −5 0 5 10 15

05

1015

First two scores of cluster 1

PONFERRADA

OURENSE

SORIA

BURGOS

VALLADOLID

ÁVILA

PUERTO DE NAVAC

SEGOVIA

VALLADOLID AIR

ZAMORA

LEÓN

SALAMANCA AIR

SALAMANCAMADRID AIRTORREJÓN ARDOZ

COLMENAR V.

MADRIDMADRID CUAT VI

GETAFE

TOLEDOCIUDAD REALGRANADA BA

GRANADA AIR

CUENCA

ALBACETE BA

ALBACETE

TERUEL

FORONDA

DAROCA

−5 0 5 10 15

05

10

First two scores of cluster 2

BARCELONA AIR

BARCELONA FABRA

CACERES

BADAJOZ AIR

HUELVA R. ESTE

JAÉN

CORDOBA

SEVILLA

MORÓN FRONTE

ROTA

JEREZ FRONT

MELILLAMALAGA

ALMERIA AIR SAN JAVIERMURCIA

ALCANTARILLAALICANTE AIR

ALICANTE

VALENCIA AIR

VALENCIA

CASTELLÓN ALMAZ

LOGROÑO

PAMPLONAPAMPLONA AIR

ZARAGOZA

LLEIDA

HUESCA AIR

TORTOSA

PALMA PORT

PALMA AIRMENORCA

IBIZA

−5 0 5

−4

−3

−2

−1

01

2

First two scores of cluster 3

HONDARRIBIA

SAN SEBASTIÁN

BILBAO

SANTANDER AIR

SANTANDER

GIJÓN PORT VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.

PONTEVEDRA

VIGO

LUGO

TENERIFE NORTE

−1 0 1 2 3

−1

01

2

First two scores of cluster 4

LANZAROTE

LA PALMA

FUERTEVENTURA

STA CRUZ TENER

GRAN CANARIA

HIERRO

Page 42: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

BARCELONA AIRBARCELONA FABRA

HONDARRIBIASAN SEBASTIÁN

BILBAOSANTANDER AIR

SANTANDERGIJÓN PORT

VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.PONTEVEDRA

VIGOLUGO

PONFERRADAOURENSE

SORIABURGOS

VALLADOLIDÁVILA

PUERTO DE NAVACSEGOVIA

VALLADOLID AIRZAMORA

LEÓNSALAMANCA AIR

SALAMANCAMADRID AIR

TORREJÓN ARDOZCOLMENAR V.

MADRIDMADRID CUAT VI

GETAFETOLEDO

CACERESCIUDAD REALBADAJOZ AIR

HUELVA R. ESTEJAÉN

CORDOBAGRANADA BA

GRANADA AIRSEVILLA

MORÓN FRONTEROTA

JEREZ FRONTMELILLAMALAGA

ALMERIA AIRSAN JAVIER

MURCIAALCANTARILLAALICANTE AIR

ALICANTECUENCA

ALBACETE BAALBACETE

TERUELVALENCIA AIR

VALENCIACASTELLÓN ALMAZ

FORONDALOGROÑO

PAMPLONAPAMPLONA AIR

DAROCAZARAGOZA

LLEIDAHUESCA AIR

TORTOSAPALMA PORT

PALMA AIRMENORCA

IBIZALANZAROTE

LA PALMAFUERTEVENTURATENERIFE NORTESTA CRUZ TENER

GRAN CANARIAHIERRO

0 300 600 900Dias

Clusters

0

1

2

3

4

Page 43: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Reconstructed curves “ ” and true real data “ ” in Oviedo:

Page 44: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

• Conclusions:

� Different patterns/structures in data ⇒ Cluster Analysis

� Robust clustering aimed at (jointly) detecting main clusters(bulk of data) and outliers ⇒ Potential “frauds”...

� Higher dimensional problems: Assume clusters “living” in low-dimensional subspaces

� “Casewise” and “cellwise” trimming

Page 45: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with

Some References:

· Cuesta-Albertos, J.A., Gordaliza, A. and Matran, C. (1997), “Trimmed

k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.

· Garcıa-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of

k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,

36, 1324-1345.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2010), “A review of robust clustering methods,” Advances in Data Analysis and

Classification, 4, 89-109.

· Fritz, H.,Garcıa-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R

package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,

47, Issue 12.

Thanks for your attention!!!


Recommended