A robust clustering approach to fraud detection
Luis Angel Garcıa-Escudero
Dpto. de Estadıstica e I.O. and IMUVA - Universidad de Valladolid
joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matran(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.Perrotta, F. Torti (JRC-Ispra)
Conference on Benford’s Law for fraud detection. Stresa, July 2019 1
1. CLUSTERING AND ROBUSTNESS
• Clustering is the task of grouping a set of objects in such a way thatobjects in the same cluster are more similar to each other than to thosein other clusters:
• Sample mean:
� m = 1n
∑ni=1 xi minimizes
∑ni=1 ‖xi −m‖2
� m may be seen as the “center” of a data-cloud:
0 1 2 3 4
01
23
4
x1
y2 X
• k clusters ⇒ k “data-clouds” ⇒ k-means
• k-means: Search for
� k centers m1, ...,mk
� a partition {R1, ..., Rk} of {1, 2, ..., n}
minimizingk∑j=1
∑i∈Rj
‖xi −mj‖2.
• Cluster j:
Rj = {i : ‖xi −mj‖ ≤ ‖xi −ml‖ for every l = 1, ..., k}
(...assignment to the closest center...)
• Robustness: Many statistical procedures are strongly affected by evenfew outlying observations:
� The mean is not robust:
x =1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78
7= 1.745
x =1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78
7= 27.485
� k-means inherits that lack of robustness from the mean
• Lack of robustness of k-means:
−20 −10 0 10 20 30
−20
−10
010
2030
(a) 3−means
x1
x2
2 groups artific. joined
0 50 100 150
−40
−20
020
40
(b) 2−means
x1
x2
Cluster 1
Cluster 2
• Outliers can be seen as “clusters by themselves”
• So, why not increasing the number of clusters...?
� But:
· Due to (physical, economical,...) reasons we could have an initialidea of k without being aware of the existence of outliers
· “Radial/background” noise requires large k’s
• Moreover, the detection of outliers may be the goal itself!!!
• Outliers in trade data can be associated to “frauds”:
� Heterogeneous sources of data (clustering) + Few outliers (frauds??)
0 50 100 150 200
010
020
030
040
050
060
0
Trade data
Quantity (tons)
Valu
e (1
000
euro
s)
2.- TRIMMED k-MEANS
• Trimming is the oldest and most widely used way to achieve robustness.
• Trimmed mean: The proportion α/2 smallest and α/2 largestobservations are discarded before computing the mean:
−2 0 2 4 6
−1.0
−0.5
0.0
0.5
1.0
x
Trimmed Trimmed
• But,... how to trim in clustering?
� Why not trimming outlying “bridge” points?
0 5 10 15
−1.0
−0.5
0.00.5
1.0
x
Non−trimmed ’bridge’ points
� Why a symmetric trimming?
0 10 20 30
−1.0
−0.5
0.0
0.5
1.0
x
Symmetric trimming?
� How to trim in multivariate clustering problems?
• Idea: Data itself tell us which are the most outlying observations!!
� Data-driven, adaptive, impartial,... trimming!
• Trimmed k-means: we search for
� k centers m1, ...,mk and
� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
minimizingk∑j=1
∑i∈Rj
‖xi −mj‖2.
[A fraction α of data is not taken into account Trimmed]
• Black circles: trimmed points (k = 3 and α = 0.05):
−20 −10 0 10 20 30
−20
020
40
(a)
x1
x2
−5 0 5
−50
510
(b)
x1
x2
• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previouseruption length” and n = 271
Classificationk = 3, α = 0.03
Eruption length
Pre
viou
s er
uptio
n le
ngth
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
� k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-followed-by-short” eruptions trimmed, 3 bridge points...
3.- ROBUST MODEL-BASED CLUSTERING
• k-means and trimmed k-means prefer spherical clusters:
−6 −4 −2 0 2 4 6
−6−4
−20
24
6
(a) 2−means (spherical groups)
−6 −4 −2 0 2 4 6
−20
−10
010
20
(b) 2−means (elliptical groups)
• Elliptically contoured clusters?
• Multivariate normal distributions with densities φ(·;µ,Σ):
� µ =
(22
)and Σ =
(1 00 1
)[spherical] in (a)
� µ =
(22
)and Σ =
(2 11 1
)[non-spherical] in (b)
−2 0 2 4 6
−20
24
6
(a)
x1
y2
−2 0 2 4 6
−20
24
6
(b)
x1
y2
φ(x;µ,Σ) = (2π)−p/2|Σ|−1/2 exp(− (x− µ)′Σ−1(x− µ)/2
)
• Trimmed likelihoods: Search for
� k centers m1, ...,mk,
� k scatter matrices S1, ..., Sk, and,
� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
maximizing
k∑j=1
∑xi∈Rj
log φ(xi;mj, Sj) (obs. in R0 not taken into account)
Garcıa-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...
• Constraints on the Sj scatter matrices needed:
� Unbounded target likelihood functions� Avoid detecting (non-interesting) “spurious” clusters
• Control relative axes’ lengths (eigenvalues constraints):
c = 1 Large c value
• The FSDA Matlab toolbox:
• The R package tclust at CRAN repository:
• The R package tclust :
> library(tclust)
• tkmeans(data, k, alpha)
� k = “number of groups”
� alpha = “trimming proportion”
• tclust(data, k, alpha, restr.fact, ...)
� restr.fact = “Strength of the constraints”
• tclust(X,k=3,alpha=0.03,restr.fact=50)
Large restr.factk = 3, α = 0.03
x1
x2
0 5 10 15 20
05
1015
20
• Old Faithful Geyser data again:
Classificationk = 4, α = 0
Eruption length
Prev
ious e
rupti
on le
ngth
1.5 2.5 3.5 4.5
1.52.0
2.53.0
3.54.0
4.55.0
Classificationk = 3, α = 0.03
Eruption lengthPr
eviou
s eru
ption
leng
th1.5 2.5 3.5 4.5
1.52.0
2.53.0
3.54.0
4.55.0
• Why k = 3 and α = 0.03 was a sensible solution?
• Applying ctlcurves to the Old Faithful Geyser data:
0.0 0.2 0.4 0.6 0.8
−800
−600
−400
−200
0CTL−Curves
α
Objec
tive F
uncti
on Va
lue
Restriction Factor = 50
5
5
5
5555555
555555555
55555555555
4
4
44444444444444
444
44444444444
3
3
3333333333333333
3333333333
33
2
2
222222222
2
22222222
2222222222
1
111111111
11
1
1111111111
1111111
0.00 0.02 0.04 0.06 0.08 0.10
−800
−700
−600
−500
−400
CTL−Curves
α
Objec
tive F
uncti
on Va
lue
Restriction Factor = 50
55
5 5 55
5 5 5 55 5 5 5 5
5 5 5 55
44 4
4 44 4 4
4 44 4
4 4 44
4 4 44
33
3
33
3 3 33 3
3 33 3 3
3 3 3 3 3
22
2
22
2 2 22 2
2 22 2 2
2 22 2
2
11 1
1 11 1 1
1 11 1
1 1 11 1
1 11
4.- ROBUST CLUSTERING AROUND LINEARSUBSPACES
• Robust linear grouping: Higher p dimensions, but assuming that ourdata “live” in k low-dimensional (affine) subspaces...
� We search for
· k linear subspaces h1, ..., hk in Rp
· a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
minimizingk∑j=1
∑i∈Rj
‖xi − Prhj(xi)‖2.
� Prh(·) denotes the “orthogonal” projection onto the linear subspace h
• Example: Three linear structures in presence of noise:
1 1.5 2 2.5 3 3.5 4 4.5 5−6
−4
−2
0
2
4
6
X
Y
1 1.5 2 2.5 3 3.5 4 4.5 5−6
−4
−2
0
2
4
6
X
Y(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)
Trimmed “mixtures of regressions” can also be applied...
• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:
� PCA provides a q-dimensional (q << p) representation of data by
minBq,Aq,m
n∑i=1
||xi − xi||2 for
xi = Prh(xi) = xi(Bq,Aq,m) = m + Bqai
· Aq =
−a1−· · ·−ai−· · ·−an−
is the scores matrix (n× q)
· Bq =
−b1−· · ·−bj−· · ·−bp−
is a matrix (p × q) whose columns generate a q-dimensional approximating subspace h
• Principal Components Analysis is highly non-robust!!!
• Least Trimmed Squares PCA (Maronna 2005): Minimize
n∑i=1
wi‖xi − xi‖2 =
n∑i=1
wi‖xi − xi(Bq,Aq,m)‖2,
with {wi}ni=1 being “0-1 weights” such that
n∑i=1
wi = [n(1− α)]
� Weights: wi =
{1 If xi is not trimmed
0 If xi is trimmed.
• Cases → xi = (xi1, ..., xip)′ ∈ Rp and Cells → xij ∈ R
� i denotes a country (or a trader; company;...) for i = 1, ..., n
� xij is the “quantity-value ratio” for country i in the j-th month (orthe j-th year; the j-th product;...) for j = 1, ..., p
• Casewise trimming: Trim xi cases with (at least one) outlying xij
n = 100× p = 4 data matrix with 2% outlying cells:
Outlying xij cells Trimmed xi cases (black lines)
• But when the dimension p increases... we do not expect many xicompletely free of outlying xij cells:
n = 100× p = 80 data matrix with 2% outlying cells:
Outlying xij cells Trimmed xi cases (black lines)
• Cellwise trimming:
� Only trimming outlying cells... (⇒ “Particular” frauds identified...??)
• PCA approximation xi = m + Bqai = (xi1, ..., xip)T re-written as
xij = mj + aTi bj.
• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize
n∑i=1
wij(xij −mj − aTi bj)2
� wij = 0 if cell xij is trimmed and wij = 1 if not with
n∑i=1
wij = [n(1− α)], for j = 1, ..., p.
• Different patterns/structures in data ⇒ G subspace approximations:
xgi(Bgqg,A
gqg,m
g)
= mg + Bgqga
gi or xgij = mg
j + (agi )Tbgj ,
for g = 1, ..., G
• Minimize
minwgij,B
gq ,A
gq,mg
n∑i=1
p∑j=1
G∑g=1
wgij(xij − xgij)
2.
� wgij = 1 if cell xij is assigned to cluster g and non-trimmed and 0otherwise
� Appropriate constraints on the wgij
q1, ..., qG are intrinsic dimensions...
• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%“scattered” outliers:
0.0 0.2 0.4 0.6 0.8 1.0
−20
−10
010
2030
40
0.0 0.2 0.4 0.6 0.8 1.0
−20
−10
010
2030
40
• k = 2, q = 2 and α = 0.05:
“-” are the trimmed cells
• Cluster means and trimmed cells (◦):
• Example 2: n = 400 in dimension p = 100 with 2 groups and fewcurves with 20% consecutive cells corrupted:
0.0 0.2 0.4 0.6 0.8 1.0
−40
−20
020
40
0.0 0.2 0.4 0.6 0.8 1.0
−40
−20
020
40
• Results:
• Real data example: Average daily temperatures in 83 Spanishmeteorologic stations between 2007-2009 (n = 83 and p = 1096).
• Artificial outliers:
� Two periods of 50 consecutive days in Oviedo replaced by 0oC.
� 150 consecutive days in Huelva temperature replaced by 0oC.
• Cluster means:
� “Meseta” (Central plateau-Castile): Mediterranean:
� Cantabrian Coast: Canary Islands:
• Clustered stations:
• Clusters found and trimmed cells:
−10 −5 0 5 10 15
05
1015
First two scores of cluster 1
PONFERRADA
OURENSE
SORIA
BURGOS
VALLADOLID
ÁVILA
PUERTO DE NAVAC
SEGOVIA
VALLADOLID AIR
ZAMORA
LEÓN
SALAMANCA AIR
SALAMANCAMADRID AIRTORREJÓN ARDOZ
COLMENAR V.
MADRIDMADRID CUAT VI
GETAFE
TOLEDOCIUDAD REALGRANADA BA
GRANADA AIR
CUENCA
ALBACETE BA
ALBACETE
TERUEL
FORONDA
DAROCA
−5 0 5 10 15
05
10
First two scores of cluster 2
BARCELONA AIR
BARCELONA FABRA
CACERES
BADAJOZ AIR
HUELVA R. ESTE
JAÉN
CORDOBA
SEVILLA
MORÓN FRONTE
ROTA
JEREZ FRONT
MELILLAMALAGA
ALMERIA AIR SAN JAVIERMURCIA
ALCANTARILLAALICANTE AIR
ALICANTE
VALENCIA AIR
VALENCIA
CASTELLÓN ALMAZ
LOGROÑO
PAMPLONAPAMPLONA AIR
ZARAGOZA
LLEIDA
HUESCA AIR
TORTOSA
PALMA PORT
PALMA AIRMENORCA
IBIZA
−5 0 5
−4
−3
−2
−1
01
2
First two scores of cluster 3
HONDARRIBIA
SAN SEBASTIÁN
BILBAO
SANTANDER AIR
SANTANDER
GIJÓN PORT VITORIAOVIEDO
CORUÑACORUÑA AIR
SANTIAGO DE C.
PONTEVEDRA
VIGO
LUGO
TENERIFE NORTE
−1 0 1 2 3
−1
01
2
First two scores of cluster 4
LANZAROTE
LA PALMA
FUERTEVENTURA
STA CRUZ TENER
GRAN CANARIA
HIERRO
BARCELONA AIRBARCELONA FABRA
HONDARRIBIASAN SEBASTIÁN
BILBAOSANTANDER AIR
SANTANDERGIJÓN PORT
VITORIAOVIEDO
CORUÑACORUÑA AIR
SANTIAGO DE C.PONTEVEDRA
VIGOLUGO
PONFERRADAOURENSE
SORIABURGOS
VALLADOLIDÁVILA
PUERTO DE NAVACSEGOVIA
VALLADOLID AIRZAMORA
LEÓNSALAMANCA AIR
SALAMANCAMADRID AIR
TORREJÓN ARDOZCOLMENAR V.
MADRIDMADRID CUAT VI
GETAFETOLEDO
CACERESCIUDAD REALBADAJOZ AIR
HUELVA R. ESTEJAÉN
CORDOBAGRANADA BA
GRANADA AIRSEVILLA
MORÓN FRONTEROTA
JEREZ FRONTMELILLAMALAGA
ALMERIA AIRSAN JAVIER
MURCIAALCANTARILLAALICANTE AIR
ALICANTECUENCA
ALBACETE BAALBACETE
TERUELVALENCIA AIR
VALENCIACASTELLÓN ALMAZ
FORONDALOGROÑO
PAMPLONAPAMPLONA AIR
DAROCAZARAGOZA
LLEIDAHUESCA AIR
TORTOSAPALMA PORT
PALMA AIRMENORCA
IBIZALANZAROTE
LA PALMAFUERTEVENTURATENERIFE NORTESTA CRUZ TENER
GRAN CANARIAHIERRO
0 300 600 900Dias
Clusters
0
1
2
3
4
• Reconstructed curves “ ” and true real data “ ” in Oviedo:
• Conclusions:
� Different patterns/structures in data ⇒ Cluster Analysis
� Robust clustering aimed at (jointly) detecting main clusters(bulk of data) and outliers ⇒ Potential “frauds”...
� Higher dimensional problems: Assume clusters “living” in low-dimensional subspaces
� “Casewise” and “cellwise” trimming
Some References:
· Cuesta-Albertos, J.A., Gordaliza, A. and Matran, C. (1997), “Trimmed
k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.
· Garcıa-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of
k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.
· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.
(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,
36, 1324-1345.
· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.
(2010), “A review of robust clustering methods,” Advances in Data Analysis and
Classification, 4, 89-109.
· Fritz, H.,Garcıa-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R
package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,
47, Issue 12.
Thanks for your attention!!!