cnrs - upmc laboratoire d’informatique de paris 6 Outskewer: Using Skewness to Spot Outliers in Samples and Time Series S´ ebastien Heymann, Matthieu Latapy, Cl´ emence Magnien ASONAM 2012
Transcript
1. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer:
Using Skewness to Spot Outliers in Samples and Time Series Sbastien
Heymann, Matthieu Latapy, Clmence Magnien e e ASONAM 2012
2. Did you know?Outlier detection is an important problem to
data mining: source: https://xkcd.com/539/
3. cnrs - upmc laboratoire dinformatique de paris 6 How to
detect outliers? No formal denition, it is a subjective concept.
Depends on cases and hypotheses on data. Intuitively: to identify
values which deviate remarkably from the remainder of values
(Grubbs, 1969). Sbastien Heymann, Matthieu Latapy, Clmence Magnien
Outskewer ASONAM 2012 e e 3/27
4. cnrs - upmc laboratoire dinformatique de paris 6 Usual
approaches in literature Hypothesis: data normal Distance data
points / distribution. theoretical values. Sbastien Heymann,
Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
4/27
5. cnrs - upmc laboratoire dinformatique de paris 6 Problem
statement Most of the time, we cant make strong assumptions on: the
theoretical distribution of values. how the data should evolve over
time (time series). Thus we want a method which makes no hypothesis
on data. Sbastien Heymann, Matthieu Latapy, Clmence Magnien
Outskewer ASONAM 2012 e e 5/27
6. Our Method
7. cnrs - upmc laboratoire dinformatique de paris 6 Skewness
coecient n xmean 3 = (n1)(n2) xX standard deviation density density
x x 0 Example of skewed distributions. Sbastien Heymann, Matthieu
Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 7/27
8. cnrs - upmc laboratoire dinformatique de paris 6 Skewness
coecient n xmean 3 = (n1)(n2) xX standard deviation density density
x x 0 Example of skewed distributions. It is sensitive to extremal
values (min/max) far from the mean ! Sbastien Heymann, Matthieu
Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 7/27
9. cnrs - upmc laboratoire dinformatique de paris 6 Skewness
signature Denition Evolution of skewness coecient when extremal
values are removed one by one from the sample. Algorithm If > 0
then remove max(X ), 1.5 skewness Else remove min(X ). 1.0 0.5 0.0
Example 1 2 3 4 5 6 7 X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} #
extremal values removed : 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73
Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM
2012 e e 8/27
10. cnrs - upmc laboratoire dinformatique de paris 6 Our
method: Outskewer Our denition Outlier = extremal value which skews
a distribution of values. Implication The removal of these extremal
values one by one should reduce the skewness of the distribution.
Implication Otherwise, there is no outlier as we dene it. Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
9/27
11. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: non-relevant cases Where extremal values far from the mean are
common. e.g. Power law distributions Sbastien Heymann, Matthieu
Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 10/27
12. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: p-stability Is the signature p-stable? p: fraction of extremal
values removed. p-stable || 0.5 p, for each p from p to 0.5 1.0 q
0.5 t T cumulative distribution q q q q qq q 0.8 q q q q q 0.4 q q
q |skewness| q q q 0.6 q q 0.3 |g| qq qq qq q q 0.4 q q q q q 0.2
qq q q qq 0.2 qq q q q 0.1 q q q q q 0.0 0.0 8 6 4 2 0 2 0 0.14
0.30 0.16 0.5 x p Example: 0.16-stable but not 0.30-stable Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
11/27
13. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: p-stability Is the signature p-stable? p: fraction of extremal
values removed. p-stable || 0.5 p, for each p from p to 0.5 If yes:
there may be outliers. Sbastien Heymann, Matthieu Latapy, Clmence
Magnien Outskewer ASONAM 2012 e e 12/27
14. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: p-stability Is the signature p-stable? p: fraction of extremal
values removed. p-stable || 0.5 p, for each p from p to 0.5 If yes:
there may be outliers. If no for all p: the skewness coecient is
always too large, thus no outlier as we dene it can lie in the
sample. Sbastien Heymann, Matthieu Latapy, Clmence Magnien
Outskewer ASONAM 2012 e e 12/27
15. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: outlier detection |g| area of outliers area of potential area
with no outlier 2.0 outliers 1.5 1.0 q not outlier q |skewness| q
qcumulative frequency qq qq 0.8 potential outlier q q q q 1.0 q q
qq outlier q q q q q 0.6 q q q q qq qq q 0.5 q 0.4 q q q q t q q q
q 0.2 T 0.0 0.0 8 6 4 2 0 2 t T x 0 0.14 0.5 1 p t smallest
t-stable value , t smallest value so that || 0.5 t T largest T
-stable value , T smallest value so that || 0.5 T Example: 50
values, including 7 outliers and 5 potential outliers Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
13/27
16. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer
: outcome Each value of the sample is classied as
follows:qqqqqqqqqqqqqq qqqqqqqqqq status q not outlier potential
outlier outlier 2000 or unknown when the method is not applicable
(skewness signature never p-stable). Sbastien Heymann, Matthieu
Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 14/27
17. cnrs - upmc laboratoire dinformatique de paris 6 Extension
to time series On a sliding window of size w , each value of X is
classied w times. The nal class of a value is the one that appears
the most. time Sbastien Heymann, Matthieu Latapy, Clmence Magnien
Outskewer ASONAM 2012 e e 15/27
19. cnrs - upmc laboratoire dinformatique de paris 6 False
positive rate Normal distribution: 3% for n = 10, 0.01% for n = 100
Pareto distribution: 5% for n = 100, 0.01% for n = 1000 Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
17/27
21. Experimental Results French population during the 20th
century. Logs of a P2P search engine.
22. cnrs - upmc laboratoire dinformatique de paris 6 French
population during the 20th century Number of inhabitants per year
qqq qqq 60M qqq qqq qqqqq qqqq qqqq population qqqq qqq qqqq q qqq
50M qqq qqq qq qq qqq qq qqq q qqq qqqqqqqqqqqqq qqqqq qqqqqqqqqq
qqq q 40M qqq qq qqqq qqqqq q 1900 1920 1940 1960 1980 2000 Year
Dierence over years 1000000 q q q q 500000 q q qqq qqqqqqq qqq
qqqqqqqqqqq status population q qqqqqqqqqq qq qq q q
qqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq q qqq qq 0 q qq q not
outlier 500000 potential outlier 1000000 1500000 outlier 1900 1920
1940 1960 1980 2000 Year Sbastien Heymann, Matthieu Latapy, Clmence
Magnien Outskewer ASONAM 2012 e e 20/27
23. cnrs - upmc laboratoire dinformatique de paris 6 Harry
Potter on eDonkey Number of outliers per day 75 # outliers / day in
theatre unknown event pirate release outliers 0 50 potential
outliers 15 Jul 24 Aug 12 Oct 1 Dec Date Data: search logs on P2P
network eDonkey. # queries containing half blood prince per hour,
computed every 10 minutes. during 28 weeks. over 205 millions of
queries. for 24.4 millions of IP addresses. Sbastien Heymann,
Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
21/27
24. cnrs - upmc laboratoire dinformatique de paris 6
Contributions Our method: is non-parametric but for the size of the
time window. classies values only when the statistical conditions
are met. is naturally generalized to on-line analysis. Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
22/27
25. cnrs - upmc laboratoire dinformatique de paris 6 Conclusion
Motivation: outlier detection with no hypothesis on data. Method
based on the skewness of distributions. Excellent experimental
results. Relevant on various data sets. Open source code in R on
http://outskewer.sebastien.pro Sbastien Heymann, Matthieu Latapy,
Clmence Magnien Outskewer ASONAM 2012 e e 23/27
26. Questions?Outskewer: Using Skewness to Spot Outliers in
Samples and Time Series
27. cnrs - upmc laboratoire dinformatique de paris 6
Homogeneous / heterogeneous data Outlier = unexpected extremal
value? Extremal values far from the mean? heterogeneous (Pareto,
Zipf...): common homogeneous (normal, Laplace...): uncommon 100 105
density 1010 1015 1020 10 5 0 5 10 x Probability density function
of normal and Pareto laws. Sbastien Heymann, Matthieu Latapy,
Clmence Magnien Outskewer ASONAM 2012 e e 25/27
28. cnrs - upmc laboratoire dinformatique de paris 6 Skewness
signature Normal 2 1 median 0 min s(p) max 1 q1 2 q3 0.0 0.2 0.4
0.6 0.8 1.0 p Pareto 8 6 median 4 min s(p) 2 max 0 q1 2 q3 0.0 0.2
0.4 0.6 0.8 1.0 p Sbastien Heymann, Matthieu Latapy, Clmence
Magnien Outskewer ASONAM 2012 e e 26/27
29. cnrs - upmc laboratoire dinformatique de paris 6 Local view
of the internet topology 13000 Nb nodes 12000 11000 outlier
potential outlier q not outlier unknown 0 1000 2000 3000 4000 5000
Nb rounds M. Latapy, C. Magnien and F. Oudraogo, A Radar for the
Internet, in Complex Systems, 20 (1), 23-30, 2011. e Sbastien
Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e
27/27