Date post: | 03-Jun-2018 |
Category: |
Documents |
Upload: | gianni-pantaleo |
View: | 219 times |
Download: | 0 times |
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 1/40
The Curse of Dimensionality
Panagiotis Parchas
Advanced Data Management
Spring 2012
CSE !"ST
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 2/40
Multiple Dimensions• As #e discussed in the lectures$ many times it is convenient
to transform a signal %time series$ picture& to a point in
multidimensional space'• This transformation is handy as #e can apply conventional
data(ase inde)ing techni*ues for *ueries such as NN$ orsearch
• This transform may lead as to very high +dimensionality,%hundreds of dimensions&
• -n high dimensionality$ there is a num(er of pro(lems%geometrical and inde) performance& that are usually referredto as the +Curse of Dimensionality,
• -n this presentation.
– Some intuition a(out the Curse'
– E)plore techni*ues that try to overcome it'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 3/40
The Curse
• /olume and area depend e)ponentially on the
num(er of dimensions'
• o intuitive effects.
–
eome r c e ec s concern ng e vo ume o ypercu(es and spheres
– -nde)ing effects
– Effects in the Data(ase environment %*ueryselectivity&
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 4/40
a&eometric EffectsLemma:
A sphere touching or intersecting
all the d1 (orders of a cu(e$ #ill
contain the center'
• True for 2D and 3D %(y
visuali4ation&
• -t should (e true for hi her
dimensions %hyper cu(es$ hyperspheres&5
It is NOT!
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 5/40
(&-nde)ing Effects
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 6/40
(&-nde)ing effects6cont7
• The higher the dimensionality the more
coarse the inde)ing %#hich renders it
useless5&
• This affects all the inde)ing techni*ues'
C8-ST-A 9:M$ 2001
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 7/40
c&;uery selectivity
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 8/40
<hen is meaningful=
!evin 9eyer et all$ 1>>>
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 9/40
<hat is the spell for the curse=• /arious attempts of multidimensional inde)ing
#here proved that don?t ma@e sense for a (igcategory of data distri(utions 6C8-ST-A 9:M$ 20017
Dimensionality 8eduction techni*ues'
• They (asically apply ideas of compression$ to
data$ in order to reduce the dimensionality'
• -n the ne)t #e #ill focus mainly in Time Series'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 10/40
-ntroduction
11'
>
>'
10
10'
11
>B1B2011 10B1B2011 11B1B2011 12B1B2011 1B1B2012 2B1B2012
-
12D
space
12 Data points
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 11/40
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 1200 20 40 60 80 100 120 0 20 40 60 80 100 120
DT D<T APCA PAA PA
Tutorial in -EEE -CDM 200F (y Dr' !eogh
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 12/40
Discrete ourier Transform %DT&• +Every signal$ no matter ho# comple)$ can (e represented as
a summation of sinusoids,
• -dea. – ind the hidden sinusoids that form the time series
– Store two num(ers for each. % A , φ&
– arger fre*uency sins generally correspond to details of the time series
– <e can discard them and @eep Gust the first ones %lo# fre*uency&
– Then #e use -nverse DT to get the appro)imation of the time series'
phasemagnitude
DT.
-nverse DT.
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 13/40
DT e)ample
>'
10
10'
11
11'TIME SERIES
11'1>3F
11'2F
11'31H
11'2>I3
11'303H11'303H
11'202
11'120>
11'1012
11'00F>
A
133>'2
22'HI2
13'F1
10'F>
H'HF>
3'1
'>H21
'03
2'30
3'23
1'320>
φ
0
1'FFH
0'33IF2
0'I33
1'3F2
1'FI3
1'F2
1'2H1I
1'IHF1
1'>H
1'00
> 1 I
1 3
1 >
2 A
3 1
3 I
F 3
F >
A A
H 1
H I
I 3
I >
C A
> 1
> I
1 0 3
1 0 >
1 1 A
1 2 1
1 2 I
10'>
10'>F0110'>FIH
10'IH>
10'HFF
10'HFIH
10'I13H
10'IF>2
10'I210'H32
10'HF>
10'H2F>
10'F>0F
10'FI>
3'1>3>
2'10'3FI2
2'HF11
3'12
2'2F
2'0IH
1'00HH
1'2
1'F2I
0'3HF
1'02
0'>I202
1'3F33
1'H>I2
DT 1'
1'32'I3
0'>H0HI
1'F3IF
1'3I02
2'00
0'H>IF
0'302
1'0F0
0'0>2F03
1'22>3
0'310F
0'2F0FI
1'H03F
<e store
JK
valuesL
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 14/40
DT e)ample%cont&
A
133>'2
22'HI2
φ
0
1'FFH
A""roximate TS
10'2F
10'>F>
11'0>
11'1FI
11'21
11'2F3
11'2F
11'22H
11'11
11'11 >'
10
10'
11
11'
#T a""roximation
-DT
'
10'F>H'HF>
3'1
'>H21
'03
'
0'I331'3F2
1'FI3
1'F2
1'2H1I
11'0F3
10'>H3
10'3
10'0>
10'IFH
10'H>
10'HI
10'H32
10'H110'H11
10'H0>
10'H0I
10'H03
10'>F
10'I>
> 1 H
1 1
1 H
2 1
2 H
3 1
3 H
F 1
F H
A 1
A H
H 1
H H
I 1
I H
C 1
C H
> 1
> H
1 0 1
1 0 H
1 1 1
1 1 H
1 2 1
1 2 H
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 15/40
DT
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 16/40
DT %pros cons&• :%nlogn& comple)ity
•ard#are -mplementations• ood a(ility to compress most signals
• Many applications
• ot good appro)imation for (ursty signals
• ot good appro)imation if the signal contains (oth flatand (usy segments
• Cannot support other distance metrics• Contains info only for the fre*uency distri(ution
– The time domain=
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 17/40
<hy DT is not enough=• -t gives us information a(out the fre*uency
component of a time series$ #ithout tellingwhere this fre*uency lies in the time domain
1z(t)=sin(5*t) , sin(10*t)
2x(t)=sin(5*t)+sin(10*t)
3500Fourier Decomposition (Spectrum)
0 1000 2000 3000 4000 5000 6000 70-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 1000 2000 3000 4000 5000 6000 7000-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 10 20 30 40 50 60 70 80 90 1000
500
1000
1500
2000
2500
3000
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 18/40
Discrete <avelet Transform%D<T&• This comes as a solution to the previous pro(lem'
• The #avelet transform contains information (oth for the fre*uency
domain AD the time domain'
• The (asic -dea is to e)press the time series as a linear com(ination of a
#ave et asis unction' aar <ave et is most y use .
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 19/40
D<T. raphical -ntuition• The #avelet is stretched and shifted in time and this is done
for all the possi(le stretches and shifts.
• After#ards$ each is multiplied #ith the TS'
• <e @eep only the ones #ith high product'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 20/40
D<T. umerical -ntuition
Reso%ution A&erages #etai%s
F 6> I 3 7
2 6 F7 61 17
1 6H7 627
9
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60
1
2
3
4
5
6
7
8
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 21/40
E)ample ta@en (y Stollnit4$ E' et all 1>>
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 22/40
D<T-n our e)ample.
• <e had 12pts• The appro)imation
%red line& uses only10'H
10'
11
11'2
11'F'a&e%et A""roximation
1H haar coefficients
>'
10
10'2
'
1 A > 1 3
1 I
2 1
2 A
2 >
3 3
3 I
F 1
F A
F >
A 3
A I
H 1
H A
H >
I 3
I I
C 1
C A
C >
> 3
> I
1 0 1
1 0 A
1 0 >
1 1 3
1 1 I
1 2 1
1 2 A
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 23/40
D<T%Pros Cons&N ood a(ility to compress stationary signals'
N ast linear time algorithms for D<T e)ist'N A(le to support some interesting nonEuclideansimilarity measures'
N Signals must have a length n K 2someOinteger
N <or@s (est if N is K 2someOinteger' :ther#ise #aveletsappro)imate the left side of signal at the e)pense of the right side'
N Cannot support #eighted distance measures'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 24/40
Singular /alue Decomposition%S/D&• All the previous methods$ try to transform
each time series independently of the others'• <hat if #e ta@e into account all the Time
• <e can then achieve the desired
dimensionality reduction for the specific
Dataset
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 25/40
S/D. 9asic -dea 617
*
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 26/40
S/D. 9asic -dea %2&
*
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 27/40
S/D. 9asic -dea %3&
*
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 28/40
S/D 6more7• The goal is to find the a)es #ith the (iggest
variance'
High variance A lot of -mportanta)es
-nformation
Lo( variance
a)es
ittle
-nformationB
oise
A)es
Axes can )e
truncate*
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 29/40
S/D6more7• -n the previous intuition$ #e can @eep the coefficients of
the proGections to the ne# a)is'
• This can (e efficiently done (y S/D'• So #e perform the dimensionality reduction in an
aggregate #ay ta@ing into account the #hole dataset'
• This idea #as traditionally used in linear alge(ra formatri) compression'
• The idea #as to find the %nearly& linearly dependentcolumns of a matri) A and eliminate them'
• It can be proved that this compression is optimal.
T V U A Σ=
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 30/40
S/D. compression
*
ProGection to the
a)is denoted (y
the (iggest
singular value s
MINIM+M
information loss
ood forcompression
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 31/40
S/D. Clustering
*
ProGection to the
a)is denoted (y
the smallestsingular value s,
MAIM+Minformation loss
ood for
clustering
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 32/40
S/D%Pros Cons&N O"tima% linear dimensionality reduction techni*ue '
N The eigenvalues tell us something a(out the underlying structure of the
data'
N Computationally very e)pensive'
N Time. :%Mn2&
N Space. :%Mn&
N An insertion into the data(ase re*uires recom"uting the S/D'
N Cannot support #eighted distance measures or non Euclidean measures'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 33/40
Piece#ise AggregatePiece#ise Aggregate Appro)imationAppro)imation
%PAA&%PAA&• /ery simple$ intuitive
• 8epresent the time series as a summation of (o)esof e*ual length'
11.4PAA approximation
• <e @eep 13 (o)es
0 20 40 60 80 100 120 1409.8
10
10.2
10.4
10.6
10.8
11
.
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 34/40
PAA%Pros Cons&N ast$ easy to implement$ intuitive
N The authors claim it is as efficient as otherapproaches %empirically&
N Supports *ueries o ar itrary engt sN Supports non Euclidean measures
N -t seems as a simplification of D<T$ that
cannot (e generali4ed to other types of signals
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 35/40
Adaptive Piece#ise ConstantAdaptive Piece#ise Constant
Appro)imation %APCA&Appro)imation %APCA&• <hat a(out signals #ith
flat areas and pea@s=
-DEA. generalize PAA
8a# Data %Electrocardiogram&
Adaptive 8epresentation %APCA&8econstruction Error 2'H1
so t can automat ca
adapt itself to the correct
bo! size.
%#e should no# @eep (oth
the length and height of
the (o)&
50 100 150 200 2500
aar <avelet or PAA8econstruction Error 3'2I
DT8econstruction Error 3'11
e)ample (y E'!eogh -EEE -CDM 200F
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 36/40
APCA 6more7• -n order to implement it$ the authors propose
first a D<T transformation that is follo#ed (ymerging of the similar, ad"acent #avelets'
• o#ever the inde)ing is more complicated
than PAA since #e need t#o num(ers for each
(o)'• That is the reason #hy is not used very often'
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 37/40
Piece#ise inearPiece#ise inear Appro)imation %PA&Appro)imation %PA&inear segments
for representation%not necessarily
Although efficient in
some cases$
The implementation
is slo# and it is not
inde)a(lee)ample for visuali4ation only
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 38/40
on inear Techni*ues
Dimensionality 8eduction. A Comparative8evie#$ ''P' van der Maaten 200
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 39/40
on inear techni*ues 627• A lot of techni*ues hve emerged the last
years'• o#ever $ 6Maaten et al 2007 compared
most of the datasets all these complicated
techni*ues turn out to (e #orse'
• The reasons the authors claim$ are data overfitting and curse of dimensionality5
8/12/2019 L6 Curse of Dimensionality Parchas
http://slidepdf.com/reader/full/l6-curse-of-dimensionality-parchas 40/40
Conclusion• All the (efore mentioned techni*ues have their
strong and #ea@ points'
• Dr !eogh tested them over H different datasets#ith different characteristics.
On a&erage. the/ are a%% a)out the same' -nparticular$ on 0Q of the datasets they are all #ithin10Q of each other'
So the choice for the (est method depends on thecharacteristics of the Dataset