Multivariate Analysis Harrison B. Prosper Durham, UK 2002 1
Multivariate AnalysisMultivariate AnalysisA Unified PerspectiveA Unified Perspective
Harrison B. ProsperFlorida State University
Advanced Statistical Techniques in Particle Physics
Durham, UK, 20 March 2002
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 2
OutlineOutline
Introduction Some Multivariate Methods
Fisher Linear Discriminant (FLD) Principal Component Analysis (PCA) Independent Component Analysis (ICA) Self Organizing Map (SOM) Random Grid Search (RGS) Probability Density Estimation (PDE) Artificial Neural Network (ANN) Support Vector Machine (SVM)
Comments Summary
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 3
Introduction – i Introduction – i
Multivariate analysis is hard! Our mathematical intuition based on analysis in one
dimension often fails rather badly for spaces of very high dimension.
One should distinguish the problem to be solved from the algorithm to solve it.
Typically, the problems to be solved, when viewed with sufficient detachment, are relatively few in number whereas algorithms to solve them are invented every day.
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 4
Introduction – ii Introduction – ii
So why bother with multivariate analysis? Because:
The variables we use to describe events are usually statistically dependent.
Therefore, the N-d density of the variables contains more information than is contained in the set of 1-d marginal densities fi(xi).
This extra information may be useful
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 5
jetslttpp
Dzero 1995TopDiscovery
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 6
Introduction - iiiIntroduction - iii
Problems that may benefit from multivariate analysis: Signal to background discrimination Variable selection (e.g., to give maximum
signal/background discrimination) Dimensionality reduction of the feature space Finding regions of interest in the data Simplifying optimization (by ) Model comparison Measuring stuff (e.g., tanin SUSY)
1: Uf N
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 7
Fisher Linear DiscriminantFisher Linear Discriminant
Purpose Signal/background discrimination
bxw
xg
xg
)()(
,|
,|log
12
22
2
1
g is a Gaussianw
0 bxw
0 bxw
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 8
Principal Component AnalysisPrincipal Component Analysis
Purpose Reduce dimensionality of data
K
ii wdw
1
21 )(maxarg
1st principal axis
21
112 ))](([maxarg wdwxww i
K
ii
x2
x1
di
ii xwd
wix 1w
2nd principal axis
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 9
PCA algorithm in practicePCA algorithm in practice
Transform from X = (x1,..xN)T to U = (u1,..uN)T in which lowest order correlations are absent. Compute Cov(X)
Compute its eigenvalues ii and eigenvectors vi
Construct matrix T = Col(vi)T
U = TX Typically, one eliminates ui with smallest amount
of variation
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 10
Independent Component AnalysisIndependent Component Analysis
Purpose Find statistically independent variables. Dimensionality reduction
Basic Idea Assume X = (x1,..,xN)T is a linear sum X = AS
of independent sources S = (s1,..,sN)T. Both A, the mixing matrix, and S are unknown.
Find a de-mixing matrix T such that the components of U = TX are statistically independent
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 11
ICA-AlgorithmICA-Algorithm
0)(
)(log)()|(
dU
Ug
UfUfgfK
Given two densities f(U) and g(U) one measure of their “closeness”is the Kullback-Leibler divergence
which is zero if, and only if, f(U) = g(U).
We set
i
ii ufUg )()(
and minimize K( f | g) (now called the mutual information) with respect to the de-mixing matrix T.
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 12
Self Organizing MapSelf Organizing Map
Purpose Find regions of interest in data; that is, clusters. Summarize data
Basic Idea (Kohonen, 1988) Map each of K feature vectors X = (x1,..,xN)T
into one of M regions of interest defined by the vector wm so that all X mapped to a given wm are closer to it than to all remaining wm.
Basically, perform a coarse-graining of the feature space.
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 13
x
y
Apply cuts at each grid point
Apply cuts at each grid point
Grid Search Grid Search
x x
y yi
i
We refer to as a cut-pointcut-point
We refer to as a cut-pointcut-point
( , )x yi i
Number of cut-points ~ NbinNdimNumber of cut-points ~ Nbin
Ndim
Purpose: Signal/Background discrimination
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 14
Random Grid SearchRandom Grid Search
x
y
Sig
nal f
ract
ion
Background fraction
0
0
1
1
Ntot = # events before cutsNcut = # events after cutsFraction = Ncut/Ntot
Ntot = # events before cutsNcut = # events after cutsFraction = Ncut/Ntot
Take each point each point ofthe signal class as a cut-pointa cut-point
Take each point each point ofthe signal class as a cut-pointa cut-point
x x
y yi
i
H.B.P. et al, Proceedings, CHEP 1995
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 15
Probability Density EstimationProbability Density Estimation
Purpose Signal/background discrimination Parameter estimation
Basic Idea Parzen Estimation (1960s)
Mixtures
Nnh
xx
hNxp
n
nd
111
)(
Njjqjxxpj
)()|()(
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 16
Artificial Neural NetworksArtificial Neural Networks
Purpose Signal/background discrimination Parameter estimation Function estimation Density estimation
Basic Idea Encode mapping (Kolmogorov, 1950s).
Using a set of 1-D functions.
],..,[)(: 1 KMN FxfUUf
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 17
Feedforward NetworksFeedforward Networks
))((),(5
1
i
ii afwfwxn
)(2
1ii
jjiji afxwa
Input nodes Hidden nodes Output node
f(a)
a
1x
2xijw
iw
i
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 18
Minimize the empirical risk function with respect to
i
iiN xntR 21 )],([)(
ANN- AlgorithmANN- Algorithm
Solution (for large N)
dtxtpxtxn )|()(),( dtxtpxtxn )|()(),(
k
kpkxpkpkxpxkpxn )()|(/)()|()|(),( k
kpkxpkpkxpxkpxn )()|(/)()|()|(),(
If t(x) = k[1I(x)], where I(x) = 1 if x is of class k, 0 otherwise
D.W. Ruck et al., IEEE Trans. Neural Networks 1(4), 296-298 (1990)E.A. Wan, IEEE Trans. Neural Networks 1(4), 303-305 (1990)
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 19
Support Vector MachinesSupport Vector Machines
Purpose Signal/background discrimination
Basic Idea Data that are non-separable in N-dimensions
have a higher chance of being separable if mapped into a space of higher dimension
Use a linear discriminant to partition the high dimensional feature space.
bxwxD )()(
HugeN :
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 20
SVM – Kernel TrickSVM – Kernel TrickOr how to cope with a possibly infinite number of parameters!Or how to cope with a possibly infinite number of parameters!
),,(),(: 32121 zzzxx
x1
x2
z1
z2
z3
bxxyxDj
ii )]()([)(
bxwxD )()(
)()(),( jj xxxxK Try different because mapping unknown!
y = 1
y = +1
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 21
Comments – i Comments – i
Every classification task tries to solves the same fundamental problem, which is: After adequately pre-processing the data …find a good, and practical, approximation to
the Bayes decision rule: Given X, if P(S|X) > P(B|X) , choose hypothesis S otherwise choose B.
If we knew the densities p(X|S) and p(X|B) and the priors p(S) and p(B) we could compute the Bayes Discriminant Function (BDF): D(X) = P(S|X)/P(B|X)
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 22
Comments – ii Comments – ii
The Fisher discriminant (FLD), random grid search (RGS), probability density estimation (PDE), neural network (ANN) and support vector machine (SVM) are simply different algorithms to approximate the Bayes discriminant function D(X), or a function thereof.
It follows, therefore, that if a method is already close to the Bayes limit, then no other method, however sophisticated, can be expected to yield dramatic improvements.
Multivariate Analysis Harrison B. Prosper Durham, UK 2002 23
SummarySummary
Multivariate analysis is hard, but useful if it is important to extract as much information from the data as possible.
For classification problems, the common methods provide different approximations to the Bayes discriminant.
There is considerably empirical evidence that, as yet, no uniformly most powerful method exists. Therefore, be wary of claims to the contrary!