Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | wayne-kirk |
View: | 37 times |
Download: | 0 times |
Multivariate Data Analysis
Principal Component Analysis
Principal Component Analysis (PCA)
• Singular Value Decomposition
• Eigenvector / eigenvalue calculation
Data Matrix (IxK)
• Reduce variables
• Improve projections
• Remove noise
• Find outliers
• Find classes
X
I
K
PCA
• Example with 2 variables, 6 objects
• Find best (most informative) direction in space
• Describe direction
• Make projection
x1
x2
x1
x2
1st PC
1st PC
Score
Residual
1st PC
Loading p1
Loading p2
Unit vector
1st PC
Loading p1 = cos()
Loading p2 = sin ()Unit vector
X t
p
I
K
Score vector
Loading vector
i
X t
p
I
K
Score vector
Loading vector
k
X t
p
I
K
Score vector
Loading vector
X = t1p1’ + t2p2’ + ... + tApA’ + E
X=TP’+E
X : properly preprocessed (IxK)T: Score matrix (IxA)P: loading matrix (KxA)E: residual matrix (IxK)ta: score vectorpa: loading vector
The Wine Example
People magazine
Wise & Gallagher
63.5000 40.1000 2.5000 78.0000 61.1000 58.0000 25.1000 0.9000 78.0000 94.1000 46.0000 65.0000 1.7000 78.0000 106.4000 15.7000 102.1000 1.2000 78.0000 173.0000 12.2000 100.0000 1.5000 77.0000 199.7000 8.9000 87.8000 2.0000 76.0000 176.0000 2.7000 17.1000 3.8000 69.0000 373.6000 1.7000 140.0000 1.0000 73.0000 283.7000 1.0000 55.0000 2.1000 79.0000 34.7000 0.2000 50.4000 0.8000 73.0000 36.4000
FranceItaly Switz AustraBrit U.S.A.RussiaCzech Japan Mexico
Wine Beer Spirit LifeEx HeartD
Beer Wine Spirit LifeEx HeartD
20.9900 68.2600 1.7500 75.9000 153.8700
24.9270 38.6718 0.9132 3.2128 110.8182
Mean
StandardDeviation
1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Component
Singular value
1=46%
32%
12%8%
2%
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
1
2
3
4 5
6
7
8
9
10
Score 1 (46%)
Score 2 (32%)
France
ItalySwitz
AustralBrit
USA
Russia
Czech
JapanMex
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
2
3
45
Loading 1
Loading 2
Wine
Beer
Spirit
Life exp.Heart dis.
Conclusions
Scores = positions of objects in multivariate space
Loadings = importance of original variables for new directions
Try to explain a large enough portion of X (46+32 = 78%)
The Apricot Example
Manley & Geladi
1000 1500 2000 2500
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Wavelength, nm
Pseudoabsorbance
Appelkoos
1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
Component number
Singular value
Scree plot
What is rank?
Mathematical rank = max(min(I,K))
Gives zero residual
Effective rank = A
Separates model from noise
ANOVA
68.8269 1.2843 0.0463 0.0045 0.0007 0.0003 0.0002 0.0001 0.0000 0.0000
70.1634
98.10 1.83 0.07 0.01 0.00 0.00 0.00 0.00 0.00 0.00
Comp# SS SS% SS%cum
100
98.1099.93100
1 2 3 4 5 6 7 8 9 10
Total
-0.5 0 0.5 1
-0.5
0
0.5
1
1
2
3
4
5 6
7
8
9
10
Score 1 (98%)
Score 2 (2%)
ANOVA
SStot = SS1 + SS2 + SS3 +...+ SS(I or K)
SStot = 1 + 2 + 3 +...+ (I or K)
From largest to smallest!
ANOVA
X = TP’ + E
data = model + residual
SStot = SSmod + SSres
R2 = SSmod / SStot = 1 - SSres / SStot
Coefficient of determination (often in %)
Examples
Wines R2 = SSmod = 78% SSres = 22% 2 Comp.
Apricots 1 R2 = SSmod = 99.93% SSres = 0.07%
2 Comp.
Apricots 2 R2 = SSmod = 100% SSres = ±0.0%
3 Comp.
1000 1500 2000 2500
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wavelength, nm
Absorbance
Outliers removed
1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Singular values
Component
No outliers
1=81%
16%
3%
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1
2
34
5
6
7
8
Score 2 (16%)
Score 3 (3%)
Whole fruit
No kernel
Thin slice
1000 1500 2000 2500-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
Wavelength, nm
Loading 2 3
-0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
Loading 2
Loading 3
More nomenclature
Score = Latent Variable
Loading vector = Eigenvector
Effective rank = Pseudorank = Model dimensionality = Number of components
SSa = Eigenvalue
Singular value = SSa1/2
An analysis sequence
• 1. Scale, mean-center data
• 2. Calculate a few components
• 3. Check scores, loadings
• 4. Find outliers, groupings, explain
• 5. Remove outliers
An analysis sequence
• 6. Scale, mean-center data
• 7. Calculate enough components
• 8. Try to detemine pseudorank
• 9. Check score plots
• 10. Check loading plots
• 11. Check residuals
Residual stdev
1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4
Wines
1 2 3 4 50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Residual stdevWines
0 1 2 3 4