What is multivariate data? ◦ How to represent data as matrices and vectors
Introducing Chemometrics ◦ The most commonly used methods for treating
analytical chemistry data
Avoiding “Overfitting” ◦ How the incorrect use of these statistical
methods can lead to nonsense results
Experimental dataset that contains measurements of multiple attributes made on a collection of objects (people, samples, etc)
Measurements of 3 different “attributes” of 8 “objects” (=children)
Suppose we want to investigate what factors influence a child’s weight – this is the “dependent” variate
The remaining variates are “predictors”
Child (‘object’) Alison Bethany Chloe Denise Emma Frances Grace Helen
Height (cm) (‘predictor’)
146 138 143 127 141 132 155 128
Age (years) (‘predictor’)
9
10 9 8 9
10 12 7
Weight (kg) (‘dependent’)
29 35 31 28 26 37 34 26
Traditional statistical methods - one predictor at a time a well-known way is linear least-squares (LS) regression
120 130 140 150 160 22
24
26
28
30
32
34
36
38
40
WEIG
HT
(kg)
HEIGHT (cm) 6 7 8 9 10 11 12 13
22
24
26
28
30
32
34
36
38
40
WEIG
HT
(kg)
AGE (yrs)
y = m x + c WEIGHT = slope.HEIGHT + intercept
y = m x + c WEIGHT = slope.AGE + intercept
m and c are estimated by LS regression of the dependent variate y onto the predictor variate x
Make use of both predictor variates in modelling the dependent:
•Use multiple linear regression to estimate the m and c values Until the last couple of decades (i.e. before readily
available computers in the 1980s) this was slow and laborious, even for datasets of this size
•The regression model defines a plane in a 3-d coordinate system
WEIGHT = m1.HEIGHT + m2.AGE + c
4
6
8
10
12
14
120130140150160
0
10
20
30
40
50
WEIGHT
HEIGHT
AGE
60MHz proton NMR spectra from a selection of edible oils
Dependent data: concentrations of oleic acid (C18:1) (by reference method)
Predictor data: peak areas calculated from the spectra
Chemical shift
Sig
na
l in
ten
sity (
no
rma
lize
d)
terminal-CH3
omega-3
olefinic-C =C -H H
glyceride backbone-C OCORH2
bis-allylic=CHC CH=H2
Concentration of oleic acid (C18:1) plotted versus “olefinic” and “bis-allylic” peak areas separately:
1.6 1.8 2 2.2 2.4 2.610
20
30
40
50
60
70
80
Olefinic peak area
Con
cen
trati
on
(%
w/w
)
0 0.2 0.4 0.6 0.8 110
20
30
40
50
60
70
80
Bis-allylic peak area
Con
cen
trati
on
(%
w/w
)
OLEFINIC PEAK AREA BIS-ALLYLIC PEAK AREA
Concentration of oleic acid (C18:1) plotted versus “olefinic” and “bis-allylic” peak areas simultaneously:
1.5
2
2.5
3
00.20.40.60.81
10
20
30
40
50
60
70
80
Concentr
ati
on (
%w
/w
)
B(area)
O(area) OLEFINIC PEAK AREA
BIS-ALLYLIC PEAK AREA
Child (‘object’) Alison Bethany Chloe Denise Emma Frances Grace Helen
Height (cm) (‘predictor’)
146 138 143 127 141 132 155 128
Age (years) (‘predictor’)
9 10 9 8 9 10 12 7
Weight (kg) (‘dependent’)
29 35 31 28 26 37 34 26
• Total of four variates to deal with
• Can’t easily plot on axes - need to leave behind “2-d thinking”
• Need different ways of representing the data graphically
• Need a matrix language to conveniently deal with the dataset
Wrist circumf. (cm) (‘predictor’)
13.2 13.9 14.1 12.8 12.9 13.6 14.3 12.3
y1
y2
y3
.
.
.
.
.
yn
Object 1
Object 2
Object 3
.
.
.
.
.
Object n
Dependent
x11
x21
x31
.
.
.
.
.
xn1
Object 1
Object 2
Object 3
.
.
.
.
.
Object n
Predictors
x12
x22
x32
.
.
.
.
.
xn2
x1d
x2d
x3d
.
.
.
.
.
xnd
. . .
A set of d attributes of the objects, e.g.
infrared absorbance values at d
different wavelengths (each row in
the data matrix = a spectrum)
This could be something like:
A “continuous” variate – e.g. concentration of
some chemical component measured by a
reference technique
A “category” variate – e.g. species, variety,
cultivar, genotype, etc
General representation of large, multivariate dataset
General representation of large, multivariate dataset
Dependent Predictors (>hundreds!)
Sample 1
Sample 2
Sample 3
.
.
.
Sample n-1
Sample n
type ‘A’ coffee
type ‘A’ coffee
type ‘A’ coffee
.
.
.
.
type ‘B’ coffee
type ‘B’ coffee
Spectrum 1
Spectrum 2
Spectrum 3
Spectrum 4
.
.
.
Spectrum n-1
Spectrum n
This could be something like:
A “continuous” variate – e.g. concentration of
some chemical component measured by a
reference technique
A “category” variate – e.g. species, variety,
cultivar, genotype, etc
A set of d attributes of the objects, e.g.
infrared absorbance values at d
different wavelengths (each row in
the data matrix = a spectrum)
General representation of large, multivariate dataset
Dependent Predictors (>hundreds!)
Sample 1
Sample 2
Sample 3
.
.
.
Sample n-1
Sample n
y
Spectrum 1
Spectrum 2
Spectrum 3
Spectrum 4
.
.
.
Spectrum n-1
Spectrum n
X
A set of d attributes of the objects, e.g.
infrared absorbance values at d
different wavelengths (each row in
the data matrix = a spectrum)
This could be something like:
A “continuous” variate – e.g. concentration of
some chemical component measured by a
reference technique
A “category” variate – e.g. species, variety,
cultivar, genotype, etc
Linear regression using matrix algebra
y = m1.x1 + m2.x2 + ..... + c
Using matrix algebra this becomes:
y = X m (to remove the need for c, X is mean-centered)
Least-squares solution for m is:
m = (XT X)-1 XT y (T = matrix transpose, -1 = matrix inverse)
Practically impossible to do this before computers, but very easy to
do in modern matrix programming languages e.g. Matlab, SAS, R,…
Multivariate analysis is a modern discipline, arising
over last 20 years due to advances in computers
Large numbers of objects and huge numbers of
variates mean that it is…
Difficult to examine the data set graphically
Usually “under-determined” and “high-
dimensional”
X has fewer rows than columns (number of attributes per
object exceeds number of objects)
Most columns of X are inter-correlated, sometimes to a large
extent
(XT X) is singular and not invertible
This leads to mathematical problems in applying many
multivariate modelling methods directly
Challenges in multivariate analysis
“Chemometrics” – a family of multivariate statistical methods for treating the large datasets of modern analytical chemistry
Originated in the 1980’s when computers first started to become connected to analytical instruments (especially near-infrared)
Some chemometric methods were proposed theoretically several decades earlier, but were not able to be carried out
Introducing Chemometrics
Can offer solutions to the difficulties of treating high-dimensional data
Some well-known methods:
Principal component analysis (PCA), Partial least squares (PLS), artificial neural networks, genetic algorithms, discriminant analysis,...and many more... (+ lots of synonyms!)
Same methods have spread throughout the sciences – e.g. psychology, econometrics, meteorology, bioinformatics,...
Introducing Chemometrics
Data exploration (“Unsupervised analysis”)
◦ Examine a matrix of experimental data (e.g. a large collection of spectra) looking for patterns, groups, etc
Probably the most widely used chemometric approach used for data exploration is Principal Component Analysis
1st PC
2nd
PC
Rearranges the variance (= information) in the data set to make it easier to deal with (visualise, display, analyse further, etc)
◦ Definition using matrix algebra:
Z = X . P
◦ The data in X are post-multiplied by loadings P (“weights”) to give a set of scores Z
◦ This can be considered as a data rotation
◦ Loadings are the eigenvectors of data covariance matrix:
XTX/(n-1)
• “Calibration” type problems
– eg. relate spectral data to sensory data, concentration data, etc.
• “Classification” type problems
– E.g. model the differences between groups of data from different sample types
A method commonly used for these modelling approaches is Partial Least Squares (PLS) Regression
(“Supervised analysis”)
A commonly used supervised method, for performing multiple linear regression on high-dimensional data
Regression model: y = X m (but can’t solve this directly if X has d > n)
PLSR writes: ◦ y = Z b = X P b (where Z = X P, the data compression step)
◦ b = (ZTZ)-1ZTY (ZTZ is invertible, whereas XTX is not)
◦ m = P b (allows us to get a solution for m)
PCA and PLS – Both are data compression methods
X
data points
sam
ple
s
Data matrix n
d 1
PT
Z PCA or PLS
Matrix of Scores
Matrix of Loadings
r
n
d
r
1 1 data points
Scores:
Z is smaller than X – easier to explore (e.g by plotting graphically)
“redundancy” removed
may reveal patterns or groups which were not clear in original data
r << d
PCA and PLS – Both are data compression methods
X Z . P
The kth spectrum times the 1st and 2nd loading…
…gives the scores along the 1st and 2nd axes for that spectrum
Scores plot
Axis 1
Axis
2
1 . . . k . n
Scores plots:
Each point represents an individual spectrum
Loadings plots:
Show relative importance of each variate
Information on same scale as original spectra
0 d Data point
Weig
ht
0 d Data point
Weig
ht
After data compression, scores have successively maximized…
◦ “Information content”
(= variance, in PCA)
◦ “Relevant information content”
(= covariance with the dependent variate, in PLS)
Scores are uncorrelated, and there are fewer of them than the original variates
This makes further statistical and graphical analyses easier, using methods that would not otherwise be possible
10001500200025003000
0
5
10
15
x 10-4
Wavenumbers
Abso
rbanc
e U
nits
Data compression: a simple example of its usefulness
Raw data: 60 spectra of edible oils
- Hazelnut - Extra virgin Olive
X matrix – [60 x 250] (oils x absorbance values)
y vector – a category “dummy” variate indicating whether a spectrum was from an olive or hazelnut oil
10001500200025003000
0
5
10
15
x 10-4
Wavenumbers
Abso
rbanc
e U
nits
- Hazelnut - Extra virgin Olive
PLS Scores
Scores plots reveal grouping not apparent in raw data
Loading indicates regions of the spectrum associated with each group type
1000 1500 2000 2500 3000
-0.05
0
0.05
0.1
Wavenumbers
Weig
ht
Scores on first axis are enough to almost entirely distinguish the groups
1st PLS loading
Any mathematical model can be “overfit”, by including irrelevant information (noise) into the model
Other potential problems with modelling include:
◦ lack of generalization ability – i.e., model is unable to extrapolate successfully to new data
◦ Incorrect assumption about nature of model (e.g. is linear modelling appropriate)
"When Elvis Presley died, there were 48
professional Elvis impersonators. Today, there
are 7328. If that growth is projected, by the
year 2012, one person in four on the face of the
globe will be an Elvis impersonator."
Financial Times, March 1995
1977 1994 2012
Nu
mb
er
of
Elv
is Im
pe
rso
nato
rs
10
10
10
1
4
9
YEAR
PREDICTED GROWTH IN "ELVIS" IMPERSONATORS
The “analyst” had… Assumed log-linear model Extrapolated Incorporated noise into the model n = 2, not enough data!
-6 -4 -2 0 2 4 6 8 -6
-4
-2
0
2
4
6
PLS Score 1
PLS S
core
2
PLS model
[30 x 200] Matrix of “data” assigned to 3 (meaningless) groups
0 50 100 150 200
PLS
Partial Least Squares
Regression attempts
to model the
difference between
“groups”
n = 30 here; but overfitting occurs easily because n << d
Any multivariate model can be overfit, but this is especially likely when d >> n
◦ This is the case for virtually all spectroscopy techniques, which measure large numbers of properties (absorbances values, counts, etc) on each sample
The only way to be confident that a model is not overfit is to perform some kind of model validation ◦ Use one of various cross-validation schemes
◦ Use permutation tests
◦ Apply model to totally independent test samples
-0.04 -0.02 0 0.02 0.04
-0.06
-0.04
-0.02
0
0.02
First PLS score
Se
co
nd
PL
S s
co
re
Packaging
No packaging
-0.04 -0.02 0 0.02 0.04
-0.06
-0.04
-0.02
0
0.02
First PLS score
Se
co
nd
PL
S s
co
reRandom group 1
Random group 2
PLS
Detecting overfitting in a real experimental dataset
Raman Spectra from Apple skins
PLS is repeated with a randomly scrambled y-vector This shows that any random group assignment would have produced the same outcome – so the finding is NOT significant
Many standard software tools for multivariate analysis provide only limited validation, allowing multivariate methods to be misused
◦ Particularly true for the software supplied as
standard with analytical instrumentation
Solution: use of bespoke software (coding in statistical and matrix languages)
The R project for Statistical Computing
random permutation of observations in group 1
random permutation of observations in group 2
training (grp 1)
test (grp 2)
training (grp 2)
test (grp 1)
Model development
Model validation
training (grp 1)
test (grp 1)
training (grp 2)
test (grp 2)
1
2
3 . . .
.
.
.
. n
group 2
group 1
obse
rvat
ion
DATA
MATRIX
Iterate
Multivariate = more than one predictor variate
Data from modern analytical techniques is usually highly multivariate, with large d (number of attributes) and relatively smaller n (number of samples)
Data compression methods like PCA and PLS are very useful, especially for graphical representation
Overfitting is a real possibility - naive use of multivariate statistics can lead to misleading results
Best practice requires the use of a suitable model validation technique
Matrix language software packages are the best platforms for chemometric analysis
Software:
Matlab Student License/Trial license, & “Getting Started” Manual
http://www.mathworks.co.uk/academia/student_version/
http://www.mathworks.co.uk/help/pdf_doc/matlab/getstart.pdf
Books:
“Multivariate Calibration” by Harald Martens & Tormod Næs
“Principles of Multivariate Analysis: A User's Perspective” by Wojtek J. Krzanowski