Dr Kate Kemsley Analytical Sciences Unit Institute of Food ...

Dr Kate Kemsley Analytical Sciences Unit

Institute of Food Research [email protected]

What is multivariate data? ◦ How to represent data as matrices and vectors

Introducing Chemometrics ◦ The most commonly used methods for treating

analytical chemistry data

Avoiding “Overfitting” ◦ How the incorrect use of these statistical

methods can lead to nonsense results

Experimental dataset that contains measurements of multiple attributes made on a collection of objects (people, samples, etc)

Measurements of 3 different “attributes” of 8 “objects” (=children)

Suppose we want to investigate what factors influence a child’s weight – this is the “dependent” variate

The remaining variates are “predictors”

Child (‘object’) Alison Bethany Chloe Denise Emma Frances Grace Helen

Height (cm) (‘predictor’)

146 138 143 127 141 132 155 128

Age (years) (‘predictor’)

9

10 9 8 9

10 12 7

Weight (kg) (‘dependent’)

29 35 31 28 26 37 34 26

Traditional statistical methods - one predictor at a time a well-known way is linear least-squares (LS) regression

120 130 140 150 160 22

24

26

28

30

32

34

36

38

40

WEIG

HT

(kg)

HEIGHT (cm) 6 7 8 9 10 11 12 13

22

24

26

28

30

32

34

36

38

40

WEIG

HT

(kg)

AGE (yrs)

y = m x + c WEIGHT = slope.HEIGHT + intercept

y = m x + c WEIGHT = slope.AGE + intercept

m and c are estimated by LS regression of the dependent variate y onto the predictor variate x

Make use of both predictor variates in modelling the dependent:

•Use multiple linear regression to estimate the m and c values Until the last couple of decades (i.e. before readily

available computers in the 1980s) this was slow and laborious, even for datasets of this size

•The regression model defines a plane in a 3-d coordinate system

WEIGHT = m1.HEIGHT + m2.AGE + c

4

6

8

10

12

14

120130140150160

0

10

20

30

40

50

WEIGHT

HEIGHT

AGE

60MHz proton NMR spectra from a selection of edible oils

Dependent data: concentrations of oleic acid (C18:1) (by reference method)

Predictor data: peak areas calculated from the spectra

Chemical shift

Sig

na

l in

ten

sity (

no

rma

lize

d)

terminal-CH3

omega-3

olefinic-C =C -H H

glyceride backbone-C OCORH2

bis-allylic=CHC CH=H2

Concentration of oleic acid (C18:1) plotted versus “olefinic” and “bis-allylic” peak areas separately:

1.6 1.8 2 2.2 2.4 2.610

20

30

40

50

60

70

80

Olefinic peak area

Con

cen

trati

on

(%

w/w

)

0 0.2 0.4 0.6 0.8 110

20

30

40

50

60

70

80

Bis-allylic peak area

Con

cen

trati

on

(%

w/w

)

OLEFINIC PEAK AREA BIS-ALLYLIC PEAK AREA

Concentration of oleic acid (C18:1) plotted versus “olefinic” and “bis-allylic” peak areas simultaneously:

1.5

2

2.5

3

00.20.40.60.81

10

20

30

40

50

60

70

80

Concentr

ati

on (

%w

/w

)

B(area)

O(area) OLEFINIC PEAK AREA

BIS-ALLYLIC PEAK AREA

Child (‘object’) Alison Bethany Chloe Denise Emma Frances Grace Helen

Height (cm) (‘predictor’)

146 138 143 127 141 132 155 128

Age (years) (‘predictor’)

9 10 9 8 9 10 12 7

Weight (kg) (‘dependent’)

29 35 31 28 26 37 34 26

• Total of four variates to deal with

• Can’t easily plot on axes - need to leave behind “2-d thinking”

• Need different ways of representing the data graphically

• Need a matrix language to conveniently deal with the dataset

Wrist circumf. (cm) (‘predictor’)

13.2 13.9 14.1 12.8 12.9 13.6 14.3 12.3

y1

y2

y3

.

.

.

.

.

yn

Object 1

Object 2

Object 3

.

.

.

.

.

Object n

Dependent

x11

x21

x31

.

.

.

.

.

xn1

Object 1

Object 2

Object 3

.

.

.

.

.

Object n

Predictors

x12

x22

x32

.

.

.

.

.

xn2

x1d

x2d

x3d

.

.

.

.

.

xnd

. . .

A set of d attributes of the objects, e.g.

infrared absorbance values at d

different wavelengths (each row in

the data matrix = a spectrum)

This could be something like:

A “continuous” variate – e.g. concentration of

some chemical component measured by a

reference technique

A “category” variate – e.g. species, variety,

cultivar, genotype, etc

General representation of large, multivariate dataset


Dependent Predictors (>hundreds!)

Sample 1

Sample 2

Sample 3

.

.

.

Sample n-1

Sample n

type ‘A’ coffee

type ‘A’ coffee

type ‘A’ coffee

.

.

.

.

type ‘B’ coffee

type ‘B’ coffee

Spectrum 1

Spectrum 2

Spectrum 3

Spectrum 4

.

.

.

Spectrum n-1

Spectrum n




reference technique








Dependent Predictors (>hundreds!)

Sample 1

Sample 2

Sample 3

.

.

.

Sample n-1

Sample n

y

Spectrum 1

Spectrum 2

Spectrum 3

Spectrum 4

.

.

.

Spectrum n-1

Spectrum n

X








reference technique



Linear regression using matrix algebra

y = m1.x1 + m2.x2 + ..... + c

Using matrix algebra this becomes:

y = X m (to remove the need for c, X is mean-centered)

Least-squares solution for m is:

m = (XT X)-1 XT y (T = matrix transpose, -1 = matrix inverse)

Practically impossible to do this before computers, but very easy to

do in modern matrix programming languages e.g. Matlab, SAS, R,…

Multivariate analysis is a modern discipline, arising

over last 20 years due to advances in computers

Large numbers of objects and huge numbers of

variates mean that it is…

Difficult to examine the data set graphically

Usually “under-determined” and “high-

dimensional”

X has fewer rows than columns (number of attributes per

object exceeds number of objects)

Most columns of X are inter-correlated, sometimes to a large

extent

(XT X) is singular and not invertible

This leads to mathematical problems in applying many

multivariate modelling methods directly

Challenges in multivariate analysis

“Chemometrics” – a family of multivariate statistical methods for treating the large datasets of modern analytical chemistry

Originated in the 1980’s when computers first started to become connected to analytical instruments (especially near-infrared)

Some chemometric methods were proposed theoretically several decades earlier, but were not able to be carried out

Introducing Chemometrics

Can offer solutions to the difficulties of treating high-dimensional data

Some well-known methods:

Principal component analysis (PCA), Partial least squares (PLS), artificial neural networks, genetic algorithms, discriminant analysis,...and many more... (+ lots of synonyms!)

Same methods have spread throughout the sciences – e.g. psychology, econometrics, meteorology, bioinformatics,...

Introducing Chemometrics

Data exploration (“Unsupervised analysis”)

◦ Examine a matrix of experimental data (e.g. a large collection of spectra) looking for patterns, groups, etc

Probably the most widely used chemometric approach used for data exploration is Principal Component Analysis

1st PC

2nd

PC

Rearranges the variance (= information) in the data set to make it easier to deal with (visualise, display, analyse further, etc)

◦ Definition using matrix algebra:

Z = X . P

◦ The data in X are post-multiplied by loadings P (“weights”) to give a set of scores Z

◦ This can be considered as a data rotation

◦ Loadings are the eigenvectors of data covariance matrix:

XTX/(n-1)

• “Calibration” type problems

– eg. relate spectral data to sensory data, concentration data, etc.

• “Classification” type problems

– E.g. model the differences between groups of data from different sample types

A method commonly used for these modelling approaches is Partial Least Squares (PLS) Regression

(“Supervised analysis”)

A commonly used supervised method, for performing multiple linear regression on high-dimensional data

Regression model: y = X m (but can’t solve this directly if X has d > n)

PLSR writes: ◦ y = Z b = X P b (where Z = X P, the data compression step)

◦ b = (ZTZ)-1ZTY (ZTZ is invertible, whereas XTX is not)

◦ m = P b (allows us to get a solution for m)

PCA and PLS – Both are data compression methods

X

data points

sam

ple

s

Data matrix n

d 1

PT

Z PCA or PLS

Matrix of Scores

Matrix of Loadings

r

n

d

r

1 1 data points

Scores:

Z is smaller than X – easier to explore (e.g by plotting graphically)

“redundancy” removed

may reveal patterns or groups which were not clear in original data

r << d

PCA and PLS – Both are data compression methods

X Z . P

The kth spectrum times the 1st and 2nd loading…

…gives the scores along the 1st and 2nd axes for that spectrum

Scores plot

Axis 1

Axis

2

1 . . . k . n

Scores plots:

Each point represents an individual spectrum

Loadings plots:

Show relative importance of each variate

Information on same scale as original spectra

0 d Data point

Weig

ht

0 d Data point

Weig

ht

After data compression, scores have successively maximized…

◦ “Information content”

(= variance, in PCA)

◦ “Relevant information content”

(= covariance with the dependent variate, in PLS)

Scores are uncorrelated, and there are fewer of them than the original variates

This makes further statistical and graphical analyses easier, using methods that would not otherwise be possible

10001500200025003000

0

5

10

15

x 10-4

Wavenumbers

Abso

rbanc

e U

nits

Data compression: a simple example of its usefulness

Raw data: 60 spectra of edible oils

- Hazelnut - Extra virgin Olive

X matrix – [60 x 250] (oils x absorbance values)

y vector – a category “dummy” variate indicating whether a spectrum was from an olive or hazelnut oil

10001500200025003000

0

5

10

15

x 10-4

Wavenumbers

Abso

rbanc

e U

nits

- Hazelnut - Extra virgin Olive

PLS Scores

Scores plots reveal grouping not apparent in raw data

Loading indicates regions of the spectrum associated with each group type

1000 1500 2000 2500 3000

-0.05

0

0.05

0.1

Wavenumbers

Weig

ht

Scores on first axis are enough to almost entirely distinguish the groups

1st PLS loading

Any mathematical model can be “overfit”, by including irrelevant information (noise) into the model

Other potential problems with modelling include:

◦ lack of generalization ability – i.e., model is unable to extrapolate successfully to new data

◦ Incorrect assumption about nature of model (e.g. is linear modelling appropriate)

"When Elvis Presley died, there were 48

professional Elvis impersonators. Today, there

are 7328. If that growth is projected, by the

year 2012, one person in four on the face of the

globe will be an Elvis impersonator."

Financial Times, March 1995

1977 1994 2012

Nu

mb

er

of

Elv

is Im

pe

rso

nato

rs

10

10

10

1

4

9

YEAR

PREDICTED GROWTH IN "ELVIS" IMPERSONATORS

The “analyst” had… Assumed log-linear model Extrapolated Incorporated noise into the model n = 2, not enough data!

-6 -4 -2 0 2 4 6 8 -6

-4

-2

0

2

4

6

PLS Score 1

PLS S

core

2

PLS model

[30 x 200] Matrix of “data” assigned to 3 (meaningless) groups

0 50 100 150 200

PLS

Partial Least Squares

Regression attempts

to model the

difference between

“groups”

n = 30 here; but overfitting occurs easily because n << d

Any multivariate model can be overfit, but this is especially likely when d >> n

◦ This is the case for virtually all spectroscopy techniques, which measure large numbers of properties (absorbances values, counts, etc) on each sample

The only way to be confident that a model is not overfit is to perform some kind of model validation ◦ Use one of various cross-validation schemes

◦ Use permutation tests

◦ Apply model to totally independent test samples

-0.04 -0.02 0 0.02 0.04

-0.06

-0.04

-0.02

0

0.02

First PLS score

Se

co

nd

PL

S s

co

re

Packaging

No packaging

-0.04 -0.02 0 0.02 0.04

-0.06

-0.04

-0.02

0

0.02

First PLS score

Se

co

nd

PL

S s

co

reRandom group 1

Random group 2

PLS

Detecting overfitting in a real experimental dataset

Raman Spectra from Apple skins

PLS is repeated with a randomly scrambled y-vector This shows that any random group assignment would have produced the same outcome – so the finding is NOT significant

Many standard software tools for multivariate analysis provide only limited validation, allowing multivariate methods to be misused

◦ Particularly true for the software supplied as

standard with analytical instrumentation

Solution: use of bespoke software (coding in statistical and matrix languages)

The R project for Statistical Computing

random permutation of observations in group 1

random permutation of observations in group 2

training (grp 1)

test (grp 2)

training (grp 2)

test (grp 1)

Model development

Model validation

training (grp 1)

test (grp 1)

training (grp 2)

test (grp 2)

1

2

3 . . .

.

.

.

. n

group 2

group 1

obse

rvat

ion

DATA

MATRIX

Iterate

Multivariate = more than one predictor variate

Data from modern analytical techniques is usually highly multivariate, with large d (number of attributes) and relatively smaller n (number of samples)

Data compression methods like PCA and PLS are very useful, especially for graphical representation

Overfitting is a real possibility - naive use of multivariate statistics can lead to misleading results

Best practice requires the use of a suitable model validation technique

Matrix language software packages are the best platforms for chemometric analysis

Software:

Matlab Student License/Trial license, & “Getting Started” Manual

http://www.mathworks.co.uk/academia/student_version/

http://www.mathworks.co.uk/help/pdf_doc/matlab/getstart.pdf

Books:

“Multivariate Calibration” by Harald Martens & Tormod Næs

“Principles of Multivariate Analysis: A User's Perspective” by Wojtek J. Krzanowski

http://www.mathworks.co.uk/academia/student_version/

http://www.mathworks.co.uk/help/pdf_doc/matlab/getstart.pdf

Date post:	18-Dec-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Dr Kate Kemsley Analytical Sciences Unit Institute of Food ...

Documents