+ All Categories
Home > Documents > Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life...

Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life...

Date post: 22-Dec-2015
Category:
View: 242 times
Download: 6 times
Share this document with a friend
Popular Tags:
58
Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam
Transcript
Page 1: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Multiway Data Analysis

Johan Westerhuis

Biosystems Data Analysis

Swammerdam Institute for Life Sciences

Universiteit van Amsterdam

Page 2: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

The “future” science faculty of the Universiteit van Amsterdam

Page 3: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

The Biosystems Data Analysis group officially started in 2004 as a follow up of the process analysis group at the Universiteit van Amsterdam.Its aims are: Developing and validation of new data analysis methods for summarizing and visualizing complex structured biological data (Metabolomics / Proteomics).

Page 4: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way Data

Three-way Models

Three-way Applications

Page 5: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way Data

Page 6: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way data

Three-way data is a set of two-way matrices of the same objects and variables.

IR, Raman, NMR spectra of the same samples will not give a three-way data set, but a multi-block data set.

IR Raman NMR

Page 7: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Examples of three-way data

BatchProcessB

atch

esTim

e

Process variables

Fluorescence

Sam

ples

Emiss

ion

Excitation

Sensory Analysis

Pro

duct

sJu

dges

Attributes

Chromatography

Sam

ples

UV

Chromatogram

ImageAnalysisIm

age

RGB

Image

Page 8: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

From noway to multi-wayScalar

1-way

2-way

3-way

4-way

5-way

1

1

1

I

I

I

J

J

J

J J

J

JJ

J

I I

I

II

I

1

1

1 L

M

L

K K K

K

KK

K

Page 9: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Slabs and tubes

Vertical slab

Horizontal slab

Vertical tube

Horizontal tube

Lateral tube

Frontal slab

Page 10: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three slabs of fluorescence data5 Samples x 60 Excitation x 200 Emission

Fluorescence

Sam

ples

Emiss

ion

Excitation

Page 11: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way batch process data

‘Engineering’ process data i.e. temperature, pressure, flow rate

Spectroscopic process data i.e. NIR, Raman, UV-Vis

One batch A series of batches X (J K) X (I J K)

process variable

time

ba

tchtime

process variable

Page 12: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

SBR batch process dataEngineering variables

0 100 2008

8.5

9

9.5x 10

-3 Flow S

0 100 2008

8.5

9

9.5x 10

-3 Flow B

0 100 20049.95

50

50.05

50.1Temp Feed

0 100 20049.5

50

50.5

51T React

0 100 20044

46

48

50T Cool

0 100 20046

48

50T Jacket

0 100 200970

980

990

1000Density

0 100 2000

0.5

1Conversion

0 100 2000

500

1000E Release

Page 13: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Spectroscopic three-way batch data

2 batch runs of a reaction followed with UV-Vis spectroscopy during 45 minutes

Page 14: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Batch Fermentation in two steps: Threeway multiblock

Bat

ches

Variables Tim

e

Bat

ches

Variables Tim

eInoculum

Fermentation

API

Page 15: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Four-way data in combinatorial catalysis

Composition

Con

diti

ons

What we want

What we measure

...

...

...

...

...

...

...

...

Composition

Con

diti

ons

Page 16: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Multiway data from the Omics age

Gene expression

Exp

erim

ents

Time

Metabolites

Exp

erim

ents

Time

Page 17: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way Models

Page 18: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

M.C. Escher:

Some history

Small problem with orthogonality

Page 19: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

More history

Psychometrics (1944-1980) Catell 1944: Parallel Proportional profiles (Common factors

fitted simultaneously to many data matrices). Tucker 1964: Tucker models Carroll & Chang 1970: Canonical Decomposition

(CANDECOMP) Harshman 1970: Parallel Factor Analysis (PARAFAC)

Chemistry Ho 1978: Rank Annihilation (close to Parafac) on

fluorescence data. End 80’s beginning 90’s: Threeway methods to resolve

LC-UV data.

Page 20: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Multiway PCA:Unfolding of three-way data

IK

J

J

I

K

J

I

K

I

JK

Wold MacGregor

Page 21: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Two ways of unfoldingDifferent assumptions in MSPC

Wold Nonlinear behavior in the data Batch trajectories are monitored Online monitoring

MacGregor Nonlinearities removed Whole batch is considered a

measurement Off-line monitoring

Page 22: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Extension of SVD to Parafac

UX

VT

= = +

X A

CT

+

S

=

B

G

u1 u2

v1T v2

T

=a1 a2

c1 c2

b1 b2

Page 23: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Parafac / Candecomp

Parafac is not sequential Need to re-estimate whole model when more

components are calculated [no deflation]. Parafac solution is unique

No rotational freedom Changing parameters will reduce the fit. NB! A PCA model is not unique X = T*PT + E = T*R*R-1*PT + E = C*ST + E Unique ≠ true

Page 24: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Extension of Two Mode component Analysis (TMCA)

AX

G CT

=

X ACT

=

G

B

P

P RR

Q Q

P

P

R

R

Tucker III

Page 25: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Tucker models

Tucker I,

Tucker II,

Tucker IIIA

CTG

B

ACTG

AG Equals MPCAX

X

X =

=

=

Page 26: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Tucker models

Core array can be fully filled PxQxR triads (1,1,1 / 1,1,2 / 1,2,1 etc) Not unique rotational freedom

Components can be rotated towards orthogonality.

Not sequential Restricted Tucker models can be developed

when using prior chemical knowledge

Page 27: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Number of parameters

X(IxJxK) example I=50, J=9, K=100, P = Q = R = 3

Parafac: Rx(I + J + K) 477 Tucker3: PxI + QxJ + RxK + PxQxR 504 MPCA: Rx(I + JK) 2850

Fit MPCA > Parafac (Overfit?)

Page 28: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Soft models vs hard models

Two-way bilinear model: Beer’s law

PCA

Trilinear model: Parafac Fluorescence

ijjijiij eptptx 2211

ijkkjikjiijk ecbacbax 222111

,2,,21,,1, iiii eccA No orthogonal constraints

Orthogonal constraints

No orthogonal constraints

Page 29: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Multiway Regression I

Two step approach:

fAby

EPAX

~~

P~

Can be Parafac, Tucker, MPCA etc

Decomposition of X to A and modelRegression of y on A

No information of Y is used in the decompositionSimilar to PCR method

P~

X Y

2

2

~,

min

~~min

Aby

PAX

b

PA

y

Page 30: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Multiway Regression II

Direct approach

22

,~

,

~~min AbyPAX

bPA

Now X is decomposed with y in mind.This leads to a not optimal decomposition of X but an improved fit of y.

fAby

EPAX

~~

X Yy

Page 31: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

When data are not exactly 3-way

process variable

time

ba

tch

Time

Indi

cato

r va

riabl

e

Tim

e /

Var

iabl

e

Indicator variableTime

varia

ble

Page 32: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Alignment problems

Peakshifts in LCMS/GCMS

Warping methods to align the peaks Dynamic Time Warping Correlation optimized warping

Page 33: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Three-way Applications

Page 34: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Fluorescence data

5 samples with varying concentration of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water.

Excitation wavelength: 240 – 300 nm

Emission wavelength: 250 – 450 nm

Page 35: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Unfold PCA model of Fluorescence data

99.97% explained with 3 PC’s

Loadings refolded into Excitation / Emission form

Overfit of data:

Loading 2 has negative parts. This is not according fluorescence theory.

1 2 3 4 5-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5x 10

4

Page 36: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Parafac model of Fluorescence data

99.93% explained variation: Good Fit

Loadings are very well interpretable.

Intensity in A mode can be related to concentration

A mode

B and C mode

Page 37: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Fluorescence data

ijkkExEmkExEmkExEmkExEm ecbacbacbaI 333222111,,

Florescence data perfectly fits the trilinear model that is applied by Parafac

Due to uniqueness property of Parafac, the loadings found will perfectly resemble the Emission spectra and Excitation spectra of the three compounds in de mixtures.

This is a nice example of Mathematical chromatography

Page 38: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Pseudo-first-order reaction:A + BC D + E

UV-Vis spectrum (300-500nm) measured every 10 seconds.

Obeys Lambert-Beer law

35 NOC batches. X (35 201 271)

In addition, some disturbed batches were measured pH disturbance during the reaction Temperature change Impurity

0 50 100 150 200 250 3000

5

10

15

20

25

30

35

40

45

Time (s)

Concentr

ation (

MM

ol)

ReactantIntermediateProduct

300 320 340 360 380 400 420 440 460 480 5000

0.005

0.01

0.015

0.02

0.025

Wavelength (nm)

Absorb

ance (

units)

ReactantIntermediateProduct

Batch reaction monitoring

Page 39: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Aims and goals of research I

Data modelling: Improve understanding of process by interpretation

of model parameters

Analysis of historical batches: Are the current process measurements able to

distinguish between ‘good’ and ‘bad’ batches? On-line monitoring:

Rapid fault detection Easier fault diagnosis: what is the cause of the fault? Prediction of batch duration

Page 40: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Which batch is different ?

Aims and goals of research II

Page 41: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Unfold PCA model

PT

E

jki,r

rk,jri,jki, eptx

TX

= +

Unfold keeping the batch direction (IxJK)

Page 42: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Unfold PCA model

Many parameters estimated, likely to overfit the data

Page 43: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Unrestricted Parafac model

The simplest three-way model is the PARAFAC model:

X

wavelengths

time

ba

tch

EB

C

A

I +=

Page 44: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Unrestricted Parafac model

Loadings are highly correlated - solution may be unstable.

Model is difficult to interpret.

99.4% fit Can external

knowledge of the process be used to improve the model?

1 27-5

0

5Batch mode

Load

ing

1

1 27-5

0

5

Load

ing

2

1 27-0.5

0

0.5

Load

ing

3

Batch number

300 500-0.2

0

0.2Wavelength mode

300 500-0.2

0

0.2

300 500-0.2

0

0.2

Wavelength

0 450.085

0.09

0.095Time mode

0 450.06

0.08

0.1

0.12

0 45-0.5

0

0.5

Time

Page 45: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

‘Black-box’ or ‘soft’ models are empirical models which aim to fit the data as well as possible e.g. PCA, neural networks.

‘White’ or ‘hard’ models use known external knowledge of the process e.g. physicochemical model, mass-energy balances.

Difficult to interpret

Good fit

Easy to interpret

Not always availableGood fit

+

University of Amsterdam

‘Grey’ or ‘hybrid’ models combine the two.

Grey Modelling of batch data

Page 46: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Total variation Systematic variation due to known causes

Systematic variation due to unknown causes

Unsystematic variation

Modelling batch data

= ++white part black partX E

Page 47: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

External information

Incorporating external information can increase model interpretability increase model stability

300 320 340 360 380 400 420 440 460 480 5000

0.005

0.01

0.015

0.02

0.025

Wavelength (nm)

Absorb

ance (

units)

ReactantIntermediateProduct

ttt

tktkt

tkt

eekk

k

e

CAAD

AC

AA

0

12

01

0

21

1

Pure Spectra Reaction kinetics

Page 48: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Restricted ‘white’ model

External information is introduced in the form of parameter restrictions:

X

wavelengths

time

ba

tch

EB

C

A

G +=

REACTION KINETICS

KNOWN SPECTRA

LAMBERT-BEER LAW

Page 49: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

1 27-0.5

0

0.5Batch mode

Load

ing

1

Batch number300 5000

0.1

0.2Wavelength mode

300 5000

0.1

0.2

Load

ing

2300 5000

0.1

0.2

Load

ing

3

Wavelength

0 450

0.5

1Time mode

0 450

0.5

1

0 450

0.5

1

Time

Restricted Tucker model

Model is stable. 97.6% fit - lower than

for black model Some systematic

variation in the data is left unexplained by this model.

Page 50: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Grey model

White components Black components describe known effects can be interpreted

99.8% fit (corresponds well with estimated level of spectral noise of 0.13%)

1 32-0.4-0.2

00.2

0.4

Batch mode

1 32-0.6

-0.4-0.2

00.2

Batch number

300 500-0.1

0

0.1

Wavelength mode

300 500

0

0.1

0.2

Wavelength

0 45-0.1

0

0.1

0.2

Time mode

0 45

0.08

0.09

0.1

Time

1 32-0.5

0

0.5Batch mode

Batch number300 5000

0.1

0.2Wavelength mode

300 5000

0.1

0.2

300 5000

0.1

0.2

Wavelength

0 450

0.5

1Time mode

0 450

0.5

1

0 450

0.5

1

Time

Page 51: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Core array of restricted Tucker model

Only combinations: g111,a1,b1,c1

g122,a1,b2,c2

g133,a1,b3,c3

g244,a2,b4,c4

g355,a3,b5,c5

g111 0 0 0 0 0 g122 0 0 0 0 0 g133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g244 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g355

G

3x5x5 core array

Page 52: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Grey model residuals

0 10 200

0.005

0.01

0.015

0.02

Batch number

Squ

ared

res

idua

ls

300 350 400 450 5000

1

2

3

4

5x 10

-3

Wavelength

Squ

ared

res

idua

ls

0 5 10 15 20 25 30 35 40 450

0.002

0.004

0.006

0.008

0.01

Time

Squ

ared

res

idua

ls

Page 53: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Properties of grey models

White and black model parts can be calculated simultaneously (via restricted core matrix) with

better % fit sequentially with better diagnostics - allows

partitioning of variance

100% = 97.1% + 1.9% + 0.2% simultaneously but with orthogonality restrictions

which also allow partitioning of variance

2222EXXX bw

Page 54: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Off-line batch monitoring

NOC: # 1:32 Validation: # 33-35 pH Disturbed: # 36 Temp. problem # 37 Impurity # 38

0 5 10 15 20 25 30 35 4010

-3

10-2

10-1

100

101

102

103

36

37

38

8 11 13

Batch number

ln(Q

-sta

tistic

)

Off-line monitoring: Q-statistic with 95% and 99% confidence limits

Page 55: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

On-line monitoring of a validation batch

0 5 10 15 20 25 30 35 40 4510

0

101

102

Time

ln(D

-sta

tistic

)

On-line monitoring of batch 33: D-statistic with 95% and 99% confidence limits

0 5 10 15 20 25 30 35 40 4510

-5

100

Time

ln(S

PE

)

On-line monitoring of batch 33: SPE with 95% and 99% confidence limits

Page 56: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

On-line monitoring of the pH disturbed batch

0 5 10 15 20 25 30 35 40 4510

0

101

102

Time

ln(D

-sta

tistic

)

On-line monitoring of batch 36: D-statistic with 95% and 99% confidence limits

0 5 10 15 20 25 30 35 40 4510

-4

10-3

10-2

10-1

Time

ln(S

PE

)

On-line monitoring of batch 36: SPE with 95% and 99% confidence limits

After 23 minutes SPE goes outside control limits

pH was disturbed after 21 minutes

Only small change in D-statistic

Page 57: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

On-line monitoring of the temperature disturbed batch

0 5 10 15 20 25 30 35 40 4510

0

101

102

103

Time

ln(D

-sta

tistic

)

On-line monitoring of batch 37: D-statistic with 95% and 99% confidence limits

0 5 10 15 20 25 30 35 40 4510

-4

10-2

100

Time

ln(S

PE

)

On-line monitoring of batch 37: SPE with 95% and 99% confidence limits

Temperature slowly decreasing from start of reaction

Rate constant k1 lower than usual.

Contribution plot shows difference spectrum between reactant (too high) and intermediate (too low)

Page 58: Multiway Data Analysis Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam.

Want to know moreLook at Rasmus Bro’s website


Recommended