MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research...

Post on 04-Jun-2020

7 views 0 download

transcript

Multivariate DataAnalysis inOmicsResearch

Diverging AlternativeSplicing FingerprintsIdentified inThoracic Aortic Aneurysm

SanelaKjellqvist,PhDWABIRNAseq course

2017-11-08

Outline• Why multivariate dataanalysis?• Multivariate statistics

– Differentanalyses– Datapreprocessing

• Alternativesplicing inthoracic aortic aneurysm– Thoracic aortic aneurysm– Study setup– Aim of thestudy– Results– Summary

• Today’s exercise

WHYMULTIVARIATE DATAANALYSIS?

DevelopmentofClassicalStatistics–1930s

• Multipleregression• Canonicalcorrelation• Lineardiscriminantanalysis• Analysisofvariance

Assumptions:

• IndependentXvariables

• Manymoreobservationsthanvariables

• RegressionanalysisoneYatatime

• Nomissingdata

N

K

Tablesarelongandlean

Today’sdata• RNASeq,Array,LC-MS/MS,GC/MSor

NMRdata

• Problems– Manyvariables– Fewobservations– Noisydata– Missingdata– Multipleresponses

• Implications– Highdegreeofcorrelation– Difficulttoanalysewith

conventionalmethods

• Data¹ Information– Needwaystoextractinformation

fromthedata– Needreliable,predictive

information– Ignorerandomvariation(noise)

N

K

PoorMethodsofDataAnalysis

X1 Y1 Y2Y3X2 X3

• Plotpairsofvariables– Tedious,impractical– Riskofspuriouscorrelations– Riskofmissinginformation

• SelectafewvariablesanduseMLR– Throwingawayinformation– Assumesno‘noise’inX– OneYatatime

ABetterWay...• MultivariateanalysisbyProjection

– LooksatALLthevariablestogether– Avoidslossofinformation– Findsunderlyingtrends=“latentvariables”– Morestablemodels

FundamentalDataAnalysisObjectives

Overview Discrimination Regression

TrendsOutliersQuality ControlBiological DiversityPatient Monitoring

Discriminating between groupsBiomarker candidatesComparing studies or instrumentation

Comparing blocks of omics dataMetab vs Proteomic vsGenomicOmic vs medicalPrediction

MULTIVARIATE STATISTICS

Differentmethods• Principalcomponentanalysis(PCA)• Partialleast squares tolatentstructuresanalysis(PLS)• Orthogonalpartialleastsquarestolatentstructuresanalysis(OPLS)

• PLS-DA• OPLS-DA• K-meansclustering• Hierarchical clustering• Biplotanalysis• Canonicalcorrelationanalysis

What isaprojection?

Principalcomponentanalysis(PCA) • Algebraically

–Summarizestheinformationintheobservationsasafewnew(latent)variables

• Geometrically– TheswarmofpointsinaKdimensionalspace(K=numberofvariables)isapproximatedbya(hyper)planeandthepointsareprojectedonthatplane.

PCA- GeometricInterpretation

x2

x3

x1

t1

Fit first principal component (line describing maximum variation)

t2

Add second component (accounts for next largest amount of variation) and is at right angles to first - orthogonal

Each component goes through origin

12

PCA- GeometricInterpretation

x2

x3

x1

X

Points are projected down onto a plane with co-ordinates t1, t2

Comp 1

t1

Comp 2

t2

K

N

“Distance to Model”

13

Loadings

x2

x3

x1

How do the principal components relate to the original variables?

Look at the angles betweenPCs and variable axes

t1 t2

XK

N

Comp 1

α2

α3

α1

14

Loadings

x2

x3

x1

Take cos(α) for each axis

Loadings vector p’ - one for each principal component

One value per variable

Comp 1

t1 t2

p’1

α2

α3

α1

cos(α1)

cos(α2)

cos(α3)

XK

N

15

Principalcomponentanalysis(PCA)• PCAcompresstheX datablockintoA numberoforthogonal

components• Variationseeninthescorevectort canbeinterpretedfrom

thecorrespondingloadingvectorp

PT

X1…A

1…A

T

X = t1p1T+ t2p2

T +…+tApAT +E = TPT + EPCAModel

PCA

Recognition of molecular quasi-species(evolving units)inenzyme evolutionbyPCA

Emrén,L.,Kurtovic,S.,Runarsdottir,A.,Larsson,A-K.,&Mannervik,B.(2006)ProcNatlAcadSciUSA,103,10866-10870Kurtovic,S,&MannervikB(2009)Biochemistry,48,9330-9339

Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)

Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)

X OPLSY

Class 1

Class 2

OPLSwithsingleY/modellingandprediction

p1T

XTO

POT

y

’Y-predictive’’Y-orthogonal’

1 11…

1… 1 1q1T

t1 u1OPLS

X = t1p1T + TOPO

T + EOPLSModel Y = t1qT

1 + F

DataPreprocessing – Scaling• PCAandothermethodsarescaledependent

– Isthesizeofavariableimportant?

• Scalingweightis1/SDforeachvariablei.e.divideeachvariablebyitsstandarddeviation– UnitVarianceScaling

• Varianceofscaledvariables=1• Manyotherkindsofscalingexist

Xws

1/SD

UV scaling

Cross-Validation• DataaredividedintoGgroups(defaultin

SIMCA-Pis7)andamodelisgeneratedforthedatadevoidofonegroup

• ThedeletedgroupispredictedbythemodelÞpartialPRESS(PredictiveResidualSumofSquares)

• ThisisrepeatedGtimesandthenallpartialPRESSvaluesaresummedtoformoverallPRESS

• Ifanewcomponentenhancesthepredictivepowercomparedwith thepreviousPRESSvaluethen thenewcomponentisretained

• PCA cross-validation is done in two phases and several deletion rounds: – first removal of

observations (rows)– then removal of variables

(columns)

22

ModelDiagnostics• FitorR2

– ResidualsofmatrixEpooledcolumn-wise– Explainedvariation– Forwholemodelorindividualvariables

– RSS=Σ (observed- fitted)2

– R2 =1- RSS/SSX

• PredictiveAbilityorQ2

– Leaveout1/7th datainturn– ‘CrossValidation’– Predicteachmissingblockofdatainturn– Sumtheresults

– PRESS=Σ (observed- predicted)2

– Q2 =1– PRESS/SSX

Fit

PredictionStopwhenQ2 startstodrop

23

ALTERNATIVESPLICING INTHORACIC AORTIC ANEURYSM

Kurtovic,Paloschi,Folkersen,Gottfries,Franco-Cereceda,Eriksson(2011)MolecularMedicine,17;665-675

Thoracic aortic aneurysm (TAA)

• Monogenic– Marfan syndrome– Loeys Dietz

• Aneurysm associated with bicuspid aortic valve (BAV)

• Idiopathic thoracic aortic aneurysm

Outline of thestudy

• Biopsiesarecollectedfrombothnon-dilatedanddilatedaortaduringvalvereplacementsurgeryandreconstructionofthedilatedaortarespectively

• Mediafromascending aorta• RNA

– Affymetrix humanexon 1.0STmicroarrays (inthis study 81patients)

– RNAseq (30patients)

• Protein– HiRiEF iTRAQ LC-MS/MS– 2Dgelelectrophoresis followed by

iTRAQ LC-MS/MS

Non-dilated Dilated

Aim of thestudy

• Alternativesplicing intransforminggrowthfactor-β(TGFβ)signalingpathway

• TGFβ pathway isknown to beimportant inaorticaneurysm

• Are there any alternatively spliced genesintheTGFβpathway?

• Isalternativesplicing animportant mechanism inthoracic aortic aneurysm (TAA)?

• How dowe analyze alternativesplicing?

Affymetrix exon array design

PSR– probeselectionregion

Exons

Introns

Preprocessing of data• Probe setcore level• Unique hybridization target• Robustmultichipaverage (RMA)normalized• Splice Indexcalculated (incase of exon level analysis)

i=exonj=samplek=genee=exon signalg=genesignal

• Unit variance scaled andmean centered datapriorto MVA

𝑛𝑖,𝑗 ,𝑘 =𝑒𝑖,𝑗 ,𝑘𝑔𝑗 ,𝑘

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

• TAVandBAVtogether• 81patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups

Non-supervisedPCA SupervisedOPLS-DA

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

• Only TAVpatients• 29patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups

Non-supervisedPCA SupervisedOPLS-DA

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

Non-supervisedPCA SupervisedOPLS-DA

• OnlyBAVpatients• 52patientsincluded• 614exonsincluded• Goodmodel• Goodseparationbetweenthetwogroups

AlternativelysplicedexonsarepresentinbothTAVandBAVgroupsofpatients

AlternativesplicinganalysisofallexonsinthehumangenomerevealstheimportanceofTGFβpathwayexons

Geneexpressionpatternsofdifferentiallysplicedgenes

Summary• TGFβ pathway exons clearly important according to anoverallexon

level analysis• Dilated andnon-dilated aortasshowdifferentalternativesplicing

patterns indilated andnon-dilated tissues with respect to TAVandBAVinTGFβ pathway

• Exons responsible forthediverging alternativesplicing fingerprints inTGFβ pathway identified

• Implies that dilatationinTAVhasdifferentunderlying molecularmechanisms compared to BAVpatients

• Newmethods foranalyzing array data

Todayduringtheexercise

• PCAandOPLS-DA• Thoracicaorticaneurysmdataset• ExonlevelexpressionAffymetrix arrays• Comparetwodifferentphenotypesandsubphenotypes