9/29/08 Machine Learning Approaches to Biological Research...

transcript

9/29/08

MachineLearningApproachestoBiologicalResearch:Bioimage

Informa>csandBeyond

RobertF.MurphyExternalSeniorFellow,FreiburgIns>tuteforAdvancedStudies

RayandStephanieLaneProfessorofComputa>onalBiology,CarnegieMellonUniversity

September29‐October1,2009

Outline

•  Basicprinciplesandparadigmsofsupervisedandunsupervisedmachinelearning

•  Conceptsofautomatedimageanalysis•  Approachesforcrea>ngpredic>vemodelsfromimages

•  Ac>velearningparadigmsforclosedloopsystemsofcyclesofexperimenta>on,modelrefinementandmodeltes>ng

www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

9/29/08

WhatisMachineLearning?

•  FundamentalQues>onofComputerScience:Howcanwebuildmachinesthatsolveproblems,andwhichproblemsareinherentlytractable/intractable?

•  FundamentalQues>onofSta>s>cs:Whatcanbeinferredfromdataplusasetofmodelingassump>ons,withwhatreliability?

TomMitchellwhitepaper

FundamentalQues>onofMachineLearning

•  Howcanwebuildcomputersystemsthatautoma>callyimprovewithexperience,andwhatarethefundamentallawsthatgovernalllearningprocesses?– TomMitchell

WhyMachineLearning?

•  Learnrela>onshipsfromlargesetsofcomplexdata:Datamining– Predictclinicaloutcomefromtests– Decidewhethersomeoneisagoodcreditrisk

•  Dotaskstoocomplextoprogrambyhand– Autonomousdriving

•  Customizeprogramstouserneeds– Recommendbook/moviebasedonpreviouslikes

9/29/08

WhyMachineLearning?

•  Economicallyefficient•  Canconsiderlargerdataspacesandhypothesisspacesthanpeoplecan

•  Canformalizelearningproblemtoexplicitlyiden>fy/describegoalsandcriteria

SuccessfulMachineLearningApplica>ons

•  Speechrecogni>on–  Telephonemenunaviga>on

•  Computervision– Mailsor>ng

•  Bio‐surveillance–  Iden>fyingdiseaseoutbreaks

•  Robotcontrol– Autonomousdriving

•  Empiricalscience

MachineLearningParadigms

•  SupervisedLearning– Classifica>on– Regression

•  UnsupervisedLearning– Clustering

•  Semi‐supervisedLearning– Cotraining– Ac>velearning

9/29/08

SupervisedLearning

•  Approaches– Classifica>on(discretepredic>ons)– Regression(con>nuouspredic>ons)

•  Commonconsidera>ons– Representa>on(Features)– FeatureSelec>on– Func>onalform– Evalua>onofpredic>vepower

Classifica>onvs.Regression

•  IfIwanttopredictwhetherapa>entwilldiefromadiseasewithinsixmonths,thatisclassifica>on

•  IfIwanttopredicthowlongthepa>entwilllive,thatisregression

Representa>on

•  Defini>onofthingorthingstobepredicted–  Classifica>on:classes–  Regression:regressionvariable

•  Defini>onofthings(instances)tomakepredic>onsfor–  Individuals–  Families– Neighborhoods,etc.

•  Choiceofdescriptors(features)todescribedifferentaspectsofinstances

9/29/08

Formaldescrip>on

•  DefiningXasasetofinstancesxdescribedbyfeatures

•  GiventrainingexamplesDfromX

•  Givenatargetfunc4oncthatmapsX‐>{0,1}

•  GivenahypothesisspaceH•  DetermineanhypothesishinHsuchthath(x)=c(x)forallxinD

CourtesyTomMitchell

Induc>velearninghypothesis

•  Anyhypothesisfoundtoapproximatethetargetfunc>onwelloverasufficientlylargesetoftrainingexamplewillalsoapproximatethetargetfunc>onoverotherunobservedexample

CourtesyTomMitchell

Hypothesisspace

•  Thehypothesisspacedeterminesthefunc>onalform

•  Itdefineswhatareallowablerules/func>onsforclassifica>on

•  Eachclassifica>onmethodusesadifferenthypothesisspace

9/29/08

Simpletwoclassproblem

DescribeeachimagebyfeaturesTrainclassifier

k‐NearestNeighbor(kNN)

•  Infeaturespace,trainingexamplesare

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

‐++ +

‐‐

•  Wewanttolabel‘?’

‐++ +

‐‐

9/29/08

•  Findknearestneighborsandvote

‐++ +

‐‐

?� fork=3,

nearestneighborsare

Sowelabelit+

LinearDiscriminants

•  Fitmul>variateGaussiantoeachclass•  Measuredistancefrom?toeachGaussian

bright.

‐‐

Decisiontrees

•  Againwewanttolabel‘?’

‐++ +

‐‐

SlidecourtesyofChristosFaloutsos

9/29/08

Decisiontrees

•  sowebuildadecisiontree:

‐++ +

‐‐

Decisiontrees

•  sowebuildadecisiontree:

area<50

+round.<40

‐...

‘area’

round.

‐++ +

‐‐

Decisiontrees

•  Goal:splitaddressspacein(almost)homogeneousregions

area<50

+round.<40

‐...

‘area’

round.

‐++ +

‐‐

9/29/08

Supportvectormachines

•  Againwewanttolabel‘?’

‐++ +

‐‐

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

round.

‐‐

round.

‐‐

9/29/08

round.

‐‐

round.

‐‐

round.

9/29/08

•  wewanttolabel‘?’‐linearseparator??•  A:theonewiththewidestcorridor!

round.

‐‐

•  Whatifthepointsforeachclassarenotreadilyseparatedbyastraightline?

•  Usethe“kerneltrick”–projectthepointsintoahigherdimensionalspaceinwhichwehopethatstraightlineswillseparatetheclasses

•  “kernel”referstothefunc>onusedforthisprojec>on

•  Defini>onofSVMsexplicitlyconsidersonlytwoclasses

•  Whatifwehavemorethantwoclasses?

•  Trainmul>pleSVMs

•  Twobasicapproaches– Oneagainstall(oneSVMforeachclass)– PairwiseSVMs(oneforeachpairofclasses)

–  Variouswaysofimplemen>ngthis

9/29/08

Ques>ons

•  Whatarethehypothesisspacesfor– kNNclassifier– Lineardiscriminants– Decisiontrees– SupportVectorMachines

Cross‐Valida>on

•  Ifwetrainaclassifiertominimizeerroronasetofdata,havenoabilitytoes>mate(generalize)errorthatwillbeseenonnewdataset

•  Tocalculategeneralizableaccuracy,weusen‐foldcross‐valida5on

•  Divideimagesintonsets,trainusingn‐1ofthemandtestontheremainingset

•  Repeatun>leachsetisusedastestsetandaverageresultsacrossalltrials

•  Varia>ononthisiscalledleave‐one‐out

Describingclassifiererrors

•  Forbinaryclassifiers(posi>veornega>ve),define–  TP=trueposi>ves,FP=falseposi>ves–  TN=truenega>ves,FN=falsenega>ves–  Recall=TP/(TP+FN)–  Precision=TP/(TP+FP)–  F‐measure=2*Recall*Precision/(Recall+Precision)

9/29/08

Confusionmatrix‐binary

True\Predicted Posi5ve Nega5ve

Posi>ve TruePosi>ve FalseNega>ve

Nega>ve FalsePosi>ve TrueNega>ve

Precision‐recallanalysis

Varyclassifierparameterto“loosen”someperformancees>mate:i.e.,confidence

Idealperformance

Describingclassifiererrors

•  Formul>‐classclassifiers,typicallyreport–  Accuracy=#testimagescorrectlyclassified

#testimages–  Confusionmatrix=tableshowingallpossiblecombina>onsoftrueclassandpredictedclass

9/29/08

Confusionmatrix–mul>‐class

Overallaccuracy=98%

True Class

Output of the Classifier

DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

DNA 98 2 0 0 0 0 0 0 0 0 ER 0 100 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 96 4 0 0 0 0 0 Lam 0 0 0 4 95 0 0 0 0 2 Mit 0 0 2 0 0 96 0 2 0 0 Nuc 0 0 0 0 0 0 100 0 0 0 Act 0 0 0 0 0 0 0 100 0 0 TfR 0 0 0 0 2 0 0 0 96 2 Tub 0 2 0 0 0 0 0 0 0 98

Groundtruth

•  Whatisthesourceandconfidenceofaclasslabel?

•  Mostcommon:Humanassignment,unknownconfidence

•  Preferred:Assignmentbyexperimentaldesign,confidence~100%

Featureselec>on

•  Havingtoomanyfeaturescanconfuseaclassifier•  Canusecomparisonoffeaturedistribu>onsbetweenclassestochooseasubsetoffeaturesthatgetsridofuninforma>veorredundantfeatures

9/29/08

Basicprincipleoffeatureselec>on

1 6 11 16 21 26 31

Feature1 Feature2

Feature3 Feature4

red=class1,blue=class2

Needtoconsidermul>variatedistance

FigurefromGuyon&Elisseeff

9/29/08

BadandGoodCovariance

FigurefromGuyon&Elisseeff

FeatureSelec>onMethods

•  PrincipalComponentsAnalysis•  Non‐LinearPrincipalComponentsAnalysis

•  IndependentComponentsAnalysis

•  Informa>onGain

•  StepwiseDiscriminantAnalysis

•  Gene>cAlgorithms

Regression

9/29/08

Linearregression

0 10 200

[startMatlabdemolecture2.m]

Givenexamples

Predict givenanewpoint

SlidecourtesyRomanThibaux

Linearregression

Predic>on Predic>on

OrdinaryLeastSquares(OLS)

Erroror“residual”

Predic>on

Observa>on

SumsquarederrorSlidecourtesyRomanThibaux

9/29/08

Beyondlinesandplanes

everythingisthesamewith

s>lllinearin

0 10 200

Geometricinterpreta>on

[Matlabdemo]

Assump>onsvs.RealityVoltage

0 1 2 3 4 5 6 70

IntelsensornetworkdataTemperature

9/29/08

Overfi{ng

0 2 4 6 8 10 12 14 16 18 20-15

[Matlabdemo]

Degree15polynomial

Sensi>vitytooutliersHighweightgiventooutliers

Temperature at noon

Influencefunc>on

KernelRegression

0 2 4 6 8 10 12 14 16 18 20-10

15Kernel regression (sigma=1)

9/29/08

SplineRegressionRegressiononeachinterval

5200 5400 5600 5800

SplineRegressionWithequalityconstraints

5200 5400 5600 5800 50

Clusteranalysis

•  Supervisedlearning(Classifica>on)assumesclassesareknown

•  Unsupervisedlearning(Clusteranalysis)seekstodiscovertheclasses

9/29/08

Formaldescrip>on

•  GivenXasasetofinstancesdescribedbyfeatures

•  Givenanobjec4vefunc4ong•  Givenapar44onspaceH•  Determineapar>>onhinHsuchthath(X)maximizes/minimizesg(h(X))

Formaldescrip>on

•  objec4vefunc4ongo~enstatedintermsofminimizingadistancefunc4ond

•  Example:Euclideandistance

Hierarchicalvs.k‐meansclustering

•  Twomostpopularclusteringalgorithms•  Hierarchicalbuildstreesequen>allyfromtheclosestpairofpoints(wells/cells/probes/condi>ons)

•  k‐meansstartswithkrandomlychosenseedpoints,assignseachremainingpointtothenearestseed,andrepeatsthisun>lnopointmoves

9/29/08

HierarchicalClustering

D FEA B C

ABCDEF

SlidecourtesyofElviraGarciaOsuna

HierarchicalClustering

DEFBCAD

K‐means

9/29/08

K‐means

9/29/08

K‐means

9/29/08

ChoosingthenumberofClusters

•  Adifficultproblem

•  Mostcommonapproachistotrytofindthesolu>onthatminimizestheBayesianInforma>onCriterion

L=thelikelihoodfunc>onforthees>matedmodel

K=#ofparameters

n=#ofsamples

2ln ln( )BIC L k n= − +

Microarrayrawdata

•  LabelmRNAfromonesamplewitharedfluorescenceprobe(Cy5)andmRNAfromanothersamplewithagreenfluorescenceprobe(Cy3)

•  HybridizetoachipwithspecificDNAsfixedtoeachwell

•  Measureamountsofgreenandredfluorescence

Flashanima>ons:PCRh�p://www.maxanim.com/gene>cs/PCR/PCR.htmMicroarrayh�p://www.bio.davidson.edu/Courses/genomics/chip/chip.html

Examplemicroarrayimage

9/29/08

mRNAexpressionmicroarraydatafor9800genes(genenumbershownver>cally)for0to24h(>meshownhorizontally)a~eraddi>onofserumtoahumancelllinethathadbeendeprivedofserum(fromh�p://genome‐www.stanford.edu/serum)

Dataextrac>on

•  Adjustfluorescentintensi>esusingstandards(asnecessary)

•  Calculatera>oofredtogreenfluorescence•  Converttolog2androundtointeger•  Displaysaturatedgreen=‐3toblack=0tosaturatedred=+3

Distances

•  Highdimensionality

•  Basedonvectorgeometry–howclosearetwodatapoints?

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 1

9/29/08

Distances

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 2 1 3

Gene 1 Gene 2

Distance(Gene 1, Gene 2) = 1

Distances

•  Usedistancestodetermineclusters

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 2 1 3

Gene 1 Gene 2

Distance(Gene 1, Gene 2) = 1

GeneralMul>variateDataset

•  Wearegivenvaluesofpvariablesfornindependentobserva>ons

•  ConstructannxpmatrixMconsis>ngofvectorsX1throughXneachoflengthp

9/29/08

Mul>variateSampleMean

•  DefinemeanvectorIoflengthp

I( j) =M(i, j)

X ii=1

matrixnota>on vectornota>on

Mul>variateVariance

•  Definevariancevectorσ2oflengthp

σ 2( j) =M(i, j) − I( j)( )

n −1matrixnota>on

Mul>variateVariance

•  or

σ 2 =X i − I( )

n −1vectornota>on

9/29/08

CovarianceMatrix

•  Defineapxpmatrixcov(calledthecovariancematrix)analogoustoσ2

cov( j,k) =M(i, j) − I( j )( )M(i,k) − I(k)( )

n −1

CovarianceMatrix

•  Notethatthecovarianceofavariablewithitselfissimplythevarianceofthatvariable

cov( j, j) =σ 2 ( j)

UnivariateDistance

•  Thesimpledistancebetweenthevaluesofasinglevariablejfortwoobserva>onsiandlis

M(i, j) −M(l, j)

9/29/08

Univariatez‐scoreDistance

•  Tomeasuredistanceinunitsofstandarddevia0onbetweenthevaluesofasinglevariablejfortwoobserva>onsiandlwedefinethez‐scoredistance

M(i, j) −M(l, j)σ ( j)

BivariateEuclideanDistance

•  Themostcommonlyusedmeasureofdistancebetweentwoobserva>onsiandlontwovariablesjandkistheEuclideandistance

M(i, j) −M(l, j)( )2 + M(i,k) −M(l,k )( )2

M(i,j)

kvariable

j variable

i observation

l observation M(l,j)

M(i,k) M(l,k)

Mul>variateEuclideanDistance

•  Thiscanbeextendedtomorethantwovariables

M(i, j) −M(l, j)( )2j=1

9/29/08

EffectsofvarianceandcovarianceonEuclideandistance

PointsAandBhavesimilarEuclideandistancesfromthemean,butpointBisclearly“moredifferent”fromthepopula>onthanpointA.

Theellipseshowsthe50%contourofahypothe>calpopula>on.

MahalanobisDistance

•  Toaccountfordifferencesinvariancebetweenthevariables,andtoaccountforcorrela>onsbetweenvariables,weusetheMahalanobisdistance

D2 = X i −X l( )cov-1 X i − Xl( )T

Otherdistancefunc>ons

•  Wecanuseotherdistancefunc>ons,includingonesinwhichtheweightsoneachvariablearelearned

•  ClusteranalysistoolsformicroarraydatamostcommonlyusePearsoncorrela>oncoefficient

9/29/08

Inputdataforclustering

•  Genesinrows,condi>onsincolumns

YORF NAME GWEIGHT Cell-cycle Alpha-Factor 1Cell-cycle Alpha-Factor 2Cell-cycle Alpha-Factor 3EWEIGHT 1 1 1YHR051W YHR051W COX6 oxidative phosphorylation cytochrome-c oxidase subunit VI S00010931 0.03 0.3 0.37YKL181W YKL181W PRS1 purine, pyrimidine, tryptophanphosphoribosylpyrophosphate synthetase S00016641 0.33 -0.2 -0.12YHR124W YHR124W NDT80 meiosis transcription factor S00011661 0.36 0.08 0.06YHL020C YHL020C OPI1 phospholipid metabolism negative regulator of phospholipid biosynthesS00010121 -0.01 -0.03 0.21YGR072W YGR072W UPF3 mRNA decay, nonsense-mediated unknown S00033041 0.2 -0.43 -0.22YGR145W YGR145W unknown unknown; similar to MESA gene of Plasmodium fS00033771 0.11 -1.15 -1.03YGR218W YGR218W CRM1 nuclear protein targeting nuclear export factor S00034501 0.24 -0.23 0.12YGL041C YGL041C unknown unknown S00030091 0.06 0.23 0.2YOR202W YOR202W HIS3 histidine biosynthesis imidazoleglycerol-phosphate dehydratase S00057281 0.1 0.48 0.86YCR005C YCR005C CIT2 glyoxylate cycle peroxisomal citrate synthase S00005981 0.34 1.46 1.23YER187W YER187W unknown unknown; similar to killer toxin Khs1p S00009891 0.71 0.03 0.11YBR026C YBR026C MRF1' mitochondrial respiration ARS-binding protein S00002301 -0.22 0.14 0.14YMR244W YMR244W unknown unknown; similar to Nca3p S00048581 0.16 -0.18 -0.38YAR047C YAR047C unknown unknown S00000831 -0.43 -0.56 -0.14YMR317W YMR317W unknown unknown S00049361 -0.43 -0.03 0.21

Clusteringgenesandcondi>ons

•  Rowsandcolumnscanbeclusteredindependently‐hierarchicalispreferredforvisualizingthis

9/29/08

Sta>ngGoalsvs.Approaches

•  Tempta>onwhenfirstconsideringusingamachinelearningapproachtoabiologicalproblemistodescribetheproblemasautoma>ngtheapproachthatyouwouldsolvetheproblem

•  “Ineedaprogramtopredicthowmuchageneisexpressedbymeasuringhowwellitspromotermatchesatemplate”

Sta>ngGoalsvs.Approaches

•  “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbymeasuringhowwellitspromotermatchesatemplate”

•  “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbylearningfromsequencesofgeneswhoseexpressionisknown”

Resources

•  Associa>onfortheAdvancementofAr>ficialIntelligence–  h�p://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning

•  MachineLearning–Mitchell,CarnegieMellon–  h�p://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/~p/mlbook.html

•  Prac>calMachineLearning–Jordan,UCBerkeley–  h�p://www.cs.berkeley.edu/~asimma/294‐fall06/

•  LearningandEmpiricalInference–Rish,Tesauro,Jebara,Vadpnik–Columbia–  h�p://www1.cs.columbia.edu/~jebara/6998/

9/29/08 Machine Learning Approaches to Biological Research...

Documents