+ All Categories
Home > Documents > 9/29/08 Machine Learning Approaches to Biological Research...

9/29/08 Machine Learning Approaches to Biological Research...

Date post: 27-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
33
9/29/08 1 Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond Robert F. Murphy External Senior Fellow, Freiburg Ins>tute for Advanced Studies Ray and Stephanie Lane Professor of Computa>onal Biology, Carnegie Mellon University September 29‐October 1, 2009 Outline Basic principles and paradigms of supervised and unsupervised machine learning Concepts of automated image analysis Approaches for crea>ng predic>ve models from images Ac>ve learning paradigms for closed loop systems of cycles of experimenta>on, model refinement and model tes>ng www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
Transcript
Page 1: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

1

MachineLearningApproachestoBiologicalResearch:Bioimage

Informa>csandBeyond

RobertF.MurphyExternalSeniorFellow,FreiburgIns>tuteforAdvancedStudies

RayandStephanieLaneProfessorofComputa>onalBiology,CarnegieMellonUniversity

September29‐October1,2009

Outline

•  Basicprinciplesandparadigmsofsupervisedandunsupervisedmachinelearning

•  Conceptsofautomatedimageanalysis•  Approachesforcrea>ngpredic>vemodelsfromimages

•  Ac>velearningparadigmsforclosedloopsystemsofcyclesofexperimenta>on,modelrefinementandmodeltes>ng

www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Page 2: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

2

WhatisMachineLearning?

•  FundamentalQues>onofComputerScience:Howcanwebuildmachinesthatsolveproblems,andwhichproblemsareinherentlytractable/intractable?

•  FundamentalQues>onofSta>s>cs:Whatcanbeinferredfromdataplusasetofmodelingassump>ons,withwhatreliability?

TomMitchellwhitepaper

FundamentalQues>onofMachineLearning

•  Howcanwebuildcomputersystemsthatautoma>callyimprovewithexperience,andwhatarethefundamentallawsthatgovernalllearningprocesses?– TomMitchell

TomMitchellwhitepaper

WhyMachineLearning?

•  Learnrela>onshipsfromlargesetsofcomplexdata:Datamining– Predictclinicaloutcomefromtests– Decidewhethersomeoneisagoodcreditrisk

•  Dotaskstoocomplextoprogrambyhand– Autonomousdriving

•  Customizeprogramstouserneeds– Recommendbook/moviebasedonpreviouslikes

TomMitchellwhitepaper

Page 3: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

3

WhyMachineLearning?

•  Economicallyefficient•  Canconsiderlargerdataspacesandhypothesisspacesthanpeoplecan

•  Canformalizelearningproblemtoexplicitlyiden>fy/describegoalsandcriteria

SuccessfulMachineLearningApplica>ons

•  Speechrecogni>on–  Telephonemenunaviga>on

•  Computervision– Mailsor>ng

•  Bio‐surveillance–  Iden>fyingdiseaseoutbreaks

•  Robotcontrol– Autonomousdriving

•  Empiricalscience

TomMitchellwhitepaper

MachineLearningParadigms

•  SupervisedLearning– Classifica>on– Regression

•  UnsupervisedLearning– Clustering

•  Semi‐supervisedLearning– Cotraining– Ac>velearning

Page 4: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

4

SupervisedLearning

•  Approaches– Classifica>on(discretepredic>ons)– Regression(con>nuouspredic>ons)

•  Commonconsidera>ons– Representa>on(Features)– FeatureSelec>on– Func>onalform– Evalua>onofpredic>vepower

Classifica>onvs.Regression

•  IfIwanttopredictwhetherapa>entwilldiefromadiseasewithinsixmonths,thatisclassifica>on

•  IfIwanttopredicthowlongthepa>entwilllive,thatisregression

Representa>on

•  Defini>onofthingorthingstobepredicted–  Classifica>on:classes–  Regression:regressionvariable

•  Defini>onofthings(instances)tomakepredic>onsfor–  Individuals–  Families– Neighborhoods,etc.

•  Choiceofdescriptors(features)todescribedifferentaspectsofinstances

Page 5: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

5

Formaldescrip>on

•  DefiningXasasetofinstancesxdescribedbyfeatures

•  GiventrainingexamplesDfromX

•  Givenatargetfunc4oncthatmapsX‐>{0,1}

•  GivenahypothesisspaceH•  DetermineanhypothesishinHsuchthath(x)=c(x)forallxinD

CourtesyTomMitchell

Induc>velearninghypothesis

•  Anyhypothesisfoundtoapproximatethetargetfunc>onwelloverasufficientlylargesetoftrainingexamplewillalsoapproximatethetargetfunc>onoverotherunobservedexample

CourtesyTomMitchell

Hypothesisspace

•  Thehypothesisspacedeterminesthefunc>onalform

•  Itdefineswhatareallowablerules/func>onsforclassifica>on

•  Eachclassifica>onmethodusesadifferenthypothesisspace

Page 6: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

6

‐+

???

Simpletwoclassproblem

DescribeeachimagebyfeaturesTrainclassifier

k‐NearestNeighbor(kNN)

•  Infeaturespace,trainingexamplesare

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

k‐NearestNeighbor(kNN)

•  Wewanttolabel‘?’

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

?�

Page 7: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

7

k‐NearestNeighbor(kNN)

•  Findknearestneighborsandvote

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

?� fork=3,

nearestneighborsare

Sowelabelit+

LinearDiscriminants

•  Fitmul>variateGaussiantoeachclass•  Measuredistancefrom?toeachGaussian

area

bright.

+

+

+

+

+

‐‐

‐‐

?

Decisiontrees

•  Againwewanttolabel‘?’

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

?�

SlidecourtesyofChristosFaloutsos

Page 8: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

8

Decisiontrees

•  sowebuildadecisiontree:

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

?�

50�

40�

SlidecourtesyofChristosFaloutsos

Decisiontrees

•  sowebuildadecisiontree:

area<50

Y

+round.<40

N

‐...

Y N

‘area’

round.

+

‐++ +

+++

‐‐

‐‐

?

50

40

SlidecourtesyofChristosFaloutsos

Decisiontrees

•  Goal:splitaddressspacein(almost)homogeneousregions

area<50

Y

+round.<40

N

‐...

Y N

‘area’

round.

+

‐++ +

+++

‐‐

‐‐

?

50

40

SlidecourtesyofChristosFaloutsos

Page 9: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

9

Supportvectormachines

•  Againwewanttolabel‘?’

Feature#1(e.g..,‘area’)

Feature#2(e.g..,roundness)

+

‐++ +

+

+

+

‐‐

‐‐

?

SlidecourtesyofChristosFaloutsos

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

area

round.

+

+

+

+

+

‐‐

‐‐

?

SlidecourtesyofChristosFaloutsos

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

area

round.

+

+

+

+

+

‐‐

‐‐

?

SlidecourtesyofChristosFaloutsos

Page 10: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

10

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

area

round.

+

+

+

+

+

‐‐

‐‐

?

SlidecourtesyofChristosFaloutsos

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

+

+

+

+

+

‐‐

‐‐

?

area

round.

SlidecourtesyofChristosFaloutsos

SupportVectorMachines(SVMs)

•  Usesinglelinearseparator??

+

+

+

+

+

‐‐

‐‐

?

area

round.

SlidecourtesyofChristosFaloutsos

Page 11: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

11

SupportVectorMachines(SVMs)

•  wewanttolabel‘?’‐linearseparator??•  A:theonewiththewidestcorridor!

area

round.

+

‐+

++

+

‐‐

‐‐

?

SlidecourtesyofChristosFaloutsos

SupportVectorMachines(SVMs)

•  Whatifthepointsforeachclassarenotreadilyseparatedbyastraightline?

•  Usethe“kerneltrick”–projectthepointsintoahigherdimensionalspaceinwhichwehopethatstraightlineswillseparatetheclasses

•  “kernel”referstothefunc>onusedforthisprojec>on

SupportVectorMachines(SVMs)

•  Defini>onofSVMsexplicitlyconsidersonlytwoclasses

•  Whatifwehavemorethantwoclasses?

•  Trainmul>pleSVMs

•  Twobasicapproaches– Oneagainstall(oneSVMforeachclass)– PairwiseSVMs(oneforeachpairofclasses)

–  Variouswaysofimplemen>ngthis

Page 12: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

12

Ques>ons

•  Whatarethehypothesisspacesfor– kNNclassifier– Lineardiscriminants– Decisiontrees– SupportVectorMachines

Cross‐Valida>on

•  Ifwetrainaclassifiertominimizeerroronasetofdata,havenoabilitytoes>mate(generalize)errorthatwillbeseenonnewdataset

•  Tocalculategeneralizableaccuracy,weusen‐foldcross‐valida5on

•  Divideimagesintonsets,trainusingn‐1ofthemandtestontheremainingset

•  Repeatun>leachsetisusedastestsetandaverageresultsacrossalltrials

•  Varia>ononthisiscalledleave‐one‐out

Describingclassifiererrors

•  Forbinaryclassifiers(posi>veornega>ve),define–  TP=trueposi>ves,FP=falseposi>ves–  TN=truenega>ves,FN=falsenega>ves–  Recall=TP/(TP+FN)–  Precision=TP/(TP+FP)–  F‐measure=2*Recall*Precision/(Recall+Precision)

Page 13: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

13

Confusionmatrix‐binary

True\Predicted Posi5ve Nega5ve

Posi>ve TruePosi>ve FalseNega>ve

Nega>ve FalsePosi>ve TrueNega>ve

Precision‐recallanalysis

Varyclassifierparameterto“loosen”someperformancees>mate:i.e.,confidence

Idealperformance

Describingclassifiererrors

•  Formul>‐classclassifiers,typicallyreport–  Accuracy=#testimagescorrectlyclassified

#testimages–  Confusionmatrix=tableshowingallpossiblecombina>onsoftrueclassandpredictedclass

Page 14: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

14

Confusionmatrix–mul>‐class

Overallaccuracy=98%

True Class

Output of the Classifier

DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

DNA 98 2 0 0 0 0 0 0 0 0 ER 0 100 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 96 4 0 0 0 0 0 Lam 0 0 0 4 95 0 0 0 0 2 Mit 0 0 2 0 0 96 0 2 0 0 Nuc 0 0 0 0 0 0 100 0 0 0 Act 0 0 0 0 0 0 0 100 0 0 TfR 0 0 0 0 2 0 0 0 96 2 Tub 0 2 0 0 0 0 0 0 0 98

Groundtruth

•  Whatisthesourceandconfidenceofaclasslabel?

•  Mostcommon:Humanassignment,unknownconfidence

•  Preferred:Assignmentbyexperimentaldesign,confidence~100%

Featureselec>on

•  Havingtoomanyfeaturescanconfuseaclassifier•  Canusecomparisonoffeaturedistribu>onsbetweenclassestochooseasubsetoffeaturesthatgetsridofuninforma>veorredundantfeatures

Page 15: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

15

Basicprincipleoffeatureselec>on

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 6 11 16 21 26 31

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 6 11 16 21 26 31

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 6 11 16 21 26 31

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 6 11 16 21 26 31

Feature1 Feature2

Feature3 Feature4

red=class1,blue=class2

Needtoconsidermul>variatedistance

FigurefromGuyon&Elisseeff

Page 16: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

16

BadandGoodCovariance

FigurefromGuyon&Elisseeff

FeatureSelec>onMethods

•  PrincipalComponentsAnalysis•  Non‐LinearPrincipalComponentsAnalysis

•  IndependentComponentsAnalysis

•  Informa>onGain

•  StepwiseDiscriminantAnalysis

•  Gene>cAlgorithms

Regression

Page 17: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

17

Linearregression

010

2030

40

010

2030

20

22

24

26

Tem

pera

ture

0 10 200

20

40

[startMatlabdemolecture2.m]

Givenexamples

Predict givenanewpoint

SlidecourtesyRomanThibaux

0 200

20

40

010

2030

40

010

2030

20

22

24

26

Tem

pera

ture

Linearregression

Predic>on Predic>on

OrdinaryLeastSquares(OLS)

0 200

Erroror“residual”

Predic>on

Observa>on

SumsquarederrorSlidecourtesyRomanThibaux

Page 18: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

18

Beyondlinesandplanes

everythingisthesamewith

s>lllinearin

0 10 200

20

40

SlidecourtesyRomanThibaux

Geometricinterpreta>on

[Matlabdemo]

010

20 0

100

200

300

400

-10

0

10

20

SlidecourtesyRomanThibaux

Assump>onsvs.RealityVoltage

0 1 2 3 4 5 6 70

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

IntelsensornetworkdataTemperature

SlidecourtesyRomanThibaux

Page 19: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

19

Overfi{ng

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

[Matlabdemo]

Degree15polynomial

SlidecourtesyRomanThibaux

Sensi>vitytooutliersHighweightgiventooutliers

010

2030

40

010

2030

5

10

15

20

25

Temperature at noon

Influencefunc>on

SlidecourtesyRomanThibaux

KernelRegression

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15Kernel regression (sigma=1)

Page 20: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

20

SplineRegressionRegressiononeachinterval

5200 5400 5600 5800

50

60

70

SlidecourtesyRomanThibaux

SplineRegressionWithequalityconstraints

5200 5400 5600 5800 50

60

70

SlidecourtesyRomanThibaux

Clusteranalysis

•  Supervisedlearning(Classifica>on)assumesclassesareknown

•  Unsupervisedlearning(Clusteranalysis)seekstodiscovertheclasses

Page 21: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

21

Formaldescrip>on

•  GivenXasasetofinstancesdescribedbyfeatures

•  Givenanobjec4vefunc4ong•  Givenapar44onspaceH•  Determineapar>>onhinHsuchthath(X)maximizes/minimizesg(h(X))

Formaldescrip>on

•  objec4vefunc4ongo~enstatedintermsofminimizingadistancefunc4ond

•  Example:Euclideandistance

Hierarchicalvs.k‐meansclustering

•  Twomostpopularclusteringalgorithms•  Hierarchicalbuildstreesequen>allyfromtheclosestpairofpoints(wells/cells/probes/condi>ons)

•  k‐meansstartswithkrandomlychosenseedpoints,assignseachremainingpointtothenearestseed,andrepeatsthisun>lnopointmoves

Page 22: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

22

HierarchicalClustering

D

FE

A

B

C

D FEA B C

BC DE

DEF

BCDEF

ABCDEF

SlidecourtesyofElviraGarciaOsuna

HierarchicalClustering

DEFBCAD

FE

A

B

C

SlidecourtesyofElviraGarciaOsuna

K‐means

SlidecourtesyofElviraGarciaOsuna

Page 23: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

23

K‐means

1

2

SlidecourtesyofElviraGarciaOsuna

K‐means

SlidecourtesyofElviraGarciaOsuna

K‐means

SlidecourtesyofElviraGarciaOsuna

Page 24: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

24

K‐means

SlidecourtesyofElviraGarciaOsuna

K‐means

SlidecourtesyofElviraGarciaOsuna

K‐means

SlidecourtesyofElviraGarciaOsuna

Page 25: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

25

ChoosingthenumberofClusters

•  Adifficultproblem

•  Mostcommonapproachistotrytofindthesolu>onthatminimizestheBayesianInforma>onCriterion

L=thelikelihoodfunc>onforthees>matedmodel

K=#ofparameters

n=#ofsamples

2ln ln( )BIC L k n= − +

Microarrayrawdata

•  LabelmRNAfromonesamplewitharedfluorescenceprobe(Cy5)andmRNAfromanothersamplewithagreenfluorescenceprobe(Cy3)

•  HybridizetoachipwithspecificDNAsfixedtoeachwell

•  Measureamountsofgreenandredfluorescence

Flashanima>ons:PCRh�p://www.maxanim.com/gene>cs/PCR/PCR.htmMicroarrayh�p://www.bio.davidson.edu/Courses/genomics/chip/chip.html

Examplemicroarrayimage

Page 26: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

26

mRNAexpressionmicroarraydatafor9800genes(genenumbershownver>cally)for0to24h(>meshownhorizontally)a~eraddi>onofserumtoahumancelllinethathadbeendeprivedofserum(fromh�p://genome‐www.stanford.edu/serum)

Dataextrac>on

•  Adjustfluorescentintensi>esusingstandards(asnecessary)

•  Calculatera>oofredtogreenfluorescence•  Converttolog2androundtointeger•  Displaysaturatedgreen=‐3toblack=0tosaturatedred=+3

Distances

•  Highdimensionality

•  Basedonvectorgeometry–howclosearetwodatapoints?

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 1

Page 27: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

27

Distances

•  Highdimensionality

•  Basedonvectorgeometry–howclosearetwodatapoints?

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 2 1 3

Gene 1 Gene 2

Distance(Gene 1, Gene 2) = 1

Distances

•  Highdimensionality

•  Basedonvectorgeometry–howclosearetwodatapoints?

•  Usedistancestodetermineclusters

Array2

Array 1

Array 1 Array 2

Gene 1 1 4

Gene 2 1 3

Gene 1 Gene 2

Distance(Gene 1, Gene 2) = 1

GeneralMul>variateDataset

•  Wearegivenvaluesofpvariablesfornindependentobserva>ons

•  ConstructannxpmatrixMconsis>ngofvectorsX1throughXneachoflengthp

Page 28: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

28

Mul>variateSampleMean

•  DefinemeanvectorIoflengthp

I( j) =M(i, j)

i=1

n∑

nI =

X ii=1

n∑

nor

matrixnota>on vectornota>on

Mul>variateVariance

•  Definevariancevectorσ2oflengthp

σ 2( j) =M(i, j) − I( j)( )

i=1

n∑

2

n −1matrixnota>on

Mul>variateVariance

•  or

σ 2 =X i − I( )

i=1

n∑

2

n −1vectornota>on

Page 29: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

29

CovarianceMatrix

•  Defineapxpmatrixcov(calledthecovariancematrix)analogoustoσ2

cov( j,k) =M(i, j) − I( j )( )M(i,k) − I(k)( )

i=1

n∑

n −1

CovarianceMatrix

•  Notethatthecovarianceofavariablewithitselfissimplythevarianceofthatvariable

cov( j, j) =σ 2 ( j)

UnivariateDistance

•  Thesimpledistancebetweenthevaluesofasinglevariablejfortwoobserva>onsiandlis

M(i, j) −M(l, j)

Page 30: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

30

Univariatez‐scoreDistance

•  Tomeasuredistanceinunitsofstandarddevia0onbetweenthevaluesofasinglevariablejfortwoobserva>onsiandlwedefinethez‐scoredistance

M(i, j) −M(l, j)σ ( j)

BivariateEuclideanDistance

•  Themostcommonlyusedmeasureofdistancebetweentwoobserva>onsiandlontwovariablesjandkistheEuclideandistance

M(i, j) −M(l, j)( )2 + M(i,k) −M(l,k )( )2

M(i,j)

kvariable

j variable

i observation

l observation M(l,j)

M(i,k) M(l,k)

Mul>variateEuclideanDistance

•  Thiscanbeextendedtomorethantwovariables

M(i, j) −M(l, j)( )2j=1

p∑

Page 31: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

31

EffectsofvarianceandcovarianceonEuclideandistance

PointsAandBhavesimilarEuclideandistancesfromthemean,butpointBisclearly“moredifferent”fromthepopula>onthanpointA.

BA

Theellipseshowsthe50%contourofahypothe>calpopula>on.

MahalanobisDistance

•  Toaccountfordifferencesinvariancebetweenthevariables,andtoaccountforcorrela>onsbetweenvariables,weusetheMahalanobisdistance

D2 = X i −X l( )cov-1 X i − Xl( )T

Otherdistancefunc>ons

•  Wecanuseotherdistancefunc>ons,includingonesinwhichtheweightsoneachvariablearelearned

•  ClusteranalysistoolsformicroarraydatamostcommonlyusePearsoncorrela>oncoefficient

Page 32: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

32

Inputdataforclustering

•  Genesinrows,condi>onsincolumns

YORF NAME GWEIGHT Cell-cycle Alpha-Factor 1Cell-cycle Alpha-Factor 2Cell-cycle Alpha-Factor 3EWEIGHT 1 1 1YHR051W YHR051W COX6 oxidative phosphorylation cytochrome-c oxidase subunit VI S00010931 0.03 0.3 0.37YKL181W YKL181W PRS1 purine, pyrimidine, tryptophanphosphoribosylpyrophosphate synthetase S00016641 0.33 -0.2 -0.12YHR124W YHR124W NDT80 meiosis transcription factor S00011661 0.36 0.08 0.06YHL020C YHL020C OPI1 phospholipid metabolism negative regulator of phospholipid biosynthesS00010121 -0.01 -0.03 0.21YGR072W YGR072W UPF3 mRNA decay, nonsense-mediated unknown S00033041 0.2 -0.43 -0.22YGR145W YGR145W unknown unknown; similar to MESA gene of Plasmodium fS00033771 0.11 -1.15 -1.03YGR218W YGR218W CRM1 nuclear protein targeting nuclear export factor S00034501 0.24 -0.23 0.12YGL041C YGL041C unknown unknown S00030091 0.06 0.23 0.2YOR202W YOR202W HIS3 histidine biosynthesis imidazoleglycerol-phosphate dehydratase S00057281 0.1 0.48 0.86YCR005C YCR005C CIT2 glyoxylate cycle peroxisomal citrate synthase S00005981 0.34 1.46 1.23YER187W YER187W unknown unknown; similar to killer toxin Khs1p S00009891 0.71 0.03 0.11YBR026C YBR026C MRF1' mitochondrial respiration ARS-binding protein S00002301 -0.22 0.14 0.14YMR244W YMR244W unknown unknown; similar to Nca3p S00048581 0.16 -0.18 -0.38YAR047C YAR047C unknown unknown S00000831 -0.43 -0.56 -0.14YMR317W YMR317W unknown unknown S00049361 -0.43 -0.03 0.21

Clusteringgenesandcondi>ons

•  Rowsandcolumnscanbeclusteredindependently‐hierarchicalispreferredforvisualizingthis

Page 33: 9/29/08 Machine Learning Approaches to Biological Research ...murphylab.web.cmu.edu/presentations/Murphy-lecture1-principles-3p… · 9/29/08 3 Why Machine Learning? • Economically

9/29/08

33

Sta>ngGoalsvs.Approaches

•  Tempta>onwhenfirstconsideringusingamachinelearningapproachtoabiologicalproblemistodescribetheproblemasautoma>ngtheapproachthatyouwouldsolvetheproblem

•  “Ineedaprogramtopredicthowmuchageneisexpressedbymeasuringhowwellitspromotermatchesatemplate”

Sta>ngGoalsvs.Approaches

•  “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbymeasuringhowwellitspromotermatchesatemplate”

•  “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbylearningfromsequencesofgeneswhoseexpressionisknown”

Resources

•  Associa>onfortheAdvancementofAr>ficialIntelligence–  h�p://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning

•  MachineLearning–Mitchell,CarnegieMellon–  h�p://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/~p/mlbook.html

•  Prac>calMachineLearning–Jordan,UCBerkeley–  h�p://www.cs.berkeley.edu/~asimma/294‐fall06/

•  LearningandEmpiricalInference–Rish,Tesauro,Jebara,Vadpnik–Columbia–  h�p://www1.cs.columbia.edu/~jebara/6998/


Recommended