Post on 27-Jul-2020
transcript
9/29/08
1
MachineLearningApproachestoBiologicalResearch:Bioimage
Informa>csandBeyond
RobertF.MurphyExternalSeniorFellow,FreiburgIns>tuteforAdvancedStudies
RayandStephanieLaneProfessorofComputa>onalBiology,CarnegieMellonUniversity
September29‐October1,2009
Outline
• Basicprinciplesandparadigmsofsupervisedandunsupervisedmachinelearning
• Conceptsofautomatedimageanalysis• Approachesforcrea>ngpredic>vemodelsfromimages
• Ac>velearningparadigmsforclosedloopsystemsofcyclesofexperimenta>on,modelrefinementandmodeltes>ng
www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
9/29/08
2
WhatisMachineLearning?
• FundamentalQues>onofComputerScience:Howcanwebuildmachinesthatsolveproblems,andwhichproblemsareinherentlytractable/intractable?
• FundamentalQues>onofSta>s>cs:Whatcanbeinferredfromdataplusasetofmodelingassump>ons,withwhatreliability?
TomMitchellwhitepaper
FundamentalQues>onofMachineLearning
• Howcanwebuildcomputersystemsthatautoma>callyimprovewithexperience,andwhatarethefundamentallawsthatgovernalllearningprocesses?– TomMitchell
TomMitchellwhitepaper
WhyMachineLearning?
• Learnrela>onshipsfromlargesetsofcomplexdata:Datamining– Predictclinicaloutcomefromtests– Decidewhethersomeoneisagoodcreditrisk
• Dotaskstoocomplextoprogrambyhand– Autonomousdriving
• Customizeprogramstouserneeds– Recommendbook/moviebasedonpreviouslikes
TomMitchellwhitepaper
9/29/08
3
WhyMachineLearning?
• Economicallyefficient• Canconsiderlargerdataspacesandhypothesisspacesthanpeoplecan
• Canformalizelearningproblemtoexplicitlyiden>fy/describegoalsandcriteria
SuccessfulMachineLearningApplica>ons
• Speechrecogni>on– Telephonemenunaviga>on
• Computervision– Mailsor>ng
• Bio‐surveillance– Iden>fyingdiseaseoutbreaks
• Robotcontrol– Autonomousdriving
• Empiricalscience
TomMitchellwhitepaper
MachineLearningParadigms
• SupervisedLearning– Classifica>on– Regression
• UnsupervisedLearning– Clustering
• Semi‐supervisedLearning– Cotraining– Ac>velearning
9/29/08
4
SupervisedLearning
• Approaches– Classifica>on(discretepredic>ons)– Regression(con>nuouspredic>ons)
• Commonconsidera>ons– Representa>on(Features)– FeatureSelec>on– Func>onalform– Evalua>onofpredic>vepower
Classifica>onvs.Regression
• IfIwanttopredictwhetherapa>entwilldiefromadiseasewithinsixmonths,thatisclassifica>on
• IfIwanttopredicthowlongthepa>entwilllive,thatisregression
Representa>on
• Defini>onofthingorthingstobepredicted– Classifica>on:classes– Regression:regressionvariable
• Defini>onofthings(instances)tomakepredic>onsfor– Individuals– Families– Neighborhoods,etc.
• Choiceofdescriptors(features)todescribedifferentaspectsofinstances
9/29/08
5
Formaldescrip>on
• DefiningXasasetofinstancesxdescribedbyfeatures
• GiventrainingexamplesDfromX
• Givenatargetfunc4oncthatmapsX‐>{0,1}
• GivenahypothesisspaceH• DetermineanhypothesishinHsuchthath(x)=c(x)forallxinD
CourtesyTomMitchell
Induc>velearninghypothesis
• Anyhypothesisfoundtoapproximatethetargetfunc>onwelloverasufficientlylargesetoftrainingexamplewillalsoapproximatethetargetfunc>onoverotherunobservedexample
CourtesyTomMitchell
Hypothesisspace
• Thehypothesisspacedeterminesthefunc>onalform
• Itdefineswhatareallowablerules/func>onsforclassifica>on
• Eachclassifica>onmethodusesadifferenthypothesisspace
9/29/08
6
‐+
???
Simpletwoclassproblem
DescribeeachimagebyfeaturesTrainclassifier
k‐NearestNeighbor(kNN)
• Infeaturespace,trainingexamplesare
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
k‐NearestNeighbor(kNN)
• Wewanttolabel‘?’
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
?�
9/29/08
7
k‐NearestNeighbor(kNN)
• Findknearestneighborsandvote
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
?� fork=3,
nearestneighborsare
Sowelabelit+
LinearDiscriminants
• Fitmul>variateGaussiantoeachclass• Measuredistancefrom?toeachGaussian
area
bright.
+
‐
+
+
+
+
‐
‐‐
‐‐
?
Decisiontrees
• Againwewanttolabel‘?’
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
?�
SlidecourtesyofChristosFaloutsos
9/29/08
8
Decisiontrees
• sowebuildadecisiontree:
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
?�
50�
40�
SlidecourtesyofChristosFaloutsos
Decisiontrees
• sowebuildadecisiontree:
area<50
Y
+round.<40
N
‐...
Y N
‘area’
round.
+
‐++ +
+++
‐
‐‐
‐‐
?
50
40
SlidecourtesyofChristosFaloutsos
Decisiontrees
• Goal:splitaddressspacein(almost)homogeneousregions
area<50
Y
+round.<40
N
‐...
Y N
‘area’
round.
+
‐++ +
+++
‐
‐‐
‐‐
?
50
40
SlidecourtesyofChristosFaloutsos
9/29/08
9
Supportvectormachines
• Againwewanttolabel‘?’
Feature#1(e.g..,‘area’)
Feature#2(e.g..,roundness)
+
‐++ +
+
+
+
‐
‐‐
‐‐
?
SlidecourtesyofChristosFaloutsos
SupportVectorMachines(SVMs)
• Usesinglelinearseparator??
area
round.
+
‐
+
+
+
+
‐
‐‐
‐‐
?
SlidecourtesyofChristosFaloutsos
SupportVectorMachines(SVMs)
• Usesinglelinearseparator??
area
round.
+
‐
+
+
+
+
‐
‐‐
‐‐
?
SlidecourtesyofChristosFaloutsos
9/29/08
10
SupportVectorMachines(SVMs)
• Usesinglelinearseparator??
area
round.
+
‐
+
+
+
+
‐
‐‐
‐‐
?
SlidecourtesyofChristosFaloutsos
SupportVectorMachines(SVMs)
• Usesinglelinearseparator??
+
‐
+
+
+
+
‐
‐‐
‐‐
?
area
round.
SlidecourtesyofChristosFaloutsos
SupportVectorMachines(SVMs)
• Usesinglelinearseparator??
+
‐
+
+
+
+
‐
‐‐
‐‐
?
area
round.
SlidecourtesyofChristosFaloutsos
9/29/08
11
SupportVectorMachines(SVMs)
• wewanttolabel‘?’‐linearseparator??• A:theonewiththewidestcorridor!
area
round.
+
‐+
++
+
‐
‐‐
‐‐
?
SlidecourtesyofChristosFaloutsos
SupportVectorMachines(SVMs)
• Whatifthepointsforeachclassarenotreadilyseparatedbyastraightline?
• Usethe“kerneltrick”–projectthepointsintoahigherdimensionalspaceinwhichwehopethatstraightlineswillseparatetheclasses
• “kernel”referstothefunc>onusedforthisprojec>on
SupportVectorMachines(SVMs)
• Defini>onofSVMsexplicitlyconsidersonlytwoclasses
• Whatifwehavemorethantwoclasses?
• Trainmul>pleSVMs
• Twobasicapproaches– Oneagainstall(oneSVMforeachclass)– PairwiseSVMs(oneforeachpairofclasses)
– Variouswaysofimplemen>ngthis
9/29/08
12
Ques>ons
• Whatarethehypothesisspacesfor– kNNclassifier– Lineardiscriminants– Decisiontrees– SupportVectorMachines
Cross‐Valida>on
• Ifwetrainaclassifiertominimizeerroronasetofdata,havenoabilitytoes>mate(generalize)errorthatwillbeseenonnewdataset
• Tocalculategeneralizableaccuracy,weusen‐foldcross‐valida5on
• Divideimagesintonsets,trainusingn‐1ofthemandtestontheremainingset
• Repeatun>leachsetisusedastestsetandaverageresultsacrossalltrials
• Varia>ononthisiscalledleave‐one‐out
Describingclassifiererrors
• Forbinaryclassifiers(posi>veornega>ve),define– TP=trueposi>ves,FP=falseposi>ves– TN=truenega>ves,FN=falsenega>ves– Recall=TP/(TP+FN)– Precision=TP/(TP+FP)– F‐measure=2*Recall*Precision/(Recall+Precision)
9/29/08
13
Confusionmatrix‐binary
True\Predicted Posi5ve Nega5ve
Posi>ve TruePosi>ve FalseNega>ve
Nega>ve FalsePosi>ve TrueNega>ve
Precision‐recallanalysis
Varyclassifierparameterto“loosen”someperformancees>mate:i.e.,confidence
Idealperformance
Describingclassifiererrors
• Formul>‐classclassifiers,typicallyreport– Accuracy=#testimagescorrectlyclassified
#testimages– Confusionmatrix=tableshowingallpossiblecombina>onsoftrueclassandpredictedclass
9/29/08
14
Confusionmatrix–mul>‐class
Overallaccuracy=98%
True Class
Output of the Classifier
DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 98 2 0 0 0 0 0 0 0 0 ER 0 100 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 96 4 0 0 0 0 0 Lam 0 0 0 4 95 0 0 0 0 2 Mit 0 0 2 0 0 96 0 2 0 0 Nuc 0 0 0 0 0 0 100 0 0 0 Act 0 0 0 0 0 0 0 100 0 0 TfR 0 0 0 0 2 0 0 0 96 2 Tub 0 2 0 0 0 0 0 0 0 98
Groundtruth
• Whatisthesourceandconfidenceofaclasslabel?
• Mostcommon:Humanassignment,unknownconfidence
• Preferred:Assignmentbyexperimentaldesign,confidence~100%
Featureselec>on
• Havingtoomanyfeaturescanconfuseaclassifier• Canusecomparisonoffeaturedistribu>onsbetweenclassestochooseasubsetoffeaturesthatgetsridofuninforma>veorredundantfeatures
9/29/08
15
Basicprincipleoffeatureselec>on
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 6 11 16 21 26 31
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 6 11 16 21 26 31
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 6 11 16 21 26 31
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
1 6 11 16 21 26 31
Feature1 Feature2
Feature3 Feature4
red=class1,blue=class2
Needtoconsidermul>variatedistance
FigurefromGuyon&Elisseeff
9/29/08
16
BadandGoodCovariance
FigurefromGuyon&Elisseeff
FeatureSelec>onMethods
• PrincipalComponentsAnalysis• Non‐LinearPrincipalComponentsAnalysis
• IndependentComponentsAnalysis
• Informa>onGain
• StepwiseDiscriminantAnalysis
• Gene>cAlgorithms
Regression
9/29/08
17
Linearregression
010
2030
40
010
2030
20
22
24
26
Tem
pera
ture
0 10 200
20
40
[startMatlabdemolecture2.m]
Givenexamples
Predict givenanewpoint
SlidecourtesyRomanThibaux
0 200
20
40
010
2030
40
010
2030
20
22
24
26
Tem
pera
ture
Linearregression
Predic>on Predic>on
OrdinaryLeastSquares(OLS)
0 200
Erroror“residual”
Predic>on
Observa>on
SumsquarederrorSlidecourtesyRomanThibaux
9/29/08
18
Beyondlinesandplanes
everythingisthesamewith
s>lllinearin
0 10 200
20
40
SlidecourtesyRomanThibaux
Geometricinterpreta>on
[Matlabdemo]
010
20 0
100
200
300
400
-10
0
10
20
SlidecourtesyRomanThibaux
Assump>onsvs.RealityVoltage
0 1 2 3 4 5 6 70
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
IntelsensornetworkdataTemperature
SlidecourtesyRomanThibaux
9/29/08
19
Overfi{ng
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
[Matlabdemo]
Degree15polynomial
SlidecourtesyRomanThibaux
Sensi>vitytooutliersHighweightgiventooutliers
010
2030
40
010
2030
5
10
15
20
25
Temperature at noon
Influencefunc>on
SlidecourtesyRomanThibaux
KernelRegression
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Kernel regression (sigma=1)
9/29/08
20
SplineRegressionRegressiononeachinterval
5200 5400 5600 5800
50
60
70
SlidecourtesyRomanThibaux
SplineRegressionWithequalityconstraints
5200 5400 5600 5800 50
60
70
SlidecourtesyRomanThibaux
Clusteranalysis
• Supervisedlearning(Classifica>on)assumesclassesareknown
• Unsupervisedlearning(Clusteranalysis)seekstodiscovertheclasses
9/29/08
21
Formaldescrip>on
• GivenXasasetofinstancesdescribedbyfeatures
• Givenanobjec4vefunc4ong• Givenapar44onspaceH• Determineapar>>onhinHsuchthath(X)maximizes/minimizesg(h(X))
Formaldescrip>on
• objec4vefunc4ongo~enstatedintermsofminimizingadistancefunc4ond
• Example:Euclideandistance
Hierarchicalvs.k‐meansclustering
• Twomostpopularclusteringalgorithms• Hierarchicalbuildstreesequen>allyfromtheclosestpairofpoints(wells/cells/probes/condi>ons)
• k‐meansstartswithkrandomlychosenseedpoints,assignseachremainingpointtothenearestseed,andrepeatsthisun>lnopointmoves
9/29/08
22
HierarchicalClustering
D
FE
A
B
C
D FEA B C
BC DE
DEF
BCDEF
ABCDEF
SlidecourtesyofElviraGarciaOsuna
HierarchicalClustering
DEFBCAD
FE
A
B
C
SlidecourtesyofElviraGarciaOsuna
K‐means
SlidecourtesyofElviraGarciaOsuna
9/29/08
23
K‐means
1
2
SlidecourtesyofElviraGarciaOsuna
K‐means
SlidecourtesyofElviraGarciaOsuna
K‐means
SlidecourtesyofElviraGarciaOsuna
9/29/08
24
K‐means
SlidecourtesyofElviraGarciaOsuna
K‐means
SlidecourtesyofElviraGarciaOsuna
K‐means
SlidecourtesyofElviraGarciaOsuna
9/29/08
25
ChoosingthenumberofClusters
• Adifficultproblem
• Mostcommonapproachistotrytofindthesolu>onthatminimizestheBayesianInforma>onCriterion
L=thelikelihoodfunc>onforthees>matedmodel
K=#ofparameters
n=#ofsamples
2ln ln( )BIC L k n= − +
Microarrayrawdata
• LabelmRNAfromonesamplewitharedfluorescenceprobe(Cy5)andmRNAfromanothersamplewithagreenfluorescenceprobe(Cy3)
• HybridizetoachipwithspecificDNAsfixedtoeachwell
• Measureamountsofgreenandredfluorescence
Flashanima>ons:PCRh�p://www.maxanim.com/gene>cs/PCR/PCR.htmMicroarrayh�p://www.bio.davidson.edu/Courses/genomics/chip/chip.html
Examplemicroarrayimage
9/29/08
26
mRNAexpressionmicroarraydatafor9800genes(genenumbershownver>cally)for0to24h(>meshownhorizontally)a~eraddi>onofserumtoahumancelllinethathadbeendeprivedofserum(fromh�p://genome‐www.stanford.edu/serum)
Dataextrac>on
• Adjustfluorescentintensi>esusingstandards(asnecessary)
• Calculatera>oofredtogreenfluorescence• Converttolog2androundtointeger• Displaysaturatedgreen=‐3toblack=0tosaturatedred=+3
Distances
• Highdimensionality
• Basedonvectorgeometry–howclosearetwodatapoints?
Array2
Array 1
Array 1 Array 2
Gene 1 1 4
…
Gene 1
9/29/08
27
Distances
• Highdimensionality
• Basedonvectorgeometry–howclosearetwodatapoints?
Array2
Array 1
Array 1 Array 2
Gene 1 1 4
Gene 2 1 3
…
Gene 1 Gene 2
Distance(Gene 1, Gene 2) = 1
Distances
• Highdimensionality
• Basedonvectorgeometry–howclosearetwodatapoints?
• Usedistancestodetermineclusters
Array2
Array 1
Array 1 Array 2
Gene 1 1 4
Gene 2 1 3
…
Gene 1 Gene 2
Distance(Gene 1, Gene 2) = 1
GeneralMul>variateDataset
• Wearegivenvaluesofpvariablesfornindependentobserva>ons
• ConstructannxpmatrixMconsis>ngofvectorsX1throughXneachoflengthp
9/29/08
28
Mul>variateSampleMean
• DefinemeanvectorIoflengthp
I( j) =M(i, j)
i=1
n∑
nI =
X ii=1
n∑
nor
matrixnota>on vectornota>on
Mul>variateVariance
• Definevariancevectorσ2oflengthp
σ 2( j) =M(i, j) − I( j)( )
i=1
n∑
2
n −1matrixnota>on
Mul>variateVariance
• or
σ 2 =X i − I( )
i=1
n∑
2
n −1vectornota>on
9/29/08
29
CovarianceMatrix
• Defineapxpmatrixcov(calledthecovariancematrix)analogoustoσ2
cov( j,k) =M(i, j) − I( j )( )M(i,k) − I(k)( )
i=1
n∑
n −1
CovarianceMatrix
• Notethatthecovarianceofavariablewithitselfissimplythevarianceofthatvariable
cov( j, j) =σ 2 ( j)
UnivariateDistance
• Thesimpledistancebetweenthevaluesofasinglevariablejfortwoobserva>onsiandlis
M(i, j) −M(l, j)
9/29/08
30
Univariatez‐scoreDistance
• Tomeasuredistanceinunitsofstandarddevia0onbetweenthevaluesofasinglevariablejfortwoobserva>onsiandlwedefinethez‐scoredistance
M(i, j) −M(l, j)σ ( j)
BivariateEuclideanDistance
• Themostcommonlyusedmeasureofdistancebetweentwoobserva>onsiandlontwovariablesjandkistheEuclideandistance
M(i, j) −M(l, j)( )2 + M(i,k) −M(l,k )( )2
M(i,j)
kvariable
j variable
i observation
l observation M(l,j)
M(i,k) M(l,k)
Mul>variateEuclideanDistance
• Thiscanbeextendedtomorethantwovariables
M(i, j) −M(l, j)( )2j=1
p∑
9/29/08
31
EffectsofvarianceandcovarianceonEuclideandistance
PointsAandBhavesimilarEuclideandistancesfromthemean,butpointBisclearly“moredifferent”fromthepopula>onthanpointA.
BA
Theellipseshowsthe50%contourofahypothe>calpopula>on.
MahalanobisDistance
• Toaccountfordifferencesinvariancebetweenthevariables,andtoaccountforcorrela>onsbetweenvariables,weusetheMahalanobisdistance
D2 = X i −X l( )cov-1 X i − Xl( )T
Otherdistancefunc>ons
• Wecanuseotherdistancefunc>ons,includingonesinwhichtheweightsoneachvariablearelearned
• ClusteranalysistoolsformicroarraydatamostcommonlyusePearsoncorrela>oncoefficient
9/29/08
32
Inputdataforclustering
• Genesinrows,condi>onsincolumns
YORF NAME GWEIGHT Cell-cycle Alpha-Factor 1Cell-cycle Alpha-Factor 2Cell-cycle Alpha-Factor 3EWEIGHT 1 1 1YHR051W YHR051W COX6 oxidative phosphorylation cytochrome-c oxidase subunit VI S00010931 0.03 0.3 0.37YKL181W YKL181W PRS1 purine, pyrimidine, tryptophanphosphoribosylpyrophosphate synthetase S00016641 0.33 -0.2 -0.12YHR124W YHR124W NDT80 meiosis transcription factor S00011661 0.36 0.08 0.06YHL020C YHL020C OPI1 phospholipid metabolism negative regulator of phospholipid biosynthesS00010121 -0.01 -0.03 0.21YGR072W YGR072W UPF3 mRNA decay, nonsense-mediated unknown S00033041 0.2 -0.43 -0.22YGR145W YGR145W unknown unknown; similar to MESA gene of Plasmodium fS00033771 0.11 -1.15 -1.03YGR218W YGR218W CRM1 nuclear protein targeting nuclear export factor S00034501 0.24 -0.23 0.12YGL041C YGL041C unknown unknown S00030091 0.06 0.23 0.2YOR202W YOR202W HIS3 histidine biosynthesis imidazoleglycerol-phosphate dehydratase S00057281 0.1 0.48 0.86YCR005C YCR005C CIT2 glyoxylate cycle peroxisomal citrate synthase S00005981 0.34 1.46 1.23YER187W YER187W unknown unknown; similar to killer toxin Khs1p S00009891 0.71 0.03 0.11YBR026C YBR026C MRF1' mitochondrial respiration ARS-binding protein S00002301 -0.22 0.14 0.14YMR244W YMR244W unknown unknown; similar to Nca3p S00048581 0.16 -0.18 -0.38YAR047C YAR047C unknown unknown S00000831 -0.43 -0.56 -0.14YMR317W YMR317W unknown unknown S00049361 -0.43 -0.03 0.21
Clusteringgenesandcondi>ons
• Rowsandcolumnscanbeclusteredindependently‐hierarchicalispreferredforvisualizingthis
9/29/08
33
Sta>ngGoalsvs.Approaches
• Tempta>onwhenfirstconsideringusingamachinelearningapproachtoabiologicalproblemistodescribetheproblemasautoma>ngtheapproachthatyouwouldsolvetheproblem
• “Ineedaprogramtopredicthowmuchageneisexpressedbymeasuringhowwellitspromotermatchesatemplate”
Sta>ngGoalsvs.Approaches
• “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbymeasuringhowwellitspromotermatchesatemplate”
• “Ineedaprogramthatgivenagenesequencepredictshowmuchthatgeneisexpressedbylearningfromsequencesofgeneswhoseexpressionisknown”
Resources
• Associa>onfortheAdvancementofAr>ficialIntelligence– h�p://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning
• MachineLearning–Mitchell,CarnegieMellon– h�p://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/~p/mlbook.html
• Prac>calMachineLearning–Jordan,UCBerkeley– h�p://www.cs.berkeley.edu/~asimma/294‐fall06/
• LearningandEmpiricalInference–Rish,Tesauro,Jebara,Vadpnik–Columbia– h�p://www1.cs.columbia.edu/~jebara/6998/