Introduc)ontoPathwayandNetworkAnalysis
AlisonMotsinger-Reif,PhDAssociateProfessor
Bioinforma)csResearchCenterDepartmentofSta)s)cs
NorthCarolinaStateUniversity
PathwayandNetworkAnalysis• High-throughputgene)c/genomictechnologiesenable
comprehensivemonitoringofabiologicalsystem
• Analysisofhigh-throughputdatatypicallyyieldsalistofdifferen)allyexpressedgenes,proteins,metabolites…– Typicallyprovideslistsofsinglegenes,etc.– Willuse“genes”throughout,butusinginterchangeablymostly
• ThislistoPenfailstoprovidemechanis)cinsightsintotheunderlyingbiologyofthecondi)onbeingstudied
• Howtoextractmeaningfromalonglistofdifferen)allyexpressedgenesàpathway/networkanalysis
Whatmakesanairplanefly?
Chas'StainlessSteel,MarkThompson'sAirplaneParts,About1000PoundsofStainlessSteelWire,andGagosian'sBeverlyHillsSpace
FromcomponentstonetworksAbiologicalfunc)onisaresultofmanyinterac)ngmoleculesandcannotbeaTributedtojustasinglemolecule.
PathwayandNetworkAnalysis• Oneapproach:simplifyanalysisbygroupinglonglistsofindividualgenesintosmallersetsofrelatedgenesreducesthecomplexityofanalysis.– alargenumberofknowledgebasesdevelopedtohelpwiththistask
• Knowledgebases– describebiologicalprocesses,components,orstructuresinwhichindividualgenes\areknowntobeinvolvedin
– howandwheregeneproductsinteractwitheachother
PathwayandNetworkAnalysis
• Analysisatthefunc)onallevelisappealingfortworeasons:– First,groupingthousandsofgenesbythepathwaystheyareinvolvedinreducesthecomplexitytojustseveralhundredpathwaysfortheexperiment
– Second,iden)fyingac)vepathwaysthatdifferbetweentwocondi)onscanhavemoreexplanatorypowerthanasimplelistofgenes
PathwayandNetworkAnalysis
• Whatkindsofdataisusedforsuchanalysis?– Geneexpressiondata
• Microarrays• RNA-seq
– Proteomicdata– Metabolomicsdata– Singlenucleo)depolymorphisms(SNPs)– ….
PathwayandNetworkAnalysis
• Theterm“pathwayanalysis”getsusedoPen,andoPenindifferentways– appliedtotheanalysisofGeneOntology(GO)terms(alsoreferredtoasa“geneset”)
– physicalinterac)onnetworks(e.g.,protein–proteininterac)ons)
– kine)csimula)onofpathways– steady-statepathwayanalysis(e.g.,flux-balanceanalysis)– inferenceofpathwaysfromexpressionandsequencedata
• Mayormaynotactuallydescribebiologicalpathways
PathwayandNetworkAnalysis
• Forthefirstpartofthismodule,wewillfocusonmethodsthatexploitpathwayknowledgeinpublicrepositoriesratherthanonmethodsthatinferpathwaysfrommolecularmeasurements– UserepositoriessuchasGOorKyotoEncyclopediaofGenesandGenomes(KEGG)
àknowledgebase–drivenpathwayanalysis
AHistoryofPathwayAnalysisApproaches
• Overadecadeofdevelopmentofpathwayanalysisapproaches
• Canberoughlydividedintothreegenera)ons:– 1st:Over-Representa)onAnalysis(ORA)Approaches
– 2nd:Func)onalClassScoring(FCS)Approaches– 3rd:PathwayTopology(PT)-BasedApproaches
KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
• Thedatageneratedbyanexperimentusingahigh-throughputtechnology(e.g.,microarray,proteomics,metabolomics),alongwithfunc)onalannota)ons(pathwaydatabase)ofthecorrespondinggenome,areinputtovirtuallyallpathwayanalysismethods.
• ORAmethodsrequirethattheinputisalistofdifferen)allyexpressedgenes• FCSmethodsusetheen)redatamatrixasinput• PT-basedmethodsaddi)onallyu)lizethenumberandtypeofinterac)onsbetweengeneproducts,
whichmayormaynotbeapartofapathwaydatabase.• Theresultofeverypathwayanalysismethodisalistofsignificantpathwaysinthecondi)onunder
study.
Over-Representa)onAnalysis(ORA)Approaches
• Earliestmethodsàover-representa)onanalysis(ORA)
• Sta)s)callyevaluatesthefrac)onofgenesinapar)cularpathwayfoundamongthesetofgenesshowingchangesinexpression
• Itisalsoreferredtoas“2×2tablemethod”intheliterature
Over-Representa)onAnalysis(ORA)• Usesoneormorevaria)onsofthefollowingstrategy:– First,aninputlistiscreatedusingacertainthresholdorcriteria• Forexample,maychoosegenesthataredifferen)allyover-orunder-expressedinagivencondi)onatafalsediscoveryrate(FDR)of5%
– Then,foreachpathway,inputgenesthatarepartofthepathwayarecounted
– Thisprocessisrepeatedforanappropriatebackgroundlistofgenes• (e.g.,allgenesmeasuredonamicroarray)
– Next,everypathwayistestedforover-orunder-representa)oninthelistofinputgenes• Themostcommonlyusedtestsarebasedonthehypergeometric,chi-square,orbinomialdistribu)on
KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
Limita)onsofORAApproaches• First,thedifferentsta)s)csusedbyORAareindependent
ofthemeasuredchanges– (e.g.,hypergeometricdistribu)on,binomialdistribu)on,chi-squaredistribu)on,etc.)
• Testsconsiderthenumberofgenesalonebutignoreanyvaluesassociatedwiththem– suchasprobeintensi)es
• Bydiscardingthisdata,ORAtreatseachgeneequally– Informa)onabouttheextentofregula)on(e.g.,fold-changes,significanceofachange,etc.)canbeusefulinassigningdifferentweightstoinputgenes/pathways
– Thiscanprovidemoreinforma)on
Limita)onsofORAApproaches• Second,ORAtypicallyusesonlythemostsignificantgenesanddiscardstheothers– inputlistofgenesisusuallyobtainedusinganarbitrarythreshold(e.g.,geneswithfold-changeand/orp-values)
• Marginallylesssignificantgenesaremissed,resul)ngininforma)onloss– (e.g.,fold-change=1.999orp-value=0.051)– Afewmethodsavoidingthresholds
• Theyuseanitera)veapproachthataddsonegeneata)metofindasetofgenesforwhichapathwayismostsignificant
Limita)onsofORAApproaches• Third,ORAassumesthateachgeneisindependentoftheother
genes
• However,biologyisacomplexwebofinterac)onsbetweengeneproductsthatcons)tutedifferentpathways– Onegoalmightbetogaininsightsintohowinterac)onsbetweengene
productsaremanifestedaschangesinexpression– Astrategythatassumesthegenesareindependentissignificantly
limitedinitsabilitytoprovideinsights
• Furthermore,assumingindependencebetweengenesamountsto“compe))venullhypothesis”tes)ng(morelater),whichignoresthecorrela)onstructurebetweengenes– thees)matedsignificanceofapathwaymaybebiasedorincorrect
Limita)onsofORAApproaches• Fourth,ORAassumesthateachpathwayisindependentof
otherpathwaysàNOTTRUE!
• Examplesofdependence:– GOdefinesabiologicalprocessasaseriesofeventsaccomplishedbyoneormoreorderedassembliesofmolecularfunc)ons
– ThecellcyclepathwayinKEGGwherethepresenceofagrowthfactorac)vatestheMAPKsignalingpathway• This,inturn,ac)vatesthecellcyclepathway
• NoORAmethodsaccountforthisdependencebetweenmolecularfunc)onsinGOandsignalingpathwaysinKEGG
Func)onalClassScoring(FCS)Approaches
• ThehypothesisoffuncGonalclassscoring(FCS)isthatalthoughlargechangesinindividualgenescanhavesignificanteffectsonpathways,weakerbutcoordinatedchangesinsetsoffuncGonallyrelatedgenes(i.e.,pathways)canalsohavesignificanteffects
• Withfewexcep)ons,allFCSmethodsuseavaria)onofageneralframeworkthatconsistsofthefollowingthreesteps.
Step1• First,agene-levelsta)s)ciscomputedusingthemolecularmeasurementsfromanexperiment– Involvescompu)ngdifferen)alexpressionofindividualgenesorproteins
• Sta)s)cscurrentlyusedatgene-levelincludecorrela)onofmolecularmeasurementswithphenotype– ANOVA– Q-sta)s)c– signal-to-noisera)o– t-test– Z-score
Step1• Choiceofagene-levelsta)s)cgenerallyhasanegligibleeffectontheiden)fica)onofsignificantlyenrichedgenesets– However,whentherearefewbiologicalreplicates,aregularizedsta)s)cmaybebeTer
• Untransformedgene-levelsta)s)cscanfailtoiden)fypathwayswithup-anddown-regulatedgenes– Inthiscase,transforma)onofgene-levelsta)s)cs(e.g.,absolutevalues,squaredvalues,ranks,etc.)isbeTer
Step2• Second,thegene-levelsta)s)csforallgenesinapathwayareaggregatedintoasinglepathway-levelsta)s)c– canbemul)variateandaccountforinterdependenciesamonggenes
– canbeunivariateanddisregardinterdependenciesamonggenes
• Thepathway-levelsta)s)csusedinclude:– Kolmogorov-Smirnovsta)s)c– sum,mean,ormedianofgene-levelsta)s)c– Wilcoxonranksum– maxmeansta)s)c
Step2• Irrespec)veofitstype,thepowerofapathway-levelsta)s)cdependson– thepropor)onofdifferen)allyexpressedgenesinapathway
– thesizeofthepathway– theamountofcorrela)onbetweengenesinthepathway
• Univariatesta)s)csshowmorepoweratstringentcutoffswhenappliedtorealbiologicaldata,andequalpowerasmul)variatesta)s)csatlessstringentcutoffs
Step3• Assessingthesta)s)calsignificanceofthepathway-levelsta)s)c
• Whencompu)ngsta)s)calsignificance,thenullhypothesistestedbycurrentpathwayanalysisapproachescanbebroadlydividedintotwocategories:– i)compe))venullhypothesis– ii)self-containednullhypothesis
• Aself-containednullhypothesispermutesclasslabels(i.e.,phenotypes)foreachsampleandcomparesthesetofgenesinagivenpathwaywithitself,whileignoringthegenesthatarenotinthepathway
• Acompe))venullhypothesispermutesgenelabelsforeachpathway,andcomparesthesetofgenesinthepathwaywithasetofgenesthatarenotinthepathway
KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
AdvantagesofFCSMethodsFCSmethodsaddressthreelimita)onsofORA
1. Don’trequireanarbitrarythresholdfordividingexpressiondataintosignificantandnon-significantpools.
Rather,FCSmethodsuseallavailablemolecularmeasurementsforpathwayanalysis.
2. WhileORAcompletelyignoresmolecularmeasurementswheniden)fyingsignificantpathways,FCSmethodsusethisinforma)oninordertodetectcoordinatedchangesintheexpressionofgenesinthesamepathway
3. Byconsideringthecoordinatedchangesingeneexpression,FCSmethodsaccountfordependencebetweengenesinapathway
Limita)onsofFCSMethods• First,similartoORA,FCSanalyzeseachpathwayindependently– Becauseagenecanfunc)oninmorethanonepathway,meaningthatpathwayscancrossandoverlap
– Consequently,inanexperiment,whileonepathwaymaybeaffectedinanexperiment,onemayobserveotherpathwaysbeingsignificantlyaffectedduetothesetofoverlappinggenes
• SuchaphenomenonisverycommonwhenusingtheGOtermstodefinepathwaysduetothehierarchicalnatureoftheGO
Limita)onsofFCSMethods• Second,manyFCSmethodsusechangesingeneexpressiontorank
genesinagivenpathway,anddiscardthechangesfromfurtheranalysis– Forinstance,assumethattwogenesinapathway,AandB,are
changingby2-foldand20-fold,respec)vely– Aslongastheybothhavethesamerespec)veranksincomparison
withothergenesinthepathway,mostFCSmethodswilltreatthemequally,althoughthegenewiththehigherfold-changeshouldprobablygetmoreweight
• Importantly,however,consideringonlytheranksofgenesisalsoadvantageous,asitismorerobusttooutliers.– Anotableexcep)ontothisscenarioisapproachesthatusegene-level
sta)s)cs(e.g.,t-sta)s)c)tocomputepathway-levelscores.– Forexample,anFCSmethodthatcomputesapathway-levelsta)s)c
asasumormeanofthegene-levelsta)s)caccountsforarela)vedifferenceinmeasurements(e.g.,Category,SAFE).
PathwayTopology(PT)-BasedApproaches
• Alargenumberofpubliclyavailablepathwayknowledgebasesprovideinforma)onbeyondsimplelistsofgenesforeachpathway– KEGG– MetaCyc– Reactome– RegulonDB– STKE– BioCarta– PantherDB– ….
• UnlikeGOandMSigDB,theseknowledgebasesalsoprovideinforma)onaboutgeneproductsthatinteractwitheachotherinagivenpathway,howtheyinteract(e.g.,ac)va)on,inhibi)on,etc.),andwheretheyinteract(e.g.,cytoplasm,nucleus,etc.)
PathwayTopology(PT)-BasedApproaches
• ORAandFCSmethodsconsideronlythenumberofgenesinapathwayorgenecoexpressiontoiden)fysignificantpathways,andignoretheaddi)onalinforma)onavailablefromtheseknowledgebases– Evenifthepathwaysarecompletelyredrawnwithnewlinksbetweenthegenes,aslongastheycontainthesamesetofgenes,ORAandFCSwillproducethesameresults
• Pathwaytopology(PT)-basedmethodshavebeendevelopedtousetheaddi)onalinforma)on– PT-basedmethodsareessen)allythesameasFCSmethodsinthattheyperformthesamethreestepsasFCSmethods
– Thekeydifferencebetweenthetwoistheuseofpathwaytopologytocomputegene-levelsta)s)cs
PathwayTopology(PT)-BasedApproaches
• Rahnenfuhreretal.proposedScorePAGE,whichcomputessimilaritybetweeneachpairofgenesinapathway(e.g.,correla)on,covariance,etc.)– similaritymeasurementbetweeneachpairofgenesisanalogoustogene-levelsta)s)csinFCSmethods
– averagedtocomputeapathway-levelscore
• Insteadofgivingequalweighttoallpairwisesimilari)es,ScorePAGEdividesthepairwisesimilari)esbythenumberofreac)onsneededtoconnecttwogenesinagivenpathway
PathwayTopology(PT)-BasedApproaches
• Impactfactor(IF)analysis– IFconsidersthestructureanddynamicsofanen)repathwayby
incorpora)nganumberofimportantbiologicalfactors,includingchangesingeneexpression,typesofinterac)ons,andtheposi)onsofgenesinapathway
Aliwilltalkmoreabouttheseapproachesindetail!!!
IFAnalysis
• Briefly…– Modelsasignalingpathwayasagraph,wherenodesrepresentgenesandedgesrepresentinterac)onsbetweenthem
– Definesagene-levelsta)s)c,calledperturba)onfactor(PF)ofagene,asasumofitsmeasuredchangeinexpressionandalinearfunc)onoftheperturba)onfactorsofallgenesinapathway
– BecausethePFofeachgeneisdefinedbyalinearequa)on,theen)repathwayisdefinedasalinearsystem• addressesloopsinthepathways
– TheIFofapathway(pathway-levelsta)s)c)isdefinedasasumofPFofallgenesinapathway
PathwayTopology(PT)-BasedApproaches
• FCSmethodsthatusecorrela)onsamonggenesimplicitlyassumethattheunderlyingnetwork,asdefinedbythecorrela)onstructure,doesnotchangeastheexperimentalcondi)onschange
• Thisassump)onmaybeinaccurateàPTapproachesimproveonthis
PathwayTopology(PT)-BasedApproaches
• NetGSAaccountsforthethechangeincorrela)onaswellasthechangeinnetworkstructureasexperimentalcondi)onschange– likeIFanalysis,modelsgeneexpressionasalinearfunc)onofothergenesinthenetwork
• itdiffersfromIFintwoaspects– First,itaccountsforagene'sbaselineexpressionbyrepresen)ngitasalatentvariableinthemodel
– Second,itrequiresthatthepathwaysberepresentedasdirectedacyclicgraphsDAGs• Ifapathwaycontainscycles,NetGSArequiresaddi)onallatentvariablesaffec)ngthenodesinthecycle.
• Incontrast,IFanalysisdoesnotimposeanyconstraintonthestructureofapathway
Limita)onsofPT-basedApproaches
• Truepathwaytopologyisdependentonthetypeofcellduetocell-specificgeneexpressionprofilesandcondi)onbeingstudied– informa)onisrarelyavailable– fragmentedinknowledgebasesifavailable– Asannota)onsimprove,theseapproachesareexpectedtobecomemoreuseful
• Inabilitytomodeldynamicstatesofasystem
• Inabilitytoconsiderinterac)onsbetweenpathwaysduetoweakinter-pathwaylinkstoaccountforinterdependencebetweenpathways
KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
RRRRpackagenetgsa
OutstandingChallenges
• Nextgenera)onapproacheswillrequireimprovementoftheexis)ngannota)ons– necessarytocreateaccurate,highresolu)onknowledgebaseswithdetailedcondi)on-,)ssue-,andcell-specificfunc)onsofeachgene• PharmGKB….
– theseknowledgebaseswillallowinves)gatorstomodelanorganism'sbiologyasadynamicsystem,andwillhelppredictchangesinthesystemduetofactorssuchasmuta)onsorenvironmentalchanges
Annota)onChallenges
• Lowresolu)onknowledgebases• Incompleteandinaccurateannota)ons• Missingcondi)on-andcell-specificinforma)on
Greenarrowsrepresentabundantlyavailableinforma)on,andredarrowsrepresentmissingand/orincompleteinforma)on.Theul)mategoalofpathwayanalysisistoanalyzeabiologicalsystemasalarge,singlenetwork.However,thelinksbetweensmallerindividualpathwaysarenotyetwellknown.Furthermore,theeffectsofaSNPonagivenpathwayarealsomissingfromcurrentknowledgebases.Whilesomepathwaysareknowntoberelatedtoafewdiseases,itisnotclearwhetherthechangesinpathwaysarethecauseforthosediseasesorthedownstreameffectsofthediseases.
LowResolu)onKnowledgeBases• Knowledgebasesnotashighresolu)onastechnologies– usingRNA-seq,morethan90%ofthehumangenomeises)matedtobealterna)velyspliced
– mul)pletranscriptsfromthesamegenemayhaverelated,dis)nct,orevenopposingfunc)ons
– GWAShaveiden)fiedalargenumberofSNPsthatmaybeinvolvedindifferentcondi)onsanddiseases.
– However,currentknowledgebasesonlyspecifywhichgenesareac)veinagivenpathway
– Essen)althattheyalsobeginspecifyingotherinforma)on,suchastranscriptsthatareac)veinagivenpathwayorhowagivenSNPaffectsapathway
LowResolu)onKnowledgeBases• Becauseoftheselowresolu)onknowledgebases,every
availablepathwayanalysistoolfirstmapstheinputtoanon-redundantnamespace,typicallyanEntrezGeneID– thistypeofmappingisadvantageous,althoughitcanbenon-trivial,asitallowstheexis)ngpathwayanalysisapproachestobeindependentofthetechnologyusedintheexperiment
– However,mappinginthiswayalsoresultsinthelossofimportantinforma)onthatmayhavebeenprovidedbecauseaspecifictechnologywasused• XRN2a,avariantofgeneXRN2,isexpressedinseveralhuman)ssues,whereasanothervariantofthesamegene,XRN2b,ismainlyexpressedinbloodleukocytes
• AlthoughRNA-seqcanquan)fyexpressionofbothvariants,mappingbothtranscriptstoasinglegenecauseslossof)ssue-specificinforma)on,andpossiblyevencondi)on-specificinforma)on
LowResolu)onKnowledgeBases
• Therefore,beforepathwayanalysiscanexploitcurrentandfuturetechnologicaladvancesinbiotechnology,itiscri)callyimportanttoannotateexacttranscriptsandSNPsthatpar)cipateinagivenpathway
• Whilenewapproachesarebeingdevelopedinthisregard,theymaynotyetbeadequate– Braunetal.proposedamethodforanalyzingSNPdatafromaGWAS
– S)llreliesonmappingmul)pleSNPstoasinglegene,followedbygene-to-pathwaymapping
IncompleteandInaccurateAnnota)on
• Asurprisinglylargenumberofgenesares)llnotannotated
• Manyofthegenesarehypothe)cal,predicted,orpseudogenes– Althoughthenumberofprotein-codinggenesinthehumangenomeis
es)matedtobebetween20,000and25,000,accordingEntrezGene,thereare45,283humangenes,ofwhich14,162arepseudogenes
– Onecouldarguethatthepseudogenesshouldnotbeincludedwhenevalua)ngfunc)onalannota)oncoverage
– pseudogene-derivedsmallinterferingRNAshavebeenshowntoregulategeneexpressioninmouseoocytes
– GOprovidesannota)onsfor271pseudogenes– AwidelyusedDNAmicroarray,AffymetrixHGU133plus2.0,contains
1,026probesetsthatcorrespondto823pseudogenes– Shouldpseudogenesbeincludedinthecountwhenes)ma)ng
annota)oncoverageforthehumangenome?
IncompleteandInaccurateAnnota)onNumberofGO-annotatedgenes(lePpanel)andnumberofGOannota)ons(rightpanel)forhumanfromJanuary2003toNovember2009.Asthees)matednumberofknowngenesinthehumangenomeisadjusted(betweenJanuary2003andDecember2003)andannota)onprac)cesaremodified(betweenDecember2004andDecember2005,andbetweenOctober2008andNovember2009),onecanarguethat,althoughthenumberofannotatedgenesandtheannota)onsaredecreasing(whichismainlyduetotheadjustednumberofgenesinthehumangenomeandchangesintheannota)onprocess),thequalityofannota)onsisimproving,asdemonstratedbythesteadyincreaseinnon-IEAannota)onsandthenumberofgeneswithnon-IEAannota)ons.However,theincreaseinthenumberofgeneswithnon-IEAannota)onsisveryslow.Inalmost7years,betweenJanuary2003andNovember2009,only2,039newgenesreceivednon-IEAannota)ons.Atthesame)me,thenumberofnon-IEAannota)onsincreasedfrom35,925to65,741,indica)ngastrongresearchbiasforasmallnumberofgenes.doi:10.1371/journal.pcbi.1002375.g003
IncompleteandInaccurateAnnota)on
• Addi)onally,manyoftheexis)ngannota)onsareoflowqualityandmaybeinaccurate– >90%oftheannota)onsintheOctober2015releaseofGOhadtheevidencecode“inferredfromelectronicannota)ons(IEA)”
– theonlyonesinGOthatarenotcuratedmanually– Annota)onsinferredfromindirectevidenceareconsideredtobeoflowerqualitythanthosederivedfromdirectexperimentalevidence
– Iftheannota)onswithIEAcodeareremoved,thenumberofgeneswithgoodqualityannota)onsintheNovember2015releaseofhumanGOannota)onsisreducedfrom~18Kto~12K
IncompleteandInaccurateAnnota)on
• Itisverylikelythatthereducednumberofannota)onsandannotatedgenessinceJanuary2003isanindicatorofimprovingquality
• Thisisdueinparttothefactthatthenumberofgenesinagenomearecon)nuouslybeingadjustedandthefunc)onalannota)onalgorithmsarebeingimproved– thenumberofnon-IEAannota)onsiscon)nuouslyincreasing
• However,therateofincreasefornon-IEAannota)onsisveryslow(approximately2,000genesannotatedin7years)
IncompleteandInaccurateAnnota)on
• Manualcura)onoftheen)regenomeisexpectedtotakeaverylong)me(~13–25years)
• En)reresearchcommunitycouldpar)cipateinthecura)onprocess
• Oneapproachtofacilitatepar)cipa)onofalargenumberofresearchersistoadoptastandardannota)onformatsimilartoMinimumInforma)onAboutaMicroarrayExperiment(MIAME)– shouldthisberequiredlikeGEO?
• Aformatforfunc)onalannota)oncanbedesignedoradoptedfromtheexis)ngformats(e.g.,BioPAX,SBML)– Suchaformatcouldallowresearcherstospecifyanexperimentally
confirmedroleofaspecifictranscriptoraSNPinapathwayalongwithexperimentalandbiologicalcondi)ons
MissingCondi)onandcell-specificinforma)on
• Mostpathwayknowledgebasesarebuiltbycura)ngexperimentsperformedindifferentcelltypesatdifferent)mepointsunderdifferentcondi)ons
• Thesedetailsaretypicallynotavailableintheknowledgebases!
• Oneeffectofthisomissionisthatmul)pleindependentgenesareannotatedtopar)cipateinthesameinterac)oninapathway
• Thiseffectissowidespreadthatmanypathwayknowledgebasesrepresentasetofdis)nctgenesasasinglenodeinapathway
MissingCondi)onandcell-specificinforma)on
• Example:Wnt/beta-cateninpathwayinSTKE– thenodelabeled“Genes”represents19genesdirectlytargetedbyWntindifferentorganisms(Xenopusandhuman)indifferentcellsand)ssues(coloncarcinomacellsandepithelialcells
– thesenon-specificgenesintroducebiasforthesepathwaysinallexis)nganalysisapproaches
– Forinstance,anyORAmethodwillassignhighersignificance(typicallyanorderofmagnitudelowerp-value)toapathwaywithmoregenes
– Similarly,moregenesinapathwayalsoincreasetheprobabilityofahigherpathway-levelsta)s)cinFCSapproaches,yieldinghighersignificanceforagivenpathway.
MissingCondi)onandcell-specificinforma)on
• Thiscontextualinforma)onistypicallynotavailablefrommostoftheexis)ngknowledgebases
• Astandardfunc)onalannota)onformatdiscussedabovewouldmakethisinforma)onavailabletocuratorsanddevelopers– Forinstance,therecentlyproposedBiologicalConnec)onMarkupLanguage(BCML)allowspathwayrepresenta)ontospecifythecellororganisminwhicheachpathwayinterac)onoccurs.
– BCMLcangeneratecell-,condi)on-,ororganism-specificpathwaysbasedonuser-definedquerycriteria,whichinturncanbeusedfortargetedanalysis
MissingCondi)onandcell-specificinforma)on
• Exis)ngknowledgebasesdonotdescribetheeffectsofanabnormalcondi)ononapathway– Forexample,itisnotclearhowtheAlzheimer'sdiseasepathwayinKEGGdiffersfromanormalpathway
– Noritisclearwhichsetofinterac)onsleadstoAlzheimer'sdisease
• Wearenowunderstandingthatcontextplaysanimportantroleinpathwayinterac)ons
• Informa)onabouthowcelland)ssuetype,age,andenvironmentalexposuresaffectpathwayinterac)onswilladdcomplexitythatiscurrentlylacking
MethodologicalChallenges
• Benchmarkdatasetsforcomparingdifferentmethods
• Inabilitytomodelandanalyzedynamicresponse
• Inabilitytomodeleffectsofanexternals)muli
ComparingDifferentMethods
• Howdowecomparedifferentpathwayanalysismethods?
• Simulateddata– Advantages:
• Realsignalissimulated,so“true”answerisknown
– Disadvantages• Cannotcontainallthecomplexityofrealdata• Thesuccessofthemethodscanreflectthesimilarityofhowwellthesimula)onmatchestheknowledgebasestructureused
ComparingDifferentMethods• Benchmarkdata– Advantages:
• Cancomparesensi)vityandspecificity• Severaldatasetshavebeenconsistentlyusedintheliterature
• Includesallthecomplexityofrealbiologicaldata
– Disadvantages• Affectedbyconfoundingfactors
– absenceofapuredivisionintoclasses– presenceofoutliers– ….
• Notrueanswerknownforgroundedcomparisons–actualbiologyisntknown
ComparingDifferentMethods• Ageneralchallenge:DifferentdefiniGonsofthesame
pathwayindifferentknowledgebasescanaffectperformanceassessment
– GOdefinesdifferentpathwaysforapoptosisindifferentcells• (e.g.,cardiacmusclecellapoptosis,Bcellapoptosis,Tcellapoptosis)• Furtherdis)nguishesbetweeninduc)onandregula)onofapoptosis
– KEGGdefinesasinglesignalingpathwayforapoptosis• doesnotdis)nguishbetweeninduc)onandregula)on
– AnapproachusingKEGGwouldiden)fyasinglepathwayassignificant,whereasGOcouldiden)fymul)plepathways,and/orspecificaspectsofasingleapoptosispathway
Inabilitytomodelandanalyzedynamicresponse
• Noexis)ngapproachcancollec)velymodelandanalyzehigh-throughputdataasasingledynamicsystem
• Currentapproachesanalyzeasnapshotassumingthateachpathwayisindependentoftheothersatagiven)me– measureexpressionchangesatmul)ple)mepoints,andanalyzeeach)mepointindividually
– Implicitlyassumesthatpathwaysatdifferent)mepointsareindependent
• Needmodelsthataccountsfordependenceamongpathwaysatdifferent)mepoints– Muchofthislimita)onisduetotechnology/experimentaldesignànotallbioinforma)cslimita)ons
Inabilitytomodeleffectsofanexternals)muli
• Geneset–basedapproachesoPenonlyconsidergenesandtheirproducts
• Completelyignoretheeffectsofothermoleculespar)cipa)nginapathway– suchastheratelimi)ngstepofamul)-steppathway.
• Example:– Theamount/strengthofCa2+causesdifferenttranscrip)onfactorstobeac)vated
– Thisinforma)onisusuallynotavailable.
Summary• Inthelastdecade,pathwayanalysishasmatured,andbecomethestandardfortryingtodissectthebiologyofhighthroughputexperiments.
• Manysimilari)esacrossthethreemaingenera)onsofpathwayanalysistools.
• Willdiscussmoredetailsofsomeofthesechoices,knowledgebases,andspecificapproachesnext.
• Manyopenmethodsdevelopmentchallenges!
OverviewofModule
• FirstHalf:– Overviewofgenesetandpathwayanalysis
• Commonlyuseddatabasesandannota)onissues• 1stand2ndgenera)ontools
– Basicdifferencesinmethods– Detailsonverypopularmethods
• Issueswithdifferent“omics”datatypes
• SecondHalf– “3rdgenera)on”methods– Networkanalysismodeling