+ All Categories
Home > Documents > Introduc)on to Pathway and Network Analysis · Introduc)on to Pathway and Network Analysis Alison...

Introduc)on to Pathway and Network Analysis · Introduc)on to Pathway and Network Analysis Alison...

Date post: 30-Aug-2018
Category:
Upload: doandiep
View: 214 times
Download: 0 times
Share this document with a friend
63
Introduc)on to Pathway and Network Analysis Alison Motsinger-Reif, PhD Associate Professor Bioinforma)cs Research Center Department of Sta)s)cs North Carolina State University
Transcript

Introduc)ontoPathwayandNetworkAnalysis

AlisonMotsinger-Reif,PhDAssociateProfessor

Bioinforma)csResearchCenterDepartmentofSta)s)cs

NorthCarolinaStateUniversity

PathwayandNetworkAnalysis•  High-throughputgene)c/genomictechnologiesenable

comprehensivemonitoringofabiologicalsystem

•  Analysisofhigh-throughputdatatypicallyyieldsalistofdifferen)allyexpressedgenes,proteins,metabolites…–  Typicallyprovideslistsofsinglegenes,etc.–  Willuse“genes”throughout,butusinginterchangeablymostly

•  ThislistoPenfailstoprovidemechanis)cinsightsintotheunderlyingbiologyofthecondi)onbeingstudied

•  Howtoextractmeaningfromalonglistofdifferen)allyexpressedgenesàpathway/networkanalysis

Whatmakesanairplanefly?

Chas'StainlessSteel,MarkThompson'sAirplaneParts,About1000PoundsofStainlessSteelWire,andGagosian'sBeverlyHillsSpace

FromcomponentstonetworksAbiologicalfunc)onisaresultofmanyinterac)ngmoleculesandcannotbeaTributedtojustasinglemolecule.

PathwayandNetworkAnalysis•  Oneapproach:simplifyanalysisbygroupinglonglistsofindividualgenesintosmallersetsofrelatedgenesreducesthecomplexityofanalysis.–  alargenumberofknowledgebasesdevelopedtohelpwiththistask

•  Knowledgebases–  describebiologicalprocesses,components,orstructuresinwhichindividualgenes\areknowntobeinvolvedin

–  howandwheregeneproductsinteractwitheachother

PathwayandNetworkAnalysis

•  Analysisatthefunc)onallevelisappealingfortworeasons:– First,groupingthousandsofgenesbythepathwaystheyareinvolvedinreducesthecomplexitytojustseveralhundredpathwaysfortheexperiment

– Second,iden)fyingac)vepathwaysthatdifferbetweentwocondi)onscanhavemoreexplanatorypowerthanasimplelistofgenes

PathwayandNetworkAnalysis

•  Whatkindsofdataisusedforsuchanalysis?– Geneexpressiondata

•  Microarrays•  RNA-seq

– Proteomicdata– Metabolomicsdata– Singlenucleo)depolymorphisms(SNPs)– ….

PathwayandNetworkAnalysis

•  Whatkindsofques)onscanweask/answerwiththeseapproaches?

PathwayandNetworkAnalysis

•  Theterm“pathwayanalysis”getsusedoPen,andoPenindifferentways–  appliedtotheanalysisofGeneOntology(GO)terms(alsoreferredtoasa“geneset”)

–  physicalinterac)onnetworks(e.g.,protein–proteininterac)ons)

–  kine)csimula)onofpathways–  steady-statepathwayanalysis(e.g.,flux-balanceanalysis)–  inferenceofpathwaysfromexpressionandsequencedata

•  Mayormaynotactuallydescribebiologicalpathways

PathwayandNetworkAnalysis

•  Forthefirstpartofthismodule,wewillfocusonmethodsthatexploitpathwayknowledgeinpublicrepositoriesratherthanonmethodsthatinferpathwaysfrommolecularmeasurements– UserepositoriessuchasGOorKyotoEncyclopediaofGenesandGenomes(KEGG)

àknowledgebase–drivenpathwayanalysis

AHistoryofPathwayAnalysisApproaches

•  Overadecadeofdevelopmentofpathwayanalysisapproaches

•  Canberoughlydividedintothreegenera)ons:– 1st:Over-Representa)onAnalysis(ORA)Approaches

– 2nd:Func)onalClassScoring(FCS)Approaches– 3rd:PathwayTopology(PT)-BasedApproaches

KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

•  Thedatageneratedbyanexperimentusingahigh-throughputtechnology(e.g.,microarray,proteomics,metabolomics),alongwithfunc)onalannota)ons(pathwaydatabase)ofthecorrespondinggenome,areinputtovirtuallyallpathwayanalysismethods.

•  ORAmethodsrequirethattheinputisalistofdifferen)allyexpressedgenes•  FCSmethodsusetheen)redatamatrixasinput•  PT-basedmethodsaddi)onallyu)lizethenumberandtypeofinterac)onsbetweengeneproducts,

whichmayormaynotbeapartofapathwaydatabase.•  Theresultofeverypathwayanalysismethodisalistofsignificantpathwaysinthecondi)onunder

study.

Over-Representa)onAnalysis(ORA)Approaches

•  Earliestmethodsàover-representa)onanalysis(ORA)

•  Sta)s)callyevaluatesthefrac)onofgenesinapar)cularpathwayfoundamongthesetofgenesshowingchangesinexpression

•  Itisalsoreferredtoas“2×2tablemethod”intheliterature

Over-Representa)onAnalysis(ORA)•  Usesoneormorevaria)onsofthefollowingstrategy:–  First,aninputlistiscreatedusingacertainthresholdorcriteria•  Forexample,maychoosegenesthataredifferen)allyover-orunder-expressedinagivencondi)onatafalsediscoveryrate(FDR)of5%

–  Then,foreachpathway,inputgenesthatarepartofthepathwayarecounted

–  Thisprocessisrepeatedforanappropriatebackgroundlistofgenes•  (e.g.,allgenesmeasuredonamicroarray)

–  Next,everypathwayistestedforover-orunder-representa)oninthelistofinputgenes•  Themostcommonlyusedtestsarebasedonthehypergeometric,chi-square,orbinomialdistribu)on

KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

Limita)onsofORAApproaches•  First,thedifferentsta)s)csusedbyORAareindependent

ofthemeasuredchanges–  (e.g.,hypergeometricdistribu)on,binomialdistribu)on,chi-squaredistribu)on,etc.)

•  Testsconsiderthenumberofgenesalonebutignoreanyvaluesassociatedwiththem–  suchasprobeintensi)es

•  Bydiscardingthisdata,ORAtreatseachgeneequally–  Informa)onabouttheextentofregula)on(e.g.,fold-changes,significanceofachange,etc.)canbeusefulinassigningdifferentweightstoinputgenes/pathways

–  Thiscanprovidemoreinforma)on

Limita)onsofORAApproaches•  Second,ORAtypicallyusesonlythemostsignificantgenesanddiscardstheothers–  inputlistofgenesisusuallyobtainedusinganarbitrarythreshold(e.g.,geneswithfold-changeand/orp-values)

•  Marginallylesssignificantgenesaremissed,resul)ngininforma)onloss–  (e.g.,fold-change=1.999orp-value=0.051)– Afewmethodsavoidingthresholds

•  Theyuseanitera)veapproachthataddsonegeneata)metofindasetofgenesforwhichapathwayismostsignificant

Limita)onsofORAApproaches•  Third,ORAassumesthateachgeneisindependentoftheother

genes

•  However,biologyisacomplexwebofinterac)onsbetweengeneproductsthatcons)tutedifferentpathways–  Onegoalmightbetogaininsightsintohowinterac)onsbetweengene

productsaremanifestedaschangesinexpression–  Astrategythatassumesthegenesareindependentissignificantly

limitedinitsabilitytoprovideinsights

•  Furthermore,assumingindependencebetweengenesamountsto“compe))venullhypothesis”tes)ng(morelater),whichignoresthecorrela)onstructurebetweengenes–  thees)matedsignificanceofapathwaymaybebiasedorincorrect

Limita)onsofORAApproaches•  Fourth,ORAassumesthateachpathwayisindependentof

otherpathwaysàNOTTRUE!

•  Examplesofdependence:–  GOdefinesabiologicalprocessasaseriesofeventsaccomplishedbyoneormoreorderedassembliesofmolecularfunc)ons

–  ThecellcyclepathwayinKEGGwherethepresenceofagrowthfactorac)vatestheMAPKsignalingpathway•  This,inturn,ac)vatesthecellcyclepathway

•  NoORAmethodsaccountforthisdependencebetweenmolecularfunc)onsinGOandsignalingpathwaysinKEGG

Func)onalClassScoring(FCS)Approaches

•  ThehypothesisoffuncGonalclassscoring(FCS)isthatalthoughlargechangesinindividualgenescanhavesignificanteffectsonpathways,weakerbutcoordinatedchangesinsetsoffuncGonallyrelatedgenes(i.e.,pathways)canalsohavesignificanteffects

•  Withfewexcep)ons,allFCSmethodsuseavaria)onofageneralframeworkthatconsistsofthefollowingthreesteps.

Step1•  First,agene-levelsta)s)ciscomputedusingthemolecularmeasurementsfromanexperiment–  Involvescompu)ngdifferen)alexpressionofindividualgenesorproteins

•  Sta)s)cscurrentlyusedatgene-levelincludecorrela)onofmolecularmeasurementswithphenotype– ANOVA– Q-sta)s)c–  signal-to-noisera)o–  t-test–  Z-score

Step1•  Choiceofagene-levelsta)s)cgenerallyhasanegligibleeffectontheiden)fica)onofsignificantlyenrichedgenesets– However,whentherearefewbiologicalreplicates,aregularizedsta)s)cmaybebeTer

•  Untransformedgene-levelsta)s)cscanfailtoiden)fypathwayswithup-anddown-regulatedgenes–  Inthiscase,transforma)onofgene-levelsta)s)cs(e.g.,absolutevalues,squaredvalues,ranks,etc.)isbeTer

Step2•  Second,thegene-levelsta)s)csforallgenesinapathwayareaggregatedintoasinglepathway-levelsta)s)c–  canbemul)variateandaccountforinterdependenciesamonggenes

–  canbeunivariateanddisregardinterdependenciesamonggenes

•  Thepathway-levelsta)s)csusedinclude:–  Kolmogorov-Smirnovsta)s)c–  sum,mean,ormedianofgene-levelsta)s)c– Wilcoxonranksum– maxmeansta)s)c

Step2•  Irrespec)veofitstype,thepowerofapathway-levelsta)s)cdependson–  thepropor)onofdifferen)allyexpressedgenesinapathway

–  thesizeofthepathway–  theamountofcorrela)onbetweengenesinthepathway

•  Univariatesta)s)csshowmorepoweratstringentcutoffswhenappliedtorealbiologicaldata,andequalpowerasmul)variatesta)s)csatlessstringentcutoffs

Step3•  Assessingthesta)s)calsignificanceofthepathway-levelsta)s)c

•  Whencompu)ngsta)s)calsignificance,thenullhypothesistestedbycurrentpathwayanalysisapproachescanbebroadlydividedintotwocategories:–  i)compe))venullhypothesis–  ii)self-containednullhypothesis

•  Aself-containednullhypothesispermutesclasslabels(i.e.,phenotypes)foreachsampleandcomparesthesetofgenesinagivenpathwaywithitself,whileignoringthegenesthatarenotinthepathway

•  Acompe))venullhypothesispermutesgenelabelsforeachpathway,andcomparesthesetofgenesinthepathwaywithasetofgenesthatarenotinthepathway

KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

AdvantagesofFCSMethodsFCSmethodsaddressthreelimita)onsofORA

1.  Don’trequireanarbitrarythresholdfordividingexpressiondataintosignificantandnon-significantpools.

Rather,FCSmethodsuseallavailablemolecularmeasurementsforpathwayanalysis.

2.  WhileORAcompletelyignoresmolecularmeasurementswheniden)fyingsignificantpathways,FCSmethodsusethisinforma)oninordertodetectcoordinatedchangesintheexpressionofgenesinthesamepathway

3.  Byconsideringthecoordinatedchangesingeneexpression,FCSmethodsaccountfordependencebetweengenesinapathway

Limita)onsofFCSMethods•  First,similartoORA,FCSanalyzeseachpathwayindependently–  Becauseagenecanfunc)oninmorethanonepathway,meaningthatpathwayscancrossandoverlap

–  Consequently,inanexperiment,whileonepathwaymaybeaffectedinanexperiment,onemayobserveotherpathwaysbeingsignificantlyaffectedduetothesetofoverlappinggenes

•  SuchaphenomenonisverycommonwhenusingtheGOtermstodefinepathwaysduetothehierarchicalnatureoftheGO

Limita)onsofFCSMethods•  Second,manyFCSmethodsusechangesingeneexpressiontorank

genesinagivenpathway,anddiscardthechangesfromfurtheranalysis–  Forinstance,assumethattwogenesinapathway,AandB,are

changingby2-foldand20-fold,respec)vely–  Aslongastheybothhavethesamerespec)veranksincomparison

withothergenesinthepathway,mostFCSmethodswilltreatthemequally,althoughthegenewiththehigherfold-changeshouldprobablygetmoreweight

•  Importantly,however,consideringonlytheranksofgenesisalsoadvantageous,asitismorerobusttooutliers.–  Anotableexcep)ontothisscenarioisapproachesthatusegene-level

sta)s)cs(e.g.,t-sta)s)c)tocomputepathway-levelscores.–  Forexample,anFCSmethodthatcomputesapathway-levelsta)s)c

asasumormeanofthegene-levelsta)s)caccountsforarela)vedifferenceinmeasurements(e.g.,Category,SAFE).

PathwayTopology(PT)-BasedApproaches

•  Alargenumberofpubliclyavailablepathwayknowledgebasesprovideinforma)onbeyondsimplelistsofgenesforeachpathway–  KEGG–  MetaCyc–  Reactome–  RegulonDB–  STKE–  BioCarta–  PantherDB–  ….

•  UnlikeGOandMSigDB,theseknowledgebasesalsoprovideinforma)onaboutgeneproductsthatinteractwitheachotherinagivenpathway,howtheyinteract(e.g.,ac)va)on,inhibi)on,etc.),andwheretheyinteract(e.g.,cytoplasm,nucleus,etc.)

PathwayTopology(PT)-BasedApproaches

•  ORAandFCSmethodsconsideronlythenumberofgenesinapathwayorgenecoexpressiontoiden)fysignificantpathways,andignoretheaddi)onalinforma)onavailablefromtheseknowledgebases–  Evenifthepathwaysarecompletelyredrawnwithnewlinksbetweenthegenes,aslongastheycontainthesamesetofgenes,ORAandFCSwillproducethesameresults

•  Pathwaytopology(PT)-basedmethodshavebeendevelopedtousetheaddi)onalinforma)on–  PT-basedmethodsareessen)allythesameasFCSmethodsinthattheyperformthesamethreestepsasFCSmethods

–  Thekeydifferencebetweenthetwoistheuseofpathwaytopologytocomputegene-levelsta)s)cs

PathwayTopology(PT)-BasedApproaches

•  Rahnenfuhreretal.proposedScorePAGE,whichcomputessimilaritybetweeneachpairofgenesinapathway(e.g.,correla)on,covariance,etc.)–  similaritymeasurementbetweeneachpairofgenesisanalogoustogene-levelsta)s)csinFCSmethods

–  averagedtocomputeapathway-levelscore

•  Insteadofgivingequalweighttoallpairwisesimilari)es,ScorePAGEdividesthepairwisesimilari)esbythenumberofreac)onsneededtoconnecttwogenesinagivenpathway

PathwayTopology(PT)-BasedApproaches

•  Impactfactor(IF)analysis–  IFconsidersthestructureanddynamicsofanen)repathwayby

incorpora)nganumberofimportantbiologicalfactors,includingchangesingeneexpression,typesofinterac)ons,andtheposi)onsofgenesinapathway

Aliwilltalkmoreabouttheseapproachesindetail!!!

IFAnalysis

•  Briefly…–  Modelsasignalingpathwayasagraph,wherenodesrepresentgenesandedgesrepresentinterac)onsbetweenthem

–  Definesagene-levelsta)s)c,calledperturba)onfactor(PF)ofagene,asasumofitsmeasuredchangeinexpressionandalinearfunc)onoftheperturba)onfactorsofallgenesinapathway

–  BecausethePFofeachgeneisdefinedbyalinearequa)on,theen)repathwayisdefinedasalinearsystem•  addressesloopsinthepathways

–  TheIFofapathway(pathway-levelsta)s)c)isdefinedasasumofPFofallgenesinapathway

PathwayTopology(PT)-BasedApproaches

•  FCSmethodsthatusecorrela)onsamonggenesimplicitlyassumethattheunderlyingnetwork,asdefinedbythecorrela)onstructure,doesnotchangeastheexperimentalcondi)onschange

•  Thisassump)onmaybeinaccurateàPTapproachesimproveonthis

PathwayTopology(PT)-BasedApproaches

•  NetGSAaccountsforthethechangeincorrela)onaswellasthechangeinnetworkstructureasexperimentalcondi)onschange–  likeIFanalysis,modelsgeneexpressionasalinearfunc)onofothergenesinthenetwork

•  itdiffersfromIFintwoaspects–  First,itaccountsforagene'sbaselineexpressionbyrepresen)ngitasalatentvariableinthemodel

–  Second,itrequiresthatthepathwaysberepresentedasdirectedacyclicgraphsDAGs•  Ifapathwaycontainscycles,NetGSArequiresaddi)onallatentvariablesaffec)ngthenodesinthecycle.

•  Incontrast,IFanalysisdoesnotimposeanyconstraintonthestructureofapathway

Limita)onsofPT-basedApproaches

•  Truepathwaytopologyisdependentonthetypeofcellduetocell-specificgeneexpressionprofilesandcondi)onbeingstudied–  informa)onisrarelyavailable–  fragmentedinknowledgebasesifavailable–  Asannota)onsimprove,theseapproachesareexpectedtobecomemoreuseful

•  Inabilitytomodeldynamicstatesofasystem

•  Inabilitytoconsiderinterac)onsbetweenpathwaysduetoweakinter-pathwaylinkstoaccountforinterdependencebetweenpathways

KhatriP,SirotaM,BuTeAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

RRRRpackagenetgsa

OutstandingChallenges

•  BroadCategories:1.  annota)onchallenges2.  methodologicalchallenges

OutstandingChallenges

•  Nextgenera)onapproacheswillrequireimprovementoftheexis)ngannota)ons– necessarytocreateaccurate,highresolu)onknowledgebaseswithdetailedcondi)on-,)ssue-,andcell-specificfunc)onsofeachgene•  PharmGKB….

–  theseknowledgebaseswillallowinves)gatorstomodelanorganism'sbiologyasadynamicsystem,andwillhelppredictchangesinthesystemduetofactorssuchasmuta)onsorenvironmentalchanges

Annota)onChallenges

•  Lowresolu)onknowledgebases•  Incompleteandinaccurateannota)ons•  Missingcondi)on-andcell-specificinforma)on

Greenarrowsrepresentabundantlyavailableinforma)on,andredarrowsrepresentmissingand/orincompleteinforma)on.Theul)mategoalofpathwayanalysisistoanalyzeabiologicalsystemasalarge,singlenetwork.However,thelinksbetweensmallerindividualpathwaysarenotyetwellknown.Furthermore,theeffectsofaSNPonagivenpathwayarealsomissingfromcurrentknowledgebases.Whilesomepathwaysareknowntoberelatedtoafewdiseases,itisnotclearwhetherthechangesinpathwaysarethecauseforthosediseasesorthedownstreameffectsofthediseases.

LowResolu)onKnowledgeBases•  Knowledgebasesnotashighresolu)onastechnologies–  usingRNA-seq,morethan90%ofthehumangenomeises)matedtobealterna)velyspliced

–  mul)pletranscriptsfromthesamegenemayhaverelated,dis)nct,orevenopposingfunc)ons

–  GWAShaveiden)fiedalargenumberofSNPsthatmaybeinvolvedindifferentcondi)onsanddiseases.

–  However,currentknowledgebasesonlyspecifywhichgenesareac)veinagivenpathway

–  Essen)althattheyalsobeginspecifyingotherinforma)on,suchastranscriptsthatareac)veinagivenpathwayorhowagivenSNPaffectsapathway

LowResolu)onKnowledgeBases•  Becauseoftheselowresolu)onknowledgebases,every

availablepathwayanalysistoolfirstmapstheinputtoanon-redundantnamespace,typicallyanEntrezGeneID–  thistypeofmappingisadvantageous,althoughitcanbenon-trivial,asitallowstheexis)ngpathwayanalysisapproachestobeindependentofthetechnologyusedintheexperiment

–  However,mappinginthiswayalsoresultsinthelossofimportantinforma)onthatmayhavebeenprovidedbecauseaspecifictechnologywasused•  XRN2a,avariantofgeneXRN2,isexpressedinseveralhuman)ssues,whereasanothervariantofthesamegene,XRN2b,ismainlyexpressedinbloodleukocytes

•  AlthoughRNA-seqcanquan)fyexpressionofbothvariants,mappingbothtranscriptstoasinglegenecauseslossof)ssue-specificinforma)on,andpossiblyevencondi)on-specificinforma)on

LowResolu)onKnowledgeBases

•  Therefore,beforepathwayanalysiscanexploitcurrentandfuturetechnologicaladvancesinbiotechnology,itiscri)callyimportanttoannotateexacttranscriptsandSNPsthatpar)cipateinagivenpathway

•  Whilenewapproachesarebeingdevelopedinthisregard,theymaynotyetbeadequate–  Braunetal.proposedamethodforanalyzingSNPdatafromaGWAS

–  S)llreliesonmappingmul)pleSNPstoasinglegene,followedbygene-to-pathwaymapping

IncompleteandInaccurateAnnota)on

•  Asurprisinglylargenumberofgenesares)llnotannotated

•  Manyofthegenesarehypothe)cal,predicted,orpseudogenes–  Althoughthenumberofprotein-codinggenesinthehumangenomeis

es)matedtobebetween20,000and25,000,accordingEntrezGene,thereare45,283humangenes,ofwhich14,162arepseudogenes

–  Onecouldarguethatthepseudogenesshouldnotbeincludedwhenevalua)ngfunc)onalannota)oncoverage

–  pseudogene-derivedsmallinterferingRNAshavebeenshowntoregulategeneexpressioninmouseoocytes

–  GOprovidesannota)onsfor271pseudogenes–  AwidelyusedDNAmicroarray,AffymetrixHGU133plus2.0,contains

1,026probesetsthatcorrespondto823pseudogenes–  Shouldpseudogenesbeincludedinthecountwhenes)ma)ng

annota)oncoverageforthehumangenome?

IncompleteandInaccurateAnnota)onNumberofGO-annotatedgenes(lePpanel)andnumberofGOannota)ons(rightpanel)forhumanfromJanuary2003toNovember2009.Asthees)matednumberofknowngenesinthehumangenomeisadjusted(betweenJanuary2003andDecember2003)andannota)onprac)cesaremodified(betweenDecember2004andDecember2005,andbetweenOctober2008andNovember2009),onecanarguethat,althoughthenumberofannotatedgenesandtheannota)onsaredecreasing(whichismainlyduetotheadjustednumberofgenesinthehumangenomeandchangesintheannota)onprocess),thequalityofannota)onsisimproving,asdemonstratedbythesteadyincreaseinnon-IEAannota)onsandthenumberofgeneswithnon-IEAannota)ons.However,theincreaseinthenumberofgeneswithnon-IEAannota)onsisveryslow.Inalmost7years,betweenJanuary2003andNovember2009,only2,039newgenesreceivednon-IEAannota)ons.Atthesame)me,thenumberofnon-IEAannota)onsincreasedfrom35,925to65,741,indica)ngastrongresearchbiasforasmallnumberofgenes.doi:10.1371/journal.pcbi.1002375.g003

IncompleteandInaccurateAnnota)on

•  Addi)onally,manyoftheexis)ngannota)onsareoflowqualityandmaybeinaccurate–  >90%oftheannota)onsintheOctober2015releaseofGOhadtheevidencecode“inferredfromelectronicannota)ons(IEA)”

–  theonlyonesinGOthatarenotcuratedmanually–  Annota)onsinferredfromindirectevidenceareconsideredtobeoflowerqualitythanthosederivedfromdirectexperimentalevidence

–  Iftheannota)onswithIEAcodeareremoved,thenumberofgeneswithgoodqualityannota)onsintheNovember2015releaseofhumanGOannota)onsisreducedfrom~18Kto~12K

IncompleteandInaccurateAnnota)on

•  Itisverylikelythatthereducednumberofannota)onsandannotatedgenessinceJanuary2003isanindicatorofimprovingquality

•  Thisisdueinparttothefactthatthenumberofgenesinagenomearecon)nuouslybeingadjustedandthefunc)onalannota)onalgorithmsarebeingimproved–  thenumberofnon-IEAannota)onsiscon)nuouslyincreasing

•  However,therateofincreasefornon-IEAannota)onsisveryslow(approximately2,000genesannotatedin7years)

IncompleteandInaccurateAnnota)on

•  Manualcura)onoftheen)regenomeisexpectedtotakeaverylong)me(~13–25years)

•  En)reresearchcommunitycouldpar)cipateinthecura)onprocess

•  Oneapproachtofacilitatepar)cipa)onofalargenumberofresearchersistoadoptastandardannota)onformatsimilartoMinimumInforma)onAboutaMicroarrayExperiment(MIAME)–  shouldthisberequiredlikeGEO?

•  Aformatforfunc)onalannota)oncanbedesignedoradoptedfromtheexis)ngformats(e.g.,BioPAX,SBML)–  Suchaformatcouldallowresearcherstospecifyanexperimentally

confirmedroleofaspecifictranscriptoraSNPinapathwayalongwithexperimentalandbiologicalcondi)ons

MissingCondi)onandcell-specificinforma)on

•  Mostpathwayknowledgebasesarebuiltbycura)ngexperimentsperformedindifferentcelltypesatdifferent)mepointsunderdifferentcondi)ons

•  Thesedetailsaretypicallynotavailableintheknowledgebases!

•  Oneeffectofthisomissionisthatmul)pleindependentgenesareannotatedtopar)cipateinthesameinterac)oninapathway

•  Thiseffectissowidespreadthatmanypathwayknowledgebasesrepresentasetofdis)nctgenesasasinglenodeinapathway

MissingCondi)onandcell-specificinforma)on

•  Example:Wnt/beta-cateninpathwayinSTKE–  thenodelabeled“Genes”represents19genesdirectlytargetedbyWntindifferentorganisms(Xenopusandhuman)indifferentcellsand)ssues(coloncarcinomacellsandepithelialcells

–  thesenon-specificgenesintroducebiasforthesepathwaysinallexis)nganalysisapproaches

–  Forinstance,anyORAmethodwillassignhighersignificance(typicallyanorderofmagnitudelowerp-value)toapathwaywithmoregenes

–  Similarly,moregenesinapathwayalsoincreasetheprobabilityofahigherpathway-levelsta)s)cinFCSapproaches,yieldinghighersignificanceforagivenpathway.

MissingCondi)onandcell-specificinforma)on

•  Thiscontextualinforma)onistypicallynotavailablefrommostoftheexis)ngknowledgebases

•  Astandardfunc)onalannota)onformatdiscussedabovewouldmakethisinforma)onavailabletocuratorsanddevelopers–  Forinstance,therecentlyproposedBiologicalConnec)onMarkupLanguage(BCML)allowspathwayrepresenta)ontospecifythecellororganisminwhicheachpathwayinterac)onoccurs.

–  BCMLcangeneratecell-,condi)on-,ororganism-specificpathwaysbasedonuser-definedquerycriteria,whichinturncanbeusedfortargetedanalysis

MissingCondi)onandcell-specificinforma)on

•  Exis)ngknowledgebasesdonotdescribetheeffectsofanabnormalcondi)ononapathway–  Forexample,itisnotclearhowtheAlzheimer'sdiseasepathwayinKEGGdiffersfromanormalpathway

–  Noritisclearwhichsetofinterac)onsleadstoAlzheimer'sdisease

•  Wearenowunderstandingthatcontextplaysanimportantroleinpathwayinterac)ons

•  Informa)onabouthowcelland)ssuetype,age,andenvironmentalexposuresaffectpathwayinterac)onswilladdcomplexitythatiscurrentlylacking

MethodologicalChallenges

•  Benchmarkdatasetsforcomparingdifferentmethods

•  Inabilitytomodelandanalyzedynamicresponse

•  Inabilitytomodeleffectsofanexternals)muli

ComparingDifferentMethods

•  Howdowecomparedifferentpathwayanalysismethods?

•  Simulateddata– Advantages:

•  Realsignalissimulated,so“true”answerisknown

– Disadvantages•  Cannotcontainallthecomplexityofrealdata•  Thesuccessofthemethodscanreflectthesimilarityofhowwellthesimula)onmatchestheknowledgebasestructureused

ComparingDifferentMethods•  Benchmarkdata– Advantages:

•  Cancomparesensi)vityandspecificity•  Severaldatasetshavebeenconsistentlyusedintheliterature

•  Includesallthecomplexityofrealbiologicaldata

– Disadvantages•  Affectedbyconfoundingfactors

–  absenceofapuredivisionintoclasses–  presenceofoutliers–  ….

•  Notrueanswerknownforgroundedcomparisons–actualbiologyisntknown

ComparingDifferentMethods•  Ageneralchallenge:DifferentdefiniGonsofthesame

pathwayindifferentknowledgebasescanaffectperformanceassessment

–  GOdefinesdifferentpathwaysforapoptosisindifferentcells•  (e.g.,cardiacmusclecellapoptosis,Bcellapoptosis,Tcellapoptosis)•  Furtherdis)nguishesbetweeninduc)onandregula)onofapoptosis

–  KEGGdefinesasinglesignalingpathwayforapoptosis•  doesnotdis)nguishbetweeninduc)onandregula)on

–  AnapproachusingKEGGwouldiden)fyasinglepathwayassignificant,whereasGOcouldiden)fymul)plepathways,and/orspecificaspectsofasingleapoptosispathway

Inabilitytomodelandanalyzedynamicresponse

•  Noexis)ngapproachcancollec)velymodelandanalyzehigh-throughputdataasasingledynamicsystem

•  Currentapproachesanalyzeasnapshotassumingthateachpathwayisindependentoftheothersatagiven)me–  measureexpressionchangesatmul)ple)mepoints,andanalyzeeach)mepointindividually

–  Implicitlyassumesthatpathwaysatdifferent)mepointsareindependent

•  Needmodelsthataccountsfordependenceamongpathwaysatdifferent)mepoints–  Muchofthislimita)onisduetotechnology/experimentaldesignànotallbioinforma)cslimita)ons

Inabilitytomodeleffectsofanexternals)muli

•  Geneset–basedapproachesoPenonlyconsidergenesandtheirproducts

•  Completelyignoretheeffectsofothermoleculespar)cipa)nginapathway–  suchastheratelimi)ngstepofamul)-steppathway.

•  Example:–  Theamount/strengthofCa2+causesdifferenttranscrip)onfactorstobeac)vated

–  Thisinforma)onisusuallynotavailable.

Summary•  Inthelastdecade,pathwayanalysishasmatured,andbecomethestandardfortryingtodissectthebiologyofhighthroughputexperiments.

•  Manysimilari)esacrossthethreemaingenera)onsofpathwayanalysistools.

•  Willdiscussmoredetailsofsomeofthesechoices,knowledgebases,andspecificapproachesnext.

•  Manyopenmethodsdevelopmentchallenges!

OverviewofModule

•  FirstHalf:– Overviewofgenesetandpathwayanalysis

•  Commonlyuseddatabasesandannota)onissues•  1stand2ndgenera)ontools

–  Basicdifferencesinmethods–  Detailsonverypopularmethods

•  Issueswithdifferent“omics”datatypes

•  SecondHalf– “3rdgenera)on”methods– Networkanalysismodeling

Ques)ons?

[email protected]


Recommended