NewPerspec*vesonGeneFamilyEvolu*on:lossesinreconcilia*onandalink
withsupertrees
By:CedricChauveandNadiaEl-Mabrouk
Presenta*onbyJulieHudsonForMAT5313March10,2017
Overview• Twomainproblems:– Reconcilingagenetreewithaknownspeciestree– Determiningaprobablespeciestreegivenonlygenetrees
• Op#miza#onproblems• Thispresenta*onwillleavetheoremswithoutproof(orouten*rely)andinsteadaimstobroadlyintroducereconcilia*on
• Experimentalresultsarepresentedatend
GeneFamily
• Genesthatevolvedfromacommonancestorthroughspecia*onandduplica*on
• Containorthologs(genecopiesindifferentspecies)andparalogs(copiesevolvedbyduplica*on)
• Importantalsoarethegenelosses,arisingthroughpseudogeniza*on(func*onlostthroughadisrup*ontothecodingsequence)
TreeTerminology
• SpeciesTree:binarytreewithG={1,2,…,g}leaves,oneforeveryspecies
• GeneTree:binarytreewhereeachleafislabelledfromGandrepresentsagenecopy
• L(x):Thegenomesetofvertexx• Forest:Asetofgenetrees
Reconcilia*on
• Mostcommonlyusedmethodsforinferringevolu*onaryscenariosarereconcilia*onapproaches
• Reconcilia*onisamapbetweenagenetreeandspeciestreewhereincongruencesareexplainedthroughhypothesizedgeneduplica*onsandlosses
Reconcilia*on• Definedintermsofsubtreeinser#ons• A\erundergoingsubtreeinser*onsatreeissaidtobeanextensionoftheoriginal
• Areconcilia*onbetweenagenetree(T)andspeciestree(S)isanextensionofTthatisDS-consistentwithS
• DS-consistent:ifforeveryvertexxofTsuchthat|L(x)|≥2,thereexistsavertexuofSsuchthatL(x)=L(u)andoneofthefollowingcondi*onsholds:L(xr)=L(xl)[duplica*onevent]orL(xr)=L(ur)andL(xl)=L(ul)
AlgorithmMinimum-Reconcilia*on
• Theorem2:AlgorithmMinimum-Reconcilia*onreconstructstheuniquereconcilia*onbetweenTandSthatminimizesthenumberofgenelosses
• Aseriesofsubtreeinser*onsonTcorrespondingtospecia*oneventsofS
• Roughly,visitleafonT->checksibling->doesitmatchthesiblingonS?->insertifdoesn’t
• Turnourleavesintocherries!
SpeciesTree(S)
12345 11234515
GeneTree(T)
AlgorithmMinimum-Reconcilia*on
12345 11234515
Sibling:2 Sibling:NA
AlgorithmMinimum-Reconcilia*on
SpeciesTree(S)
GeneTree(T)
12345
Sibling:2 Sibling:2
121234515
AlgorithmMinimum-Reconcilia*on
SpeciesTree(S)
GeneTree(T)
12345
CheckingformatchingpahernfromSinTAllthesematch
121234515
AlgorithmMinimum-Reconcilia*on
SpeciesTree(S)
ReconciledTree(T)
12345
Finishingfirstitera*on
12123451245
AlgorithmMinimum-Reconcilia*on
SpeciesTree(S)
ReconciledTree(T)
12345
FinalReconciledTree
12123451245
AlgorithmMinimum-Reconcilia*on
345 123
ReconciledTree(T)SpeciesTree(S)
AlgorithmMinimum-Reconcilia*on• MinR(S,T)• Visitseachvertexexactlyonce• Runsinlinear#me
• Fromitera*onsofMinR(S,T),anevolu*onaryscenariocanbedrawn
expanded leaf of T is a vertex x such that |L(x)| = 1 and L(x) ̸= L(xp),or x is the root of T . A cherry of a tree is an internal vertex x for whichboth children are expanded leaves.
Reconciliation. There are several definitions of reconciliation between agene tree and a species tree. Here we define reconciliation in terms of sub-tree insertions, following an approach used in [16, 7]. A subtree insertionin a tree T consists in grafting a new subtree onto an existing branch ofT . A tree T ′ is said to be an extension of T if it can be obtained from Tby a sequence subtree insertions in T .
Given a gene tree T on G and a species tree S on G, T is said to beDS-consistent with S (following the terminology used in [7]) if, for everyvertex x of T such that |L(x)| ≥ 2, there exists a vertex u of S such thatL(x) = L(u) and one of the two following conditions (D) or (S) holds:(D) either L(xr) = L(xℓ), or (S) L(xr) = L(ur) and L(xℓ) = L(uℓ).
A reconciliation between a gene tree T and a species tree S is anextension R of T that is DS-consistent with S (this definition is easilyshown to be equivalent to other definitions of reconciliation [3, 12]). Sucha reconciliation between T and S implies an unambiguous evolution sce-nario for the gene family T where a vertex of R that satisfies property(D) represents a duplication (the number of duplications induced by R isdenoted by d(R,S)), and an inserted subtree represents a gene loss (thenumber of gene losses induced by R is denoted by ℓ(R,S)). Vertices of Rthat satisfy property (S) represent speciation events (see Fig. 1).
2111 12 13
31 41
Genome 2Genome 1
Genome 4Genome 3
Speciation 1,3
Duplication
Gene loss
Duplication
Speciation 3,4
Speciation 1,2 Gene lossGene loss
(c) H:
2111 12 3113 41
A
B C
4321
(a) S:
2432
A
AA
B B B CC
(b)
Fig. 1. (a) A species tree S; (b) The reconciliation R of S with the gene tree T rep-resented by plain lines. Dotted lines represent subtree insertions (3 insertions). Thecorrespondence between vertices of R and S is indicated by vertices labels. Circles rep-resent duplications. All other internal vertices of R are speciation vertices; (c) Evolutionscenario resulting from R. Each oval is a gene copy.
Given a gene tree T , it is immediate to see that every vertex x of Tsuch that L(xℓ) ∩ L(xr) ̸= ∅ will always be a duplication vertex in any
Problem2
• Whataboutwhenwedon’thaveaspeciestreetoinformtheevolu*onaryscenarios?
• Goal:Findanevolu*onaryhistorythatiscompa*blewithasmanyofthegenetreesintheforestaspossible
• NotethatminimizinglosscostsdoesNOTminimizeduplica*oncostsinthisproblemsoduplica*onsarethefocus
Supertree!
• Aninducedspeciestreefromasetofuniquelyleaf-labelledgenetrees
• Trea*ngourproblemasasupertreeproblemmeansthattheheuris*csusedonsupertreescouldbeusefulhereIFwecanshowthistobeasupertreeproblemofsorts
• Issue:Ourgenetreesarenotuniquelylabelled
SolvingSupertreethroughBipar**ons• Bipar##on:a“collapsed”uniquely-labelledtreewhereonly3internalver*cesexist:aparent,andtwochildren.Fromthesenon-binaryver*ces,leavesofallspeciesinthegenomesetarepresent(B)
• C(B,T)isthenumberofbipar**onsnotconsistentwiththespeciestree
• Minimizingthissolvesthesupertreeproblem
becomes
12345 12345
Rela*ontoaSupertree
• IFweleteachgenetreebeasinglespecia*onevent,
• Theorem3:LetFbeaforestofgenetreesonGandkbethenumberofapparentduplica*onspresentinthetreesofF.ThenforanyspeciestreeSonG,d(F,S)=k+C(B(F),T))
• Transla*on:theduplica*oncost(whichwewanttominimize)isafunc*onofthenumberofinconsistentbipar**ons
So….?• Wecanapplysupertreeheurisi*cstotheminimumduplica*onop*miza*on!
• Inpar*cular,themincutalgorithmisagreedyapproachtosolveit
• Runsinpolynomial*me
• Majorresult:anMDforestwillbeacompa*blegenetreeforest!
• Compa*bilityis,looselyspeaking,wherethegenetreesandspeciestreeagreeateverypointyoucancutthem
• ThisgivesONEspeciestreethatisthemostparsimoniousevolu*onhistory
ExperimentalResults
• 250genetreesweresimulatedusinga12speciesDrosophiliatreefor4differentgenegain/lossrates
• Fromthesegeneforestsinforma*vebipar**onswereusedtocomputeaspeciestree(usingamin-cutalgorithm)
Rate Nb. of Nb. of Losses Nb. of Genes Nb. of Int. Nb. of Apparent Nb. ofDuplications vertices duplications Bipartitions0.02 1080 976 3014 2752 1057 8310.05 2018 1366 3622 3360 1948 5930.1 3126 1603 4376 4114 3007 3580.2 6123 2552 7709 7447 5875 429
Table 1. Characteristics of simulated gene trees. Considered bipartitions are thosecontaining more than two species.
Modified Min-Cut algorithm described in [27] to compute a species treefrom these bipartitions. With rates 0.02 and 0.04, this species tree is thecorrect species tree, while with rate 0.1, it differs from the correct one bya single branch swap, and with rate 0.2, it differs from the correct oneby the fact that two consecutive binary nodes have been replaced by asingle quaternary node. The fit statistic associated to the inferred speciestree, that measures how well it agrees with the bipartitions, is very high,ranging from 0.98 to 0.855 (maximum fit is 1). This shows the effective-ness of the supertree approach using bipartitions, at least on a dataset ofrelatively close species where few vertices indicating a speciation are falsepositive.
We also studied the phylogenetic signal given by triplets of speciesthat were split by non-apparent duplication vertices. With rates 0.02 and0.05, for each triplet of species, there is a phylogeny that appears inmost cases. However, with rates 0.1 and 0.2, among the triplets thatappear a significant number of times (at least 50 times), the ones wherethe dominant phylogeny appears in less than 90% of the bipartitionssplitting this triplet, contain the two species involved in the branch swapor species involved in the unresolved node that differs from the correctspecies tree. This illustrates the interest in using triplets of species that aresplit by non-apparent duplication vertices to point at possible locationsof an inferred species tree that are associated with a weaker phylogeneticsignal.
6 Conclusion
In this paper, we show that minimizing losses is a more constraining crite-rion than minimizing duplications for reconciliation. This highlights theimportance of the former criterion from a combinatorial point of view,although it has been rarely considered alone in reconciliation approaches.
YAK
SEC
WILMOJ
SIM
GRI
PSEPERANA
VIR
EREMEL
Treefromlowestratesofgain/loss
SECSIM
ERE
PER
VIR
WIL
PSE
MOJ
ANA
MELYAK
GRI
Treefromhighestrateofgain/loss
ExperimentalResults
FitSta*s*c:0.98
FitSta*s*c:0.85
Conclusion• Op*miza*onproblemshaveasagoalefficiency• Inproblem1,asimplyimplementedalgorithmtoinducereconcilia*onwasintroducedthatrunsinlinear*me
• Problem2firstdescribeditselfintermsofasupertreeproblemandusesthosewelltestedalgorithmstosolvetheminimumduplica*onproblem
• Fairlyobviously,errorsingenetreescanleadtoerroneousduplica*on/losshistoriesandspeciestrees.Supertreemethodshighlightpoten*alerrorsfortreepruningpurposes(usingsplittripletstodetectweakphylogene*csignal)