INSTRAL: Discordance-aware Phylogenetic Placement
using Quartet ScoresMaryam Rabiee
Department of Computer Science University of California, San Diego
1
2
➤ Given: — an existing tree — some new data not in the treeFind: the best position of the new data on the tree
➤ Why?
➤ With emergence of new data, trees get outdated
• De novo phylogenetic reconstruction is expensive
➤ Sample identification for query sequences, especially for mixed samples from environment
Phylogenetic placement
3
Gene Trees
Backbone TreeX
Y
W
Z
A
x
y
w
z
Aw
y
z
x
Ax
y w
z
w
y
z
x ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
gene 1
-----CATTGCT--
xwyzA
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
gene 2
xwyz
---CATTGCT-- AY
w
z
x
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 3
Y w
zx
---CATTG---CT--- A
xwyz
EPA-ng [Barbera et al., Sys Bio., 2018]SEPP [Mrarab et al., Biocomputing, 2012]APPLES [Balaban et al., Sys Bio, 2019]PPlacer [Matsen et al., BMC Bio.,2010]
4
Gene Trees
Backbone TreeX
Y
W
Z
A
x
y
w
z
Aw
y
z
x
Ax
y w
z
w
y
z
x ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
gene 1
xwyz
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
gene 2
xwyzY
w
z
x
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 3
Y w
zxxwyz
X
Y
W
Z
A
5
Gene Trees
Backbone TreeX
Y
W
Z
INSTRAL (INsertion of New Species using asTRAL)
[Rabiee and Mirarab, SysBio, 2019]
A
x
y
w
z
Aw
y
z
x
Ax
y w
z
X
Y
W
Z
A
A species tree on n+1 species that induces the backbone tree and has maximum quartet score versus gene trees
6
Orang
GorillChim
Human
Gorilla
Orang.
Chimp
HumanBonobo
Orang
GorillBonobo
Human Orang
GorillChim
Bonobo Bonobo
GorillChim
Human Orang
Chim
Human
Bonobo
Quartet support
Quartets:
INSTRAL algorithm
➤ Finds the placement with maximum quartet support versus the gene trees
➤ Looks at all possible placements and finds the exact solution with no heuristics
➤ Runs in polynomial time with respect to #species and #genes
7
Measuring accuracy
• Remove one leaf at a time from the true species tree
• Add back the left-out species
• Measure Node distance: the number of branches between the correct placement and the reported placement
• 0 means perfect placement
8
INSTRAL accuracy
9
●
●
●
50
Moderate High Very High
0.1
0.2
0.3
0.4
Nod
e di
stan
ce Method●
●
●
CA−ML (EPA−ng)
INSTRAL+de novo
INSTRAL+EPA−ng
Genes
ILS
Comparison with concatenation
10
●
●
●
50
Moderate High Very High
0.1
0.2
0.3
0.4
Nod
e di
stan
ce Method●
●
●
CA−ML (EPA−ng)
INSTRAL+de novo
INSTRAL+EPA−ng●●
●
●
●
●
50
Moderate High Very High
0.1
0.2
0.3
0.4
Nod
e di
stan
ce Method●
●
●
CA−ML (EPA−ng)
INSTRAL+de novo
INSTRAL+EPA−ng
Genes
ILS
Maximum-likelihood method for placement
[Barbera et al., Sys Bio., 2018]
INSTRAL running time
11
●
●
●
●
●
●
1.32
16
64
256
1024
250 500 1000 2500 5000 10000Backbone tree size
Run
ning
tim
es (s
ecs)
INSTRAL on large trees
➤ INSTRAL was able to insert ~70k new genomes onto the tree with 10K genomes to create a tree with ~100k leaves with around a week of computation (10 nodes with 24 cores.)
[Zhu et al, Nature Communication, 2019]
(mini) Tutorial• Software available at Github site:
• https://github.com/maryamrabiee/INSTRAL
• https://github.com/maryamrabiee/Constrained-Search
• See README at GitHub site: https://github.com/maryamrabiee/INSTRAL
• Publication:
• Maryam Rabiee, Siavash Mirarab, INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores, Systematic Biology , syz045, https://doi.org/10.1093/sysbio/syz045
13
Step 0a: updating gene alignments
• Add new sequences into existing gene alignments
• Tools:
• SEPP
• UPP
• HMMER
• Mafft —addfragments
14
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CATTGCT
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG—-CAT-TGCT-
[Mirarab et al., 2012]
[Nguyen et al., 2015]
[Potter et al., 2018]
[Kotah et al., 2019]
Step 0b: updating gene trees
• De novo reconstruction of gene trees from alignments
• RAxML, FastTree , IQ-TREE,…
• Placement on the existing gene trees
• PPlacer, EPA, SEPP,…
15
[Kozlov et al., Bioinformatics, 2019] [Price et al., PloS ONE, 2010]
[Nguyen et al., Mol. Biol. Evol., 2015]
Step 1: prepare input
• Concatenate all the updated gene trees in Newick format into a file
• The backbone tree should also be in Newick format
16
Step 2: Run INSTRAL
java -Djava.library.path=/path_to_repo/lib/ -jar instral.5.13.4.jar -i estimatedgenetrees.tre -f backbone.nwk -o placement.out --placement new_species_label --no-scoring -C > placement.br 2> log.txt
• Use -Xmx for large datasets to increase memory
• Use -T for multi-thread version
17
• Internal branches of the backbone need labels
• If they don’t use “label_internal_nodes” script in the repo
• INSTRAL outputs the label of the branch and the tree with new species inserted
• Branch labels can be used for multiple insertions
Interpreting the Output
18A dCB FE
N1
N2
N3Output: N1
Output tree file: ((((A,(B,C)),d),(E,F));
Multiple new species• Need to run INSTRAL for each new species separately
• Combine the all insertions by the script in the repo
• Final tree is unresolved
• Sample run:
./multiple_placements.sh estimatedgenetrees.tre backbone.tree outdir/ final_tree.tree
19
X
Y
W
Z
A
B
C
A polytomy
19
Resolving polytomies
20
Constrained ASTRAL• ASTRAL can be used to resolve polytomies of a species
tree based on input gene trees
• Need as input a constrained tree and gene trees
• Code available at https://github.com/maryamrabiee/Constrained-search
• Example run:java -jar astral.5.6.9.jar -i estimatedgenetrees.tre -o resolved-speciestree.tree -j contraint.tree 2> log.txt
21
For more info• Contact me: Maryam Rabiee, [email protected]
• Software available at Github site:
• https://github.com/maryamrabiee/INSTRAL
• https://github.com/maryamrabiee/Constrained-Search
• See tutorial and README at GitHub site: https://github.com/maryamrabiee/INSTRAL
• See publications:
• Maryam Rabiee, Siavash Mirarab, INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores, Systematic Biology , syz045, https://doi.org/10.1093/sysbio/syz045
22