Date post: | 26-Jul-2015 |
Category: |
Technology |
Upload: | beiko |
View: | 174 times |
Download: | 0 times |
Robert Beiko
When trees can’t agree
2
- The human microbiome -an ecosystem unlike any other
Human gut microbiome: 2-3 million genes
Typically > 160 “species” at any given time
Human: ~25,000 genes
Qin et al., Nature (2010)
4
Microbial communities
http://upload.wikimedia.org/wikipedia/commons/2/2d/Bacteria_%28251_31%29_Airborne_microbes.jpg
5
Photo courtesy of Emma Allen-Vercoe, University of Guelph
Lachnospiraceae bacterium 3-1-57 CT1“Lachnozilla”
6Meehan and Beiko (2014) Genome Biol Evol
Lachno
Lachnospiraceae – commonly thought of as “Good bacteria”
7
0 1000 2000 3000 4000 5000 6000 7000 8000
Sizes of Assembly and Draft Genomes of Class Clostridia
Number of Protein-Coding Genes
Zilla
?
9
50
33
4
?
10W. Ford Doolittle, Sci Am (1999)
11
PNAS, 2012
“…pathogen-driven inflammatory responses in the gut can generate transient enterobacterial blooms in which conjugative transfer occurs at unprecedented rates.”
PLoS Biol, 2007
“…lateral gene transfer, mobile elements, and gene amplification have played important roles in affecting the ability of gut-dwelling Bacteroidetes to vary their cell surface, sense their environment, and harvest nutrient resources present in the distal intestine.”
Gene transfer matters
12
The genomics toolkitGene profiles
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5
…
13
The genomics toolkit“Species” trees
14
The genomics toolkitGene trees
Do this forALL genes
15
Representing and understandingmicrobial relationships
1. Matrix-based approaches
2. Phylogenetic reconciliation
3. Gene distributions and “microbial identity”
1The tyrannyof distance
17
From profile to distance matrix
Gene 1 Gene 2 Gene 3 Gene 4 Gene n
A
B
C
D
E
F
S1 = 0.91 0.82 0.72 0.89
𝑑𝐴 ,𝐵=1.0−1𝑛∑
𝑔=1
𝑛
𝑆𝑔
A B C
A 0 0.165 0.252
B 0.165 0 0.297
C 0.252 0.297 0
18
Neighbor-joining
Start with a ‘star’ tree
At each iteration, split off the pair of taxa that minimizes the total sum of branch lengths in the tree
Choose groups x and y to minimize the Q-criterion:
Distance matrix entry for (x,y)
x
y
Weighted distance to all leaves
19
Continue until binary tree is obtained
Saitou and Nei (1987)
20
Neighbor-net: Building a splits graph
Bryant and Moulton, Mol Biol Evol (2003)
21
Neighbor-net is guaranteed to produce a circular set of splits
This will produce a planar graph
22
Neighbor-net of 298 microbial genomes
Beiko, Biol Direct (2011)
23
Limitations of neighbor-net
• Neighbor-net still imposes a constraint on the relationships among genomes: “long-distance” connections cannot be shown
?
24
Explicit connections between genomes• Make each genome a vertex in a graph G
V = {A,B,C,D,E,F,…}E = {{A,B},…}
For some threshold t:{A,B} ϵ G iff dA,B ≤ tor if some other condition is satisfied
A BwA,B
25
Linear programming
• Weighting networks based on straight genome-genome similarity highlights close relatives, redundancy
• LP introduces weighting scheme that constrains connections and promotes distinct relationships
26
P. aeruginosaP. fluorescensP. lePewtidaP. syringaeP. entomophilaP. stutzeriP. mendocina
Holloway and Beiko, BMC Evol Biol (2010)
“Plume”
27
Some like it hotPyrococcus furiosusoptimal growth temperature:
100°C
28Kunin et al. (2005) Genome Res
Networks
29
Networks!!!!
Dagan et al. (2008) PNAS
2Inferring andcomparing trees
31
Phylogenetic tree reconciliation
Species tree S Gene tree GLateral gene transfer
Subtree prune and regraftWhidden et al., Syst Biol (2014)
32
For two rooted trees, dSPR is equal to thenumber of components in a MAF, minus 1
So building a MAF is equivalent to inferring the minimumnumber of SPR events needed to reconcile a species treewith a gene tree
Problem is NP-hard
dSPR = 1
MAF components = 2
Bordewich and Semple, Ann Combinatorics (2005)
33
T1 T2
Case 1(separate components)
Case 3(several pendant nodes)
Case 2(one pendant node)
Chris’s algorithm
34
Fixed-parameter tractability
• Problem is dominated by Case 3 (3 alternatives)
• Cut all candidate edges at each step = linear 3-approximation
• Decision problem: to decide if SPR distance ≤ k
• Problem is exponential in SPR distance, NOT number of leaves
therefore FPT
Chris Whidden + Norbert Zeh
35
In practice
36
SPR Supertrees
Supertree: a tree that satisfies some optimality criterion with respect to a set of input trees
SPR supertree: given a set of gene trees, find a tree that minimizes the total number of SPR operations vs. all gene trees
Building an SPR supertree: assemble an initial tree, then propose SPR operations and evaluate its total SPR distance from input trees
Whidden et al., 2014
37
Why SPR supertrees?
1. Explicit representation of LGT events
2. Branches broken in MAF → implied LGT events. Can build graph of connections
244 bacterial genomes40,631 gene trees= Bacterial SPR supertree
LGT patterns for Clostridium
Whidden et al., 2014
(taming in progress) http://en.wikipedia.org/wiki/File:Godzilla_%2754_design.jpg
3Taming Lachnozilla
What makes LachnoZilla
LachnoZilla ?
41
C. difficile….
“Virulence-associated protein”Mobile DNA
Phylogenetic profile basedon extremely good matches toother genomes (> 95% ID, > 95% coverage)
= “recent” LGT events
42
LZ & friends
279 genomesConserved marker-gene tree
Ben Wright
43
LachnoZilla (and friends)genome graph
!
44
Close relative(expected)
45
Distant relative(not so expected)(big genome though!)
46
Selective sharing
Gene-centric graphsLZ Genom
e 1Genom
e 2Genom
e 3Genom
e 4Genom
e 5Genom
e 6
Gene 1
× ×
Gene 2
×
Gene 3
× ×
Gene 4
× × ×
Edge weights are proportional to similarity of distributionUse graph clustering to divide up completely connected, weighted graph
Gene 2
Gene 3
Gene 1
Gene 4
Legionaminic acidAcetylneuraminic acid
(pathogen associated)
Bacteroides pectinophilusButyrivibrio proteoclasticusEubacterium plexicaudatumRoseburiaNeighborsWeirdly named isolates
Lachnozilla in graph form(it all makes sense now)
Mystery isolate #1(made-up example)
Mystery isolate #2(made-up example)
Questions
Representations
Clear inference
From pattern to understanding
52
FIN