Inferring Evolutionary History with Network Models in Population
Genomics: Challenges and Progress
Yufeng WuDept. of Computer Science and Engineering
University of Connecticut, USA
Dagstuhl Seminar, 2010
2
Recombination
• One of the principle genetic forces shaping sequence variations within species
• Two equal length sequences generate a third new equal length sequence in genealogy• Spatial order is important: different parts of genome inherit
from different ancestors.
110001111111001
000110000001111
Prefix
Suffix
Breakpoint
1100 00000001111
Ancestral Recombination Graph (ARG)
10 01 00
S1 = 00S2 = 01S3 = 10S4 = 10
Mutations
S1 = 00S2 = 01S3 = 10S4 = 11
10 01 0011
Recombination
Network model: beyond tree model
1 0 0 1
1 1
00
10
Assumption: At most one mutation per site
4
Reconstruction of Network-based Evolutionary History
Input: DNA sequences (haplotypes) or phylogenetic trees
Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation
Different formulation
Reconstruct the network-based evolutionary history (and related problems)• Efficiency• Accuracy
Same objective
Reconstructing ARGs by Parsimony
• Input: a set of binary sequences M• Goal: reconstruct ARGs deriving M• Parsimony formulation
– minARG: Minimize the number of recombination events
– NP complete (Wang, et al)5
Kreitman’s data for adh locus of D. Malonagaster (1983)
The minARG Problem
Uniform sampling of minARGs by treating each minARG as equally likely (Wu)
Estimating the range of minARGs: lower and upper bounds
Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al).• Simplified ARG topology
Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al.
Exact minARG by branch and bound (Lyngso, Song and Hein)
minARG for Kreitman’s data
Challenge: accurate inference of ARGs
Rmin: minimum number of recombination for M.L(M): lower bound on RminU(M): upper bound on Rmin
Several lower bounds give L(M)=7.
U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, Rmin(M)=7
8
ARG Induces Local Trees
0101 1010 00000110
0100
0000
0000
0010
Local trees: evolutionary history at a genomic position.
Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location.
0110 1010
1110
Data
00000101011011101010
Local tree near site 3
MutationsRecombination
Local Trees Change Across the Genome
0101 1010 00000110
0100
0000
0000
0010
Local trees change when moving across recombination breakpoints.
0110 1010
1110
Data
00000101011011101010
Local tree near site 2
Spatial property:
Nearby local tree tends to be more similar.
How good is the inferred ARGs?Compare the inferred local tree topologies with the simulated trees
Inferring Local TreesProblem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length)
Parsimony-based approaches• Hein (1990,1993), Song and Hein (2005)• Wu (2010): shared topological features in nearby trees.
Key: local trees have different topology due to recombination
Trees or Network? Do not reconstruct full network; local trees are very informative
Challenge: How to improve the accuracy?
Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree
RENT: REfining Neighboring Trees
• Maintain for each SNP site a (possibly non-binary) tree topology– Initialize to a tree containing the split induced by the
SNP• Gradually refining trees by adding new splits to
the trees– Splits found by a set of rules (later)– Splits added early may be more reliable
• Stop when binary trees or enough information is recovered
11
12
0 0 0 1 0 0 0 0 1 1 0 1 0 1 1
A B C
abcde
M
A Little Background: Compatibility
• Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible.• Easily extended to splits.
Sites A and B are compatible, but A and C are incompatible.
Fully-Compatible Region: Simple Case
• A region of consecutive SNP sites where these SNPs are pairwise compatible.– May indicate no topology-altering recombination
occurred within the region• Rule: for site s, add any such split to tree at s.
– Compatibility: very strong property and unlikely arise due to chance.
13
A
B
C
Split Propagation: More General Rule
• Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A?– Trees at site A and B are different.– Suppose site C is compatible with sites A and B. Then?– Site C may indicate a shared subtree in both trees at sites A and B.
• Rule: a split propagates to both directions until reaching a incompatible tree.
14
AB
C
1 2 3 4
Keep two red edges
Keep two black edgesHybridization
event: nodes with in-degree two or more
1 2 3 4
ρ
1 3 2 4
ρT T’
Reticulate NetworksGene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted- Different topologies at different genes
Reticulate evolution: one explanation- Hybrid speciation, horizontal gene transfer
Gene A1: 0 0 02: 0 0 13: 1 1 0 4: 1 0 0
Gene B1: 0 0 02: 1 0 13: 0 1 0 4: 0 0 1
Reticulate network: A directed acyclic graph displaying each of the gene trees
The Minimum Reticulation ProblemGiven: a set of K gene trees G.
Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree.
NP complete: even for K=2
Current approaches: • exact methods for K=2 case (see Semple, et al)• impose topological constraints (e.g. galled networks, see Huson, et al.)
1 2 3 4
T1
1 2 3 4 1 2 4 3
T2 T3
1 2 3 4
NChallenge: efficient and accurate reconstruction of reticulate network for multiple trees.
Close lower and upper bounds for arbitrary number of trees (Wu, 2010)
Performance of PIRN: Optimal Solution
• Lower and upper bounds often match for many data 17
Horizontal axis: number of taxaVertical axis: % of data LB=UB
K: number of treesr: level of reticulation
Performance of PIRN: Gap of Bounds
• Gap between the lower and upper bounds is often small for many data 18
Horizontal axis: number of taxaVertical axis: gap between lower and upper bounds
K: number of treesr: level of reticulation
Reticulate Network for Five Poaceae Trees
19
rpoC2phyB rbcLndhF ITS
Lower bound: 11Upper bound: 13
Reticulate Network for Five Poaceae Trees
20
Upper bound: 13 used in this network
21
Acknowledgement
• More information available at: http://www.engr.uconn.edu/~ywu
• Research supported by National Science Foundation and UConn Research Foundation
Coalescent with Recombination
Coalescent theory: define probabilistic distribution of genealogyLikelihood computation for coalescent with recombination
Probability of ARGs under certain parameters
Likelihood: summation of probability of all the ARGsChallenging: too many ARGs (Lyngso, Song and Hein)
Importance Sampling approach: draw samples (ARGs) wrt some probablistic distributionWork well with no recombinationNot working well with recombination
Coalescent-based ARG Sampling
Uniform sampling of minARGs (Wu, 2007)• Treat each minARG as equally likely.• Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling)
Probability of ARGs under certain parameters
Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities.
minARG
A related problem: compute coalescent likelihood with recombination efficiently.Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009)
The Mosaic ModelM: input sequences
Assumption: input sequences are descendent of K founder sequences (unknown)
Extant sequences: concatenation of exact copies of founder segment (no shift of position)
• Coloring: assign which position of a sequence is from which founder (color); need consistency
M, K=2
00000101011111111110
breakpoint
Total 5breakpoint
The Minimum Mosaic Problem
• Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints)• And find the K founder sequences (not part of input)
Inferred founders
Data from Rastas and Ukkonen
20 sequences40 sites
55 breakpoints: minimum number of breakpoints
26
The Minimum Mosaic Problem• Introduced by Ukkonen (2002)• Simple and easier to visualize• Main known results
– An exponential-time algorithm which runs in polynomial-time algorithm for K=2 (Ukkonen 2002)
– An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007)
– Haplovisual program and other extensions by Rastas and Ukkonen (2007).
– Heuristic algorithm by Roli and Blum (2009)– Lower bounds for the minimum number of breakpoints
needed (Wu, 2010)• Challenges
– Polynomial-time algorithm for K 3?– Concrete applications in biology?