+ All Categories
Home > Documents > Understanding sets of trees CS 394C September 10, 2009.

Understanding sets of trees CS 394C September 10, 2009.

Date post: 19-Jan-2016
Category:
Upload: molly-hawkins
View: 216 times
Download: 3 times
Share this document with a friend
29
Understanding sets of trees CS 394C September 10, 2009
Transcript
Page 1: Understanding sets of trees CS 394C September 10, 2009.

Understanding sets of trees

CS 394C

September 10, 2009

Page 2: Understanding sets of trees CS 394C September 10, 2009.

Basic challenge

• Phylogenetic analyses are sometimes based upon a single marker, but often based upon many markers

• Each marker can be analyzed separately, or the entire set can be combined into one “super-matrix”

• Each matrix (each dataset) can result in many trees (almost no matter how you analyze the matrix)

What to do with huge numbers of trees?

Page 3: Understanding sets of trees CS 394C September 10, 2009.

What to do?

• How to estimate evolutionary history from many trees

• How to efficiently store large sets of trees

• How to enable efficient queries of the set of trees

Page 4: Understanding sets of trees CS 394C September 10, 2009.

What to do?

• How to estimate evolutionary history from many trees

• How to efficiently store large sets of trees

• How to enable efficient queries of the set of trees

Page 5: Understanding sets of trees CS 394C September 10, 2009.

First, a few questions:

• Why are gene trees different from the species tree?

• Why are estimated gene trees different from the true gene tree?

• Under what conditions is the true evolutionary history not a tree? (i.e., what is “reticulation”?)

Page 6: Understanding sets of trees CS 394C September 10, 2009.

Reticulation

• Evolutionary histories can be reticulate (meaning non-treelike):– Horizontal Gene Transfer (HGT)– Hybrid speciation– Recombination

• Most phylogeny estimation methods produce trees.

• Good resource about reticulate phylogenies: book chapter by Luay Nakhleh (see 394C webpage for the link)

Page 7: Understanding sets of trees CS 394C September 10, 2009.

• We will assume that all evolutionary histories are treelike for the remainder of today’s presentation.

• Later in the course we’ll discuss reticulate evolution…

Page 8: Understanding sets of trees CS 394C September 10, 2009.

Estimated Gene Trees can differ from Species Trees

• Biological reasons:– Deep coalescent events (alleles)– Gene duplication and loss (gene families)

• Computational reasons: – Insufficient time– Poor methods (e.g., UPGMA)– Poor models (e.g., ML using Jukes-Cantor)

• Data issues:– Insufficient data (meaning not enough sites)– Poor alignments

Page 9: Understanding sets of trees CS 394C September 10, 2009.

Examples of problems

When true gene trees can differ from species tree:• Given a collection of gene trees, find a species tree

that minimizes the number of “deep coalescent” events

When true gene trees should equal the species tree:• Given a collection of gene trees, find a species tree

that minimizes the total distance to the gene trees

Page 10: Understanding sets of trees CS 394C September 10, 2009.

When gene trees can differ from species tree

Software/Algorithms for deep-coalescent (see PhyloNet from Nakhleh’s webpage at Rice)

GLASS (Roch and Mossel) - distance-basedMDC (Than and Nakhleh) - parsimony

STEM (Kubatko) - ML

BEST (Liu et al.) - Bayesian

BUCKy (Ané et al.) - Bayesian

Software/Algorithms for duplication-loss

NOTUNG (Durand)

Duptree (Bansal et al.)

Hallet and Lagergren - algorithms/complexity

Page 11: Understanding sets of trees CS 394C September 10, 2009.

When gene trees should equal the species tree

• The problem here is that estimated gene trees can differ from the true gene trees.

• Although the problem is “simple”, it is still interesting -- computationally and mathematically.

• Plus, we can still make novel contributions.

Page 12: Understanding sets of trees CS 394C September 10, 2009.

The very simplest problem

Easiest case:• One species tree, true gene trees will agree with the

species tree, • Estimated trees are on the full set of taxa

Approaches:Consensus methods: return a tree on the entire set S of taxa

summarizing the input treesAgreement methods: return a tree on a subset of the taxa on

which the trees agreeClustering, then consensus/agreement

Page 13: Understanding sets of trees CS 394C September 10, 2009.

Consensus methods

• These are the most usual ways of analyzing datasets of trees

• Examples:– Strict consensus– Majority consensus– Greedy consensus (aka “extended majority”)– Others less frequently used include: Gordon’s,

Adams, the Strict Consensus Supertree, Local Consensus methods, and more.

• Survey paper by David Bryant for some of these

Page 14: Understanding sets of trees CS 394C September 10, 2009.

Simplest problems, cont.

• “Agreement” methods return trees on subsets of S, on which the trees are the same (or compatible)– MAST: maximum agreement subtree (used in

practice, sometimes)

– MCST: maximum compatible subtree (Ganapathy et al., not used in practice)

• The difference between these is how polytomies are handled

Page 15: Understanding sets of trees CS 394C September 10, 2009.

Soft vs. hard polytomies

• Polytomy: node of high degree (greater than three for an unrooted tree)

• Polytomies arise in estimations when consensus methods are used

• Polytomies also arise when contracting short branches in estimated trees

• Polytomies can be “hard” (representing true radiations) or “soft” (representing lack of information)

Page 16: Understanding sets of trees CS 394C September 10, 2009.

Compatible source trees

• Estimated trees can be “compatible” when we interpret polytomies as “soft”

• “Compatible” means that there is a tree which is a common refinement.

• Example: 123|456, 12|3456, 1235|46.

• We can compute the compatibility tree (when it exists) in O(nk) time, where n=|S| and there are k source trees

Page 17: Understanding sets of trees CS 394C September 10, 2009.

Computational complexity

• Most consensus methods (which return a tree on the entire set S of taxa) are polynomial time.

• Most “agreement methods” (which return a tree on the largest subset of the taxa on which the source trees “agree”) are based upon NP-hard problems. Some (e.g., MAST) have fixed-parameter polynomial time solutions.

Page 18: Understanding sets of trees CS 394C September 10, 2009.

Supertree problems

• Realistic complexity: not all the source trees are on the same set of taxa.

• Obvious problems: – Find the tree on which all the source trees

agree (if it exists).– Find the tree on which a maximum number

of the source trees agree.

• Both are NP-hard.

Page 19: Understanding sets of trees CS 394C September 10, 2009.

Quartet compatibility

• Simple case: all the source trees are on four taxa.

• We ask: does there exist a tree which agrees with all the source trees?

• NP-hard!

Page 20: Understanding sets of trees CS 394C September 10, 2009.

Quartet tree amalgamation

• Given collection of quartet trees, find a tree which agrees with a maximum number of these quartet trees

NP-hard, since compatibility is NP-hardHard to approximate, but PTAS if you

have a tree on every quartet of taxa (Jiang et al.)

Page 21: Understanding sets of trees CS 394C September 10, 2009.

Quartet amalgamation algorithms

• Quartet Puzzling (Strimmer and von Haeseler)

• Q* (Berry et al.)

• Quartet Cleaning (Berry et al.)

• Weight Optimization (Ranwez and Gascuel)

• Quartets MaxCut (Snir and Rao)

But see also the paper (St. John et al.) evaluating early quartet methods on the CS 394C webpage

Page 22: Understanding sets of trees CS 394C September 10, 2009.

What about rooted trees?

Given set of rooted source trees, we ask:

• Is there a tree on which all the rooted source trees are correct?

Page 23: Understanding sets of trees CS 394C September 10, 2009.

Rooted tree compatibility

• Aho, Sagiv, Szymanski, and Ullman: polynomial time, recursive algorithm:– If n=1, return the singleton tree.– If n>1, then compute an equivalence relation on the

set of taxa as follows. • For each rooted triple ((a,b),c) in the set, put a and b in the

same equivalence class. • Compute transitive closure.

– If only one equivalence class, reject (set is incompatible). Otherwise, recurse on each subset, and return tree obtained by making all recursively computed trees sibling subtrees.

Page 24: Understanding sets of trees CS 394C September 10, 2009.

Subtree compatibility

• If source trees are rooted, then compatibility can be tested in polynomial time. Optimization problems are NP-hard, however.

• If source trees are unrooted, then compatibility is NP-hard. And so optimization problems are also NP-hard.

Page 25: Understanding sets of trees CS 394C September 10, 2009.

Supertree problems, in practice

• In practice, the most frequently used supertree method is MRP, for “Matrix Representation with Parsimony”.

• There are, however, many other supertree methods!

Page 26: Understanding sets of trees CS 394C September 10, 2009.

Many Supertree Methods

• MRP• weighted MRP• Min-Cut• Modified Min-Cut• Semi-strict Supertree• MRF• MRD• QILI

• SDM• Q-imputation• PhySIC• Majority-Rule

Supertrees• Maximum Likelihood

Supertrees• and many more ...

Matrix Representation with Parsimony(Most commonly used)

Page 27: Understanding sets of trees CS 394C September 10, 2009.

MRP

• Idea: take every sourcetree, and replace it with a matrix of 0,1,?.

• Concatenate the matrices.• Apply Maximum Parsimony.

If all the source trees are compatible, then an exact solution to MRP will return the compatibility trees.

Page 28: Understanding sets of trees CS 394C September 10, 2009.

Homework, due 9/15

• Read two papers (linked on the webpage):– St. John et al., about quartet-based

methods– Moret et al., about sequence-length

requirements

• Pick one, write summary, and include questions

Page 29: Understanding sets of trees CS 394C September 10, 2009.

Question!

• How do you feel about occasionally having class on some Monday or Friday, so we can have guest lectures?


Recommended