+ All Categories
Home > Science > Assembling the Tree of Life from public DNA sequence data

Assembling the Tree of Life from public DNA sequence data

Date post: 12-Apr-2017
Category:
Upload: rutger-vos
View: 524 times
Download: 3 times
Share this document with a friend
20
ASSEMBLING THE TREE OF LIFE FROM PUBLIC DNA SEQUENCE DATA PITFALLS, CHALLENGES AND SOLUTIONS Rutger Vos, Naturalis Biodiversity Center Twitter: @rvosa
Transcript
Page 1: Assembling the Tree of Life from public DNA sequence data

ASSEMBLING THE TREE OF LIFE FROM PUBLIC DNA SEQUENCE DATA PITFALLS, CHALLENGES AND SOLUTIONS

Rutger Vos, Naturalis Biodiversity CenterTwitter: @rvosa

Page 2: Assembling the Tree of Life from public DNA sequence data

Public DNA sequence data. A lot.

Page 3: Assembling the Tree of Life from public DNA sequence data
Page 4: Assembling the Tree of Life from public DNA sequence data

Want a moonshot? Build a rocket.

Page 5: Assembling the Tree of Life from public DNA sequence data

Assembling a Tree of Life

If you want to build The Tree of Life you will need to build a system for building Trees of Life.

Such a system has many moving parts:

- data mining- marker selection- tree inference- tree calibration- tree grafting

Page 6: Assembling the Tree of Life from public DNA sequence data

Data miningKeyword searches allow one to locate specific genes for specific taxa, assuming that gene names are standardized and applied correctly.

Page 7: Assembling the Tree of Life from public DNA sequence data

Data miningSimilarity ("BLAST") searches allow one to locate sequences that are similar to a query sequence

Page 8: Assembling the Tree of Life from public DNA sequence data

Pitfalls in data mining

Bad gene naming conventions Incomplete knowledge of

orthologyAmbiguous taxonomy

Page 9: Assembling the Tree of Life from public DNA sequence data

Possible solutions TNRS helps resolve ambiguities in

names (e.g. synonyms): http://taxosaurus.org

Sequence clustering databases help avoid keyword searching and haphazard BLASTing, e.g.:http://phylota.net

Page 10: Assembling the Tree of Life from public DNA sequence data
Page 11: Assembling the Tree of Life from public DNA sequence data

Digression: orthology assignment Based on gene names?

Hopeless: outside of very few markers, gene names are a mess

Based on pairwise genome comparisons?Sort-of done for proteins (e.g. InParanoid) but not for non-coding

Based on pairwise reciprocal best BLAST hits?Quick and cheap, but error prone depending on losses and gains, and database completeness

Page 12: Assembling the Tree of Life from public DNA sequence data

Supermatrix packing

How to optimize combinations of markers to maximize taxon sampling and minimize sparseness?

Page 13: Assembling the Tree of Life from public DNA sequence data

Tree inference The current gold standard in species

trees from multilocus alignments: *BEAST. Very expensive.

Cheaper, scalable Bayesian tree inference is provided by other tools, e.g. ExaBayes.

Maybe the two can be combined? For example: A cheap, large, taxonomically broad

backbone Within-clade relationships resolved more

expensively

Page 14: Assembling the Tree of Life from public DNA sequence data

Tree calibration Fossils now available

through web service of http://fossilcalibrations.org

Scalable tree calibration tools now exist, e.g. treePL

*BEAST has more sophisticated methods

Page 15: Assembling the Tree of Life from public DNA sequence data

Tree grafting

Pitfalls:• What if your exemplar species aren't on either side of the root?• Grafting could then easily result in negative branch lengths• Also, node density effects will lead to younger nodes in the backbone

Page 16: Assembling the Tree of Life from public DNA sequence data

All in one: SUPERSMART

The SUPERSMART platform helps assemble high quality marker sets and analyze them using configurable divide-and-conquer tree inference approaches.

The platform is available as a VM, a Docker container, or as an easy to install (using Puppet)software stack.

Page 17: Assembling the Tree of Life from public DNA sequence data

Results: Primates and Palms

Page 18: Assembling the Tree of Life from public DNA sequence data

Conclusions More and more moving parts for a ToL

moonshot are becoming available Workflows that result in high-quality

estimates can be composed from these SUPERSMART uses a recursive workflow

with different inference methods at different levels

Page 19: Assembling the Tree of Life from public DNA sequence data

Acknowledgements Alexandre Antonelli Hannes Hettling Mike Sanderson Bengt Oxelman Karin Nilsson Mats Töpel Hervé Sauquet Henrik Nilsson Daniele Silvestro Fabien Condamine Ruud Scharn

Page 20: Assembling the Tree of Life from public DNA sequence data

Questions? Thanks for listening! Contact me at @rvosa For more about our project:

http://www.supersmart-project.org


Recommended