Date post: | 12-Apr-2017 |
Category: |
Science |
Upload: | rutger-vos |
View: | 524 times |
Download: | 3 times |
ASSEMBLING THE TREE OF LIFE FROM PUBLIC DNA SEQUENCE DATA PITFALLS, CHALLENGES AND SOLUTIONS
Rutger Vos, Naturalis Biodiversity CenterTwitter: @rvosa
Public DNA sequence data. A lot.
Want a moonshot? Build a rocket.
Assembling a Tree of Life
If you want to build The Tree of Life you will need to build a system for building Trees of Life.
Such a system has many moving parts:
- data mining- marker selection- tree inference- tree calibration- tree grafting
Data miningKeyword searches allow one to locate specific genes for specific taxa, assuming that gene names are standardized and applied correctly.
Data miningSimilarity ("BLAST") searches allow one to locate sequences that are similar to a query sequence
Pitfalls in data mining
Bad gene naming conventions Incomplete knowledge of
orthologyAmbiguous taxonomy
Possible solutions TNRS helps resolve ambiguities in
names (e.g. synonyms): http://taxosaurus.org
Sequence clustering databases help avoid keyword searching and haphazard BLASTing, e.g.:http://phylota.net
Digression: orthology assignment Based on gene names?
Hopeless: outside of very few markers, gene names are a mess
Based on pairwise genome comparisons?Sort-of done for proteins (e.g. InParanoid) but not for non-coding
Based on pairwise reciprocal best BLAST hits?Quick and cheap, but error prone depending on losses and gains, and database completeness
Supermatrix packing
How to optimize combinations of markers to maximize taxon sampling and minimize sparseness?
Tree inference The current gold standard in species
trees from multilocus alignments: *BEAST. Very expensive.
Cheaper, scalable Bayesian tree inference is provided by other tools, e.g. ExaBayes.
Maybe the two can be combined? For example: A cheap, large, taxonomically broad
backbone Within-clade relationships resolved more
expensively
Tree calibration Fossils now available
through web service of http://fossilcalibrations.org
Scalable tree calibration tools now exist, e.g. treePL
*BEAST has more sophisticated methods
Tree grafting
Pitfalls:• What if your exemplar species aren't on either side of the root?• Grafting could then easily result in negative branch lengths• Also, node density effects will lead to younger nodes in the backbone
All in one: SUPERSMART
The SUPERSMART platform helps assemble high quality marker sets and analyze them using configurable divide-and-conquer tree inference approaches.
The platform is available as a VM, a Docker container, or as an easy to install (using Puppet)software stack.
Results: Primates and Palms
Conclusions More and more moving parts for a ToL
moonshot are becoming available Workflows that result in high-quality
estimates can be composed from these SUPERSMART uses a recursive workflow
with different inference methods at different levels
Acknowledgements Alexandre Antonelli Hannes Hettling Mike Sanderson Bengt Oxelman Karin Nilsson Mats Töpel Hervé Sauquet Henrik Nilsson Daniele Silvestro Fabien Condamine Ruud Scharn
Questions? Thanks for listening! Contact me at @rvosa For more about our project:
http://www.supersmart-project.org