The iPlant Tree of Life Project and Toolkit: Building a
Cyberinfrastructure for Plant Science Research
Naim MatasciThe iPlant Collaborative
Evolution 2011
Jun 17-21, 2011
What is iPlant?
Discovery Environment
NEW RELEASE COMING SOON!
NEW RELEASE COMING SOON!
http://www.iplantcollaborative.org/discovery-environment-preview-access
4
Physical Infrastructure
Computation•63K cores cluster•20K cores cluster •1 TB RAM
Storage•2 PB •20 PB archive
Cloud Storage
• Store, access and share large datasets
• Multiple points of entry: web interface, mounted FS, API
• Free and secure
AVAILABLE NOW!
AVAILABLE NOW!
http://www.iplantcollaborative.org/about/policies/data-set-hosting
Cloud Computing
• Virtual Machines– Up to 4 cores, 32 GB RAM,
100 GB dedicated disk– Run any x86-compatible OS
(even Windows)– Persistent or on-demand– Log in via SSH or secure VNC
• Use Cases– Internet-enabled Servers– Database management
appliances– Virtual desktops– …The sky is the limit!
AVAILABLE NOW!
AVAILABLE NOW!
http://www.iplantcollaborative.org/atmosphere-preview
Consumer Applications
8
iPlant's CI
iPlant Tree of Life Grand Challange
Large phylogenetic inferenceBuilding a tree of life for up to 500,000 green plants
Tree VisualizationScalable visualization for small to large trees
Data Assembly and IntegrationAcquisition, organization and processing the data
Taxonomic IntelligenceSorting out different names for the same species
Tree ReconciliationResolving discordant gene and species trees
Trait EvolutionUsing trees to understand how traits evolved
Big TreesTo optimize existing methods to construct phylogenetic trees in the order of 500K taxa.
Big Trees
NINJA/WINDJAMMER (Travis Wheeler)Neighbor-Joining implementation that can analyze > 200K species
Six day run time reduced 32-fold to 4.5 hours for 220K species data set
Two/three day run time reduced 1,800-folds to 2 minutes for distance matrix calculation on 220K set
RAxML-Light (Alexandros Stamatakis)
Large Scale Maximum Likelihood implementation
55K Tree published (Stephen A. Smith et al., “Understanding angiosperm diversification using small and large phylogenetic trees,” American Journal of Botany 98, no. 3 (2011): 404 -414)
AVAILABLE NOW!
AVAILABLE NOW!
Tree VisualizationTo develop an application for viewing, analyzing and exploring large phylogenetic trees.
Tree Visualization
• > 500K Taxa• Fast• Web based, platform independent• Semantic zooming• Metadata driven display of information
iPlant Tree Viewer Prototype
AVAILABLE NOW!
AVAILABLE NOW!
http://portnoy.iplantcollaborative.org/
1KPCollaboration (1KP) – To support the data analysis of the Thousand Plant Transcriptomes Project
1KP
unexplored territory
N(g
enes
)
dozens of species completed genomes
N(species)
dozens of genes PCR in 104 species
Broad phylogenetic coverage
algae non-flowering flowering (angiosperm)
on role of polyploidy in
Darwin’s “abominable
mystery”
Phylogenomics of 1000 species across plant taxa
Tree ReconciliationTo reconcile the evolutionary history of genes and species.
Gene family data courtesy John Bowers
Tree Reconciliation
Taxonomic Name ResolutionCollaboration (BIEN) - To unify and resolve synonymous, erroneous, or other conflicting taxonomic names.
Taxonomic uncertainty
1. Non-existent names• Misspellings• Contamination
• Annotations• Morphospecies• Digitization issues (frame shifts, character encoding)Lexical
variants (digitization conventions)2. Synonymy
• Nomenclatural synonyms• Taxonomic synonyms / concepts
3. Misidentifications, incomplete identifications
AS SEEN IN NATURE!
AS SEEN IN NATURE!
AVAILABLE
NOW!AVAILABLE
NOW!
Taxonomic Name Resolution Service
• Computer assisted standardization of plant names
• Corrects spelling errors and alternative spellings to a standard list of names
• Convert out-of-date names to currently accepted names
Trait EvolutionTo develop an infrastructure for downstream analysis of large trees.
Trait Evolution
• Toolkit to study the evolution of traits of interest on very large phylogenies– Diversification– Biogeographic patterns– Adaptation– Co-evolution – …
Current analyses (Proof of concept)
• Phylogenetically Independent Contrasts(Felsenstein 1985)
• Continuous Ancestral Character Estimation (Schulter et al. 1997, Paradis 2004)
• Discrete Ancestral Character Estimation (Pagel 1994, Paradis 2004)
Community Integrated (2 ½ Days Workshop)
My-Plant.orgTo easily share information and research, collaborate, and stay on top of the latest news in the field.
Collaborative Tool
AVAILABLE NOW!
AVAILABLE NOW!
NEW AND
IMPROVED!NEW AND
IMPROVED!
http://my-plant.org/
http://www.iplantcollaborative.org