Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy Meg Staton University of Tennessee, Knoxville [email protected]@hardwoodgenomics
Transcript
1. Building genomic data cyberinfrastructure with the online
database software Tripal and analysis workflows driven by Galaxy
Meg Staton University of Tennessee, Knoxville [email protected]
@hardwoodgenomics
2. Cyberinfrastructure Need to connect people to Computing
systems Data storage systems Advanced instruments Data repositories
Visualization environments Sensors All distributed across the
world
3. Wilkinson et al 2016
4. FAIR data principles Findable Unique and persistent
identifiers Accessible Open and free method for retrieval
Interoperable Data are properly associated with other datasets
Re-usable Rich metadata (attributes for who, what, when, where,
how)
5. The community (genome) database Mission Collect data Curate
data Integrate data Provide access to data
6. Difference from primary repositories Why do we need
community databases? The Community Part Understand what is
important for your users Respond to questions Attend community
meetings Participate in grants Take data that doesnt have a home
anywhere else Manual curation
7. Challenges 2007, Clemson University We were writing all the
database and web code from scratch Starting to accumulate multiple
databases Would like to focus on biological visualization, instead
cobbling together code modules to handle
Usernames/passwords/permissions Front page news items Calendar of
meetings There has to be an easier way! Dorrie Main Stephen
Ficklin
8. A web framework for genetic and genomic data Goals: Simplify
construction of a community genomics websites Encourage
high-quality, standards-based websites for data sharing and
collaboration Expand and reuse code http://tripal.info
9. Content Management System Website construction toolkit Open
source Globally utilized and supported Manages users Module-based
design My Drupal Web Site Calendar Module Views XML Sitemap
10. My Drupal Web Site Calendar Module Views Organism Sequence
Feature GenotypeDrupal Database
11. Why use Tripal? Goals: Simpler construction Encourage
high-quality, standards-based websites for data sharing and
collaboration Expand and reuse code Open source Friendly developers
Responsive mailing list
12. Modules Core Modules Organisms Contact Controlled
Vocabularies Stocks/Germplasm Phenotypes Genotypes Features
Phenotypes Phylogenies Bulk Data Loader Jobs Management Extension
Modules Transcriptomes Functional annotation: BLAST KEGG InterPro
GeneOntology BLAST server Breeding API Genetic Maps Libraries
JBrowse
14. What problem is being solved? Drupal internal search Easy
to set up and customize (for normal Drupal data types) Slow to
index, slow to return results Need a solution that will: Access
Chado database Provide flexible and customizable indexing index
only what is needed, not everything Scale to very large biological
data sets
15. Elasticsearch Software Distributed, open source search and
analytics engine Massively distributed can scale horizontally
Multitenancy a search cluster can manage many individual indices
that can be queried individually or as a group Feature-rich -
autocomplete, fuzzy searching, did you mean suggestions Open source
Widely adopted
16. Elasticsearch Module Implementation Install Elasticsearch
Install Tripal Elasticsearch Module Connect to the Elasticsearch
Server Index Drupal nodes Site-wide search Index targeted Chado or
Drupal tables Customized search
17. Index chado table or materialized view
18. After indexing, build search block The block is a normal
Tripal block that can be placed on any or all pages. Blocks can
also be deleted from the admin back end.
19. Alter form fields
20. Final Custom Transcript Search
21. Final Custom Transcript Search - Results
22. Elasticsearch Module Faster indexing (if only due to
multicore usage) Faster searching Future Development Multisite
installs on a single web server currently working Port to Tripal
3.0 Compare to new internal searching
24. What problem is being solved? Biological Samples RNA
Libraries Gene Expression Levels Need a better way to store and
visualize RNASeq differential gene expression experiments.
25. Expression Module Content Types Biomaterial Similar to NCBI
BioSample and SRA We currently do not differentiate between samples
and libraries Expression Analysis User specifies protocol and array
design if a microarray was used Upload and display of gene
expression values
26. Loading Data Import biomaterial BioSample data downloaded
from NCBI (xml) Flat file format (based on NCBI biomaterial bulk
load form) Can associate ontology terms through flat file Create a
new expression analysis Import expression values as text files
(assumed to be normalized, features must already exist) Individual
file per sample Tab delimited file with gene rows, sample
columns
27. Visualization - Biomaterials
28. Visualization Gene Expression
29. Visualization Gene Expression Hover over a library name for
a description Some options to alter the graphic
30. Expression Visualization Tool Paste a list of genes in to
get a full heatmap across all libraries. Plotly allows you to zoom,
download, etc.
31. Future Work on Expression Module Transfer the list of all
features from search results to expression visualization tool Add
significance/p-values from differential gene expression test
results Aid searching limit results only to genes that respond to
cold stress Interactive data filtering Tie into analysis engine Tie
into a publication module
32. Galaxy Extension Module and Analysis Engine
https://github.com/tripal/tripal_galaxy
33. Galaxy is an open, web-based platform for accessible,
reproducible, and transparent computational biomedical research. No
need to use the command line to run NGS pipelines. Use a website to
upload data, build an analysis pipeline and run it.
34. Tripal Galaxy Module Currently under development
https://github.com/tripal/tripal_galaxy Tripal sites can provide
Galaxy workflows to their users Ensures reproducibility of data
analysis steps Decreases curator effort/time Provides the workflow
within the look-and-feel of the site Can be installed by any Tripal
site once completed.
35. Galaxy Workflows Testing on Galaxy instances at Washington
State University, University of Connecticut, and University of
Tennessee DNA Sequence Data Re-sequencing alignment Variant
discovery (against the reference) Variant discovery (between
samples) Prediction of functional genetic variants
https://github.com/statonlab/dibbs
36. Tripal Galaxy Expected release in April 2016 for first
workflow on HWG Galaxy backend will be running at WSU Need to
continue work on Selecting and filtering data to input to a
workflow Monitoring workflow status Receiving meaningful error
messages if problems occur
37. Going Mobile
38. Users produce messy data Day Collector Color Diseased?
11-14-16 Evan Red 0 11-14-16 Evan Pink 0 11-14-16 Evan White 1 Nov
14 2016 Becky Fuschia True Nov 14 2016 Becky White False 16-11-14
Miriam Vermillion Yes
39. Standardize Collection Create forms for data collection
Serve through a flexible mobile app Currently prototyping as a
citizen science app
40. Mobile App Timeline Citizen Science app released by July
2017 Prototype of full phenotyping app by Jan 2018 Testing in
multiple systems
41. Cyberinfrastructure Access data Find data Visualize
dataAnalyze data Collect data
42. Abdullah Almsaeed Bradford Condon Miriam Paya Milans
Research Associate Postdoc Postdoc Ming Chen Fang Lui Graduate
Student Graduate Student Stephen Ficklin Dorrie Main Jill Wegrzyn
Bert Abbott Dana Nelson Ellen Crocker