+ All Categories
Home > Science > Building genomic data cyberinfrastructure with the online database software Tripal and analysis...

Building genomic data cyberinfrastructure with the online database software Tripal and analysis...

Date post: 28-Jan-2018
Author: mestato
View: 144 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 52 /52
Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy Meg Staton University of Tennessee, Knoxville [email protected] @hardwoodgenomics
  1. 1. Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy Meg Staton University of Tennessee, Knoxville [email protected] @hardwoodgenomics
  2. 2. Cyberinfrastructure Need to connect people to Computing systems Data storage systems Advanced instruments Data repositories Visualization environments Sensors All distributed across the world
  3. 3. Wilkinson et al 2016
  4. 4. FAIR data principles Findable Unique and persistent identifiers Accessible Open and free method for retrieval Interoperable Data are properly associated with other datasets Re-usable Rich metadata (attributes for who, what, when, where, how)
  5. 5. The community (genome) database Mission Collect data Curate data Integrate data Provide access to data
  6. 6. Difference from primary repositories Why do we need community databases? The Community Part Understand what is important for your users Respond to questions Attend community meetings Participate in grants Take data that doesnt have a home anywhere else Manual curation
  7. 7. Challenges 2007, Clemson University We were writing all the database and web code from scratch Starting to accumulate multiple databases Would like to focus on biological visualization, instead cobbling together code modules to handle Usernames/passwords/permissions Front page news items Calendar of meetings There has to be an easier way! Dorrie Main Stephen Ficklin
  8. 8. A web framework for genetic and genomic data Goals: Simplify construction of a community genomics websites Encourage high-quality, standards-based websites for data sharing and collaboration Expand and reuse code http://tripal.info
  9. 9. Content Management System Website construction toolkit Open source Globally utilized and supported Manages users Module-based design My Drupal Web Site Calendar Module Views XML Sitemap
  10. 10. My Drupal Web Site Calendar Module Views Organism Sequence Feature GenotypeDrupal Database
  11. 11. Why use Tripal? Goals: Simpler construction Encourage high-quality, standards-based websites for data sharing and collaboration Expand and reuse code Open source Friendly developers Responsive mailing list
  12. 12. Modules Core Modules Organisms Contact Controlled Vocabularies Stocks/Germplasm Phenotypes Genotypes Features Phenotypes Phylogenies Bulk Data Loader Jobs Management Extension Modules Transcriptomes Functional annotation: BLAST KEGG InterPro GeneOntology BLAST server Breeding API Genetic Maps Libraries JBrowse
  13. 13. Elasticsearch Extension Module https://github.com/tripal/tripal_elasticsearch
  14. 14. What problem is being solved? Drupal internal search Easy to set up and customize (for normal Drupal data types) Slow to index, slow to return results Need a solution that will: Access Chado database Provide flexible and customizable indexing index only what is needed, not everything Scale to very large biological data sets
  15. 15. Elasticsearch Software Distributed, open source search and analytics engine Massively distributed can scale horizontally Multitenancy a search cluster can manage many individual indices that can be queried individually or as a group Feature-rich - autocomplete, fuzzy searching, did you mean suggestions Open source Widely adopted
  16. 16. Elasticsearch Module Implementation Install Elasticsearch Install Tripal Elasticsearch Module Connect to the Elasticsearch Server Index Drupal nodes Site-wide search Index targeted Chado or Drupal tables Customized search
  17. 17. Index chado table or materialized view
  18. 18. After indexing, build search block The block is a normal Tripal block that can be placed on any or all pages. Blocks can also be deleted from the admin back end.
  19. 19. Alter form fields
  20. 20. Final Custom Transcript Search
  21. 21. Final Custom Transcript Search - Results
  22. 22. Elasticsearch Module Faster indexing (if only due to multicore usage) Faster searching Future Development Multisite installs on a single web server currently working Port to Tripal 3.0 Compare to new internal searching
  23. 23. Analysis Expression Extension Module https://github.com/tripal/tripal_analysis_expression
  24. 24. What problem is being solved? Biological Samples RNA Libraries Gene Expression Levels Need a better way to store and visualize RNASeq differential gene expression experiments.
  25. 25. Expression Module Content Types Biomaterial Similar to NCBI BioSample and SRA We currently do not differentiate between samples and libraries Expression Analysis User specifies protocol and array design if a microarray was used Upload and display of gene expression values
  26. 26. Loading Data Import biomaterial BioSample data downloaded from NCBI (xml) Flat file format (based on NCBI biomaterial bulk load form) Can associate ontology terms through flat file Create a new expression analysis Import expression values as text files (assumed to be normalized, features must already exist) Individual file per sample Tab delimited file with gene rows, sample columns
  27. 27. Visualization - Biomaterials
  28. 28. Visualization Gene Expression
  29. 29. Visualization Gene Expression Hover over a library name for a description Some options to alter the graphic
  30. 30. Expression Visualization Tool Paste a list of genes in to get a full heatmap across all libraries. Plotly allows you to zoom, download, etc.
  31. 31. Future Work on Expression Module Transfer the list of all features from search results to expression visualization tool Add significance/p-values from differential gene expression test results Aid searching limit results only to genes that respond to cold stress Interactive data filtering Tie into analysis engine Tie into a publication module
  32. 32. Galaxy Extension Module and Analysis Engine https://github.com/tripal/tripal_galaxy
  33. 33. Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. No need to use the command line to run NGS pipelines. Use a website to upload data, build an analysis pipeline and run it.
  34. 34. Tripal Galaxy Module Currently under development https://github.com/tripal/tripal_galaxy Tripal sites can provide Galaxy workflows to their users Ensures reproducibility of data analysis steps Decreases curator effort/time Provides the workflow within the look-and-feel of the site Can be installed by any Tripal site once completed.
  35. 35. Galaxy Workflows Testing on Galaxy instances at Washington State University, University of Connecticut, and University of Tennessee DNA Sequence Data Re-sequencing alignment Variant discovery (against the reference) Variant discovery (between samples) Prediction of functional genetic variants https://github.com/statonlab/dibbs
  36. 36. Tripal Galaxy Expected release in April 2016 for first workflow on HWG Galaxy backend will be running at WSU Need to continue work on Selecting and filtering data to input to a workflow Monitoring workflow status Receiving meaningful error messages if problems occur
  37. 37. Going Mobile
  38. 38. Users produce messy data Day Collector Color Diseased? 11-14-16 Evan Red 0 11-14-16 Evan Pink 0 11-14-16 Evan White 1 Nov 14 2016 Becky Fuschia True Nov 14 2016 Becky White False 16-11-14 Miriam Vermillion Yes
  39. 39. Standardize Collection Create forms for data collection Serve through a flexible mobile app Currently prototyping as a citizen science app
  40. 40. Mobile App Timeline Citizen Science app released by July 2017 Prototype of full phenotyping app by Jan 2018 Testing in multiple systems
  41. 41. Cyberinfrastructure Access data Find data Visualize dataAnalyze data Collect data
  42. 42. Abdullah Almsaeed Bradford Condon Miriam Paya Milans Research Associate Postdoc Postdoc Ming Chen Fang Lui Graduate Student Graduate Student Stephen Ficklin Dorrie Main Jill Wegrzyn Bert Abbott Dana Nelson Ellen Crocker