+ All Categories
Home > Science > Building genomic data cyberinfrastructure with the online database software Tripal and analysis...

Building genomic data cyberinfrastructure with the online database software Tripal and analysis...

Date post: 28-Jan-2018
Category:
Upload: mestato
View: 146 times
Download: 0 times
Share this document with a friend
52
Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy Meg Staton University of Tennessee, Knoxville [email protected] @hardwoodgenomics
Transcript
Page 1: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Building genomic data cyberinfrastructure with the online

database software Tripal and analysis workflows driven by

Galaxy

Meg Staton

University of Tennessee, Knoxville

[email protected]

@hardwoodgenomics

Page 2: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Cyberinfrastructure

Need to connect people to

• Computing systems

• Data storage systems

• Advanced instruments

• Data repositories

• Visualization environments

• Sensors

All distributed across the world

Page 3: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Wilkinson et al 2016

Page 4: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

FAIR data principles

Findable

• Unique and persistent identifiers

Accessible

• Open and free method for retrieval

Interoperable

• Data are properly associated with other datasets

Re-usable

• Rich metadata (attributes for who, what, when, where, how)

Page 5: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

The community (genome) database

Mission

• Collect data

• Curate data

• Integrate data

• Provide access to data

Page 6: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Difference from primary repositories

Why do we need community databases?

The “Community” Part

• Understand what is important for your users

• Respond to questions

• Attend community meetings

• Participate in grants

• Take data that doesn’t have a home anywhere else

• Manual curation

Page 7: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Challenges

• 2007, Clemson University

• We were writing all the database and web code from scratch

• Starting to accumulate multiple databases

• Would like to focus on biological visualization, instead cobbling together code modules to handle• Usernames/passwords/permissions

• Front page news items

• Calendar of meetings

• There has to be an easier way!

Dorrie Main Stephen Ficklin

Page 8: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

A web framework for genetic and genomic data

Goals:

• Simplify construction of a community genomics

websites

• Encourage high-quality, standards-based websites

for data sharing and collaboration

• Expand and reuse code

http://tripal.info

Page 9: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Content Management

System

Website construction toolkit

Open source

Globally utilized and

supported

Manages users

Module-based design

My Drupal Web Site

Calendar Module

Views

XML Sitemap

Page 10: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

My Drupal Web Site

Calendar Module

Views

Organism

Sequence Feature

GenotypeDrupal Database

Page 11: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Why use Tripal?

Goals:

• Simpler construction

• Encourage high-quality, standards-based websites

for data sharing and collaboration

• Expand and reuse code

Open source

Friendly developers

Responsive mailing list

Page 12: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 13: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 14: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 15: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 16: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 17: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 18: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Modules

Core Modules

• Organisms

• Contact

• Controlled Vocabularies

• Stocks/Germplasm

• Phenotypes

• Genotypes

• Features

• Phenotypes

• Phylogenies

Bulk Data Loader

Jobs Management

Extension Modules

• Transcriptomes

• Functional annotation:• BLAST

• KEGG

• InterPro

• GeneOntology

• BLAST server

• Breeding API

• Genetic Maps

• Libraries

• JBrowse

Page 19: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

ElasticsearchExtension Module

https://github.com/tripal/tripal_elasticsearch

Page 20: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

What problem is being solved?

• Drupal internal search

• Easy to set up and customize (for normal Drupal data types)

• Slow to index, slow to return results

• Need a solution that will:

• Access Chado database

• Provide flexible and customizable indexing – index only what is

needed, not everything

• Scale to very large biological data sets

Page 21: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Elasticsearch Software

Distributed, open source search and analytics engine

• Massively distributed – can scale horizontally

• Multitenancy – a search cluster can manage many

individual indices that can be queried individually or as a

group

• Feature-rich - autocomplete, fuzzy searching, “did you

mean” suggestions

• Open source

• Widely adopted

Page 22: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Elasticsearch Module Implementation

Install Elasticsearch

Install Tripal Elasticsearch

Module

Connect to the Elasticsearch

Server

Index Drupal nodes

Site-wide search

Index targeted Chado or

Drupal tables

Customized search

Page 23: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Index chado table or materialized view

Page 24: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

After indexing, build search block

The block is a normal Tripal

block that can be placed on

any or all pages.

Blocks can also be deleted

from the admin back end.

Page 25: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Alter form fields

Page 26: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Final Custom Transcript Search

Page 27: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Final Custom Transcript Search - Results

Page 28: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Elasticsearch Module

Faster indexing (if only due to multicore usage)

Faster searching

Future Development

• Multisite installs on a single web server – currently

working

• Port to Tripal 3.0

• Compare to new internal searching

Page 29: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Analysis ExpressionExtension Module

https://github.com/tripal/tripal_analysis_expression

Page 30: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

What problem is being solved?

Biological Samples

RNA Libraries Gene Expression Levels

Need a better way to store and visualize RNASeq differential gene

expression experiments.

Page 31: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Expression Module – Content Types

• Biomaterial

• Similar to NCBI BioSample and SRA

• We currently do not differentiate between samples and

libraries

• Expression Analysis

• User specifies protocol and array design if a microarray

was used

• Upload and display of gene expression values

Page 32: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Loading Data

• Import biomaterial

• BioSample data downloaded from NCBI (xml)

• Flat file format (based on NCBI biomaterial bulk load

form)

• Can associate ontology terms through flat file

• Create a new expression analysis

• Import expression values as text files

• (assumed to be normalized, features must already

exist)

• Individual file per sample

• Tab delimited file with gene rows, sample columns

Page 33: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Visualization - Biomaterials

Page 34: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Visualization – Gene Expression

Page 35: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Visualization – Gene Expression

Hover over a library name for

a description

Some options to alter the

graphic

Page 36: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Expression Visualization Tool

• Paste a list of genes in to get a full heatmap across all

libraries.

• Plotly allows you to zoom, download, etc.

Page 37: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Future Work on Expression Module

• Transfer the list of all features from search results to

expression visualization tool

• Add significance/p-values from differential gene

expression test results

• Aid searching – limit results only to genes that respond to cold

stress

• Interactive data filtering

• Tie into analysis engine

• Tie into a publication module

Page 38: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

GalaxyExtension Module and Analysis Engine

https://github.com/tripal/tripal_galaxy

Page 39: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Galaxy is an open, web-based platform for accessible,

reproducible, and transparent computational biomedical

research.

No need to use the command line to run NGS pipelines.

Use a website to upload data, build an analysis pipeline

and run it.

Page 40: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 41: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Tripal Galaxy Module

• Currently under development

• https://github.com/tripal/tripal_galaxy

• Tripal sites can provide Galaxy workflows to their users

• Ensures reproducibility of data analysis steps

• Decreases curator effort/time

• Provides the workflow within the look-and-feel of the site

• Can be installed by any Tripal site once completed.

Page 42: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 43: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Galaxy Workflows

Testing on Galaxy instances at Washington State University, University of Connecticut, and University of Tennessee

DNA Sequence Data• Re-sequencing

alignment

• Variant discovery

(against the reference)

• Variant discovery

(between samples)

• Prediction of functional

genetic variants

https://github.com/statonlab/dibbs

Page 44: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Tripal Galaxy

• Expected release in April 2016 for first workflow on HWG

• Galaxy backend will be running at WSU

• Need to continue work on

• Selecting and filtering data to input to a workflow

• Monitoring workflow status

• Receiving meaningful error messages if problems occur

Page 45: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Going Mobile

Page 46: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Users produce messy data

Day Collector Color Diseased?

11-14-16 Evan Red 0

11-14-16 Evan Pink 0

11-14-16 Evan White 1

Nov 14 2016 Becky Fuschia True

Nov 14 2016 Becky White False

16-11-14 Miriam Vermillion Yes

Page 47: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Standardize Collection

• Create forms for data collection

• Serve through a flexible mobile app

• Currently prototyping as a citizen science app

Page 48: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
Page 49: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Mobile App

• Timeline

• Citizen Science app released

by July 2017

• Prototype of full phenotyping

app by Jan 2018

• Testing in multiple systems

Page 50: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Cyberinfrastructure

Access data

Find data

Visualize dataAnalyze data

Collect data

Page 51: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Abdullah Almsaeed Bradford Condon Miriam Paya Milans

Research Associate Postdoc Postdoc

Ming Chen Fang Lui

Graduate Student Graduate Student

Stephen Ficklin

Dorrie Main

Jill Wegrzyn

Bert Abbott

Dana Nelson

Ellen Crocker

Page 52: Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy

Recommended