+ All Categories
Home > Technology > Welsh_BioHDF_BOSC2009

Welsh_BioHDF_BOSC2009

Date post: 29-Nov-2014
Category:
Upload: bosc
View: 1,658 times
Download: 0 times
Share this document with a friend
Description:
 
21
Todd Smith (1), Eric Olson (1), Mark Welsh (1), Mike Folk (2), Christopher Mason (3). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF Group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Yale University, New Haven CT. BioHDF : Toward Scalable Bioinformatics Infrastructures TM
Transcript
Page 1: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 1

Todd Smith (1), Eric Olson (1), Mark Welsh (1), Mike Folk (2), Christopher Mason (3). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF Group 1901 S. First St., Suite C-2 Champaign, IL 61820.3. Yale University, New Haven CT.

BioHDF : Toward ScalableBioinformatics Infrastructures

TM

Page 2: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 2

Overview

• Driver: bioinformatics challenges in Next GenerationDNA Sequencing (NGS)

• BioHDF project and examples

• HDF5 (Hierarchical Data Format)

• How you can get involved

• Geospiza

Page 3: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 3

Next Generation DNA Sequencing

“Genome center in a mail room”

“Democratizing genomics”

“Changing the landscape”

“The beginning of the end for microarrays”

“Transforms today’s biology”

NGS is PowerfulNGS is Powerful

Page 4: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 4

Example: Measuring Gene Expression

>1,000,000 5 others

1,380,017 Zebra Fish

1,380,071 Soybean

1,476,771 Pig

1,517,143 Cattle

1,526,124 Arabidopsis

2,018,337 Maize

4,850,605 Mouse

8,163,902 Human

59,498,205Total ESTs

1,683Total Organisms

dbEST - Jan 20, 2009MeasurementsExperiment

1.4 M (probes)0.8 M (probes)

48 K (transcripts)

Microarrays1,000-100,000SAGE

Other Technologies

180,000,000SOLiD V3

MeasurementsInstrument

80,000,000Illumina GA1,000,000454 Titanium

Next Generation Sequencing

Greater sensitivity, higher dynamic ranges+ Qualitative data: isoforms, alleles, …

Page 5: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 5

NGS is Daunting

“Prepare for the deluge”

“Byte-ing off more than you can chew”

“These sequencers are going to totally screw you”

Page 6: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 6

NGS Data are Analyzed in Three PhasesPrimary Data Analysis - Images to bases

Secondary Data Analysis

Tertiary Data Analysis

Sequences +Quality valuesRun quality

Gene listsRead DensityVariant listSample, run quality

Differential expressionMethylation sitesGene associationGenomic structureExperiment, science

Ref Seq +Aligner

One or moreData sets

Secondary DataProductionDe novo assembly =>

Assembler

Contigs + Annotation

Page 7: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 7

Secondary Analysis is Complex

Tag profilingChIP-Seq Resequencing

Examples: MAQ - http://maq.sourceforge.net

Secondary Analysis for:

Story repeats for BWA, Bowtie,TopHat, Mapreads, SOAP …

Page 8: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 8

Complexity Limits Scale and Productivity

• Data are unstructured - no consistent data model• Solve problems using redundant data processing

– Incremental processing with data filtering at each stage– New question? Then rerun alignment operations

• Each analysis step has a new output format– One file for tables of alignments– Another file with bases aligned to see mismatches– Another file to ask statistical questions– More files and images for visualization– Files are linked by virtue of being in the same directory– Perl hashes used to link the data fill up memory– Redundant text-based formats fill up disk space

Page 9: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 9

Makes Getting Answers Difficult

10 - 100million reads

Align toreference data

Review results, make decisions

Process Applications

Parse files, reformatdata, create reports

Small RNA

Epi-Genomics

Variation Analysis

Gene Expression

Page 10: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 10

And Comparing Between Samples Hard

10 - 100million reads

Align toreference data

Repeat n times, With n samples

Review results, make decisions

Process Explore Data Between SamplesAnd drill into details

Alternative splicingCompare expression

Page 11: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 11

What is Desired1. Scalable systems with smoothly operating user interfaces2. Summarize results and drill into details for single samples3. Compare results between samples and within groups

Data must be structured, indexed, and annotated

Need a better way to work with NGS data and information

Page 12: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 12

BioHDF Project

• NIH STTR– Geospiza, Seattle WA– The HDF Group, Urbana/Champaign IL

• Goal: Move bioinformatics problems from organizingand structuring data to asking questions andvisualizing data– Develop data models and tools to work with NGS data in HDF

(Hierarchical Data Format)– Create HDF5 domain-specific extensions and library modules to

support the unique aspects of NGS data => BioHDF– Integrate BioHDF technologies into Geospiza products

• Deliver core BioHDF technologies to the communityas open-source software

Page 13: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 13

Performance Advantages

1100 ms540 ms

62 ms19 ms15 ms

1470 ms735 ms735 ms735 ms735 ms

HDF5 WorldFlat File World

143 MB - compressed,random access609 MB - no random accessfasta file

Days, Weeks - write I/0code - parsers, loaders,and access methods

> Months - develop file formats,indices, access libraries, anddebug to make efficient

Development

~1 M alignments450000 alignments44000 alignments

4000 alignments600 alignments

~1 M alignments450000 alignments44000 alignments

4000 alignments600 alignments

Export Alignments chr5100 Mbase region10 Mbase region 1 Mbase region0.1 Mbase region

284 MB - index374 MB + index1033 MB - no random access

Bowtie Alignments =fasta + alignment

HDF improves storage, access, and development efficiencyAnd does not add to computational overhead

Test Case: 9.3 million GA reads aligned to HG build 36.1(4-core 3GHz Intel Xeon)

Page 14: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 14

Value of Development Time

Focus on Science:• Working with 100s of million

reads for 100s of samples• Measuring gene expression• Identifying isoforms• Observing sequence and

structural variation• Drilling into details from

summaries

Instead of Software / IT:• Developing and debugging low

level infrastructures to support“novel” binary data formats

• Optimizing high-end hardwaresystems

• Tuning and redesigning RDBMSand other implementations

exons Exons observedSample A vs Sample B

SpliceIndexID

Page 15: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 15

readsreads

ref data

alignments

HDF file(n)Sample (n)

ref data

alignments

HDF file2Sample 2

Enables a Different Approach

reads

ref data

alignments

HDF file1Sample 1

Series of sources

Sample 1 (exon-crossing)Sample 2 (exon-crossing)

annotations

queries

.wig.bedfile

Integrate between systems

Data byrange

form

atte

rs

Integrate data across platforms

Integrate samples / annotationsBasecomp.

Page 16: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 16

Explore Data with Different Questions

Small RNA Analysis

Splicing / Exon Analysis

miRBase

rRNA/mtDNA

reads

Adapter/Primers

alignments

HDF file1Sample 1

Sources

annotations

queries

form

atte

rs

Transcripts

Genome

Exonjunctions

One alignment step, different questions

“Subtractive”question

“Biology”question

Examinematch quality

Splicing / Exon Analysis

Page 17: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 17

Why HDF?

Arrays, rich data types, groupsaccommodate every kind of data

Store any combination of dataobjects in one container.

Performance: fast random accessand efficient, scalable storage

Portability, data sharing: platformindependent, self describing,common data models

Tools for viewing, analysis:HDFview, MATLAB, others

Widespread: used in academia,govt, industry - MATLAB, IDL,NASA-Earth Observing System

A platform for creating software towork with many kinds of scientific data

HDF5: 20 Years in Physical Sciences

Page 18: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 18

HDF Software

HDFI/OLibraryHDFI/OLibrary

Tools,Applications,LibrariesTools,Applications,Libraries

HDFFileHDFFile

Command Line Tools

Library Extensions

Modifications

BioHDF

Page 19: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 19

Benefits

• Separates the model,implementation, andview of the data

• Combines data frommultiple samples

• Compression, chunkingand other performanceadvantages

• Rapid prototypingenvironment

• Significant reduction indevelopment time

• Approach NGS analysisdifferently

Only had to define the data model, write data import and export tools

Page 20: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 20

Getting Involved

• BioHDF is being built on existing, available, andproven HDF5 technology

• Import, export tools will be open source• Geospiza and The HDF Group are seeking

collaborations– Bioinformatics pipeline developers– Algorithm developers

• Funding - NIH STTR HG003792• Interested? Contact [email protected]

Page 21: Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 21

Geospiza ProductsGeneSifterTM Laboratory and Analysis Software Systems

From Samples to ResultsTM

• For: Core, Service, Data Production Labs andResearch Scientists

• Working with: Sanger Sequencing, Microarray, NextGeneration Sequencing, and (or) other platforms

• GeneSifter supports: Laboratory operations, DataManagement, Multiple Levels of Data Analysis

• Deployment: cost effective hosted or on-site models.

BioHDF at ISMB:Monday June 29th - 6pm - Poster U57

BioHDF Poster Session

Wednesday July 1st - 1pm - Room T1 BioHDF hack-a-thon / BOF

BioHDF at BOSC:Today - 5:30pm - here:

BioHDF hack-a-thon / BOF

Getting BioHDF software:http://www.hdfgroup.org/projects/bioinformatics/


Recommended