Welsh_BioHDF_BOSC2009

Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 1

Todd Smith (1), Eric Olson (1), Mark Welsh (1), Mike Folk (2), Christopher Mason (3). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF Group 1901 S. First St., Suite C-2 Champaign, IL 61820.3. Yale University, New Haven CT.

BioHDF : Toward ScalableBioinformatics Infrastructures

TM


Overview

• Driver: bioinformatics challenges in Next GenerationDNA Sequencing (NGS)

• BioHDF project and examples

• HDF5 (Hierarchical Data Format)

• How you can get involved

• Geospiza


Next Generation DNA Sequencing

“Genome center in a mail room”

“Democratizing genomics”

“Changing the landscape”

“The beginning of the end for microarrays”

“Transforms today’s biology”

NGS is PowerfulNGS is Powerful


Example: Measuring Gene Expression

>1,000,000 5 others

1,380,017 Zebra Fish

1,380,071 Soybean

1,476,771 Pig

1,517,143 Cattle

1,526,124 Arabidopsis

2,018,337 Maize

4,850,605 Mouse

8,163,902 Human

59,498,205Total ESTs

1,683Total Organisms

dbEST - Jan 20, 2009MeasurementsExperiment

1.4 M (probes)0.8 M (probes)

48 K (transcripts)

Microarrays1,000-100,000SAGE

Other Technologies

180,000,000SOLiD V3

MeasurementsInstrument

80,000,000Illumina GA1,000,000454 Titanium

Next Generation Sequencing

Greater sensitivity, higher dynamic ranges+ Qualitative data: isoforms, alleles, …


NGS is Daunting

“Prepare for the deluge”

“Byte-ing off more than you can chew”

“These sequencers are going to totally screw you”


NGS Data are Analyzed in Three PhasesPrimary Data Analysis - Images to bases

Secondary Data Analysis

Tertiary Data Analysis

Sequences +Quality valuesRun quality

Gene listsRead DensityVariant listSample, run quality

Differential expressionMethylation sitesGene associationGenomic structureExperiment, science

Ref Seq +Aligner

One or moreData sets

Secondary DataProductionDe novo assembly =>

Assembler

Contigs + Annotation


Secondary Analysis is Complex

Tag profilingChIP-Seq Resequencing

Examples: MAQ - http://maq.sourceforge.net

Secondary Analysis for:

Story repeats for BWA, Bowtie,TopHat, Mapreads, SOAP …


Complexity Limits Scale and Productivity

• Data are unstructured - no consistent data model• Solve problems using redundant data processing

– Incremental processing with data filtering at each stage– New question? Then rerun alignment operations

• Each analysis step has a new output format– One file for tables of alignments– Another file with bases aligned to see mismatches– Another file to ask statistical questions– More files and images for visualization– Files are linked by virtue of being in the same directory– Perl hashes used to link the data fill up memory– Redundant text-based formats fill up disk space


Makes Getting Answers Difficult

10 - 100million reads

Align toreference data

Review results, make decisions

Process Applications

Parse files, reformatdata, create reports

Small RNA

Epi-Genomics

Variation Analysis

Gene Expression


And Comparing Between Samples Hard

10 - 100million reads

Align toreference data

Repeat n times, With n samples

Review results, make decisions

Process Explore Data Between SamplesAnd drill into details

Alternative splicingCompare expression


What is Desired1. Scalable systems with smoothly operating user interfaces2. Summarize results and drill into details for single samples3. Compare results between samples and within groups

Data must be structured, indexed, and annotated

Need a better way to work with NGS data and information


BioHDF Project

• NIH STTR– Geospiza, Seattle WA– The HDF Group, Urbana/Champaign IL

• Goal: Move bioinformatics problems from organizingand structuring data to asking questions andvisualizing data– Develop data models and tools to work with NGS data in HDF

(Hierarchical Data Format)– Create HDF5 domain-specific extensions and library modules to

support the unique aspects of NGS data => BioHDF– Integrate BioHDF technologies into Geospiza products

• Deliver core BioHDF technologies to the communityas open-source software


Performance Advantages

1100 ms540 ms

62 ms19 ms15 ms

1470 ms735 ms735 ms735 ms735 ms

HDF5 WorldFlat File World

143 MB - compressed,random access609 MB - no random accessfasta file

Days, Weeks - write I/0code - parsers, loaders,and access methods

> Months - develop file formats,indices, access libraries, anddebug to make efficient

Development

~1 M alignments450000 alignments44000 alignments

4000 alignments600 alignments

~1 M alignments450000 alignments44000 alignments

4000 alignments600 alignments

Export Alignments chr5100 Mbase region10 Mbase region 1 Mbase region0.1 Mbase region

284 MB - index374 MB + index1033 MB - no random access

Bowtie Alignments =fasta + alignment

HDF improves storage, access, and development efficiencyAnd does not add to computational overhead

Test Case: 9.3 million GA reads aligned to HG build 36.1(4-core 3GHz Intel Xeon)


Value of Development Time

Focus on Science:• Working with 100s of million

reads for 100s of samples• Measuring gene expression• Identifying isoforms• Observing sequence and

structural variation• Drilling into details from

summaries

Instead of Software / IT:• Developing and debugging low

level infrastructures to support“novel” binary data formats

• Optimizing high-end hardwaresystems

• Tuning and redesigning RDBMSand other implementations

exons Exons observedSample A vs Sample B

SpliceIndexID


readsreads

ref data

alignments

HDF file(n)Sample (n)

ref data

alignments

HDF file2Sample 2

Enables a Different Approach

reads

ref data

alignments

HDF file1Sample 1

Series of sources

Sample 1 (exon-crossing)Sample 2 (exon-crossing)

annotations

queries

.wig.bedfile

Integrate between systems

Data byrange

form

atte

rs

Integrate data across platforms

Integrate samples / annotationsBasecomp.


Explore Data with Different Questions

Small RNA Analysis

Splicing / Exon Analysis

miRBase

rRNA/mtDNA

reads

Adapter/Primers

alignments

HDF file1Sample 1

Sources

annotations

queries

form

atte

rs

Transcripts

Genome

Exonjunctions

One alignment step, different questions

“Subtractive”question

“Biology”question

Examinematch quality

Splicing / Exon Analysis


Why HDF?

Arrays, rich data types, groupsaccommodate every kind of data

Store any combination of dataobjects in one container.

Performance: fast random accessand efficient, scalable storage

Portability, data sharing: platformindependent, self describing,common data models

Tools for viewing, analysis:HDFview, MATLAB, others

Widespread: used in academia,govt, industry - MATLAB, IDL,NASA-Earth Observing System

A platform for creating software towork with many kinds of scientific data

HDF5: 20 Years in Physical Sciences


HDF Software

HDFI/OLibraryHDFI/OLibrary

Tools,Applications,LibrariesTools,Applications,Libraries

HDFFileHDFFile

Command Line Tools

Library Extensions

Modifications

BioHDF


Benefits

• Separates the model,implementation, andview of the data

• Combines data frommultiple samples

• Compression, chunkingand other performanceadvantages

• Rapid prototypingenvironment

• Significant reduction indevelopment time

• Approach NGS analysisdifferently

Only had to define the data model, write data import and export tools


Getting Involved

• BioHDF is being built on existing, available, andproven HDF5 technology

• Import, export tools will be open source• Geospiza and The HDF Group are seeking

collaborations– Bioinformatics pipeline developers– Algorithm developers

• Funding - NIH STTR HG003792• Interested? Contact [email protected]


Geospiza ProductsGeneSifterTM Laboratory and Analysis Software Systems

From Samples to ResultsTM

• For: Core, Service, Data Production Labs andResearch Scientists

• Working with: Sanger Sequencing, Microarray, NextGeneration Sequencing, and (or) other platforms

• GeneSifter supports: Laboratory operations, DataManagement, Multiple Levels of Data Analysis

• Deployment: cost effective hosted or on-site models.

BioHDF at ISMB:Monday June 29th - 6pm - Poster U57

BioHDF Poster Session

Wednesday July 1st - 1pm - Room T1 BioHDF hack-a-thon / BOF

BioHDF at BOSC:Today - 5:30pm - here:

BioHDF hack-a-thon / BOF

Getting BioHDF software:http://www.hdfgroup.org/projects/bioinformatics/

Date post:	29-Nov-2014
Category:	Technology
Upload:	bosc
View:	1,658 times
Download:	0 times

Welsh_BioHDF_BOSC2009

Technology