Date post: | 29-Nov-2014 |
Category: |
Technology |
Upload: | bosc |
View: | 1,658 times |
Download: | 0 times |
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 1
Todd Smith (1), Eric Olson (1), Mark Welsh (1), Mike Folk (2), Christopher Mason (3). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF Group 1901 S. First St., Suite C-2 Champaign, IL 61820.3. Yale University, New Haven CT.
BioHDF : Toward ScalableBioinformatics Infrastructures
TM
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 2
Overview
• Driver: bioinformatics challenges in Next GenerationDNA Sequencing (NGS)
• BioHDF project and examples
• HDF5 (Hierarchical Data Format)
• How you can get involved
• Geospiza
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 3
Next Generation DNA Sequencing
“Genome center in a mail room”
“Democratizing genomics”
“Changing the landscape”
“The beginning of the end for microarrays”
“Transforms today’s biology”
NGS is PowerfulNGS is Powerful
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 4
Example: Measuring Gene Expression
>1,000,000 5 others
1,380,017 Zebra Fish
1,380,071 Soybean
1,476,771 Pig
1,517,143 Cattle
1,526,124 Arabidopsis
2,018,337 Maize
4,850,605 Mouse
8,163,902 Human
59,498,205Total ESTs
1,683Total Organisms
dbEST - Jan 20, 2009MeasurementsExperiment
1.4 M (probes)0.8 M (probes)
48 K (transcripts)
Microarrays1,000-100,000SAGE
Other Technologies
180,000,000SOLiD V3
MeasurementsInstrument
80,000,000Illumina GA1,000,000454 Titanium
Next Generation Sequencing
Greater sensitivity, higher dynamic ranges+ Qualitative data: isoforms, alleles, …
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 5
NGS is Daunting
“Prepare for the deluge”
“Byte-ing off more than you can chew”
“These sequencers are going to totally screw you”
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 6
NGS Data are Analyzed in Three PhasesPrimary Data Analysis - Images to bases
Secondary Data Analysis
Tertiary Data Analysis
Sequences +Quality valuesRun quality
Gene listsRead DensityVariant listSample, run quality
Differential expressionMethylation sitesGene associationGenomic structureExperiment, science
Ref Seq +Aligner
One or moreData sets
Secondary DataProductionDe novo assembly =>
Assembler
Contigs + Annotation
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 7
Secondary Analysis is Complex
Tag profilingChIP-Seq Resequencing
Examples: MAQ - http://maq.sourceforge.net
Secondary Analysis for:
Story repeats for BWA, Bowtie,TopHat, Mapreads, SOAP …
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 8
Complexity Limits Scale and Productivity
• Data are unstructured - no consistent data model• Solve problems using redundant data processing
– Incremental processing with data filtering at each stage– New question? Then rerun alignment operations
• Each analysis step has a new output format– One file for tables of alignments– Another file with bases aligned to see mismatches– Another file to ask statistical questions– More files and images for visualization– Files are linked by virtue of being in the same directory– Perl hashes used to link the data fill up memory– Redundant text-based formats fill up disk space
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 9
Makes Getting Answers Difficult
10 - 100million reads
Align toreference data
Review results, make decisions
Process Applications
Parse files, reformatdata, create reports
Small RNA
Epi-Genomics
Variation Analysis
Gene Expression
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 10
And Comparing Between Samples Hard
10 - 100million reads
Align toreference data
Repeat n times, With n samples
Review results, make decisions
Process Explore Data Between SamplesAnd drill into details
Alternative splicingCompare expression
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 11
What is Desired1. Scalable systems with smoothly operating user interfaces2. Summarize results and drill into details for single samples3. Compare results between samples and within groups
Data must be structured, indexed, and annotated
Need a better way to work with NGS data and information
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 12
BioHDF Project
• NIH STTR– Geospiza, Seattle WA– The HDF Group, Urbana/Champaign IL
• Goal: Move bioinformatics problems from organizingand structuring data to asking questions andvisualizing data– Develop data models and tools to work with NGS data in HDF
(Hierarchical Data Format)– Create HDF5 domain-specific extensions and library modules to
support the unique aspects of NGS data => BioHDF– Integrate BioHDF technologies into Geospiza products
• Deliver core BioHDF technologies to the communityas open-source software
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 13
Performance Advantages
1100 ms540 ms
62 ms19 ms15 ms
1470 ms735 ms735 ms735 ms735 ms
HDF5 WorldFlat File World
143 MB - compressed,random access609 MB - no random accessfasta file
Days, Weeks - write I/0code - parsers, loaders,and access methods
> Months - develop file formats,indices, access libraries, anddebug to make efficient
Development
~1 M alignments450000 alignments44000 alignments
4000 alignments600 alignments
~1 M alignments450000 alignments44000 alignments
4000 alignments600 alignments
Export Alignments chr5100 Mbase region10 Mbase region 1 Mbase region0.1 Mbase region
284 MB - index374 MB + index1033 MB - no random access
Bowtie Alignments =fasta + alignment
HDF improves storage, access, and development efficiencyAnd does not add to computational overhead
Test Case: 9.3 million GA reads aligned to HG build 36.1(4-core 3GHz Intel Xeon)
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 14
Value of Development Time
Focus on Science:• Working with 100s of million
reads for 100s of samples• Measuring gene expression• Identifying isoforms• Observing sequence and
structural variation• Drilling into details from
summaries
Instead of Software / IT:• Developing and debugging low
level infrastructures to support“novel” binary data formats
• Optimizing high-end hardwaresystems
• Tuning and redesigning RDBMSand other implementations
exons Exons observedSample A vs Sample B
SpliceIndexID
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 15
readsreads
ref data
alignments
HDF file(n)Sample (n)
ref data
alignments
HDF file2Sample 2
Enables a Different Approach
reads
ref data
alignments
HDF file1Sample 1
Series of sources
Sample 1 (exon-crossing)Sample 2 (exon-crossing)
annotations
queries
.wig.bedfile
Integrate between systems
Data byrange
form
atte
rs
Integrate data across platforms
Integrate samples / annotationsBasecomp.
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 16
Explore Data with Different Questions
Small RNA Analysis
Splicing / Exon Analysis
miRBase
rRNA/mtDNA
reads
Adapter/Primers
alignments
HDF file1Sample 1
Sources
annotations
queries
form
atte
rs
Transcripts
Genome
Exonjunctions
One alignment step, different questions
“Subtractive”question
“Biology”question
Examinematch quality
Splicing / Exon Analysis
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 17
Why HDF?
Arrays, rich data types, groupsaccommodate every kind of data
Store any combination of dataobjects in one container.
Performance: fast random accessand efficient, scalable storage
Portability, data sharing: platformindependent, self describing,common data models
Tools for viewing, analysis:HDFview, MATLAB, others
Widespread: used in academia,govt, industry - MATLAB, IDL,NASA-Earth Observing System
A platform for creating software towork with many kinds of scientific data
HDF5: 20 Years in Physical Sciences
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 18
HDF Software
HDFI/OLibraryHDFI/OLibrary
Tools,Applications,LibrariesTools,Applications,Libraries
HDFFileHDFFile
Command Line Tools
Library Extensions
Modifications
BioHDF
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 19
Benefits
• Separates the model,implementation, andview of the data
• Combines data frommultiple samples
• Compression, chunkingand other performanceadvantages
• Rapid prototypingenvironment
• Significant reduction indevelopment time
• Approach NGS analysisdifferently
Only had to define the data model, write data import and export tools
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 20
Getting Involved
• BioHDF is being built on existing, available, andproven HDF5 technology
• Import, export tools will be open source• Geospiza and The HDF Group are seeking
collaborations– Bioinformatics pipeline developers– Algorithm developers
• Funding - NIH STTR HG003792• Interested? Contact [email protected]
Confidential© Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 21
Geospiza ProductsGeneSifterTM Laboratory and Analysis Software Systems
From Samples to ResultsTM
• For: Core, Service, Data Production Labs andResearch Scientists
• Working with: Sanger Sequencing, Microarray, NextGeneration Sequencing, and (or) other platforms
• GeneSifter supports: Laboratory operations, DataManagement, Multiple Levels of Data Analysis
• Deployment: cost effective hosted or on-site models.
BioHDF at ISMB:Monday June 29th - 6pm - Poster U57
BioHDF Poster Session
Wednesday July 1st - 1pm - Room T1 BioHDF hack-a-thon / BOF
BioHDF at BOSC:Today - 5:30pm - here:
BioHDF hack-a-thon / BOF
Getting BioHDF software:http://www.hdfgroup.org/projects/bioinformatics/