Date post: | 16-Jun-2015 |
Category: |
Education |
Upload: | adina-chuang-howe |
View: | 2,281 times |
Download: | 1 times |
BIG DATA (IN BIOLOGY): INTEGRATING LARGE, FAST MOVING,
HETEROGENEOUS DATASETS
Adina Howe
Argonne National Laboratory
Michigan State University
EPA Air Sensors 2013: Data Quality and Applications
March 19, 2013
Introduction – My perspective
Experiment
Design
Data Generation
Workflow / Tools
Data analysis
Applied Solutions Engineering
Microbial EcologyBioinformatics
THE DATA DELUGEAn exponential landscape
Next-generation sequencing growth outpacing computational resources
Stein, Genome Biology, 2010
Log
Sca
le!
Next-generation sequencing growth outpacing computational resources
Stein, Genome Biology, 2010
Effects of low cost sequencing…1995 First free-living bacterium sequenced
for billions of dollars and years of analysis
Personal genome can be mapped in a few days and hundreds to few thousand dollars
Effects of low cost sequencing on research
Sboner et al., Genome Biology, 2011
Effects of low cost sequencing on research
Sboner et al., Genome Biology, 2011
Effects of low cost sequencing on research
Sboner et al., Genome Biology, 2011
Technology
Core
competencyValue added
RETHINKING
What it takes to deliver
Technical obstacles in the big data deluge
• Access to the data and its value • Access to the resources
Democratization of both data and resource access
“80% of awards and 50% of $$ are for grants < $350,000”
Root causes:• Data volume and velocity “clog”• Data is very heterogeneous• Previous efforts are difficult to integrate• Innovation is necessary but hard
Experiment
Design
Data Generation
Workflow / ToolsData analysis
Applied Solutions
Social obstacles are the most difficult.• Shift of costs do not mean a shift of expectations
• “Give me the answer so I can get back to work.”
• A culture of sharing (data, time, and tools)
• Evolution of necessary training• Creating teams that can communicate across domains
• Incentives are not strong enough• Patterns for success (useful data sharing and
collaboration) are not apparent or well understood.
POSSIBLE SOLUTIONS
Common solutions: been there, done that
http://xkcd.com/927/
What would an ideal solution look like?
• Flexible access to data, tools, and resources
• Cost effective, consistent, reusable (scalable)
• Rapid exploration• Incentives to participate,
share, communicate• Community sandbox (vs
lab-specific)• Painless
Platform which supports an “ecology” of databases, interfaces, and analysis software.
The success of organization: Amazon• > 50 million users, > 1 million product partners, billions of
reviews, dozens of compute services.• Continually changing/updating data sets.• Explicitly adopted a service-oriented architecture that
enables both internal and external use of this data.• For example, the Amazon.com website is itself built from
over 150 independent services…• Amazon routinely deploys new services and functionality.
http://highscalability.com/amazon-architecture
https://plus.google.com/112678702228711889851/posts/eVeouesvaVX
Amazon development guideline:Colloquially said, “You should eat your own dogfood.”
Design and implement the database and database functionality to meet your own needs; only use the functionality you’ve explicitly made available to
everyone.
To adapt to research: database functionality should be designed in tight integration with researchers who are
using it, both at a user interface level and programmatically.
If the “customers” aren’t integrated into the development loop:
http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg
DOE Knowledgebase (KBase)• Emerging software and data environment to enable
researchers• Service oriented architecture where biological data
integrated into single data model with Kbase services loosely coupled to achieve various functions
• Open development environments for community contribution (public data, services, software)
• Provides robust and scalable infrastructure (with some level of support)
https://kbase.us
Kbase uses service oriented architecture
http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf
Hig
her
leve
l fun
ctio
ns
DOE KBase Investment
“…may also apply for additional supplemental funding of up to $300,000 per year for development of systems biology and –omics data driven applications in collaboration with the DOE Systems Biology Knowledgbase.”
Free tutorials / workshops for the community provided.
Advice for the next round…
Data generator:• Managing expectations and value
Developer:• “Eat your own dogfood”
Data analyzer:• Analyze with reproducibility in mind
} Access
Training
Communication
Platform / Teams
Big data is a community
problem and solution
Resources• Amazon interviews
http://highscalability.com/amazon-architecture
• Titus Brown’s blog post on heterogeneous data integration
http://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data-integration.html
• Kbase website
http://www.kbase.us
• Software carpentry – “helping scientists build better software”
http://software-carpentry.org
Thanks!
Please feel free to contact me:
http://adina.github.com
http://cheezburger.com/6983817216