Post on 17-Jul-2015
transcript
WHAT TO DO IN THE
EVENT OF A DATA
DELUGE
EEMiS Workshop, Öland, Sweden,
12/1/2014
Adina Howe
germslab.org (Genomics and
Environmental Research in
Microbial Systems)
Iowa State University, Ag &
Biosystems Engr (January)
Slides available at
www.slideshare.com/adinachu
anghowe
NGS SEQUENCING
HOW DID WE GET HERE
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Kim Lewis, 2010
Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed CulturesNatural systems
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
Dis
k S
tora
ge,
Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted
THE DIRT ON SOIL
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
MAGNIFICENT BIODIVERSITY
THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL
DYNAMIC
THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology
I. Methods to tackle metagenomic datasets
Computational
Experimental
I. Bottlenecks for microbiologists
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
A Slight Digression: Decisions for the new
microbial ecologist
Getting the most
out of your dataComplex
Samples
16S rRNA
amplicon
sequencing
Pros:
1) Commonly used
approach
2) Deep characterization
Cons:
1) Limited knowledge
2) Resolution remains low
ID, Abundance, Function
Patrick Chain
Getting the most
out of your dataComplex
Samples
16S rRNA
amplicon
sequencing
Shotgun
sequencing
Pros:
1) Commonly used
approach
2) Deep characterization
Cons:
1) Limited knowledge
2) Resolution remains low
Assembly based Read-based /
Mapping Methods
Pros:
1) Large contigs
2) Positional Information
3) Most direct method to identify
novel orgs/genes
Cons:
1) Computational resource
intensive
2) Assembling difficulties
• Sequencing error
• genomic redundancy -
chimeras
Pros:
1) Massive data
2) Identity and
abundance answered
simultaneously
3) Look at all data**
Cons:
1) Massive data (short +
with errors)
2) Lack of specificity due
to FPs from genomic
redundancy
3) Difficult to detect
novel genomes –
must infer
ID, Abundance, Function
Patrick Chain
Getting the most
out of your dataComplex
Samples
16S rRNA
amplicon
sequencing
Shotgun
sequencing
Pros:
1) Commonly used
approach
2) Deep characterization
Cons:
1) Limited knowledge
2) Resolution remains low
Assembly based Read-based /
Mapping Methods
Pros:
1) Large contigs
2) Positional Information
3) Most direct method to identify
novel orgs/genes
Cons:
1) Computational resource
intensive
2) Assembling difficulties
• Sequencing error
• genomic redundancy -
chimeras
Pros:
1) Massive data
2) Identity and
abundance answered
simultaneously
3) Look at all data**
Cons:
1) Massive data (short +
with errors)
2) Lack of specificity due
to FPs from genomic
redundancy
3) Difficult to detect
novel genomes –
must infer
Reference
database size
reduction
Faster search
algorithms
Query size
reduction
a. Selection of marker
genes
b. Identification of
signatures (Kmers)
a. Exact match
b. K-mer based search
c. Improved algorithm
a. Clustering
b. Pattern
matching
a. Assembly / clustering
b. Unique K-mer
ID, Abundance, Function
Patrick Chain
Example #1: Data compression
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Metagenome assembly…a scaling
problem.
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computerAssembly of 300 Gbp (70,000
genomes worth) can be done with
any assembly program in less
than 14 GB RAM and less than
24 hours.
50 Gbp = 10,000 genomes
Natural community characteristics
Diverse
Many organisms
(genomes)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like…
Source: Chuck HaneyHowe et. al, 2014, PNAS
What we learned from deeply sequencing
soil
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
am
ino a
cid
meta
bolis
m
carb
ohydra
te m
eta
bo
lism
mem
bra
ne tra
nspo
rt
sig
nal tr
ansdu
ction
transla
tion
fold
ing
, sort
ing a
nd d
egra
da
tion
meta
bolis
m o
f co
facto
rs a
nd v
itam
ins
energ
y m
eta
bolis
m
transp
ort
and
cata
bolis
m
lipid
meta
bolis
m
tra
nscri
ption
ce
ll g
row
th a
nd
dea
th
replic
ation
and
rep
air
xen
obio
tics b
iod
egra
datio
n a
nd m
eta
bo
lism
nucle
otide m
eta
bolis
m
gly
can b
iosynth
esis
and m
eta
bolis
m
meta
bolis
m o
f te
rpenoid
s a
nd
poly
ke
tides
cell
motilit
y
Tota
l C
ount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.
Megahit
Example #2: Experimental partitioning
We are supraorganisms45
Is the gut community going viral?
46
Bacterial cells Bacterial cells infected
with bacteriophageViruses (Bacteriophage)
Prophage
Who is in the gut microbiome?
Novel experimental design with high
throughput sequencing47
I. Total DNA (bacteria + prophage + viruses) TOT
Separation
by sizeII. Virus-like particles
(free-living viruses)
VLP
III. Induced prophage
IND
Separation
by density
Chemically
separate
Faecal samples
(nondestructive
sampling)
+ Evolutionary biomarker (taxonomy analysis): 16S rRNA gene from total DNA
Is the gut community going viral?
48
Reyes et al, Nature Review Microbiology, 2012
Research Questions49
What are the impacts of different diets on gut
microbiome response?
What are the impacts of viruses in the gut
microbiome (rapid alteration and resilient
response?)
Multidisciplinary approach combining
novel experimental targeting of both bacterial and viral
communities
metagenomic-based sequencing to characterize
community
Two diets (with a perturbation)50
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to Baseline)Baseline
Weight of mice
Total community function: TOT metagenomic sequencing at weeks 8, 11, 14
Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14
Bacterial (only) community structure: 16s rRNA at weeks 8 through 14
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Fecal Samples
Outcomes?51
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to Baseline)Baseline
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
For The Win:
Qualitative and Quantitative Measurements
“Combat Zone” of phage-host
interactions
Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in
which there is not a rapid recovery of viral communities Instability of gut
microbiome? Availability of genes through viral infection?
Is more data better?
Bottlenecks for the emerging
microbiologists
Technical obstacles in the big data
deluge
Access to the data
Access to the resources
Democratization of both data and resource access
“80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)
Data volume and velocity
Previous efforts are difficult to integrate
Innovation is necessary
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Data intensive microbiology
Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina
Culture of sharing
http://www.heathershumaker.com/
Metagenomic Datasets
Training / Incentives
Emails between collaborators don’t contain as
much “science” as I’d like:
All analysis: accessible,
reproducible, and automated
All analysis: accessible,
reproducible, and automated
To reproduce analysis in a publication,
1. Rent Amazon EC2 computer
2. Clone github repository containing data and scripts
3. Open IPython notebook and execute
To run same analysis on different dataset,
1. Replace data files with your own data, execute notebook.
2. Tweak scripts as needed.
Acknowledgements
C. Titus Brown (MSU)
James Tiedje (MSU)
Daina Ringus (UC)
Folker Meyer (ANL)
Eugene Chang (UC)
NSF Biology Postdoc Fellowship
DOE Great Lakes Bioenergy Research Center