Date post: | 16-Jul-2015 |
Category: |
Engineering |
Upload: | adina-chuang-howe |
View: | 273 times |
Download: | 2 times |
RIDING THE BIG DATA
TIDAL WAVE IN
MODERN
MICROBIOLOGY
IOWA STATE UNIVERSITY
MARCH 12, 2014
Adina Howe, PhD
Outline of talk
My multi-discipline career
Biological sequencing: a game changer
Research – computational focus:
How to handle “big data” in biology
Research – biological focus:
The gut microbiome’s role in obesity?
Future research:
A flexible toolbox in a big playground
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Computational Biologist
Microbiology / Microbial Ecology
Our shared challenges
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human Health
An understanding
of microbial ecology
Environmental continuum
MICROBES
IN
ECOSYSTEMS
NATURE
AIR
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
BIOREACTORS
WASTEWATER
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Kim Lewis, 2010
Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed CulturesNatural systems
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
Dis
k S
tora
ge,
Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
Flexibility towards embracing change.
How to survive a data
deluge?
Experiment
Design
Data Generatio
n
Workflow / Tools
Data analysis
Applied Solutions
Reducing data volume:
Assembly of Metagenomic
Sequences
MSU: C. Titus Brown and James Tiedje
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Metagenome assembly…a scaling
problem.
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computerAssembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.
Natural community characteristics
Diverse
Many organisms
(genomes)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
Partitioning (khmer software)
Pell et al, 2012, PNAS
Howe et al., 2014, PNAS
Separates metagenomes by species
Parallel computing possible
Largest known published soil metagenome and assembly
Tackling Soil Biodiversity
Source: Chuck Haney
Tackling Soil Biodiversity
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
am
ino a
cid
meta
bolis
m
carb
ohydra
te m
eta
bo
lism
mem
bra
ne tra
nspo
rt
sig
nal tr
ansdu
ction
transla
tion
fold
ing
, sort
ing a
nd d
egra
da
tion
meta
bolis
m o
f co
facto
rs a
nd v
itam
ins
energ
y m
eta
bolis
m
transp
ort
and
cata
bolis
m
lipid
meta
bolis
m
tra
nscri
ption
ce
ll g
row
th a
nd
dea
th
replic
ation
and
rep
air
xen
obio
tics b
iod
egra
datio
n a
nd m
eta
bo
lism
nucle
otide m
eta
bolis
m
gly
can b
iosynth
esis
and m
eta
bolis
m
meta
bolis
m o
f te
rpenoid
s a
nd
poly
ke
tides
cell
motilit
y
Tota
l C
ount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Big data combined with microbiology will
changes lives.37
The health and stability of the gut
microbiome (in response to diet change)
University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38
Experiment
Design
Data Generatio
n
Workflow / Tools
Data analysis
Applied Solutions
We are supraorganisms39
Interactions between the
microbiome and the environment40
Source: Zhao, 2013
Obesity
Intestinal inflammation
IBD diseases
Diet has a greater
potential to shape the
structure and function of
gut than host genetics.Direct influence on health
state
How resilient is the microbiome?41
In mice, recovery from long term shift to obesity-inducing diet
In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)
In mice, rapid recovery from long term shift to obesity-inducing diet (2012)
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
42
Bacterial cells Bacterial cells infected
with bacteriophageViruses (Bacteriophage)
Vary by individual (Minot et al., 2011)
Altered by diet and co-vary with bacteria (Minot et al., 2011)
Long term stable (Minot et al., 2013)
Largely temperate (Reyes et al., 2013)
Prophage
Who is in the gut microbiome?
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
43
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
44
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
45
Research Questions46
What are the impacts of different diets on gut
microbiome response?
What are the impacts of viruses in the gut
microbiome (rapid alteration and resilient
response?)
Multidisciplinary approach combining
novel experimental targeting of both bacterial and viral
communities
metagenomic-based sequencing to characterize
community
Novel experimental design – targeted
sampling of community fractions
I. Total DNA (bacteria + prophage + viruses) TOT
II. Virus-like particles
(free-living viruses)
VLP
III. Induced prophage
IND
47
Separation
by density
Chemically
separate
Separation
by size
Microbiome through
faecal matter (non
destructive sampling)
Two baseline diets (with a
perturbation)
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to Baseline)Baseline
Total community function: TOT metagenomic sequencing at weeks 8, 11, 14
Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14
Weight of mice and count of VLPS with microscopy
Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.
48
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Fecal Samples
Outcomes?49
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to Baseline)Baseline
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Qualitative and Quantitative Measurements:
Who is there? What are they doing?
How much?
How does the community change
over time?
Dis
tance f
rom
Baselin
e
Baseline Intervention Washout
Dis
tance f
rom
Ba
selin
e
Baseline Intervention Washout
Altered-Recovery Altered-Altered
Measurements of gene abundance profile
(200,000+ genes) reduced to a single
distance measurement from the original
community (ordination)
Baseline Intervention Washout
No Change
Dis
tance f
rom
Baselin
e
Rapid and resilient bacterial gut
response after diet alteration
Dis
tance f
rom
Baselin
e
***
Baseline Intervention Washout
Diet-specific functional total
community recovery (mostly
bacterial)52
0.0
00
.05
0.1
0D
ista
nce
fro
m B
ase
line
Baseline Diet Perturbed Washout
***
53
0.0
0.1
0.2
0.3
Dis
tan
ce
fro
m B
ase
line
Free living viruses in MF baseline
are significantly altered without
recovery.
Baseline Diet Perturbed Washout
***
Prophages in MF baseline are
significantly altered without
recovery. 54
0.0
0.1
0.2
0.3
Dis
tan
ce
fro
m B
ase
line
Baseline Diet Perturbed Washout
“Combat Zone” as diets change
Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in
which there is not a rapid recovery of viral communities
Viral functions significantly
changed during the milk fat
baseline diet 56
Decreases in
Phage-related (p=0.01)
Iron acquisition (p<0.01)
Nucleotide metabolism (p=0.02)
Carbohydrate metabolism (p=0.01)
Motility and chemotaxis (p=0.03)
Virulence and defense (p=0.03)
Phage Iron
Nucleotide Carbs
Baseline - Change -- Washout
Flagella
57
Bacteroides (Bacterioidetes)
Clostridium (Firmucutes)
Eubacterium (Firmucutes)
Significant decrease in genes associated with MF baseline viruses
Ratio of Firmucutes and
Bacterioidetes associated with
obesity
Turnbaugh, 2008
Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic
Turnbaugh, 2009
Viromes potentially critical in gut
microbiome response.
Members of gut microbiome community do not
have co-occuring responses.
Loss of viral population and diversity is diet
specific (related to a milkfat to lowfat diet
transition)
Ability to redirect structure and function of
microbiome makes them pivotal drivers of health and
disease
Reyes et al, Nature Review Microbiology, 2012
59
Virome directly causes host response
Germ Free 11 week old mice (n = 3)
Diet: Standard chow
3 week conventionalization
60
A “standard control”
Microbiome:
Uniform cecal content
of standard chow
mice
Experimentally
introduced viruses
Mouse Treatment I:
Lowfat baseline
VLP
Mouse Treatment
2: Milkfat baseline
VLP
Control: Buffer
Significant decrease of intestinal
inflammation in LF VLP treatments61
Pro-inflammatory cytokines in mucosal scrapings
TNF-α INF-γ
Proximal colon
TN
F-a
lph
a (n
g/g
l)
Con
trol
LF V
LPs
MF V
LPs
0
5
10
15
Proximal colon
INF
-ga
mm
a (n
g/g
)
Con
trol
LF V
LPs
MF V
LPs
0
10
20
30*
Conclusions
Gut microbiome has reproducible and distinct responses to diet.
Viruses have a unique response to diet perturbations and do not co-occur with bacteria.
Viruses observed to cause inflammation in infected germ free mice.
Big data workflow enabled strategic sampling design providing unparalleled access to viruses of gut microbiome
62
Future work
Data-discovery is a national
investment.
Data-driven biological
investigations
MICROBES
IN
ECOSYSTEMS
NATURE
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
WASTEWATER
High Throughput Frameworks:
Metagenomic
Metatranscriptomic
Metaproteomic
More relevant model
systems
Improved biomarkers
Scaling approaches
Big data computation
Data driven discovery
Core research values
Research that matters
Developing scientific frameworks that enable
open-science initiatives (reproducible science)
Computational and experimental integration
Scale and power to multi-disciplinary
approaches
Team value
Flexibility
Going viral: The role of the human gut
phageome in inflammatory bowel disease
Objectives:
Define and compare core phageomesassociated with healthy and diseased gut microbiomes
Determine impact of disease-associated gut phageomes on development of disease in knockout mouse models (predisposed to disease)
NIH, National Institute of Diabetes and Digestive and
Kidney Diseases; National Institute of Allergy and Infectious
Diseases ($3-5M)
Source: Nature.com
What is the role of host-phage
dynamics in the development of
intestinal diseases?
Integration of multiple datasets
Improved model systems and
biomarkers
Microbial drivers of carbon metabolism and
warming
DOE Biological and Environmental Research ($3M/3 years, 40% PI with ISU Kirsten Hofmockel, 2013-2016)
Source: Oakridge National LaboratoryContributions:
• Omic-based characterization of carbon cycling microorganisms
in the soil
• Novel approaches to target carbon cycling subsets of
community
• Improved soil genomic databases to enable future carbon
studies
Source: Oakridge National LaboratoryHow do microbes contribute to
carbon cycling models?
Big data scaling
Integration of multiple
datasets
Improved model systems
Large-scale characterization of global dark
matter proteins in complex biological
environments
NIH – Development of Software and Analysis Methods for Biomedical
Big Data in Targeted Areas of High Need
(~$1M/3 years)
Gordon and Betty Moore – Data Driven Discovery Investigator Awards
($1.5M / 5 years)
Novel extension of current software tools:
• Integration of growing volumes of global public datasets with scalable
data-mining analysis
• Lightweight data architecture to compare abundance and co-
occurrence of sequencing patterns across multiple samples and
associated metadata to elucidate information
How do we access the novelty observed in metagenomic datasets?
Big data scaling
Integration of datasets
From field to food: The origin and
fate of our microbiomes
USDA Agriculture and Food Research Initiative ($1-2.5M)
• Identify and characterize under-
researched foodborne microbial hazards
and effective control strategies
• Elucidate fate and dissemination of
foodborne microbial hazards associated
with produce production and processing Source: aboretum.umn.edu
Where do harmful microbes in our food come
from and how do we protect ourselves from
them?
Integration of multiple datasets
Improved model systems and
biomarkers
Acknowledgements
Funding DOE Microbial Carbon Cycling Grant
NSF Postdoc Fellowship, Great Lakes Bioenergy Research Center
Microbiome: University of Chicago Digestive Diseases Research Core Pilot and Feasibility Grant
My Awesome INTER-DISCIPLINARY Team C. Titus Brown (MSU) + lab (Bioinformatics)
James Tiedje (MSU) + lab (Microbial Ecology)
Daina Ringus (UC) (Microbiology / Mice)
Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)
Eugene Chang (UC)
Folker Meyer (ANL)
71
Questions?
Reducing data, not information.
More efficient data storage and mining.
Big data scaling approaches
Storage of biological big data
What other sequences are connected to
Sequence X?
Data broken into words of length “k” (k-mers)
Overlap (for assembly) = shared “word”
Pell, PNAS, 2014
Howe, PNAS,
2014
AGTCAGTT
Into its 4-mers:
AGTC
GTCA
TCAG
CAGT
AGTT
AGAAAGTC
Into its 4-mers:
AGAA
GAAA
AAAG
CAGT
AGTC
Storage of biological big data
What other sequences are connected to Sequence X?
Data broken into words of length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Efficient storage
Do I have mail?
What other sequences are connected to Sequence X?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy
A-G H-R S-Z
Pell, PNAS, 2014
Howe, PNAS,
2014
Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G* H-R S-Z
No mail for Howe, 100% sure.
A-G H-R* S-Z
Possibly mail for Howe.
Pell, PNAS, 2014
Howe, PNAS,
2014
Do I have mail?
Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H* A-C; I-O P-Z
Howe mail status:
Mail possibility higher.
Do I have mail?
Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H A-C; I-O P-Z
Howe mail status:
No mail, 100% sure.
Do I have mail?
Bloom filter data structure
“Probablistic” data structure
Decrease of false positive rate with multiple
bloom filters – “More likely I have mail”
No false negatives – “No mail. 100% sure”
For the win: both detects and counts presence
of sequences (k-mers) and their connectivity
efficiently
Is sequence A connected to sequence B?
Pell, PNAS, 2014
Howe, PNAS,
2014