+ All Categories
Home > Science > Sweden_eemis_big_data

Sweden_eemis_big_data

Date post: 17-Jul-2015
Category:
Upload: adina-chuang-howe
View: 72 times
Download: 3 times
Share this document with a friend
63
WHAT TO DO IN THE EVENT OF A DATA DELUGE EEMiS Workshop, Öland, Sweden, 12/1/2014 Adina Howe germslab.org (Genomics and Environmental Research in Microbial Systems) Iowa State University, Ag & Biosystems Engr (January) Slides available at www.slideshare.com/adinachu anghowe NGS SEQUENCING
Transcript
Page 1: Sweden_eemis_big_data

WHAT TO DO IN THE

EVENT OF A DATA

DELUGE

EEMiS Workshop, Öland, Sweden,

12/1/2014

Adina Howe

germslab.org (Genomics and

Environmental Research in

Microbial Systems)

Iowa State University, Ag &

Biosystems Engr (January)

Slides available at

www.slideshare.com/adinachu

anghowe

NGS SEQUENCING

Page 2: Sweden_eemis_big_data

HOW DID WE GET HERE

Page 3: Sweden_eemis_big_data

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Page 4: Sweden_eemis_big_data

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Kim Lewis, 2010

Page 5: Sweden_eemis_big_data

Gene / Genome Sequencing

Collect samples

Extract DNA

Sequence DNA

“Analyze” DNA to identify its content and origin

Taxonomy

(e.g., pathogenic E. Coli)

Function

(e.g., degrades cellulose)

Page 6: Sweden_eemis_big_data

Cost of Sequencing

Stein, Genome Biology, 2010

E. Coli genome 4,500,000 bp ($4.5M, 1992)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Page 7: Sweden_eemis_big_data

Rapidly decreasing costs with

NGS Sequencing

Stein, Genome Biology, 2010

Next Generation Sequencing

4,500,000 bp (E. Coli, $200, presently)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Page 8: Sweden_eemis_big_data

Effects of low cost

sequencing…

First free-living bacterium sequenced

for billions of dollars and years of

analysis

Personal genome can be

mapped in a few days and

hundreds to few thousand

dollars

Page 9: Sweden_eemis_big_data

The experimental continuum

Single Isolate

Pure Culture

Enrichment

Mixed CulturesNatural systems

Page 10: Sweden_eemis_big_data

The era of big data in biology

Stein, Genome Biology, 2010

Computational Hardware

(doubling time 14 months)

Sanger Sequencing

(doubling time 19 months)

NGS (Shotgun) Sequencing

(doubling time 5 months)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0

1

10

100

1,000

10,000

100,000

1,000,000

Dis

k S

tora

ge,

Mb/$

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

0.1

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Page 11: Sweden_eemis_big_data

Postdoc experience with data

2003-2008 Cumulative sequencing in PhD = 2000 bp

2008-2009 Postdoc Year 1 = 50 Gbp

2009-2010 Postdoc Year 2 = 450 Gbp

2014 = 50 Tbp

2015 = 500 Tbp budgeted

Page 12: Sweden_eemis_big_data
Page 13: Sweden_eemis_big_data

THE DIRT ON SOIL

Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress

MAGNIFICENT BIODIVERSITY

Page 14: Sweden_eemis_big_data

THE DIRT ON SOIL

SPATIAL HETEROGENEITY

http://www.fao.org/ www.cnr.uidaho.edu

Page 15: Sweden_eemis_big_data

THE DIRT ON SOIL

DYNAMIC

Page 16: Sweden_eemis_big_data

THE DIRT ON SOIL

INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES

Philippot, 2013, Nature Reviews Microbiology

Page 17: Sweden_eemis_big_data

I. Methods to tackle metagenomic datasets

Computational

Experimental

I. Bottlenecks for microbiologists

Page 18: Sweden_eemis_big_data

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Page 19: Sweden_eemis_big_data

A Slight Digression: Decisions for the new

microbial ecologist

Page 20: Sweden_eemis_big_data

Getting the most

out of your dataComplex

Samples

16S rRNA

amplicon

sequencing

Pros:

1) Commonly used

approach

2) Deep characterization

Cons:

1) Limited knowledge

2) Resolution remains low

ID, Abundance, Function

Patrick Chain

Page 21: Sweden_eemis_big_data

Getting the most

out of your dataComplex

Samples

16S rRNA

amplicon

sequencing

Shotgun

sequencing

Pros:

1) Commonly used

approach

2) Deep characterization

Cons:

1) Limited knowledge

2) Resolution remains low

Assembly based Read-based /

Mapping Methods

Pros:

1) Large contigs

2) Positional Information

3) Most direct method to identify

novel orgs/genes

Cons:

1) Computational resource

intensive

2) Assembling difficulties

• Sequencing error

• genomic redundancy -

chimeras

Pros:

1) Massive data

2) Identity and

abundance answered

simultaneously

3) Look at all data**

Cons:

1) Massive data (short +

with errors)

2) Lack of specificity due

to FPs from genomic

redundancy

3) Difficult to detect

novel genomes –

must infer

ID, Abundance, Function

Patrick Chain

Page 22: Sweden_eemis_big_data

Getting the most

out of your dataComplex

Samples

16S rRNA

amplicon

sequencing

Shotgun

sequencing

Pros:

1) Commonly used

approach

2) Deep characterization

Cons:

1) Limited knowledge

2) Resolution remains low

Assembly based Read-based /

Mapping Methods

Pros:

1) Large contigs

2) Positional Information

3) Most direct method to identify

novel orgs/genes

Cons:

1) Computational resource

intensive

2) Assembling difficulties

• Sequencing error

• genomic redundancy -

chimeras

Pros:

1) Massive data

2) Identity and

abundance answered

simultaneously

3) Look at all data**

Cons:

1) Massive data (short +

with errors)

2) Lack of specificity due

to FPs from genomic

redundancy

3) Difficult to detect

novel genomes –

must infer

Reference

database size

reduction

Faster search

algorithms

Query size

reduction

a. Selection of marker

genes

b. Identification of

signatures (Kmers)

a. Exact match

b. K-mer based search

c. Improved algorithm

a. Clustering

b. Pattern

matching

a. Assembly / clustering

b. Unique K-mer

ID, Abundance, Function

Patrick Chain

Page 23: Sweden_eemis_big_data

Example #1: Data compression

http://siliconangle.com/files/2010/09/image_thumb69.png

Page 24: Sweden_eemis_big_data

de novo assembly

Compresses dataset size significantly

Improved data quality (longer sequences, gene order)

Reference not necessary (novelty)

Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes

Page 25: Sweden_eemis_big_data

Metagenome assembly…a scaling

problem.

Page 26: Sweden_eemis_big_data

Shotgun sequencing and de novo

assembly

It was the Gest of times, it was the wor

, it was the worst of timZs, it was the

isdom, it was the age of foolisXness

, it was the worVt of times, it was the

mes, it was Ahe age of wisdom, it was th

It was the best of times, it Gas the wor

mes, it was the age of witdom, it was th

isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

Page 27: Sweden_eemis_big_data

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computer

Page 28: Sweden_eemis_big_data

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computerAssembly of 300 Gbp (70,000

genomes worth) can be done with

any assembly program in less

than 14 GB RAM and less than

24 hours.

50 Gbp = 10,000 genomes

Page 29: Sweden_eemis_big_data

Natural community characteristics

Diverse

Many organisms

(genomes)

Page 30: Sweden_eemis_big_data

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x

Page 31: Sweden_eemis_big_data

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Page 32: Sweden_eemis_big_data

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Overkill

Page 33: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 34: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 35: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 36: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 37: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 38: Sweden_eemis_big_data

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Scales datasets for assembly up to 95% - same assembly

outputs.

Genomes, mRNA-seq, metagenomes (soils, gut, water)

Page 39: Sweden_eemis_big_data

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Page 40: Sweden_eemis_big_data

The reality?

Page 41: Sweden_eemis_big_data

More like…

Source: Chuck HaneyHowe et. al, 2014, PNAS

Page 42: Sweden_eemis_big_data

What we learned from deeply sequencing

soil

Grand Challenge effort –

10% of soil biodiversity

sampled

Incredible soil biodiversity

(estimate required 10

Tbp/sample)

“To boldly go where no man

has gone before”: >60%

Unknown

0

100

200

300

400

am

ino a

cid

meta

bolis

m

carb

ohydra

te m

eta

bo

lism

mem

bra

ne tra

nspo

rt

sig

nal tr

ansdu

ction

transla

tion

fold

ing

, sort

ing a

nd d

egra

da

tion

meta

bolis

m o

f co

facto

rs a

nd v

itam

ins

energ

y m

eta

bolis

m

transp

ort

and

cata

bolis

m

lipid

meta

bolis

m

tra

nscri

ption

ce

ll g

row

th a

nd

dea

th

replic

ation

and

rep

air

xen

obio

tics b

iod

egra

datio

n a

nd m

eta

bo

lism

nucle

otide m

eta

bolis

m

gly

can b

iosynth

esis

and m

eta

bolis

m

meta

bolis

m o

f te

rpenoid

s a

nd

poly

ke

tides

cell

motilit

y

Tota

l C

ount

KO

corn and prairie

corn only

prairie only

Howe et al, 2014, PNAS

Managed agriculture soils exhibit less

diversity, likely from its history of

cultivation.

Page 43: Sweden_eemis_big_data

Megahit

Page 44: Sweden_eemis_big_data

Example #2: Experimental partitioning

Page 45: Sweden_eemis_big_data

We are supraorganisms45

Page 46: Sweden_eemis_big_data

Is the gut community going viral?

46

Bacterial cells Bacterial cells infected

with bacteriophageViruses (Bacteriophage)

Prophage

Who is in the gut microbiome?

Page 47: Sweden_eemis_big_data

Novel experimental design with high

throughput sequencing47

I. Total DNA (bacteria + prophage + viruses) TOT

Separation

by sizeII. Virus-like particles

(free-living viruses)

VLP

III. Induced prophage

IND

Separation

by density

Chemically

separate

Faecal samples

(nondestructive

sampling)

+ Evolutionary biomarker (taxonomy analysis): 16S rRNA gene from total DNA

Page 48: Sweden_eemis_big_data

Is the gut community going viral?

48

Reyes et al, Nature Review Microbiology, 2012

Page 49: Sweden_eemis_big_data

Research Questions49

What are the impacts of different diets on gut

microbiome response?

What are the impacts of viruses in the gut

microbiome (rapid alteration and resilient

response?)

Multidisciplinary approach combining

novel experimental targeting of both bacterial and viral

communities

metagenomic-based sequencing to characterize

community

Page 50: Sweden_eemis_big_data

Two diets (with a perturbation)50

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

Weight of mice

Total community function: TOT metagenomic sequencing at weeks 8, 11, 14

Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14

Bacterial (only) community structure: 16s rRNA at weeks 8 through 14

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

Fecal Samples

Page 51: Sweden_eemis_big_data

Outcomes?51

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

For The Win:

Qualitative and Quantitative Measurements

Page 52: Sweden_eemis_big_data

“Combat Zone” of phage-host

interactions

Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in

which there is not a rapid recovery of viral communities Instability of gut

microbiome? Availability of genes through viral infection?

Page 53: Sweden_eemis_big_data

Is more data better?

Bottlenecks for the emerging

microbiologists

Page 54: Sweden_eemis_big_data

Technical obstacles in the big data

deluge

Access to the data

Access to the resources

Democratization of both data and resource access

“80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)

Data volume and velocity

Previous efforts are difficult to integrate

Innovation is necessary

Page 55: Sweden_eemis_big_data

Software Developers

Computer Scientists

Clinicians

PIs

Data generators

Microbiologists

Data Analyzers

Statisticians

Bioinformaticians

http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html

Data intensive microbiology

Page 56: Sweden_eemis_big_data
Page 57: Sweden_eemis_big_data
Page 58: Sweden_eemis_big_data

Social obstacles – the main

challenge

Shift of costs do not mean shift of

expectations

http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/

Dear PI,

It will take longer than

the time it took you to do

your experiment to

analyze the data. Please

do not write me for

results within 24 hours of

your sequences

becoming available.

- Adina

Page 59: Sweden_eemis_big_data

Culture of sharing

http://www.heathershumaker.com/

Metagenomic Datasets

Page 60: Sweden_eemis_big_data

Training / Incentives

Emails between collaborators don’t contain as

much “science” as I’d like:

Page 61: Sweden_eemis_big_data

All analysis: accessible,

reproducible, and automated

Page 62: Sweden_eemis_big_data

All analysis: accessible,

reproducible, and automated

To reproduce analysis in a publication,

1. Rent Amazon EC2 computer

2. Clone github repository containing data and scripts

3. Open IPython notebook and execute

To run same analysis on different dataset,

1. Replace data files with your own data, execute notebook.

2. Tweak scripts as needed.

Page 63: Sweden_eemis_big_data

Acknowledgements

C. Titus Brown (MSU)

James Tiedje (MSU)

Daina Ringus (UC)

Folker Meyer (ANL)

Eugene Chang (UC)

NSF Biology Postdoc Fellowship

DOE Great Lakes Bioenergy Research Center