+ All Categories
Home > Documents > Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics

Date post: 23-Feb-2016
Category:
Upload: aqua
View: 48 times
Download: 0 times
Share this document with a friend
Description:
Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 01-06- 10 11. Harvard School of Public Health Department of Biostatistics. What tools enable biological discoveries?. Our job is to create computational microscopes: - PowerPoint PPT Presentation
Popular Tags:
42
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-10 11 rvard School of Public Health partment of Biostatistics
Transcript
Page 1: Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics

Curtis Huttenhower

01-06-1011Harvard School of Public HealthDepartment of Biostatistics

Page 2: Scalable data mining for functional genomics and metagenomics

2

What tools enable biological discoveries?

Our job is to create computational microscopes:

To ask and answer specific biomedical questions using

millions of experimental results

Page 3: Scalable data mining for functional genomics and metagenomics

3

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

Page 4: Scalable data mining for functional genomics and metagenomics

4

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

Page 5: Scalable data mining for functional genomics and metagenomics

5

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+…

G3G6

-

G7G8

-…

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Freq

uenc

y

Let.Not let.

Freq

uenc

y

SimilarDissim.

Freq

uenc

y

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

Page 6: Scalable data mining for functional genomics and metagenomics

6

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

Page 7: Scalable data mining for functional genomics and metagenomics

7

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

11log

21'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eieie s

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

Page 8: Scalable data mining for functional genomics and metagenomics

8

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

11log

21'

'

''

z

+ =

Page 9: Scalable data mining for functional genomics and metagenomics

9

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

Page 10: Scalable data mining for functional genomics and metagenomics

10

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

X?

Page 11: Scalable data mining for functional genomics and metagenomics

11

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

Page 12: Scalable data mining for functional genomics and metagenomics

12

What to do with your metagenome?

(x1010)

Diagnostic or prognostic

biomarker for host disease

Public health tool monitoring

population health and interactions

Comprehensive snapshot of

microbial ecology and evolution

Reservoir of gene and protein

functional informationWho’s there?

What are they doing?

What do functional genomic data tell us about microbiomes?

What can our microbiomes tell us about us?*

*Using terabases of sequence and thousands of experimental results

Page 13: Scalable data mining for functional genomics and metagenomics

13

The Human Microbiome Project

2007 - ongoing

• 300 “normal” adults, 18-40

• 16S rDNA + WGS• 5 sites/18 samples +

blood• Oral cavity: saliva, tongue,

palate, buccal mucosa, gingiva,

tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid, fornix

• Reference genomes (~200+800)

All healthy subjects; followup projects in psoriasis, Crohn’s,

colitis, obesity, acne, cancer, antibiotic

resistant infection…

Hamady, 2009

Kolenbrander, 2010

Page 14: Scalable data mining for functional genomics and metagenomics

14

HMP Organisms: Everyone andeverywhere is different

← Body sites + individuals →

← O

rgan

ism

s (ta

xa) →

ear gut nose mouth vaginaarmmucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue

Every microbiome is surprisingly different

Most organisms are rare in most places

Even common organisms vary tremendously in abundance

among individuals

Aerobicity, interaction with the immune system, and

extracellular medium appear to be major determinants

There are few, if any, organismal biotypes

in health

Page 15: Scalable data mining for functional genomics and metagenomics

15

HMP: Metabolic reconstruction

WGS reads

Pathways/modules

Genes(KOs)

Pathways(KEGGs)

Functional seq.KEGG + MetaCYC

CAZy, TCDB,VFDB, MEROPS…

BLAST → Genes

rra

r

raa

p

gap

ggc

)(

)(

1

)()1(

||1)(

Genes → PathwaysMinPath (Ye 2009)

SmoothingWitten-Bell

otherwiseTNNgcgcTNTVTN

gc)/()(

0)()/()/()(Gap filling

c(g) = max( c(g), median )

300 subjects1-3 visits/subject~6 body sites/visit

10-200M reads/sample100bp reads

BLAST

?Taxonomic limitation

Rem. paths in taxa < ave.

XipeDistinguish zero/low

(Rodriguez-Mueller in review)

Page 16: Scalable data mining for functional genomics and metagenomics

16

HMP: Metabolic reconstruction

Pathway coverage Pathway abundance

Page 17: Scalable data mining for functional genomics and metagenomics

17

HMP: Metabolic reconstruction

Pathway abundance← Samples →

← P

athw

ays→

Page 18: Scalable data mining for functional genomics and metagenomics

18

HMP: Metabolic reconstruction

Pathway coverage← Samples →

← P

athw

ays→

Aerobic body sites

Gastrointestinal body sites

All body sites (“core”)

Page 19: Scalable data mining for functional genomics and metagenomics

19

GeneexpressionSNPgenotypes

Metagenomic biomarker discovery

Healthy/IBDBMIDiet

Taxa &pathways

Batch effects?Populationstructure?

Niches &Phylogeny

Test for correlates

Multiplehypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

Page 20: Scalable data mining for functional genomics and metagenomics

20

LEfSe: Metagenomic classcomparison and explanation

LEfSe

http://huttenhower.sph.harvard.edu/lefse

Nicola Segata

LDA +Effect Size

Page 21: Scalable data mining for functional genomics and metagenomics

21

LEfSe: The TRUC murine colitis microbiotaWith Wendy Garrett

Page 22: Scalable data mining for functional genomics and metagenomics

22

MetaHIT: The gut microbiome and IBD

WGS reads

Pathways/modules

124 subjects: 99 healthy21 UC + 4 CD

ReBLASTed against KEGG since published data

obfuscates read counts

Taxa

PhymmBrady 2009

Genes(KOs)

Pathways(KEGGs)

Qin 2010

With Ramnik Xavier, Joshua Korzenik

Page 23: Scalable data mining for functional genomics and metagenomics

23

MetaHIT: Taxonomic CD biomarkers

Firmicutes

Enterobacteriaceae

Up in CDDown in CD

UC

Page 24: Scalable data mining for functional genomics and metagenomics

24

MetaHIT: Functional CD biomarkers

Motility Transporters Sugar metabolism

Down in CD

Up in CD

Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients

Growth/replication

Page 25: Scalable data mining for functional genomics and metagenomics

25

MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in CD

Up in CD

Inferredmetabolites

Enzymefamilies

Page 26: Scalable data mining for functional genomics and metagenomics

26

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

• HMP: microbiome in health,

18 body sites in 300 subjects• HUMAnN: metagenomic

metabolic and functional

pathway reconstruction• LEfSe: biologically relevant

community differences

• Network framework for

scalable data integration

• HEFalMp: human data

integration• Meta-analysis for

unsupervised functional

network integration

Page 27: Scalable data mining for functional genomics and metagenomics

27

Thanks!

Jacques IzardWendy Garrett

Pinaki SarderNicola Segata

Levi Waldron LarisaMiropolsky

http://huttenhower.sph.harvard.edu

Interested? We’re recruiting students and postdocs!

Human Microbiome Project

HMP Metabolic Reconstruction

George WeinstockJennifer WortmanOwen WhiteMakedonka MitrevaErica SodergrenVivien Bonazzi Jane PetersonLita Proctor

Sahar AbubuckerYuzhen Ye

Beltran Rodriguez-MuellerJeremy ZuckerQiandong Zeng

Mathangi ThiagarajanBrandi Cantarel

Maria RiveraBarbara Methe

Bill KlimkeDaniel Haft

Ramnik Xavier Dirk Gevers

Bruce Birren Mark DalyDoyle Ward Eric AlmAshlee Earl Lisa Cosimi

Sarah Fortune

http://huttenhower.sph.harvard.edu/sleipnir

Page 28: Scalable data mining for functional genomics and metagenomics
Page 29: Scalable data mining for functional genomics and metagenomics

29

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

Page 30: Scalable data mining for functional genomics and metagenomics

30

Predicting gene function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

Page 31: Scalable data mining for functional genomics and metagenomics

31

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 32: Scalable data mining for functional genomics and metagenomics

32

Cell cycle genes

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

Page 33: Scalable data mining for functional genomics and metagenomics

33

Comprehensive validation of computational predictions

Genomic data

Computational Predictions of Gene FunctionMEFITSPELL

Hibbs et al 2007bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

Page 34: Scalable data mining for functional genomics and metagenomics

34

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Page 35: Scalable data mining for functional genomics and metagenomics

35

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

Page 36: Scalable data mining for functional genomics and metagenomics

36

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

Page 37: Scalable data mining for functional genomics and metagenomics

37

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 38: Scalable data mining for functional genomics and metagenomics

38

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 39: Scalable data mining for functional genomics and metagenomics

39

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 40: Scalable data mining for functional genomics and metagenomics

40

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 41: Scalable data mining for functional genomics and metagenomics

41

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 42: Scalable data mining for functional genomics and metagenomics

42

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered


Recommended