Predicting function from sequence Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl-heidelberg.de

Post on 23-Dec-2015

220 views 0 download

Tags:

transcript

Predicting function Predicting function from sequencefrom sequence

Peer Bork

EMBL & MDC

Heidelberg & Berlin

bork@embl-heidelberg.dehttp://www.bork.embl-heidelberg.de/

www.bork.embl-heidelberg.de

BioinformaticsBioinformatics Generation of information Generation of information

(biophysics)(biophysics) Storage and retrieval of information Storage and retrieval of information

(informatics for biodatabases)(informatics for biodatabases) Translation of information into Translation of information into

knowledge (computational biology)knowledge (computational biology)

Chance Chance of deducing of deducing structural and structural and functional features functional features by homologyby homology

Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction

www.bork.embl-heidelberg.de

Function prediction from Function prediction from sequencesequence

Function and domain prediction Function and domain prediction

Function prediction by gene context Function prediction by gene context

Quality and heterogeneity of dataQuality and heterogeneity of data

Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle

Quality and heterogeneity of dataQuality and heterogeneity of data

www.bork.embl-heidelberg.de

Challenges despite highly similar sequences due to sequencing errors and other artefacts

Challenges due to low sequence similarity, paralogy and multiple domains

Algorithmic challenges versus data quality and biological diversity

www.bork.embl-heidelberg.de

Number of human genes in time

Aug00 Apr01Oct00 Dec00 Feb01Mar00 0

100

120

20

40

80

60

HGS, Incyte and coTextbooks, public opinion

Celera

HGP38 32

5239

27 24 21

No h

uman

gen

es in

thou

sand

s

HGS

others

Nature 304, 16. November 2000

www.bork.embl-heidelberg.de

Heterogenous data from large scale approaches

Gene expression (correlation to proteins poor)

Yeast two hybrid (8% overlap with each other)

Many others….

Mycoplasma pneumoniaeMycoplasma pneumoniae predictions predictions

0

20

40

60

80

100

1995Function

1995Structure

1999Function

1999Structure

fold twilightFoldsContextTwilightHomology

Dandekar et al., 2000 NAR Sep

www.bork.embl-heidelberg.de

Mycoplasma pneumoniaeMycoplasma pneumoniae re-annotation re-annotation1995 1995 vsvs 1999 1999

ORFs: ORFs: +12 -1+12 -1 = 688 = 688

RNAs: RNAs: +9+9 = 42 = 42

ORFs with functions: ORFs with functions: +105 +105 = 458 = 458

ORFs changed: ORFs changed: 16 extended, 8 shortened16 extended, 8 shortened

Function changed: Function changed: 30 more, 18 less specific30 more, 18 less specific

57% of all entries were re-annotated !57% of all entries were re-annotated !

www.bork.embl-heidelberg.de

Function prediction from Function prediction from sequencesequence

Function and domain prediction Function and domain prediction

Function prediction by gene context Function prediction by gene context

Quality and heterogeneity of dataQuality and heterogeneity of data

Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle

Quality and heterogeneity of dataQuality and heterogeneity of data

Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle

70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference

Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.

Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998

Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein

Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999

Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999

Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein

Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999

GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999

Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996

Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein

Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996

Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998

Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein

Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein

Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99

Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b

Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998

Clear homology via Blast; yet, misleading Clear homology via Blast; yet, misleading annotation hampers automatic function predictionannotation hampers automatic function prediction

Phylogenetic tree of Blast hits revealsPhylogenetic tree of Blast hits reveals that no function prediction is possiblethat no function prediction is possible

Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence

Henikoff et al. 1997 Science 278, 609

www.bork.embl-heidelberg.de

Function prediction from Function prediction from sequencesequence

Function and domain prediction Function and domain prediction

Function prediction by gene context Function prediction by gene context

Quality and heterogeneity of dataQuality and heterogeneity of data

Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdlePrediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle

Function and domain prediction Function and domain prediction

HE

CT

c

HECTcC2

C2

WW

WW

WW

W WW W

W WW W

Rsp5 from Yeast

Ned

4 fr

om H

uman

Dotplot to reveal residue conservationDotplot to reveal residue conservation

Repeat pattern

Conserved domain

Conserved domain

www.bork.embl-heidelberg.de

Domain insertion

www.bork.embl-heidelberg.de

Function prediction for disease genesFunction prediction for disease genes

Breast cancer gene BRCA1Breast cancer gene BRCA1

Positionally cloned 1994 (Miki et al. Science 266, pp66)

Features originally deduced from the 1857aa sequence:Contains a RING finger (30aa, usually bind diverse proteins)

Function unknown, even localization unclear

www.bork.embl-heidelberg.de

Localization experiments on BRCA1Localization experiments on BRCA1

A strong candidate for the breastand ovarian cancer susceptibilitygeneScience 266, 1994

?

Abberrant subcellular localizationof BRCA1 in breast cancerScience 270, Nov. 1995

Cytoplasmic(nuclear)

Growth retardation and tumorinhibition by BRCA1Nature Genet. 12, March 1996

Nuclear ?

BRCA1 is secreted and exhibitsproperties of a graninNature Genet. 12, March 1996

Extracellular

Location of BRCA1 in breast andovarian cancerScience 272, April 1966

Nuclear andcytoplasmic

Title/JournalTitle/Journal ConclusionConclusion

Domain discovery in BRCA1Domain discovery in BRCA1

Domain discovery in disease genesDomain discovery in disease genesgene/protein disease domains reference

dystrophin Muscular dystrophy WW Bork & Sudol: TIBS 19(94)531

X11 Friedreich's ataxia (c) PI/PTB+PDZ Bork & Margolis: Cell 80(95)693

PKD1 Polycystic kidney many (PKD1) Int. PKD1 consortium: Cell 81(95)298

HD Huntington's HEAT repeats Andrade & Bork: Nat.Genet.11(95)115

BRCA2 Breast cancer BRC repeats Bork et al.: Nat. Genet. 13 (96) 22

BRCA1 Breast cancer BRCT Koonin et al.: Nat. Genet. 13 (96) 266

dsh DiGeorge syndrome DEP Ponting & Bork: TIBS 21(96) 245

X25 (FRDA) Friedreich's ataxia CyaY Gibson et al. : TINS 19 (96) 465

beige/CH Chediak-Higashi BEACH Nagle et al. : Nat. Genet. 14 (96) 307

RB Retinoblastoma BRCT Bork et al. :FASEB J. 11 (97) 68

9 incl. HML1 Colon cancer HSP90 Mushegian et al. : PNAS 94 (97) 5831

TSG101 Breast cancer UBC Ponting, Cai & Bork: JMM 75 (97) 467

WRN/BLM Werner + Bloom syn. HRDC Morozov et al. : TIBS 22 (97) 417

2 inc pyrin Mediterrian fever SPRY Schultz et al. : PNAS 95 (98) 5857

p73 various tumors? SAM Bork & Koonin: Nat. Genet. 18 (98) 313

mahagony Obesity PSI Nagle et al.: Nature 398 (99) 148

Parkin AP-J Parkinsonism IBR Morett & Bork: TIBS 24 (99) 229

SMARTSMARTBlast-like inputBlast-like input

- ID or AC sufficient

- Access to different databases

- Domain annotation

www.smart.embl-heidelberg.de

SMARTSMART

Digested outputDigested output

-signal sequence

-transmembrane regions

SMARTSMART

-comparison of domain context

www.smart.embl-heidelberg.de

www.bork.embl-heidelberg.de

Non-globuar functional features in Non-globuar functional features in protein sequencesprotein sequences

Transmembrane regions signal sequences GPI anchors coiled-coiled other compositionally biased regions (short internal repeats)

SMARTSMARTBlast with “in between”Blast with “in between”

regionsregions

-automatically cuts respective region

-cut and paste for other programs

-some specific output features

www.smart.embl-heidelberg.de

Digested outputDigested output

-signal sequence

-transmembrane regions

SMARTSMART

-comparison of domain context

www.smart.embl-heidelberg.de

SMARTSMARTDomain annotationDomain annotation

-multiple alignment

-consensus features

-residue annotation

-search options

-description

www.smart.embl-heidelberg.de

SMARTSMARTSpecies distributionSpecies distribution

-total occurrence

-taxonomic break down

-model organisms

-protein and domainstatistics

www.smart.embl-heidelberg.de

www.bork.embl-heidelberg.de

Domain architecture of C35B8.2 C. elegans

Query: VAV H. sapiens

Reconstructed structure of C35B8.2

Annotation improvement using Annotation improvement using domain correlationdomain correlation

SH3

Find closest hit: selective SMART

Evaluate correlation; scan genome region

www.bork.embl-heidelberg.de

Domain organization of TAPDomain organization of TAP

LRR

LRR

LRR

LRR

NTF2-like UBA

100aa

RNA-binding p15-binding

np-bind.

Directed mutagenesisDirected mutagenesis

619aaTAPTAP Random mutagenesisRandom mutagenesis

Collaboration with Elisa Izaurralde

NTF2-like

p15

Directed mutagenesis confirmsDirected mutagenesis confirmspredicted predicted TAPTAP//p15p15 interaction interaction

Red - loss of binding

Blue - no effect on binding Gray - alanine scan

Human genome reveals whole TAP family

Independent duplications Independent duplications in fly, worm and humanin fly, worm and human

In 90% of the human In 90% of the human genome: 6 homologues, genome: 6 homologues, but of thesebut of these1-2 pseudogenes1-2 pseudogenes

TAP

www.bork.embl-heidelberg.de

Sequenced eukaryotic genomesBork and Copley N

ature 409(01)818

History of signaling domain discovery: History of signaling domain discovery: Novel nuclear and cytoplasmic domainsNovel nuclear and cytoplasmic domains

0

5

10

15

20

25

30

35

<198

5

87/8

8

91/9

2

95/9

6

99/2

0

cytoplasmic domainsnuclear domains SystematicSystematic

approachapproachby by searching searching ‘in between’‘in between’regionsregions

Top 10 domains* in humanTop 10 domains* in humanman fly worm yeast cress

ImmunoglobulinC2H2zinc finger

*Only no of genes given, no of domains higher; note that only around 90% is sequenced

Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat

765(381) 140 64 0 1706(607) 357 151 48 115575(501) 319 437 121 1049569(616) 97 358 0 16433 198 183 97 331350 10 50 6 80300(224) 157 96 54 255277(136) 162 102 91 210276(145) 105 107 19 120

13300 18200 6100 25700

Nature 409 (01)860; Science 291(01)1304

Total no genesSpecies

Homeobox 267(160) 148 109 9 118

26500(26500)

Top 10 mobile domains in humanTop 10 mobile domains in humanman fly worm yeast cress

C2H2zinc fingerImmunoglobulin

Only no of domains given, no of proteins lower; note that only around 90% is sequenced

EGFWD40(G-protein)Ankyrin repeatCadherin domainProtein kinasesFibronectin type 3

5653 1778 587 104 2551364 457 530 0 21207 466 539 1 53894 678 488 340 1022714 363 344 38 261622 201 113 0 0586 259 462 122 1054557 217 212 2 6443 242 183 94 460

26500 13300 18200 6100 25700

SMART analysis of 31700 predicted human ORFs

Total no genesSpecies

CCP/sushi/SCR 277 64 80 0 0RRM (RNA-binding)

Correlation between domains

extra

intra

nuclear

otherMarker

PX

www.bork.embl-heidelberg.de

Function prediction from Function prediction from sequencesequence

Function and domain prediction Function and domain prediction

Function prediction by gene context Function prediction by gene context

Quality and heterogeneity of dataQuality and heterogeneity of data

Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle

Function and domain prediction Function and domain prediction

Function prediction by gene context Function prediction by gene context

Phenotypic features do not coincide with species evolution...

yeast

...but gene content does

www.bork.embl-heidelberg.de

Orthology vs paralogy

Genome A

Genome B

gene A1 gene A2

gene B1 gene B2

orthology

paralogy

genegene 2

gene 1gene A1gene B1gene A2gene B2

history

… within homology

DifferentialDifferentialGenomeGenomeDisplayDisplay

H. i

nflu

enza

e ge

nom

e

Huynen et al., 1997Trends Genet 13, 389

Exploiting the absence of genesExploiting the absence of genes

www.bork.embl-heidelberg.deHuynen et al., 1998, FEBS Lett 426, 1-5

www.bork.embl-heidelberg.de

Predicting functional interactions between proteins Predicting functional interactions between proteins by the co-occurrence of their genes in genomesby the co-occurrence of their genes in genomes

Distribution of four M.genitalium genes among 25 genomes

MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA)0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK)0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1

Using the mutual information between genes as a scoring heuristic for their co-occurrence.

M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19

hscB Jac1hscB Jac1hscAhscA

ssq1ssq1

Nfu1Nfu1

iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1

Arh1Arh1

ORF1ORF1ORF2ORF2ORF3ORF3

iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2

A.aeolicus S

ynechocystis

B.subtilis

M.genitalium

M.tuberculosis

D.radiodurans

R.prow

azekii

C.crescentus

M.loti

N.m

eningitidis

X.fastidiosa

P.aeruginosa

Buchnera

V.cholerae

H.influenzae

P.multocida

E.coli

A.pernix

M.jannaschii

A.thaliana S

.cerevisiaes

C.jejuni

C.albicans

S.pom

be

H.sapiens

C.elegan

H. pylori D

.melan.

The phylogenetic The phylogenetic distribution of cyaY distribution of cyaY (frataxin) is identical (frataxin) is identical to that of hscB/Jac1, to that of hscB/Jac1, indicating a indicating a functional role of functional role of cyaY in iron-sulfur cyaY in iron-sulfur cluster assembly on cluster assembly on proteins, specifically proteins, specifically in conjunction with in conjunction with Jac1.Jac1.

Phylogenetic distribution of iron-sulfur cluster assembly proteinsPhylogenetic distribution of iron-sulfur cluster assembly proteins

cyaY Yfh1cyaY Yfh1 (frataxin)(frataxin)

Huynen et al.Hum.Mol.Genet2001

www.bork.embl-heidelberg.de

Function prediction via gene Function prediction via gene context informationcontext information

Genomic context information:Genomic context information:

- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature

- Conserved gene neighborhood in genomes - Gene fusion as distinct neighborhood subset- Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements

Knowledge-based context information:Knowledge-based context information:

www.bork.embl-heidelberg.de

Evolution of genome organizationEvolution of genome organization

Dotplot to reveal gene order conservationDotplot to reveal gene order conservation

Conservation of gene neighboorhoodConservation of gene neighboorhood

Pairwise comparison of 20 prokaryotic genomes

(time)

(log)

xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo

EC-HIMG-MP I I

Nucleotide salvage/degradation Nucleotide salvage/degradation pathway in gram-positive bacteriapathway in gram-positive bacteria

TCA cycle inTCA cycle inevolutionevolution

Huynen et al., 1999, Trends Microb. 7, 281

Conservation of gene neighboorhoodConservation of gene neighboorhood

Pairwise comparison of 20 prokaryotic genomes

(time)

(log)

xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo

EC-HIMG-MP I I

Varying gene neighborhood within ribosomal operons

www.bork.embl-heidelberg.de

Pathway prediction using context informationPathway prediction using context information

is essential part of the hemolysin export

exp hyp tlyC era…... M. pneumo.

M. tubercul.

E.coli

T. maritima

B. subtilis hyp tlyC era …...

…...

hyp …...

…...

…...

…...

hyp tlyC era

era

…...…... hyp tlyC era

phoL

…...

exp

exp

exp

exp tlyC

hyp tlyC

exp eraGTPaseexporter

tlyCHemolysin

phoL

phoL

phoL

phoL PhoH-like

STRING server for context retrievalSTRING server for context retrieval

Tryptophan Tryptophan biosynthesisbiosynthesis

ww

w.bork.em

bl-heidelberg.de/STRIN

Gw

ww

.bork.embl-heidelberg.de/STR

INGw

ww

.bor

k.em

bl-h

eide

lber

g.de

/STR

ING

ww

w.b

ork.

embl

-hei

delb

erg.

de/S

TRIN

G

Snel et al. NAR 28(00)3442

www.bork.embl-heidelberg.de

Homology Homology vsvs context methods: context methods: M. genitaliumM. genitalium as benchmark as benchmark

333328

MG total:MG total:480 genes480 genes

Homology-basedHomology-basedfunction:368 genesfunction:368 genes

Context-based Context-based function:238 genesfunction:238 genes

26

hypotheticalhypothetical

AdditionalAdditionalinformationinformation

www.bork.embl-heidelberg.de

Martijn (NL)

Frank* (D) Yan (C) Peer (D)

Tobias (D)

Luis* (E)

Jörg* (D)Berend (NL)Warren (US)

Miguel (E)

Shamil (RU)

Birgit* (D) Mikita (J)

Richard (UK)

Vassily (RU),Ina* (D)

Gert* (D)

+Thomas* (D), David (E), Ivica (Hr), Carolina (E), Steffen (D), Francesca (I), Jan (D)

*left EMBL