Motif -...

Post on 19-Jan-2020

0 views 0 download

transcript

'&

$%

Motifinference

'&

$%

Dispersedrepeatmotifsormotifscommontoasetof

strings

'&

$%

Motifsearch�

Motifinference

Search

aknownmotif

atext

=)

positionsinthetext

wherethemotif

is\found"

Inferencea

setofproperties

atext

=)

motifssatisfying

theproperties

'&

$%

Motifsearch(verybrie y)

W

hatisthebestwayofrepresentingamotif?

pattern

positionweightmatrix

orpro�leHMM

30-40%

falsenegatives

45-60%

falsepositives

neuralnetworks

betterismoreexamples

'&

$%

Exampleofapositionweightmatrix:Positions3to9of

theCRP

bindingsite

T

T

G

T

G

G

C

T

T

T

T

G

A

T

A

A

G

T

G

T

C

A

T

T

T

G

C

A

C

T

G

T

G

A

G

A

T

G

C

A

A

A

G

T

G

T

T

A

A

A

T

T

T

G

A

A

T

T

G

T

G

A

T

A

T

T

T

A

T

T

A

C

G

T

G

A

T

A

T

G

T

G

A

G

T

T

G

T

G

A

G

C

T

G

T

A

A

C

C

T

G

T

G

A

A

T

T

G

T

G

A

C

G

C

C

T

G

A

C

T

T

G

T

G

A

T

T

T

G

T

G

A

T

G

T

G

T

G

A

A

C

T

G

T

G

A

C

A

T

G

A

G

A

C

T

T

G

T

G

A

G

'&

$%

Correspondingfrequencyandlog-likelihoodposition

weightmatrices

Frequencymatrix

A

0.35

0.043

0

0.043

0.13

0.83

0.26

C

0.17

0.087

0.043

0.043

0

0.043

0.3

G

0.13

0

0.78

0

0.83

0.043

0.17

T

0.35

0.87

0.17

0.91

0.043

0.087

0.26

Log-likelihoodpositionweightmatrix

A

0.48

-2.5

1

-2.5

-0.94

1.7

0.061

C

-0.52

-1.5

-2.5

-2.5

1

-2.5

0.28

G

-0.94

1

1.6

1

1.7

-2.5

-0.52

T

0.48

1.8

-0.52

1.9

-2.5

-1.5

0.061

'&

$%

Example of a pro�le HMM

i1

d1

m1

i0

b e

i2

d2

m2

i3

d3

m3

CCCCC

1

AGDVK

2

FWYFY

3

X X XX

C

 

X

¡

FY

Ø ü

'&

$%

Di�erent HMM or HMM-related architectures

1 2 3

¢

4

1 2

£

3

¢

4

¤

5

¥

6

¦

1 2 3

¢

4 5

¥

6

¦

1 2 3

¢

4

BLOCKS

META-MEME

profile HMM

HMMER2 "Plan 7"

Ø v

'&

$%

Motifinference:Setofproperties

Mainproperty:motifofinterest=

\conserved"element

Variouspossiblemeasuresfor\conservation"

conservationatthesequencelevel?

conservationatthelevelofphysico-chemical

propertiesofthenucleotidesequences?

'&

$%

inthistalk:

\letter"conservation

physico-chemicalconservation

TATAAT

runofpyrimidines

TTGNCA

RFXCP

runofhydrophilesaa

oramixtureofboth

TA[AT]N[AT]T

[ILMV][ASG]XXC[ILMV]H[FYW

]P

'&

$%

inthistalk:

\letter"conservation

physico-chemicalconservation

TATAAT

runofpyrimidines

TTGNCA

RFXCP

runofhydrophilesaa

oramixtureofboth

TA[AT]N[AT]T

[ILMV][ASG]XXC[ILMV]H[FYW

]P

'&

$%

\Statistical"conservationmeasure

G

T

T

T

T

T

C

T

C

T

G

C

A

T

C

T

G

T

G

T

A

A

C

C

G

G

G

T

A

T

G

T

T

T

G

T

C

T

C

T

G

C

T

T

A

T

C

T

A

T

G

T

C

T

C

T

G

A

G

T

A

T

C

A

G

T

G

T

A

G

G

T

G

T

G

A

A

T

C

A

A

1

1

0

1

7

1

0

2

C

1

1

0

1

2

0

8

1

G

7

1

8

0

0

1

2

0

T

1

7

2

8

1

8

0

7

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A G T C

G A C T

T G C A

C G A T

G C A T

f�=14

0=

2

2

2

2

2

weighted

average

ofthelog-likelihood

(theweightsare

thefrequencies)

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A G T C

G A C T

T G C A

C G A T

G C A T

f�=14

0=

0

0

0

0

0

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

T T T T

C C C C

G G G G

C C C C

f�=14

10=

2

2

2

2

2

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

A A A A

A A A A

A A A A

A A A A

A

fA

=

11

6

20=

4

4

4

4

4

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

A A A A

A A A A

A A A A

A A A A

fA

=34

2=

0.4

0.4

0.4

0.4

0.4

'&

$%

\Deterministic"conservationmeasure

\Model"

G

T

G

T

A

T

C

T

2

G

T

T

T

T

T

C

T

2

C

T

G

C

A

T

C

T

2

G

T

G

T

A

A

C

C

2

G

G

G

T

A

T

G

T

2

T

T

G

T

C

T

C

T

2

G

C

T

T

A

T

C

T

2

A

T

G

T

C

T

C

T

2

G

A

G

T

A

T

C

A

2

G

T

G

T

A

G

G

T

2

G

T

G

A

A

T

C

A

'&

$%

Model

A

motifa

wordwrittenoverthesamealphabetasthetext,

oroveradegenerate(physico-chemical)alphabet

A

numberofspeci�cproperties

minimum

numberofoccurrencesthemotifmust

have(quorum)

foreachoccurrence,maximum

numberof

di�erencesallowedinrelationtothemotif

(subs.only,orsubs.andindels)

'&

$%

Infact,thetwoarenotsodi�erent

'&

$%

Exceptperhapsfor:

C

T

G

T

A

T

C

G

C

T

G

A

T

T

C

G

C

T

G

A

G

A

C

G

G

T

G

C

A

T

C

G

C

T

C

G

C

T

C

G

C

T

G

C

G

T

C

G

C

T

G

T

C

T

C

G

C

T

G

C

T

T

C

G

C

T

G

T

C

T

C

G

C

T

G

G

A

T

C

G

A

0

0

0

2

3

1

0

0

C

9

0

1

3

3

0

10

0

G

1

0

9

2

2

0

0

10

T

0

10

0

3

2

9

0

0

'&

$%

W

hichmay,atcurrenttime,perhapsbebettercaptured

by:

\Model"

C

T

G

N

N

T

C

G

0

C

T

G

T

A

T

C

G

0

C

T

G

A

T

T

C

G

1

C

T

G

A

G

A

C

G

1

G

T

G

C

A

T

C

G

1

C

T

C

G

C

T

C

G

0

C

T

G

C

G

T

C

G

0

C

T

G

T

C

T

C

G

0

C

T

G

C

T

T

C

G

0

C

T

G

T

C

T

C

G

0

C

T

G

G

A

T

C

G

'&

$%

Approachesusingastatisticalconservationmeasure

Objective

Findthesetofwordsthatisthe\mostsurprising

possible"

Itisanoptimisationproblem,whichingeneralleads

toauniquesolution

Algorithm

Onlyapproachpossible:testallsetofwordsand,

foreachofthem,calculatethevalueoftheformula

Tootimeconsuming(O(nNk)),onemusttherefore

useheuristics

'&

$%

\Heuristic"

Threemainapproaches

Expectation-Maximization

(Lawrenceetal.,Proteins,7:41-51,1990)

MEME(Baileyetal.,MachineLearn.21:51-80)

Gibbssampling

(Lawrenceetal.,Sci.,262:208-214,1993)

Greedyalgorithm

(w)consensus(Hertzetal.,Bioinfo.,15:563-577,

1999)

'&

$%

Gibbssampling

p

forallp

value

ofthe

form

ula:Fp

and

we

startagain

(with

anotherstring)

untilconvergence

m

ax

Fp

or(stochastic)with

prob.

Fp

Pp

Fp

'&

$%

Approachesusinga\deterministic"conservationmeasure

Objective

Givenamodel(alphabetforthemotifsand

propertiessuchasquorum

andmaximum

di�erence

rateallowed),�ndallmotifswhichsatisfythe

properties

Itisanenumerationproblem,whichproducesin

generalvarious(oftenagreatnumberof)solutions

Algorithm

Anexhaustiveapproachispossible

Timecomplexitydependsonproperties

'&

$%

How

doesthealgorithm

work?

Itdoesnotmattersincethealgorithm

isexact!

'&

$%

However,thishasingeneraltobefollowedby

aSTATISTICALEVALUATION

ofthemodelsfoundtoclassifythem

accordingto

how

SURPRISING

theyaregiventheremainingofthe

sequences

'&

$%

Onecanmakethemodelsmorecomplex:motifinference

withdi�erencesandanontransitiverelation

Alphabetofmodelscorrespondstogroupsofaminoacids

-

wild

card

M

F

W Y

H

K RD

Q E

N T

L I

V

C

S

A

G

P

'&

$%

Onecanmakethemodelsmorecomplex:motifinference

withdi�erencesandanontransitiverelation

Example

modelswrittenoveraphysico-chemicalalphabet

[A

ST][ILM

V]X

X

[FY

W

][H

K

R]X

[P

G]C

occurrences

0di�erence

1substitution

1deletion

AIAGW

HAPC

ATTAYHSPC

SVMLFLPC

'&

$%

Onecanmakethemodelsmorecomplex:structuredmodels

Smile(Marsanetal.,JCB,7:345-362,2000)

anorderedcollectionofpboxes,pmaximum

ratesof

di�erences,p�

1

intervalsofdistances

(betweensuccessiveboxesinthecollection)

occurrences

quorum

=

3/4

18

TTG

ACT

TAAAAT

17

TTG

ACA

TATAAA

TTG

CCA

trop

loin

TATTAT

17

TTG

TCT

TATAAT

e1

=

2

TTG

ACA

d�

17

1

e2

=

1

TATAAT

'&

$%

Onecanmakethemodelsmorecomplex:structuredmodels

anorderedcollectionofpboxes,pmaximum

ratesof

di�erences,p�

1

intervalsofdistances

(betweensuccessiveboxesinthecollection)

occurrences

quorum

=

3/4

TTG

ACT

18

TAAAAT

16

TTG

ACA

TATAAA

TTG

CCA

too

far

TATTAT

17

TTG

TCT

TATAAT

e1

=

2

TTG

ACA

d�

17

1

e2

=

1

TATAAT

'&

$%

A

few

applications

\Experimental"set

Escherichia

coli

441sequences,35115nucleotides

Bacillussubtilis

131sequences,13099nucleotides

\Genomic"set

Escherichia

coli

1062sequences,196736nucleotides

Bacillussubtilis

1148sequences,226928nucleotides

'&

$%

\Experimental"set{MEME

Escherichia

coli

MOTIF1

width

=

46

sites

=

185.2

bits

2.2

2.0

1.7

*

1.5

*

Information

1.3

*

*

content

1.1

*

*

(10.0bits)

0.9

**

*

0.7

*

**

*

0.4

*

**

*

0.2

***

*

**

***

**

*

0.0

----------------------------------------------

Multilevel

AAATAAAAGTTGACATTTTTTGGAGTAAATGGTATAATGCGCCCCC

consensus

CTTATTTCT

TGACAACGCGCCCAATTTGTT

A

C

T

CGGGGA

sequence

C

CTA

C

CACGAATGTCCGCC

A

A

T

GGC

A

C

T

C

'&

$%

\Experimental"set{MEME

Bacillussubtilis

MOTIF

1

width=

30

sites

=

121.0

bits2.2

2.0

1.7

1.5

Information

1.3

*

*

content

1.1

*

*

*

(11.6

bits)

0.9

**

**

*

0.7

***

**

*

*

0.4

***

**

***

0.2

******

**

*******

0.0

------------------------------

Multilevel

TTGACATTATTTTAAAAATATGATATAATA

consensus

TTATAATAAAATTTTGT

G

A

G

sequence

C

CC

AG

T

'&

$%

\Experimental"set{Combinatorialalgorithm

(1box)

Escherichia

coli

Bacillussubtilis

ATAATGCGG

34

3.90

24

TATAATA

94

48.06

32

TATAATGCGC

23

1.60

19

GTATAAT

74

34.34

24

Family1

ATAATGCGC

30

5.75

17

TGTTATA

66

34.96

15

TGTGTATA

47

15.85

16

TTTTACA

76

45.96

13

ACAATGCGC

24

3.85

15

ATAATAT

82

52.52

13

GTTGACAC

36

10.80

14

GTGACA

68

39.76

12

TCACACTT

36

11.10

13

TTTACAA

75

48.56

10

Family2

TGACACTT

38

12.35

13

GTTGAC

66

40.10

10

GCTGACA

64

31.55

12

TTGACA

92

66.34

10

ACACTTAT

41

14.95

12

ATGATA

10

80.26

10

TTGACACT

37

13.75

11

TTACGCTG

39

12.80

14

Family3

TGTTACGC

39

14.45

12

TTTACGCT

44

17.85

11

Family4

TTTTTTTTTC

23

5.40

11

Family5

GCGCCCC

44

18.85

10

'&

$%

\Experimental"set{Combinatorialalgorithm

(2boxes)

Escherichia

coli

Bacillussubtilis

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

[4,6]

[6,8]

[9,11]

a

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

distances between tw

o parts of a model

8 9

121110

TTATTC_TATAAT

TTGACT_ATAATG

distances between tw

o parts of a model

TTGACA_TATAAT

b18171615141312

TTGACT_TAAAAT

TTGACT_TAAAAT

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

1

width=

30

sites

=

111.4

bits2.2

2.0

1.7

1.5

Information

1.3

*

*

content

1.1

*

*

*

*

(12.3

bits)

0.9

*

*

*

*

0.7

***

****

0.4

**

****

*****

0.2

***

*****

*******

0.0

------------------------------

Multilevel

AATTTTAAATTGTGATCTAAATCACATATT

consensus

CGAAGATTTA

C

AGTGT

G

ATAA

sequence

G

G

TAGT

G

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

2

width=

39

sites

=

128.7

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(12.0

bits)

0.9

*

0.7

*

*

*

0.4

**

*

*

*

*

*

**

*

*

0.2

****

****

*

*

***

****

*

**

*

0.0---------------------------------------

Multilevel

TAATTAATATACACAATTTTTTTTTTATTTTCATGATTT

consensus

AC

AATTATCTAGTTAAAACAAGAATAAAAT

TCAAA

sequence

C

C

CGTA

C

GG

G

A

C

T

C

C

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

3

width=

12

sites

=

181.4

bits

2.2

2.0

1.7

*

1.5

*

Information

1.3

**

content

1.1

**

(6.2bits)

0.9

**

0.7

**

0.4

**

*

0.2

********

**

0.0

------------

Multilevel

CGCCCTGTTTGC

consensus

T

GACTCCGTG

sequence

AGG

ACT

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

1

width=

12

sites

=

308.7

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

**

(5.6bits)

0.9

**

0.7

**

0.4

******

0.2

********

0.0

------------

Multilevel

AAAAAAAGGAGG

consensus

TGG

ACGAA

sequence

CT

T

T

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

2

width

=

22

sites=

54.4

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(8.3

bits)

0.9

0.7

**

*

*

0.4

***

*

**

*

*

0.2

******

*****

*

**

**

0.0

----------------------

Multilevel

GGCAGCAGCCCGTGCAGAGCGA

consensus

C

T

C

AAA

GAATACCGAG

sequence

G

T

A

CTAAC

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

3

width

=

43

sites

=

173.2

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(11.3bits)

0.9

0.7

*

*

0.4

**

*

**

*

*

0.2

*****

*

*****

********

**

*

*

*

*

0.0

-------------------------------------------

Multilevel

TTTTTTCATAATTTTTTTTTTTTTCTTTTTTTATTTAATATTT

consensus

CCCCCAACACCAACCACACACCCTCA

ACCTCAACTATAGA

sequence

CTT

TT

CA

G

C

G

T

C

C

'&

$%

\Genomic"set{Combinatorialalgorithm

(1box)

Escherichia

coli

Bacillussubtilis

CCTGAC

573

424.60

39

TATGATA

627

407.05

91

CTGACG

587

439.70

38

TATCATA

615

403.00

84

Family1

CTGACA

701

557.00

36

TATAATAA

445

277.95

58

TCCTGA

671

538.70

30

TTATTATA

439

273.85

57

CCCTGA

575

446.80

28

TACTATA

491

325.70

54

GTCAGG

576

412.10

47

ATGATAA

617

477.10

36

TGTCAG

702

555.00

37

ATGAGAA

500

377.15

29

Family2

CATCAG

711

574.60

32

TGAGAAA

520

417.85

19

CGTCAG

580

443.60

32

ATCAGG

689

553.30

32

TTTTCTG

553

419.20

31

TGACAAA

510

405.20

21

CTCTTTT

464

348.50

25

Family3

TTTCTGT

469

357.05

23

TTTTCAG

531

416.40

23

CTGATTT

498

384.60

23

CAGAAAA

539

410.55

29

CCTTTTT

638

413.05

95

CTGAAAA

525

407.75

24

CCTTTTC

496

291.10

84

Family4

GAGAAAA

460

359.50

19

CTCTTTT

600

391.80

81

AGATAAA

512

415.60

16

CTTTTCT

613

410.90

77

GTGAAAA

509

414.75

16

CTTTTTC

652

451.20

76

etc

'&

$%

\Genomic"set{Combinatorialalgorithm

(2boxes)

Escherichia

coli

Bacillussubtilis

19 23 27 31 35 Χ2

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

TTGACA_TATAAT

TTGACA_TATAAT

GAAAAA_TTTTTC

distances between tw

o parts of a model

b

ATTGAC_TATAAT

a

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

13 17 21 25 29 31

TTGTGA_TCACAT

TGTGAT_ACATTT

TGTGAT_TCACAT

TGTGAT_TCACAT

distances between tw

o parts of a model

'&

$%

\Noise"inthedata

Approachesusingastatisticalconservationmeasure

Donotselectanoccurrenceinasequenceifthe

scoreobtainedisbelow

agiventhresholdforallp

Approachesusingadeterministicconservationmeasure

Quorum

'&

$%

Variablelengthofthemotifs

Approachesusingastatisticalconservationmeasure

Problem

:therelativeentropyisalwayspositive

andcanonlyincrease

Twopossiblesolutions

Normalizetheentropybythematrixlength

Estimatea\p-value"

Approachesusingadeterministicconservationmeasure

Noproblem

'&

$%

Variousdi�erentfamiliesofmotifsinasamesequence

dataset

Approachesusingastatisticalconservationmeasure

Variousmatricesarekept

Approachesusingadeterministicconservationmeasure

Noproblem

(onthecontrary)

'&

$%

\Toomany"motifsfoundbytheapproachesusinga

deterministicconservationmeasure

A

posterioristatisticalevaluationofthemotifsfound

Careful!Di�erentingeneralfrom

thestatistics

employedbyGibbs

A

prioriprobabilityofagivenmotif(wordorsetof

words)

Sameprobabilitybutestimatedbysimulation

ApplicationofmethodssuchasGibbsonthemotifs

initiallyfoundbyanexhaustivesearch

Comparisonwithobservedon\counter-exampledata"

'&

$%

Otherconstraints

Palindromicorrepeatedmotifs

Quiteafew

approachesmayconsidersuchmotifs

Positioninrelationtoabiologicallandmarkinthesequence

Someapproaches(vanHeldenetal.,NAR,

28:1808-1818,2000inparticular)takethisinto

account(duringtheidenti�cationsteporatprinting

time)

'&

$%

New

approaches

Inferencefrom

asetofphylogeneticallyrelatedsequences

(\Phylogeneticfootprinting")

Simplewayofconstructingasetofmolecular

sequencesthatisreducedinsizeandpotentially

containsless\noise"

Motifconservationmeasureswhichtakeinto

accountthephylogenyoftheorganisms(Blanchette

etal.,ISMB

2000,

http://ismb00.sdsc.edu/technical-program.html)

'&

$%

Phylogeneticfootprinting{Mainidea

A

setofphylogeneticallyrelatedsequences

TTCG

ATCG

AACG

ATGG

TTCG

...AACG......AATG...

...TACG......TTCG...

1

1

1

0

0

1

1

0

total:5

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

possiblyevolutionary

unrelatedsequences

\our"motifs(motifs)

TATA

AAAT

AAAT

AATA

AAAT

TAAA

\ancestor"ofastar-tree

(butnotthemostparsimonious)

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

motif

\ancestor"ofthe\true"evolutionarytree

(underparsimony)forthespeciesconcerned

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

??

plusotherevolutionaryrelated(ina

(di�erentway)sequences(paralogs)

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

????

how

tomodelsuch

\multi-dimensionalconservation"

?

plusotherevolutionaryrelated(ina

(di�erentway)sequences(paralogs)

plusevolutionaryunrelatedsequences

'&

$%

A

specialuseofphylogeneticfootprinting{Gene�nding

by\purehomology"

'&

$%

A

veryelementaryview

ofaneukaryoticgene

3

nucleotides

(codon)!

1

am

ino

acid

5'U

TR

5'

3'U

TR

3'

exon

intron

startcodon

G

T

(donor

site)

AG

(acceptor

site)

stop

codon

splicing

gene

!

protein

5'U

TR

and

3'U

TR

:transcribed

into

R

N

A

butnottranslated

'&

$%

Gene�nding{A

few

generalities

Detectionbysignal

Promotersequence(verydiÆcult)

Splicing(donorandacceptor)sites

PolyA

signal

Detectionbydi�erenceofcomposition

Themostcommon:di�erentk-mercounts(oftenk=

6)

Detectionbyhomologywith\known"(storedina

database)sequence

(Observehomologyis\stronger"thansimilarity)

'&

$%

A

complicatedcaseofgene�nding{\Orphangenes"

Anorphangeneisageneforwhichnohomology(inthe

sensehereof\strongenough"similarity)hasbeen

detectedwiththesequencesstoredinthedatabases

'&

$%

Mainideaaroundtheproblem

Anorphanmayhave\parents",thatisotherorphanslike

itselfwhichareitsHOMOLOGS(havingcommon

ancestor)

ORTHOLOGSpossiblymoresimilar

PARALOGS

possiblyhavingdivergedmore

inbothcases,havingpossiblydi�erentgenestructures

(i.e.

adi�erentnumberofexons)

'&

$%

Additionalhypothesis

(importantbutisitalwaysjusti�ed?)

Exonsare\betterconserved"or,moreaccurately,

\di�erentlyconserved"thanintronsor5'UTR

or3'UTR

orintergenicsequence

'&

$%

Flavourofmethod

Findstructurebycomparingorphanswhicharehomologs

usingasinformationonlythebareessentials(methodby

\purehomology")

Usingadynamicprogrammingapproachwithafew

twists

Sequencesarecomposedofcodingandnoncoding

regions

Therearetwopotentialtypesof\errors"

Nature's(gaps,substitutions)

Man's(sequencingreadingerrors=

\frameshifts")

Utopia(Blayoetal.,acceptedTCS)

'&

$%

Objective

Findbestassemblyofexons

whichsatis�es\bareessentials"genemodel

where\best"meanshighestscoringassemblageofexons

'&

$%

W

hyacombinatorial,\bareessentials"typeofapproach?

Itdoesnotsubstituteforother,statisticalinparticular,

approaches

Itcannot(perhaps)evencompetewiththem

(itwasnot

meantto)

BUTIt

isGENERIC

Itallows,indeedobligestothinkoveragainour

notionsof\conservation"and,inparticular,ofthe

non-conservationofnon-codingregions

Itisindependentofwhatcanbelearnedfrom

speci�ccharacteristicsofknownexamplesofgenes

'&

$%

Preliminaryapplications(1)

13ADH

proteingenesofplants(amongthem

Arabidopsis

thaliana)

dicoandmonocotyledones

oneparalogand12orthologs

ofdi�erentgenestructuresandlengths

5to10exons

from

942bp.to1046bp.

'&

$%

-4500

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 500 1000 1500 2000 2500

D84

240

M36

469

M59

082

U36

586

U53

701

U63

931

U65

972

X02

915

X04

050

X54

106

Z24

755

X12733 (9 exons)

X12733 compared with 11 related sequences with pam120, intronIndel 20. Specif : 97% Sensit : 98%

’annot’’D84240’’M36469’’M59082’’U36586’’U53701’’U63931’’U65972’’X02915’’X04050’’X54106’’Z24755’

'&

$%

Sensitivityandspeci�city

Sensitivity

sensitivity=

numberofcorrectlypredicteditems

numberofactualitems

=

TP

TP

+

FN

Speci�city

specificity=

numberofcorrectlypredicteditems

numberofpredicteditems

=

TP

TP

+

FP

'&

$%

Preliminaryapplications(2)

7genesfrom

amultigenefamilyinArabidopsisthaliana

ofunknownfunction

goingbythenameofMYST

ofdi�erentgenestructuresandlengths

13to15exons

from

1848bp.to2040bp.

'&

$%

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 1000 2000 3000 4000 5000 6000 7000

MY

ST

2 M

YS

T3

MY

ST

4 M

YS

T5

MY

ST

6 M

YS

T7

MYST1 (15 exons)

MYST1 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 97%

’annot’’MYST2’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’

'&

$%

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 500 1000 1500 2000 2500 3000 3500

MY

ST

1 M

YS

T3

MY

ST

4 M

YS

T5

MY

ST

6 M

YS

T7

MYST2 (13 exons)

MYST2 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 98%

’annot’’MYST1’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’

'&

$%

Mainidea(currentlybeingexplored)

Puttingtogethermotifinferenceandgenedetectionby

multiplecomparison