+ All Categories
Home > Documents > Combining RNA and Protein selection models

Combining RNA and Protein selection models

Date post: 07-Jan-2016
Category:
Upload: ilar
View: 39 times
Download: 1 times
Share this document with a friend
Description:
Combining RNA and Protein selection models. The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein - PowerPoint PPT Presentation
29
Combining RNA and Protein selection models The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein Combining Structure Descriptions
Transcript
Page 1: Combining RNA and Protein selection models

Combining RNA and Protein selection models

The Central Idea in Comparative Molecular Biology & Genomics

Three basic applications

Protein secondary structure

RNA secondary structure

Gene structure

Combining Evolution Constraints

Protein-Protein

RNA-Protein

Combining Structure Descriptions

Page 2: Combining RNA and Protein selection models

Modelling Sequence Evolution

Pi,j(t) continuous time markov chain on the state space {A,C,G,T}.

t1 t2

CCA

ijji q

P

)(lim ,

0 iiii q

P

1)(lim ,

0

TGGTTTCGTA

a - unknown

Biological setup

.......!3

)(

!2

)(

!

)()exp()(

32

0

tQtQtQI

i

tQtQtP

i

i

Page 3: Combining RNA and Protein selection models

Rate-matrix, R: T O

A C G T

F A R C O G M T

Transition prob. after time t, a = *t:

P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a

Stationary Distribution: (1,1,1,1)/4.

Jukes-Cantor 69: Total Symmetry

342455

55

1

)1()31()4

1()

4

1(

T)T)P(AG)P(GG)P(GT)P(CP(T)4

1()2s1()1(

aa

iii

ee

sPsPP

Page 4: Combining RNA and Protein selection models

Comparison of Evolutionary Objects.

Observable

Observable Unobservable

Unobservable

U

C G

A

C

AU

A

C

)()(

)()(

SequencePSequenceStructureP

StructurePStructureSequenceP

Goldman, Thorne & Jones, 96

Knudsen & Hein, 99

Eddy & others

Pedersen & Hein, 03

Haussler & others

Pedersen, Meyer, Forsberg, Hein,…

Multiple levels of selection

Protein-protein

RNA-protein

Page 5: Combining RNA and Protein selection models

Structure Description: Grammars

Finite Set of Rules Generating Strings

i. A starting symbol:

ii. A set of substitution rules applied to variables - - in the present string:

Reg

ula

r

finished – no variables

Protein secondary structure

Gene Structure

Co

nte

xt F

ree

RNA secondary structure

Page 6: Combining RNA and Protein selection models

Simple String Generators

Terminals (capital) --- Non-Terminals (small)

i. Start with S S --> aT bS T --> aS bT

One sentence – odd # of a’s:S-> aT -> aaS –> aabS -> aabaT -> aaba

ii. S--> aSa bSb aa bb

One sentence (even length palindromes):S--> aSa --> abSba --> abaaba

Page 7: Combining RNA and Protein selection models

Stochastic GrammarsThe grammars above classify all string as belonging to the language or not.

All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language.

S -> aSa -> abSba -> abaaba

i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2)

If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules.

S -> aT -> aaS –> aabS -> aabaT -> aaba

ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb

*0.3

*0.3 *0.2 *0.7 *0.3 *0.2

*0.5 *0.1

Page 8: Combining RNA and Protein selection models

Gene Describers

Simple Prokaryotic Genes:

Simple Eukaryotic Genes:

Page 9: Combining RNA and Protein selection models

S --> LS L .869 .131F --> dFd LS .788 .212L --> s dFd .895 .105

Secondary Structure Generators

Knudsen & Hein, 99

Page 10: Combining RNA and Protein selection models

Structure Dependent Evolution Models

1. Protein Secondary Structure Dependent (Goldman, Thorne & Jones)

& Loop each has their own mutation rate matrix (20,20) , R,R & Rloop

2. RNA Secondary Structure Dependent

3. Gene Structure Dependent

i. R singlet, singlet (4,4)

ii. R doublet,doublet (16,16)

(base pair conserving relative to

R singlet, singlet X R singlet, singlet )

i. Rnon-coding{ATG-->GTG}

ii. Rcoding{ATG-->GTG}

iii-. Other structural categories, regulatory signals …..

Page 11: Combining RNA and Protein selection models

The Genetic Code

i.

3 classes of sites:

4

2-2

1-1-1-1

Problems:

i. Not all fit into those categories.

ii. Change in on site can change the status of another.

4 (3rd) 1-1-1-1 (3rd)

ii. TA (2nd)

Page 12: Combining RNA and Protein selection models

Kimura’s 2 parameter model & Li’s Model.

start

Selection on the 3 kinds of sites (a,b)(?,?)

1-1-1-1 (f*,f*)

2-2 (,f*)

4 (, )

Rates: Probabilities:

)21(25. )(24 bab ee

)1(25. 4be

)1(25. 4be

)21(25. )(24 bab ee

Page 13: Combining RNA and Protein selection models

Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)

Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] (transversion)X(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity

L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}

where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663

Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741

Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile

Page 14: Combining RNA and Protein selection models

Three Questions

What is the probability of the data?

What is the most probable ”hidden” configuration?

What is the probability of specific ”hidden” state?

HMM/Stochastic Regular Grammar:

W

i j1 L

Stochastic Context Free Grammars:

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1

H2

H3

WL WR

i’ j’

Page 15: Combining RNA and Protein selection models

Comparative Gene FindingJakob Skou Pedersen & Hein, 2004

Page 16: Combining RNA and Protein selection models

Knudsen & Hein, 99

Page 17: Combining RNA and Protein selection models

From Knudsen & Hein (1999)

Page 18: Combining RNA and Protein selection models

Knudsen and Hein, 2003

Page 19: Combining RNA and Protein selection models

Why combine RNA & Protein Models?

Short Term/Long Term Evolution Discrepancies

Separating Selective Effects

Analyzing one level without interference from the other level

Predicting gene structure and RNA structure better.

Annotation of Viral Genomes

Page 20: Combining RNA and Protein selection models

Combining Levels of Selection.

Protein-Protein

Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Assume multiplicativity: fA,B = fA*fB

Protein-RNA

DoubletsSinglet

Contagious Dependence

Page 21: Combining RNA and Protein selection models

Overlapping Coding RegionsHein & Stoevlbaek, 95

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

(f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b)

(f1a, f1f2b) (f2a, f1f2b) (a, f2b)

(f1a, f1b) (a, f1b) (a, b)

1st

2nd

Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.

Example: Gag & Pol from HIVgagpol

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

64 31 34

40 7 0

27 2 0

GagPol

MLE: a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229

Page 22: Combining RNA and Protein selection models

Hasegawa, Kisino & Yano Subsitution Model Parameters:

a*t β*t A C G T

0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003

Selection Factors

GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)

Estimated Distance per Site: 0.194

HIV2 Analysis

Page 23: Combining RNA and Protein selection models

Evolution under double constraints

Codon Nucleotide Independence Heuristic

Singlet

Ri,j =f* qi,j

Doublet

R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)

Page 24: Combining RNA and Protein selection models

Structure Prediction: Hepatitis C Analysis

U UU AA – UG – CG – CU – AC – GC CU UC – GC – GG – U

C UG – C

C AG – C

A CA GG – UG – CC – GC – GG – CU – G

U

AA A A C G - U A - U C - G U - G C - G C - G G - U U A C CGCC G - C G - C U - G G - C G - C G – CA - U U UA G A C C - G U – AAA AGU - GG - C G - U A -C - G C - G U - AC - G U - A

Page 25: Combining RNA and Protein selection models

Evolution Models: A hierarchy of hypotheses

DoubletsSinglet

1 2 3 1 2 3 1 2 3

ts/tv=2.00

3 (ts/tv)=1.50,1.26,3.05

3 (ts/tv, equil.)

3 (ts/tv, equil.)

Doublet/

singlet ratio

L= 1.0531 10-25927

L= 2.0596 10-25797

L= 1.3104 10-21569

L= 2.5006 10-21513

L= 4.5739 10-21484

L= 2.1155 10-21473

0

1

2

3

4

5

0.173

0.415

0.415

0.414

0.292

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

Codon

Factors

transversion

transition, ratioDuplet

distortion# parameters Likelihood

-

-

-

-

-

+

--

--

4

5

7

9

15

17

Page 26: Combining RNA and Protein selection models

Combined RNA & Protein Structure

Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind

Both RNA & Gene Structure Stochastic

Would imply Gene Finding as well.

Grammar for overlapping genes a new phenomena

Gene Structure Stochastic, RNA Structure Fixed

An untypical situation

A challenge for the future: structure evolution.

Page 27: Combining RNA and Protein selection models

Open Problems

Combining with Alignment

Stacking Substitution Models

Other Sets of Constraints: Regulatory Signals

In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially.N3 N4

N2N1

A TGC

TA TGC

T TGC

Page 28: Combining RNA and Protein selection models

References.Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol. 40.181-189.

Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob. 32.499-517.

Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research. 13.2042-51

Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS. 92.9047-51.

Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics 15.6.446-454.

Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research 31.13.3423-28.

New Influenza gene article???

Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics 18.10.1309-18.

Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): 277-292, (2000).

Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol. 18.5.763-76.

Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics

Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation

Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation

Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol. 19.9.1534-49

Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics. 2001 Jan 157.399-4111

Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology 5787-94

Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics 149.1993-2001.

Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol. 41.587-96

Page 29: Combining RNA and Protein selection models

Acknowledgements

1. Comparative RNA Structure - Bjarne Knudsen

2. Comparative Gene Structure - Jakob Skou Pedersen

3. Integrating Levels of Selection & Structure:

Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg

Irmtraud Meyer Roald Forsberg Jakob Skou Pedersen Bjarne Knudsen


Recommended