Combining RNA and Protein selection models

Combining RNA and Protein selection models

The Central Idea in Comparative Molecular Biology & Genomics

Three basic applications

Protein secondary structure

RNA secondary structure

Gene structure

Combining Evolution Constraints

Protein-Protein

RNA-Protein

Combining Structure Descriptions

Modelling Sequence Evolution

Pi,j(t) continuous time markov chain on the state space {A,C,G,T}.

t1 t2

CCA

ijji q

P

)(lim ,

0 iiii q

P

1)(lim ,

0

TGGTTTCGTA

a - unknown

Biological setup

.......!3

)(

!2

)(

!

)()exp()(

32

0

tQtQtQI

i

tQtQtP

i

i

Rate-matrix, R: T O

A C G T

F A R C O G M T

Transition prob. after time t, a = *t:

P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a

Stationary Distribution: (1,1,1,1)/4.

Jukes-Cantor 69: Total Symmetry

342455

55

1

)1()31()4

1()

4

1(

T)T)P(AG)P(GG)P(GT)P(CP(T)4

1()2s1()1(

aa

iii

ee

sPsPP

Comparison of Evolutionary Objects.

Observable

Observable Unobservable

Unobservable

U

C G

A

C

AU

A

C

)()(

)()(

SequencePSequenceStructureP

StructurePStructureSequenceP

Goldman, Thorne & Jones, 96

Knudsen & Hein, 99

Eddy & others

Pedersen & Hein, 03

Haussler & others

Pedersen, Meyer, Forsberg, Hein,…

Multiple levels of selection

Protein-protein

RNA-protein

Structure Description: Grammars

Finite Set of Rules Generating Strings

i. A starting symbol:

ii. A set of substitution rules applied to variables - - in the present string:

Reg

ula

r

finished – no variables

Protein secondary structure

Gene Structure

Co

nte

xt F

ree

RNA secondary structure

Simple String Generators

Terminals (capital) --- Non-Terminals (small)

i. Start with S S --> aT bS T --> aS bT

One sentence – odd # of a’s:S-> aT -> aaS –> aabS -> aabaT -> aaba

ii. S--> aSa bSb aa bb

One sentence (even length palindromes):S--> aSa --> abSba --> abaaba

Stochastic GrammarsThe grammars above classify all string as belonging to the language or not.

All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language.

S -> aSa -> abSba -> abaaba

i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2)

If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules.

S -> aT -> aaS –> aabS -> aabaT -> aaba

ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb

*0.3

*0.3 *0.2 *0.7 *0.3 *0.2

*0.5 *0.1

Gene Describers

Simple Prokaryotic Genes:

Simple Eukaryotic Genes:

S --> LS L .869 .131F --> dFd LS .788 .212L --> s dFd .895 .105

Secondary Structure Generators

Knudsen & Hein, 99

Structure Dependent Evolution Models

1. Protein Secondary Structure Dependent (Goldman, Thorne & Jones)

& Loop each has their own mutation rate matrix (20,20) , R,R & Rloop

2. RNA Secondary Structure Dependent

3. Gene Structure Dependent

i. R singlet, singlet (4,4)

ii. R doublet,doublet (16,16)

(base pair conserving relative to

R singlet, singlet X R singlet, singlet )

i. Rnon-coding{ATG-->GTG}

ii. Rcoding{ATG-->GTG}

iii-. Other structural categories, regulatory signals …..

The Genetic Code

i.

3 classes of sites:

4

2-2

1-1-1-1

Problems:

i. Not all fit into those categories.

ii. Change in on site can change the status of another.

4 (3rd) 1-1-1-1 (3rd)

ii. TA (2nd)

Kimura’s 2 parameter model & Li’s Model.

start

Selection on the 3 kinds of sites (a,b)(?,?)

1-1-1-1 (f*,f*)

2-2 (,f*)

4 (, )

Rates: Probabilities:

)21(25. )(24 bab ee

)1(25. 4be

)1(25. 4be

)21(25. )(24 bab ee

Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)

Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] (transversion)X(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity

L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}

where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663

Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741

Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile

Three Questions

What is the probability of the data?

What is the most probable ”hidden” configuration?

What is the probability of specific ”hidden” state?

HMM/Stochastic Regular Grammar:

W

i j1 L

Stochastic Context Free Grammars:

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1

H2

H3

WL WR

i’ j’

Comparative Gene FindingJakob Skou Pedersen & Hein, 2004

Knudsen & Hein, 99

From Knudsen & Hein (1999)

Knudsen and Hein, 2003

Why combine RNA & Protein Models?

Short Term/Long Term Evolution Discrepancies

Separating Selective Effects

Analyzing one level without interference from the other level

Predicting gene structure and RNA structure better.

Annotation of Viral Genomes

Combining Levels of Selection.

Protein-Protein

Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Assume multiplicativity: fA,B = fA*fB

Protein-RNA

DoubletsSinglet

Contagious Dependence

Overlapping Coding RegionsHein & Stoevlbaek, 95

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

(f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b)

(f1a, f1f2b) (f2a, f1f2b) (a, f2b)

(f1a, f1b) (a, f1b) (a, b)

1st

2nd

Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.

Example: Gag & Pol from HIVgagpol

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

64 31 34

40 7 0

27 2 0

GagPol

MLE: a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229

Hasegawa, Kisino & Yano Subsitution Model Parameters:

a*t β*t A C G T

0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003

Selection Factors

GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)

Estimated Distance per Site: 0.194

HIV2 Analysis

Evolution under double constraints

Codon Nucleotide Independence Heuristic

Singlet

Ri,j =f* qi,j

Doublet

R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)

Structure Prediction: Hepatitis C Analysis

U UU AA – UG – CG – CU – AC – GC CU UC – GC – GG – U

C UG – C

C AG – C

A CA GG – UG – CC – GC – GG – CU – G

U

AA A A C G - U A - U C - G U - G C - G C - G G - U U A C CGCC G - C G - C U - G G - C G - C G – CA - U U UA G A C C - G U – AAA AGU - GG - C G - U A -C - G C - G U - AC - G U - A

Evolution Models: A hierarchy of hypotheses

DoubletsSinglet

1 2 3 1 2 3 1 2 3

ts/tv=2.00

3 (ts/tv)=1.50,1.26,3.05

3 (ts/tv, equil.)

3 (ts/tv, equil.)

Doublet/

singlet ratio

L= 1.0531 10-25927

L= 2.0596 10-25797

L= 1.3104 10-21569

L= 2.5006 10-21513

L= 4.5739 10-21484

L= 2.1155 10-21473

0

1

2

3

4

5

0.173

0.415

0.415

0.414

0.292

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

(f1:0.24,f2:0.14)

Codon

Factors

transversion

transition, ratioDuplet

distortion# parameters Likelihood

-

-

-

-

-

+

--

--

4

5

7

9

15

17

Combined RNA & Protein Structure

Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind

Both RNA & Gene Structure Stochastic

Would imply Gene Finding as well.

Grammar for overlapping genes a new phenomena

Gene Structure Stochastic, RNA Structure Fixed

An untypical situation

A challenge for the future: structure evolution.

Open Problems

Combining with Alignment

Stacking Substitution Models

Other Sets of Constraints: Regulatory Signals

In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially.N3 N4

N2N1

A TGC

TA TGC

T TGC

References.Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol. 40.181-189.

Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob. 32.499-517.

Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research. 13.2042-51

Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS. 92.9047-51.

Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics 15.6.446-454.

Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research 31.13.3423-28.

New Influenza gene article???

Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics 18.10.1309-18.

Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): 277-292, (2000).

Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol. 18.5.763-76.

Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics

Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation

Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation

Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol. 19.9.1534-49

Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics. 2001 Jan 157.399-4111

Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology 5787-94

Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics 149.1993-2001.

Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol. 41.587-96

Acknowledgements

1. Comparative RNA Structure - Bjarne Knudsen

2. Comparative Gene Structure - Jakob Skou Pedersen

3. Integrating Levels of Selection & Structure:

Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg

Irmtraud Meyer Roald Forsberg Jakob Skou Pedersen Bjarne Knudsen

Date post:	07-Jan-2016
Category:	Documents
Upload:	ilar
View:	39 times
Download:	1 times

Combining RNA and Protein selection models

Documents