Combining RNA and Protein selection models
The Central Idea in Comparative Molecular Biology & Genomics
Three basic applications
Protein secondary structure
RNA secondary structure
Gene structure
Combining Evolution Constraints
Protein-Protein
RNA-Protein
Combining Structure Descriptions
Modelling Sequence Evolution
Pi,j(t) continuous time markov chain on the state space {A,C,G,T}.
t1 t2
CCA
ijji q
P
)(lim ,
0 iiii q
P
1)(lim ,
0
TGGTTTCGTA
a - unknown
Biological setup
.......!3
)(
!2
)(
!
)()exp()(
32
0
tQtQtQI
i
tQtQtP
i
i
Rate-matrix, R: T O
A C G T
F A R C O G M T
Transition prob. after time t, a = *t:
P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a
Stationary Distribution: (1,1,1,1)/4.
Jukes-Cantor 69: Total Symmetry
342455
55
1
)1()31()4
1()
4
1(
T)T)P(AG)P(GG)P(GT)P(CP(T)4
1()2s1()1(
aa
iii
ee
sPsPP
Comparison of Evolutionary Objects.
Observable
Observable Unobservable
Unobservable
U
C G
A
C
AU
A
C
)()(
)()(
SequencePSequenceStructureP
StructurePStructureSequenceP
Goldman, Thorne & Jones, 96
Knudsen & Hein, 99
Eddy & others
Pedersen & Hein, 03
Haussler & others
Pedersen, Meyer, Forsberg, Hein,…
Multiple levels of selection
Protein-protein
RNA-protein
Structure Description: Grammars
Finite Set of Rules Generating Strings
i. A starting symbol:
ii. A set of substitution rules applied to variables - - in the present string:
Reg
ula
r
finished – no variables
Protein secondary structure
Gene Structure
Co
nte
xt F
ree
RNA secondary structure
Simple String Generators
Terminals (capital) --- Non-Terminals (small)
i. Start with S S --> aT bS T --> aS bT
One sentence – odd # of a’s:S-> aT -> aaS –> aabS -> aabaT -> aaba
ii. S--> aSa bSb aa bb
One sentence (even length palindromes):S--> aSa --> abSba --> abaaba
Stochastic GrammarsThe grammars above classify all string as belonging to the language or not.
All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language.
S -> aSa -> abSba -> abaaba
i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2)
If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules.
S -> aT -> aaS –> aabS -> aabaT -> aaba
ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb
*0.3
*0.3 *0.2 *0.7 *0.3 *0.2
*0.5 *0.1
Gene Describers
Simple Prokaryotic Genes:
Simple Eukaryotic Genes:
S --> LS L .869 .131F --> dFd LS .788 .212L --> s dFd .895 .105
Secondary Structure Generators
Knudsen & Hein, 99
Structure Dependent Evolution Models
1. Protein Secondary Structure Dependent (Goldman, Thorne & Jones)
& Loop each has their own mutation rate matrix (20,20) , R,R & Rloop
2. RNA Secondary Structure Dependent
3. Gene Structure Dependent
i. R singlet, singlet (4,4)
ii. R doublet,doublet (16,16)
(base pair conserving relative to
R singlet, singlet X R singlet, singlet )
i. Rnon-coding{ATG-->GTG}
ii. Rcoding{ATG-->GTG}
iii-. Other structural categories, regulatory signals …..
The Genetic Code
i.
3 classes of sites:
4
2-2
1-1-1-1
Problems:
i. Not all fit into those categories.
ii. Change in on site can change the status of another.
4 (3rd) 1-1-1-1 (3rd)
ii. TA (2nd)
Kimura’s 2 parameter model & Li’s Model.
start
Selection on the 3 kinds of sites (a,b)(?,?)
1-1-1-1 (f*,f*)
2-2 (,f*)
4 (, )
Rates: Probabilities:
)21(25. )(24 bab ee
)1(25. 4be
)1(25. 4be
)21(25. )(24 bab ee
Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)
Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] (transversion)X(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity
L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}
where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663
Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741
Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127
alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile
Three Questions
What is the probability of the data?
What is the most probable ”hidden” configuration?
What is the probability of specific ”hidden” state?
HMM/Stochastic Regular Grammar:
W
i j1 L
Stochastic Context Free Grammars:
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
H1
H2
H3
WL WR
i’ j’
Comparative Gene FindingJakob Skou Pedersen & Hein, 2004
Knudsen & Hein, 99
From Knudsen & Hein (1999)
Knudsen and Hein, 2003
Why combine RNA & Protein Models?
Short Term/Long Term Evolution Discrepancies
Separating Selective Effects
Analyzing one level without interference from the other level
Predicting gene structure and RNA structure better.
Annotation of Viral Genomes
Combining Levels of Selection.
Protein-Protein
Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic
Jensen & Pedersen, 2001
Contagious Dependence
Assume multiplicativity: fA,B = fA*fB
Protein-RNA
DoubletsSinglet
Contagious Dependence
Overlapping Coding RegionsHein & Stoevlbaek, 95
1-1-1-1 sites
2-2
4
1-1-1-1 2-2 4
(f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b)
(f1a, f1f2b) (f2a, f1f2b) (a, f2b)
(f1a, f1b) (a, f1b) (a, b)
1st
2nd
Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.
Example: Gag & Pol from HIVgagpol
1-1-1-1 sites
2-2
4
1-1-1-1 2-2 4
64 31 34
40 7 0
27 2 0
GagPol
MLE: a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229
Hasegawa, Kisino & Yano Subsitution Model Parameters:
a*t β*t A C G T
0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003
Selection Factors
GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)
Estimated Distance per Site: 0.194
HIV2 Analysis
Evolution under double constraints
Codon Nucleotide Independence Heuristic
Singlet
Ri,j =f* qi,j
Doublet
R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)
Structure Prediction: Hepatitis C Analysis
U UU AA – UG – CG – CU – AC – GC CU UC – GC – GG – U
C UG – C
C AG – C
A CA GG – UG – CC – GC – GG – CU – G
U
AA A A C G - U A - U C - G U - G C - G C - G G - U U A C CGCC G - C G - C U - G G - C G - C G – CA - U U UA G A C C - G U – AAA AGU - GG - C G - U A -C - G C - G U - AC - G U - A
Evolution Models: A hierarchy of hypotheses
DoubletsSinglet
1 2 3 1 2 3 1 2 3
ts/tv=2.00
3 (ts/tv)=1.50,1.26,3.05
3 (ts/tv, equil.)
3 (ts/tv, equil.)
Doublet/
singlet ratio
L= 1.0531 10-25927
L= 2.0596 10-25797
L= 1.3104 10-21569
L= 2.5006 10-21513
L= 4.5739 10-21484
L= 2.1155 10-21473
0
1
2
3
4
5
0.173
0.415
0.415
0.414
0.292
(f1:0.24,f2:0.14)
(f1:0.24,f2:0.14)
(f1:0.24,f2:0.14)
(f1:0.24,f2:0.14)
Codon
Factors
transversion
transition, ratioDuplet
distortion# parameters Likelihood
-
-
-
-
-
+
--
--
4
5
7
9
15
17
Combined RNA & Protein Structure
Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind
Both RNA & Gene Structure Stochastic
Would imply Gene Finding as well.
Grammar for overlapping genes a new phenomena
Gene Structure Stochastic, RNA Structure Fixed
An untypical situation
A challenge for the future: structure evolution.
Open Problems
Combining with Alignment
Stacking Substitution Models
Other Sets of Constraints: Regulatory Signals
In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially.N3 N4
N2N1
A TGC
TA TGC
T TGC
References.Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol. 40.181-189.
Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob. 32.499-517.
Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research. 13.2042-51
Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS. 92.9047-51.
Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics 15.6.446-454.
Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research 31.13.3423-28.
New Influenza gene article???
Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics 18.10.1309-18.
Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): 277-292, (2000).
Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol. 18.5.763-76.
Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics
Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation
Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation
Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol. 19.9.1534-49
Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics. 2001 Jan 157.399-4111
Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology 5787-94
Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics 149.1993-2001.
Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol. 41.587-96
Acknowledgements
1. Comparative RNA Structure - Bjarne Knudsen
2. Comparative Gene Structure - Jakob Skou Pedersen
3. Integrating Levels of Selection & Structure:
Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg
Irmtraud Meyer Roald Forsberg Jakob Skou Pedersen Bjarne Knudsen