Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
RNA structure analysis
Jurgen Mourik &
Richard VogelaarsUtrecht University
RNA structure analysis2
Overview
• Introduction to RNA
• RNA secondary structure prediction– Nussinov folding algorithm– Zuker folding algorithm
• Demonstration
• Questions
RNA structure analysis3
Introduction to RNA (1)
• Ribonucleic acid
• To many people:– “RNA is the passive intermediary messenger
between DNA genes and the protein translation machinery”
• But:– Many non-coding RNAs exist
• Adopt sophisticated 3D structures• Catalyse biochemical reactions
RNA structure analysis4
Introduction to RNA (2)
• Three major types of RNA
• Messenger RNA (mRNA)– Serving as a temporary copy of genes that is used
as a template for protein synthesis.
• Transfer RNA (tRNA)– Functioning as adaptor molecules that decode the
genetic code.
• Ribosomal RNA (rRNA)– Catalyzing the synthesis of proteins.
RNA structure analysis5
RNA world hypothesis
• RNA is the only biological polymer that serves as both a catalyst (like proteins) and as information storage (like DNA).
• For this reason some people think that a RNA-like molecule was the basis of life early in evolution.
RNA structure analysis6
Terminology of RNA (1)
• Four nucleotides: – Adenine– Cytosine– Guanine– Uracil
• Canonical base pairs:– G-C– A-U
• Non-canonical base pairs– G-U
RNA structure analysis7
Terminology of RNA (2)
• Base pairs are approximately coplanar and almost always stacked onto other base pairs in a RNA structure– Contiguous stacked base pairs are called stems– In 3D, RNA stems generally form a regular double
helix
RNA structure analysis8
RNA secondary structure
• Unlike DNA, RNA is typically produced as a single stranded molecule which then folds intramolecularly to form a number of short base-paired stems. This base-paired structure is called the secondary structure of the RNA.
RNA structure analysis9
Elements of a RNA secondary structure (1)
• Loop: single stranded subsequence bounded by base pairs
• Hairpin loop: a loop at the end of a stem
• Bulge (loop): single stranded bases occurring within a stem
• Interior loop: single stranded bases interrupting both sides of a stem
• Multi-branched loop: a loop from which three or more stems radiate
RNA structure analysis10
Elements of a RNA secondary structure (2)
G ● C G ● C U ● A A ● U C ● G
G G 3’
G A 5’
CCC
etc.
UGU
RNA structure analysis11
Pseudoknots (1)
• Base pairs almost always occur in a nested fashion in RNA secondary structure
• A base pair between position i and j and a base pair between i’ and j’ are nested if and only if:
• Non-nested base pairs are called pseudoknots
''or '' jjiijjii
RNA structure analysis12
Pseudoknots (2)
• None of the dynamic programming algorithms can deal with pseudoknots, including the Zuker and Nussinov RNA folding algorithms.
• Pseudoknots occur in many important RNA’s:– The algorithms ignore biologically important
information.
• For database searching for RNA homologues, it is acceptable to sacrifice the information in pseudoknots.
RNA structure analysis13
RNA sequence evolution
• The sequence evolution of RNA is constrained by the structure.
• It is possible to have two different RNA sequences with the same secondary structure.
• Drastic changes in sequence can often be tolerated as long as compensatory mutations maintain base-pairing complementarity.
RNA structure analysis14
RNA sequence evolution (2)
• Suppose we want to search for a nucleotide sequence for occurrences of consensus R17 coat protein:– It is useless to use standard
sequence alignment
• R17 coat protein binds and represses translation of its replicase:– It blinds most of the primary
sequence positions
{A, C, G, U}
{A,G}
{C,U}
Complement of base N
RNA structure analysis15
RNA sequence evolution (3)
• How to solve this problem?– RNA pattern-matching program
(RNAMOT).
• Searches for deterministic (non-stochastic) motifs but with secondary structure constraints as extra terms.
• Works fine for small, well-defined patterns but is somewhat insensitive and problematic for finding matches to less well conserved structures.
RNA structure analysis16
Inferring structure by comparative sequence analysis
• In a structurally correct multiple alignment of RNAs, conserved base pairs are often revealed by the presence of frequent correlated compensatory mutations
• RNA secondary prediction method: comparative sequence analysis
• The accepted consensus structures of most well-studied RNAs have been derived by comparative analysis.
RNA structure analysis17
How does comparative sequence analysis work? (1)
• Inferring the correct structure by comparative analysis requires knowing a structurally correct alignment
• Inferring a structurally correct multiple alignment requires knowing the correct structure
Problem!
RNA structure analysis18
How does comparative sequence analysis work? (2)
• Solution: make use of an iterative refinement process of:– Guessing the structure based on the current best
guess of the alignment– Realigning based on the new guess at the
structure• The sequences to be compared must be:
– Sufficiently similar to start the process– Sufficiently dissimilar that a number of co-varying
substitutions can be detected
RNA structure analysis19
Mutual information (1)
• A quantitative measure of pairwise sequence covariation
• Given two aligned columns i, j, the mutual information is given by:
ji ji
ji
jixx xx
xx
xxij ff
ffM
,2log
The frequency of one of the four bases
observed in column i.
The joint (pairwise) frequency of one of the sixteen possible base pairs observed in
columns i, j.
Mij varies between 0 and 2 bits
RNA structure analysis20
Mutual information (2)
• Mij tells us how much information we get about the identity of the residue in one position if we are told the identity of the residue in the other position– If you know that i is a G, the uncertainty about j
collapses from four different possibilities to just one (C) 2 bits of information
– If i and j are uncorrelated, the mutual information is zero
RNA structure analysis21
RNA secondary structure prediction (1)
• Many plausible secondary structures can be drawn for a sequence
• But: the number of secondary structures increases exponentially with sequence length– An RNA of 200 bases has over 1050 possible
base-paired structures
• Goal: distinguish the biologically correct structure from all the incorrect structures.
RNA structure analysis22
RNA secondary structure prediction (2)
• We need:– A function that assigns the correct structure the
highest score– An algorithm for evaluating the scores of all
possible structures
• Two methods:– Nussinov folding algorithm– Zuker folding algorithm
Need a break?
Well here it is!
RNA structure analysis24
Nussinov folding algorithm (1)
• Goal: Find the structure with the most base pairs
• Nussinov introduced an efficient dynamic programming algorithm for this problem
• A recursive algorithm that calculates – the best structure for small subsequences and– works its way outwards to larger and larger
subsequences
RNA structure analysis25
Nussinov folding algorithm (2)
• Key idea of recursion:– There are only four possible ways of getting the
best structure for i,j from the best structure of the smaller subsequences
• Two stages:– Fill stage of the algorithm– Trace back stage of the algorithm
RNA structure analysis26
Nussinov folding algorithm (3)
• The four possible ways:1. Add unpaired position i onto the best structure for subsequence i+1,j
2. Add unpaired position j onto the best structure for subsequence i,j-1
3. Add i,j pair onto best structure found for subsequence i+1,j-1
4. Combine two optimal substructures i,k and k+1,j
RNA structure analysis27
Nussinov folding algorithm (4)
• Formal description of the algorithm:– Given a sequence x of length L with symbols xi,…,xL
– Let if xi and xj are complementary base pairs else
– Recursively calculate scores which are the maximum number of base pairs that can be formed for subsequence xi,…,xL
1),( ji0),( ji
),( ji
RNA structure analysis28
Nussinov algorithm: fill stage
– Initialisation:
– Recursion: starting with all sub sequences of length 2, to length L:
Liii
Liii
to1for 0),(
to2for 0)1,(
)].,1(),([max
),,()1,1(
),1,(
),,1(
max),(
jkki
jiji
ji
ji
ji
jki
RNA structure analysis29
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0
2 G 0 0
3 G 0 0
4 A 0 0
5 A 0 0
6 A 0 0
7 U 0 0
8 C 0 0
9 C 0 0
RNA structure analysis30
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0
2 G 0 0 0
3 G 0 0 0
4 A 0 0 0
5 A 0 0 0
6 A 0 0 1
7 U 0 0 0
8 C 0 0 0
9 C 0 0
A*U= base pair
110
)7,6()6,7(
),()1,1(
jiji
RNA structure analysis31
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
This value gives the
maximum nr. of base pairs
RNA structure analysis32
Nussinov algorithm: traceback stage
• Initialisation: Push (1,L) onto the stack.
• Recursion: Repeat until stack is empty:
break
),(push
),1(push
);,( ),1(),( if :1 to1for else
)1,1(push
pair base , record
:),()1,1( if else
);1,(push ),()1,( if else
);,1(push ),(),1( if else
continue if
),(pop
,
ki
jk
jijkkijik
ji
ji
jiji
jijiji
jijiji
ji
ji
ji
RNA structure analysis33
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
Initialisation:Push (1,L)
RNA structure analysis34
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
Recursion:
)9,1()9,2(
),1(push ),(),1( if else
jijiji
RNA structure analysis35
Example sequence: GGGAAAUCC
j
i
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
RNA structure analysis36
Example sequence: GGGAAAUCC
j
i
6 7 8 9
A U C C
1 G 0 1 2 3
2 G 0 1 2 3
3 G 0 1 2 2
4 A 0 1 1 1
5 A 0 1 1 1
6 A 0 1 1 1
7 U 0 0 0 0
GG ● CG ● CA
AA
● U
RNA structure analysis37
SCFG version of the Nussinov algorithm
• Stochastic Context-Free Grammars– Will be discussed next Wednesday
• Makes use of production rules:– S aS | cS | gS | uS (i unpaired)
• Every production rule has a associated probability parameter.
• The maximum probability parse is equivalent to the maximum probability secondary structure.
RNA structure analysis38
Needed terminology
• The inside-outside (recursive dynamic programming) algorithm for SCTGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM.
• Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence.
Just as the viterbi algorithm for
HMMs
Chomsky normal form:All context free grammar production rules are of the form:
S SS orS a
RNA structure analysis39
CYK for Nussinov-style RNA SCFG (2)
• Initialisation:
• Recursion:
LiSxp
Sxpii
Liii
i
i to1for )(log
)(logmax),(
to2for )1,(
).(log),1(),(max
);(log)1,1(
);(log)1,(
);(log),1(
max),(
SSpjkki
Sxxpji
Sxpji
Sxpji
ji
jki
ji
j
i
Addition to the fill stage of the Nussinov
algorithm.The principal difference
is that the SCFG description is a
probabilistic model.
RNA structure analysis40
CYK for Nussinov-style RNA SCFG (2)
• The is the log likelihood of the optimal structure given the SCFG model
• The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm
),1( L )|ˆ,(log xP̂
RNA structure analysis41
CYK for Nussinov-style RNA SCFG (3)
• Good starting example (10.2), but it is too simple to be an accurate RNA folder
• The algorithm does not consider important structural features like preferences for certain:– Loop lengths– Nearest neighbours in the structure caused by
stacking interactions between neighbouring base pairs in a stem.
RNA structure analysis42
Zuker folding algorithm (1)
• Most sophisticated secondary structure prediction method for single RNAs– An energy minimisation algorithm which assumes
that the correct structure is the one with the lowest equilibrium free energy
• The of an RNA secondary structure is approximated as the sum of individual contributions from loops, base pairs and other secondary structure elements.
G
G
RNA structure analysis43
Zuker folding algorithm (2)
• Difference with the Nussinov folding algorithm:– Energies of stems are calculated by adding
stacking contributions for the interface between neighbouring base pairs instead of individual contributions for each pair.
• Advantage:– Better fit to experimentally observed values for
RNA structures, but it complicates the dynamic programming algorithm
G
RNA structure analysis44
Zuker folding algorithm (3)Freier energy rules
• The energies in the tables are from the older ‘Freier rules’ at 37ºC.
• For more information see the article ”Improved free-energy parameters for predictions of RNA duplex stability” by Freier et al.
RNA structure analysis45
Zuker folding algorithm (4)
RNA structure analysis46
Zuker folding algorithm (5)
RNA structure analysis47
Zuker folding algorithm (6)
• The minimum energy structure can be calculated recursively by a dynamic programming algorithm very similar to how the maximum base-paired structure was calculated like the Nussinov algorithm.
• Now we keep two matrices– W(i,j) is the energy of the best structure on i,j– V(i,j) is the energy of the best structure on i,j given
that i,j are paired.
RNA structure analysis48
Suboptimal RNA folding(CYK algorithm will be explained next Wednesday)
• The original Zuker algorithm finds only the optimal structure.
• The biologically correct structure is often not the calculated optimal structure.
• Zuker introduced a suboptimal folding algorithm.– Is similar to running the CYK algorithm in both inside and
outside directions.
• The algorithm samples one base pair sub optimally.
• The rest of the structure is the optimal structure given that base pair.
Demonstration
RNAstructureBy David H. Mathews
Michael Zuker
Doulas H. Turner
RNA structure analysis50
Demo: RNAstructure (1)
• The core of RNAstructure is a dynamic programming algorithm to predict RNA or DNA secondary structures from sequence based on the principle of minimizing free energy.
RNA structure analysis51
Demo: RNAstructure (2)
• The prediction of a secondary structure is based on the Zuker algorithm for free energy minimization using the nearest neighbour parameters of Doug Turner and co-workers.
• A recursive algorithm is used that generates an optimal structure and a series of structures that are called sub-optimal structures (structures with free energy similar to the lowest free energy structure).
RNA structure analysis52
Demo: RNAstructure (3)
• The number of sub-optimal structures generated is controlled by two parameters entered by the user:– Max % Energy Difference: Sets the percent difference
from the lowest free energy allowed for the structures output. For example if the lowest-free energy structure is -100 kcal/mol, and the Max % Energy Difference is 10, any structures with an energy of -90 kcal/mol or higher is rejected (higher means less negative).
– Max number of structures: Sets an absolute upper limit on the number of structures that can be generated. A maximum of 1000 structures can be generated.
RNA structure analysis53
Demo: RNAstructure (4)
• A third parameter entered is Window Size. This controls how different the sub-optimal structures must be from each other. A small window size allows very similar structures to be generated while a larger window size requires them to be more different
Demonstration
Questions?