1
RNA Secondary Structure Prediction
2
RNA structure prediction methods
Base-Pair Maximization
Context-Free Grammar Parsing.
Free Energy Methods
Covariance Models
A C A G U U G C A
1 2 3 4 5 6 7 8 9
q = 9
The Nussinov-Jacobson Algorithm
1 2 3 4 5 6 7 8 9
A C A G U U G C A
1 A 0 0 0 1 2 2 2 3
2 C 0 0 0 1 1 1 2 2 3
3 A 0 0 0 1 1 1 2 3
4 G 0 0 0 0 0 1 2
5 U 0 0 0 0 1 2
6 U 0 0 0 1 2
7 G 0 0 1 1
8 C 0 0 0
9 A 0 0
4
SCFG Version
• Nussinov algorithm can be converted to
a stochastic context-free grammar:
• S W
• W aW | cW | gW | uW
• W Wa | Wc | Wg | Wu
• W aWu | cWg | uWa | gWc
• W WW
5
SCFGs
• Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure
• Examples – tRNAScan-SE
– program created to find snoRNAs
• Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language
6
SCFGs
• SCFGs allow the detection of
sequences belonging to a family
– tRNAs
– group I introns
– snoRNAs
– snRNAs
7
SCFGs
• Any RNA structure can be reduced to a
SCFG (see Durbin, et al., p 278-279)
8
Transformational Grammars
• First described by linguist Noam
Chomsky in the 1950’s.
– (Yes, the same Noam Chomsky who has
expressed various dissident political views
throughout the years!)
13 June 2006 9
13 June 2006 10
11
Transformational Grammars
• Very important in computer science,
most notably in compiler design
• Covered in detail in compiler and
automaton classes
12
Transformational Grammars
• Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules
• Consist of a set of symbols and production rules
• The symbols can be terminal (emitting) symbols or non-terminal symbols
13 June 2006 13
13 June 2006 14
13 June 2006 15
13 June 2006 16
17
Grammar for Palindromes
• Consider palindromic DNA sequences
• Five possible terminal symbols: {a, c, g,
t, ) ( represents the blank terminal
symbol)
18
Grammar for Palindromes
• Production Rules, where S and W are
non-terminal symbols:
• SW
• W aWa | cWc | gWg | tWt
• W a | c| g | t |
19
Derivation of Sequences
• Using these production rules, a
derivation of the palindromic sequence
acttgttca follows:
• S W aWa acWcaactWtca
acttWttca acttgttca
13 June 2006 20
21
SCFGs for RNA
• base-paired columns modeled by pairwise emitting non terminals
– aWu; uWa; gWc; cWg; ...
• single-stranded columns modeled by leftwise emitting nonterminals (when possible)
– aW; cW; gW; uW; ..., when possible
23
Parse Trees
• A context-free grammar can be aligned to a sequence using a parse tree
• Root of the tree is the non-terminal start symbol, S
• Leaves are terminal symbols
• Internal nodes are the nonterminals
• Leaves can be parsed from left to right to view the results of production
13 June 2006 24
25
Parse Tree
S
W
W
W
W
W
atta c cg t t
13 June 2006 27
13 June 2006 28
13 June 2006 29
13 June 2006 30
13 June 2006 31
CYK )Cocke-Younger-Kasami)
Parsing Algorithm
سید محمد حسین معطر
پردازش زبان طبیعی
ردانشگاه صنعتی امیر کبی
دانشکده مهندسی کامپیوتر
Parsing Algorithms
• CFGs are basis for describing (syntactic) structure of NL sentences
• Thus - Parsing Algorithms are core of NL analysis systems
• Recognition vs. Parsing:– Recognition - deciding the membership in the language:
– Parsing – Recognition +producing a parse tree for it
• Parsing is more “difficult” than recognition? (time complexity)
• Ambiguity - an input may have exponentially manyparses
CYK )Cocke-Younger-Kasami)
• One of the earliest recognition and parsing algorithms
• The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF).
• It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF– Harder to understand
• Based on a “dynamic programming” approach:– Build solutions compositionally from sub-solutions
– Store sub-solutions and re-use them whenever necessary
• Recognition version: decide whether S == > w ?
CYK Algorithm
• The CYK algorithm for the membership problem is as follows: – Let the input string be a sequence of n letters a1 ... an.
– Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol.
– Let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
– For each i = 1 to n • For each unit production Rj -> ai, set P[i,1,j] = true.
– For each i = 2 to n -- Length of span • For each j = 1 to n-i+1 -- Start of span
– For each k = 1 to i-1 -- Partition of span
» For each production RA -> RB RC
» If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true
– If P[1,n,1] is true • Then string is member of language
• Else string is not member of language
CYK Pseudocode
On input x = x1x2 … xn :
for (i = 1 to n) //create middle diagonal
for (each var. A)
if(Axi)
add A to table[i-1][i]
for (d = 2 to n) // d’th diagonal
for (i = 0 to n-d)
for (k = i+1 to i+d-1)
for (each var. A)
for(each var. B in table[i][k])
for(each var. C in table[k][k+d])
if(ABC)
add A to table[i][k+d]
return Stable[0][n] ? ACCEPT : REJECT
CYK Algorithm
• this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk.
• Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on.
• For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence.
• Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol
CYK Algorithm for Deciding Context
Free Languages
Q: Consider the grammar G given by
S | AB | XB
T AB | XB
X AT
A a
B b
1. Is x = aaabbb in L(G ) ?
CYK Algorithm for Deciding Context
Free LanguagesNow look at aaabbb :
S | AB | XB
T AB | XB
X AT
A a
B b
a a a b b b
CYK Algorithm for Deciding Context
Free Languages1) Write variables for all length 1 substrings.
S | AB | XB
T AB | XB
X AT
A a
B b
a a a b b
A A A B B
b
B
CYK Algorithm for Deciding Context
Free Languages2) Write variables for all length 2 substrings.
S | AB | XB
T AB | XB
X AT
A a
B b
a a a b b
A A A B B
S,T
b
B
CYK Algorithm for Deciding Context
Free Languages3) Write variables for all length 3 substrings.
S | AB | XB
T AB | XB
X ATA a
B b
a a a b b
A A A B B
T
X
b
B
S,T
CYK Algorithm for Deciding Context
Free Languages4) Write variables for all length 4 substrings.
S | AB | XB
T AB | XB
X AT
A a
B b
a a a b b
A A A B B
T
X
S,T
b
B
S,T
CYK Algorithm for Deciding Context
Free Languages5) Write variables for all length 5 substrings.
S | AB | XB
T AB | XB
X ATA a
B b
a a a b b
A A A B B
T
X
S,T
b
B
X
S,T
CYK Algorithm for Deciding Context
Free Languages6) Write variables for all length 6 substrings.
S | AB | XB
T AB | XBX AT
A a
B b
S is included so
aaabbb accepted!
a a a b b
A A A B B
T
XS,T
b
B
X
S,T
S,T
CYK Algorithm for Deciding Context
Free LanguagesCan also use a table for same purpose.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb
1:aaabbb
2:aaabbb
3:aaabbb
4:aaabbb
5:aaabbb
CYK Algorithm for Deciding Context
Free Languages1. Variables for length 1 substrings.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A
1:aaabbb A
2:aaabbb A
3:aaabbb B
4:aaabbb B
5:aaabbb B
CYK Algorithm for Deciding Context
Free Languages2. Variables for length 2 substrings.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A -
1:aaabbb A -
2:aaabbb A S,T
3:aaabbb B -
4:aaabbb B -
5:aaabbb B
CYK Algorithm for Deciding Context
Free Languages3. Variables for length 3 substrings.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A - -
1:aaabbb A - X
2:aaabbb A S,T -
3:aaabbb B - -
4:aaabbb B -
5:aaabbb B
CYK Algorithm for Deciding Context
Free Languages4. Variables for length 4 substrings.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A - - -
1:aaabbb A - X S,T
2:aaabbb A S,T - -
3:aaabbb B - -
4:aaabbb B -
5:aaabbb B
CYK Algorithm for Deciding Context
Free Languages5. Variables for length 5 substrings.
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A - - - X
1:aaabbb A - X S,T -
2:aaabbb A S,T - -
3:aaabbb B - -
4:aaabbb B -
5:aaabbb B
CYK Algorithm for Deciding Context
Free Languages6. Variables for aaabbb. ACCEPTED!
end at
start at
1: aaabbb
2: aaabbb
3: aaabbb
4: aaabbb
5: aaabbb
6: aaabbb
0:aaabbb A - - - X S,T
1:aaabbb A - X S,T -
2:aaabbb A S,T - -
3:aaabbb B - -
4:aaabbb B -
5:aaabbb B
Parsing results
• We keep the results for every wij in a
table.
• Note that we only need to fill in entries
up to the diagonal – the longest
substring starting at i is of length n-i+1
Constructing parse tree
• we need to construct parse trees for
string w:
• Idea:
– Keep back-pointers to the table entries that
we combine
– At the end - reconstruct a parse from the
back-pointers
• This allows us to find all parse trees
References
• Hopcroft and Ullman,“Intro. to Automata
Theory, Lang. and Comp.”Section 6.3, pp.
139-141
• “CYK algorithm ” , Wikipedia, the free
encyclopedia
• A representation by Zeph Grunschlag
A C A G U U G C A
1 2 3 4 5 6 7 8 9
q = 9
The Nussinov-Jacobson Algorithm
1 2 3 4 5 6 7 8 9
A C A G U U G C A
1 A 0 0 0 1 2 2 2 3
2 C 0 0 0 1 1 1 2 2 3
3 A 0 0 0 1 1 1 2 3
4 G 0 0 0 0 0 1 2
5 U 0 0 0 0 1 2
6 U 0 0 0 1 2
7 G 0 0 1 1
8 C 0 0 0
9 A 0 0
65
-
1 2 3 4 5 6 7 8 9
A C A G U U G C A
1 A 0 0 0 1 2 2 2 3
2 C 0 0 0 1 1 1 2 2 3
3 A 0 0 0 1 1 1 2 3
4 G 0 0 0 0 0 1 2
5 U 0 0 0 0 1 2
6 U 0 0 0 1 2
7 G 0 0 1 1
8 C 0 0 0
9 A 0 0
A C A G U U G C A
1 2 3 4 5 6 7 8 9
The Nussinov-Jacobson Algorithm
66
A C A G U U G C A
1 2 3 4 5 6 7 8 9
q-1 q
1 2 3 4 5 6 7 8 9
A C A G U U G C A
1 A 0 0 0 1 2 2 2 3
2 C 0 0 0 1 1 1 2 2 3
3 A 0 0 0 1 1 1 2 3
4 G 0 0 0 0 0 1 2
5 U 0 0 0 0 1 2
6 U 0 0 0 1 2
7 G 0 0 1 1
8 C 0 0 0
9 A 0 0
The Nussinov-Jacobson Algorithm
i < q ≤ j
67
A U C A U G G C A U
• Co-terminus foldings:
• Partitionable foldings:
A C A G U U G C A
1 2 3 4 5 6 7 8 9
).,1(),(max
);,()1,1(
);1,(
);,1(
max),(
jkki
jioreBasePairScji
ji
ji
ji
jki
68
Another way to write the
Nussinov-Jacobson recursion
• Initialization:
• Recursion:
0),(
to2for 0)1,(
ii
Liii
Two special cases of
Partitionable Folding
Partitionable
Folding
Co-Terminus
Folding
69
SCFG version of the
Nussinov-Jacobson algorithm
• Stochastic Context-Free Grammars
• Makes use of production rules:
– W aW | cW | gW | uW (i unpaired)
• Every production rule has a associated
probability parameter.
• The maximum probability parse is
equivalent to the maximum probability
secondary structure.
70
SCFG Version of Nussinov-
Jacobson Algorithm
• The algorithm can be converted to a stochastic context-free grammar:
• S W
• W aW | cW | gW | uW
• W Wa | Wc | Wg | Wu
• W aWu | cWg | uWa | gWc
• W WW
71
Needed terminology• The inside-outside (recursive dynamic
programming) algorithm for SCFGs in
Chomsky normal form is the natural
counterpart of the forward-backward
algorithm for HMM.
• Best path variant of the inside-outside
algorithm is the Cocke-Younger-Kasami
(CYK) algorithm. It finds the maximum
probabilistic alignment of the SCFG to the
sequence.
).(log),1(),(max
);(log)1,1(
);(log)1,(
);(log),1(
max),(
WWpjkki
Wxxpji
Wxpji
Wxpji
ji
jki
ji
j
i
72
CYK for Nussinov-style
RNA SCFG
• Initialization:
• Recursion:
LiSxp
Sxpii
Liii
i
i to1for
)(log
)(logmax),(
to2for )1,(
Addition to the fill stage
of the Nussinov
algorithm.
The principal difference
is that the SCFG
description is a
probabilistic model.
Two special cases of
Partitionable Folding
Partitionable
Folding
Co-Terminus
Folding
73
CYK for Nussinov-style
RNA SCFG (2)
• The is the log likelihood
of the optimal structure given the
SCFG model
• The traceback to find the secondary
structure corresponding to the best
score is performed analogously to the
traceback in the Nussinov algorithm
)|ˆ,(log xP
74
Example of RNA Structure
SCFG• RNA structure for the sequence produced by
MFOLD, can be constructed (5’ to 3’):
• GCUUACGACCAUAUCACGUUGAAUGCAC
GCCAUCCCGUCCGAUCUGGCAAGUUAAG
CAACGUUGAGUCCAGUUAGUACUUGGAU
CGGAGACGGCCUGGGAAUCCUGGAUGU
UGUAAGCU
75
Example Construction
• S
• W
• Wu
• gWcu
• gcWgcu
• gcuWagcu
• gcuuWaagcu
• gcuuaWuaagcu
• gcuuacWguaagcu
• gcuuacgWuguaagcu
• gcuuacgaWuuguaagcu
• gcuuacgacWguuguaagcu
• gcuuacgaccWguuguaagcu
• gcuuacgaccaWguuguaagcu....
76
CYK for Nussinov-style
RNA SCFG
• Good starting example, but it is too
simple to be an accurate RNA folder
• The algorithm does not consider
important structural features like
preferences for certain:
– Loop lengths
– Nearest neighbours in the structure caused
by stacking interactions between
neighbouring base pairs in a stem.