RNA Secondary Structure Predictiontabio162/wiki.files/RNACKY1.pdf · CYK Algorithm • The CYK...

Post on 16-Jul-2020

20 views 0 download

transcript

1

RNA Secondary Structure Prediction

2

RNA structure prediction methods

Base-Pair Maximization

Context-Free Grammar Parsing.

Free Energy Methods

Covariance Models

A C A G U U G C A

1 2 3 4 5 6 7 8 9

q = 9

The Nussinov-Jacobson Algorithm

1 2 3 4 5 6 7 8 9

A C A G U U G C A

1 A 0 0 0 1 2 2 2 3

2 C 0 0 0 1 1 1 2 2 3

3 A 0 0 0 1 1 1 2 3

4 G 0 0 0 0 0 1 2

5 U 0 0 0 0 1 2

6 U 0 0 0 1 2

7 G 0 0 1 1

8 C 0 0 0

9 A 0 0

4

SCFG Version

• Nussinov algorithm can be converted to

a stochastic context-free grammar:

• S W

• W aW | cW | gW | uW

• W Wa | Wc | Wg | Wu

• W aWu | cWg | uWa | gWc

• W WW

5

SCFGs

• Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure

• Examples – tRNAScan-SE

– program created to find snoRNAs

• Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language

6

SCFGs

• SCFGs allow the detection of

sequences belonging to a family

– tRNAs

– group I introns

– snoRNAs

– snRNAs

7

SCFGs

• Any RNA structure can be reduced to a

SCFG (see Durbin, et al., p 278-279)

8

Transformational Grammars

• First described by linguist Noam

Chomsky in the 1950’s.

– (Yes, the same Noam Chomsky who has

expressed various dissident political views

throughout the years!)

13 June 2006 9

13 June 2006 10

11

Transformational Grammars

• Very important in computer science,

most notably in compiler design

• Covered in detail in compiler and

automaton classes

12

Transformational Grammars

• Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules

• Consist of a set of symbols and production rules

• The symbols can be terminal (emitting) symbols or non-terminal symbols

13 June 2006 13

13 June 2006 14

13 June 2006 15

13 June 2006 16

17

Grammar for Palindromes

• Consider palindromic DNA sequences

• Five possible terminal symbols: {a, c, g,

t, ) ( represents the blank terminal

symbol)

18

Grammar for Palindromes

• Production Rules, where S and W are

non-terminal symbols:

• SW

• W aWa | cWc | gWg | tWt

• W a | c| g | t |

19

Derivation of Sequences

• Using these production rules, a

derivation of the palindromic sequence

acttgttca follows:

• S W aWa acWcaactWtca

acttWttca acttgttca

13 June 2006 20

21

SCFGs for RNA

• base-paired columns modeled by pairwise emitting non terminals

– aWu; uWa; gWc; cWg; ...

• single-stranded columns modeled by leftwise emitting nonterminals (when possible)

– aW; cW; gW; uW; ..., when possible

23

Parse Trees

• A context-free grammar can be aligned to a sequence using a parse tree

• Root of the tree is the non-terminal start symbol, S

• Leaves are terminal symbols

• Internal nodes are the nonterminals

• Leaves can be parsed from left to right to view the results of production

13 June 2006 24

25

Parse Tree

S

W

W

W

W

W

atta c cg t t

13 June 2006 27

13 June 2006 28

13 June 2006 29

13 June 2006 30

13 June 2006 31

CYK )Cocke-Younger-Kasami)

Parsing Algorithm

سید محمد حسین معطر

پردازش زبان طبیعی

ردانشگاه صنعتی امیر کبی

دانشکده مهندسی کامپیوتر

Parsing Algorithms

• CFGs are basis for describing (syntactic) structure of NL sentences

• Thus - Parsing Algorithms are core of NL analysis systems

• Recognition vs. Parsing:– Recognition - deciding the membership in the language:

– Parsing – Recognition +producing a parse tree for it

• Parsing is more “difficult” than recognition? (time complexity)

• Ambiguity - an input may have exponentially manyparses

CYK )Cocke-Younger-Kasami)

• One of the earliest recognition and parsing algorithms

• The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF).

• It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF– Harder to understand

• Based on a “dynamic programming” approach:– Build solutions compositionally from sub-solutions

– Store sub-solutions and re-use them whenever necessary

• Recognition version: decide whether S == > w ?

CYK Algorithm

• The CYK algorithm for the membership problem is as follows: – Let the input string be a sequence of n letters a1 ... an.

– Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol.

– Let P[n,n,r] be an array of booleans. Initialize all elements of P to false.

– For each i = 1 to n • For each unit production Rj -> ai, set P[i,1,j] = true.

– For each i = 2 to n -- Length of span • For each j = 1 to n-i+1 -- Start of span

– For each k = 1 to i-1 -- Partition of span

» For each production RA -> RB RC

» If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true

– If P[1,n,1] is true • Then string is member of language

• Else string is not member of language

CYK Pseudocode

On input x = x1x2 … xn :

for (i = 1 to n) //create middle diagonal

for (each var. A)

if(Axi)

add A to table[i-1][i]

for (d = 2 to n) // d’th diagonal

for (i = 0 to n-d)

for (k = i+1 to i+d-1)

for (each var. A)

for(each var. B in table[i][k])

for(each var. C in table[k][k+d])

if(ABC)

add A to table[i][k+d]

return Stable[0][n] ? ACCEPT : REJECT

CYK Algorithm

• this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk.

• Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on.

• For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence.

• Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol

CYK Algorithm for Deciding Context

Free Languages

Q: Consider the grammar G given by

S | AB | XB

T AB | XB

X AT

A a

B b

1. Is x = aaabbb in L(G ) ?

CYK Algorithm for Deciding Context

Free LanguagesNow look at aaabbb :

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b b

CYK Algorithm for Deciding Context

Free Languages1) Write variables for all length 1 substrings.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

b

B

CYK Algorithm for Deciding Context

Free Languages2) Write variables for all length 2 substrings.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

S,T

b

B

CYK Algorithm for Deciding Context

Free Languages3) Write variables for all length 3 substrings.

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b

A A A B B

T

X

b

B

S,T

CYK Algorithm for Deciding Context

Free Languages4) Write variables for all length 4 substrings.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

T

X

S,T

b

B

S,T

CYK Algorithm for Deciding Context

Free Languages5) Write variables for all length 5 substrings.

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b

A A A B B

T

X

S,T

b

B

X

S,T

CYK Algorithm for Deciding Context

Free Languages6) Write variables for all length 6 substrings.

S | AB | XB

T AB | XBX AT

A a

B b

S is included so

aaabbb accepted!

a a a b b

A A A B B

T

XS,T

b

B

X

S,T

S,T

CYK Algorithm for Deciding Context

Free LanguagesCan also use a table for same purpose.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb

1:aaabbb

2:aaabbb

3:aaabbb

4:aaabbb

5:aaabbb

CYK Algorithm for Deciding Context

Free Languages1. Variables for length 1 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A

1:aaabbb A

2:aaabbb A

3:aaabbb B

4:aaabbb B

5:aaabbb B

CYK Algorithm for Deciding Context

Free Languages2. Variables for length 2 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A -

1:aaabbb A -

2:aaabbb A S,T

3:aaabbb B -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context

Free Languages3. Variables for length 3 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - -

1:aaabbb A - X

2:aaabbb A S,T -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context

Free Languages4. Variables for length 4 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - -

1:aaabbb A - X S,T

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context

Free Languages5. Variables for length 5 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - - X

1:aaabbb A - X S,T -

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context

Free Languages6. Variables for aaabbb. ACCEPTED!

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - - X S,T

1:aaabbb A - X S,T -

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

Parsing results

• We keep the results for every wij in a

table.

• Note that we only need to fill in entries

up to the diagonal – the longest

substring starting at i is of length n-i+1

Constructing parse tree

• we need to construct parse trees for

string w:

• Idea:

– Keep back-pointers to the table entries that

we combine

– At the end - reconstruct a parse from the

back-pointers

• This allows us to find all parse trees

References

• Hopcroft and Ullman,“Intro. to Automata

Theory, Lang. and Comp.”Section 6.3, pp.

139-141

• “CYK algorithm ” , Wikipedia, the free

encyclopedia

• A representation by Zeph Grunschlag

A C A G U U G C A

1 2 3 4 5 6 7 8 9

q = 9

The Nussinov-Jacobson Algorithm

1 2 3 4 5 6 7 8 9

A C A G U U G C A

1 A 0 0 0 1 2 2 2 3

2 C 0 0 0 1 1 1 2 2 3

3 A 0 0 0 1 1 1 2 3

4 G 0 0 0 0 0 1 2

5 U 0 0 0 0 1 2

6 U 0 0 0 1 2

7 G 0 0 1 1

8 C 0 0 0

9 A 0 0

65

-

1 2 3 4 5 6 7 8 9

A C A G U U G C A

1 A 0 0 0 1 2 2 2 3

2 C 0 0 0 1 1 1 2 2 3

3 A 0 0 0 1 1 1 2 3

4 G 0 0 0 0 0 1 2

5 U 0 0 0 0 1 2

6 U 0 0 0 1 2

7 G 0 0 1 1

8 C 0 0 0

9 A 0 0

A C A G U U G C A

1 2 3 4 5 6 7 8 9

The Nussinov-Jacobson Algorithm

66

A C A G U U G C A

1 2 3 4 5 6 7 8 9

q-1 q

1 2 3 4 5 6 7 8 9

A C A G U U G C A

1 A 0 0 0 1 2 2 2 3

2 C 0 0 0 1 1 1 2 2 3

3 A 0 0 0 1 1 1 2 3

4 G 0 0 0 0 0 1 2

5 U 0 0 0 0 1 2

6 U 0 0 0 1 2

7 G 0 0 1 1

8 C 0 0 0

9 A 0 0

The Nussinov-Jacobson Algorithm

i < q ≤ j

67

A U C A U G G C A U

• Co-terminus foldings:

• Partitionable foldings:

A C A G U U G C A

1 2 3 4 5 6 7 8 9

).,1(),(max

);,()1,1(

);1,(

);,1(

max),(

jkki

jioreBasePairScji

ji

ji

ji

jki

68

Another way to write the

Nussinov-Jacobson recursion

• Initialization:

• Recursion:

0),(

to2for 0)1,(

ii

Liii

Two special cases of

Partitionable Folding

Partitionable

Folding

Co-Terminus

Folding

69

SCFG version of the

Nussinov-Jacobson algorithm

• Stochastic Context-Free Grammars

• Makes use of production rules:

– W aW | cW | gW | uW (i unpaired)

• Every production rule has a associated

probability parameter.

• The maximum probability parse is

equivalent to the maximum probability

secondary structure.

70

SCFG Version of Nussinov-

Jacobson Algorithm

• The algorithm can be converted to a stochastic context-free grammar:

• S W

• W aW | cW | gW | uW

• W Wa | Wc | Wg | Wu

• W aWu | cWg | uWa | gWc

• W WW

71

Needed terminology• The inside-outside (recursive dynamic

programming) algorithm for SCFGs in

Chomsky normal form is the natural

counterpart of the forward-backward

algorithm for HMM.

• Best path variant of the inside-outside

algorithm is the Cocke-Younger-Kasami

(CYK) algorithm. It finds the maximum

probabilistic alignment of the SCFG to the

sequence.

).(log),1(),(max

);(log)1,1(

);(log)1,(

);(log),1(

max),(

WWpjkki

Wxxpji

Wxpji

Wxpji

ji

jki

ji

j

i

72

CYK for Nussinov-style

RNA SCFG

• Initialization:

• Recursion:

LiSxp

Sxpii

Liii

i

i to1for

)(log

)(logmax),(

to2for )1,(

Addition to the fill stage

of the Nussinov

algorithm.

The principal difference

is that the SCFG

description is a

probabilistic model.

Two special cases of

Partitionable Folding

Partitionable

Folding

Co-Terminus

Folding

73

CYK for Nussinov-style

RNA SCFG (2)

• The is the log likelihood

of the optimal structure given the

SCFG model

• The traceback to find the secondary

structure corresponding to the best

score is performed analogously to the

traceback in the Nussinov algorithm

)|ˆ,(log xP

74

Example of RNA Structure

SCFG• RNA structure for the sequence produced by

MFOLD, can be constructed (5’ to 3’):

• GCUUACGACCAUAUCACGUUGAAUGCAC

GCCAUCCCGUCCGAUCUGGCAAGUUAAG

CAACGUUGAGUCCAGUUAGUACUUGGAU

CGGAGACGGCCUGGGAAUCCUGGAUGU

UGUAAGCU

75

Example Construction

• S

• W

• Wu

• gWcu

• gcWgcu

• gcuWagcu

• gcuuWaagcu

• gcuuaWuaagcu

• gcuuacWguaagcu

• gcuuacgWuguaagcu

• gcuuacgaWuuguaagcu

• gcuuacgacWguuguaagcu

• gcuuacgaccWguuguaagcu

• gcuuacgaccaWguuguaagcu....

76

CYK for Nussinov-style

RNA SCFG

• Good starting example, but it is too

simple to be an accurate RNA folder

• The algorithm does not consider

important structural features like

preferences for certain:

– Loop lengths

– Nearest neighbours in the structure caused

by stacking interactions between

neighbouring base pairs in a stem.