+ All Categories
Home > Documents > RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter...

RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter...

Date post: 14-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
131
RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center for Bioinformatics, University of Leipzig Institute for Theoretical Chemistry, Univ. of Vienna (external faculty) The Santa Fe Institute (external faculty) CSSS, June 2006
Transcript
Page 1: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNA Bioinformatics

Peter F. Stadler

Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center forBioinformatics, University of Leipzig

Institute for Theoretical Chemistry, Univ. of Vienna (external faculty)The Santa Fe Institute (external faculty)

CSSS, June 2006

Page 2: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Overview

PART 1: RNA Structures and How to Compute Them

PART 2: RNA Landscapes

PART 3: The Modern RNA World

Page 3: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

PART1Why RNA?

until relatively recently:Central Dogma of Molecular BiologyDNA → RNA → ProteinDNA = “genetic memory”, RNA = working copy, proteins dothe work

around 1980: discovery of catalytic RNAs (Nobelprize for TomCech and Sidney Altman)nevertheless long considered “exotic” remnants from theancient RNA world

around 2000: structure of the ribosome showns that theribosome is an “RNA enzyme”

around 2000: microRNAs are discovered as a large class ofregulatory RNAs that inhibit translation of proteins

2006: the ENCODE project shows that human geneexpression is quite different from textbook knowledge

Page 4: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNA Bioinformatics

RNA Secondary Structures are an appropriate level of description

explain the thermodynamics of RNA Structures

often highly conserved in evolution

can be computed efficiently

Page 5: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Many Functional RNAs are Structured

(a) Group I intron P4-P6 domain(b) Hammerhead ribozyme(c) HDV ribozyme

(d) Yeast tRNAphe

(e) L1 domain of 23S rRNA

Hermann & Patel, JMB 294, 1999

Page 6: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

The RNA Model

GCGGGAAU

AGCUC

AGUUGG U A

G A G CA

CGA

CC

UU

GC C

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

UC

CA

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

Page 7: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Formal Definition

A secondary structure on a sequence s is a collection of pairs (i , j)with i < j such that

Base pairing rules are respected, i.e., (i , j) ∈ Ω implies (si , sj)form an allowed pair (GC, CG, AU, UA, GU, UG)

Each base is involved in at most one pair, i.e., Ω is amatching, (i , j), (i , k) ∈ Ω implies j = k and (i , k), (j , k) ∈ Ωin implies i = j .

(i , j)Ω implies |j − i | > 3 (sterical constraint)

No-crossing rule: (i , j), (k, l) ∈ Ω and i < k implies eitheri < k < l < j or i < j < k < l .This excludes so-called pseudoknots

Page 8: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Let’s count the structures . . .

Counting secondary structures. Given a sequence of length n.Πkl = 1 if sequence positions k, l can form a pair GC, CG, AU,

UA, GU, UG and Πkl = 0 otherwise.Nkl = number of structures of the subsequence from k to l .Basic recursion:

• xxxxxxx +∑

(xxxx)xxxx

Nkl = Nk+1,l +l∑

j=k+m

ΠkjNk+1,j−1Nj+1,l

Page 9: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNA Folding in a nutshell

i jj i i+1 j i i+1 k−1 k k+1|=

Nij = Ni+1,j +∑

k(i,k)pair

Ni+1,k−1Nk+1,j

Eij = minEi+1,j + min

k(i,k)pair

(Ei+1,k−1 + Ek+1,j + εik)

Zij = Zi+1,j +∑

k(i,k)pair

Zi+1,k−1Zk+1,j exp(−εik/RT )

Partition function: Z =∑

Ω exp(−E (Ω)/RT )

Page 10: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

A word on the Partition Function

The partition function is the link between the combinatorics of thestructures (in general: states in an ensemble) and thethermodynamic properties of the physical ensemble, e.g.:

Free energy G = −RT lnZ

Expected Energy 〈E 〉 = RT 2 ∂ lnZ∂T

Heat Capacity Cp = −T ∂2G∂T 2

0 20 37 100

0

1

2

3

4

5

CUGUAUUGUUGUAUAGCCCGUGUGGUAAUAUGG

T [C]

C(T

) [k

cal•(

mol

•K)-1

]

Page 11: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Realistic Energy Model

GC

U

A

C

G

closing base pair

A

5

3

3

5

A

5

3

closing base pair

interior base pairsclosing base pair

3

C

U

A

A

U G

C

closing base pair

multi-loop

interior base pair

A U

interior base pair

interior base pair

hairpin loop

bulge

C

G

C U

A

closing base pair

stacking pair

interior loop

G

G

G

3

5A

G

C

CA

5

Parameters from large number of melting experiments by Douglas

Turner, David Matthews, John Santa Lucia, and others

Page 12: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Recursions for Linear RNAs

M1 M1

M1

i u u+1|

MC

= |

= | |

FC

i j i+1 j i

hairpin Cinterior

i j i i k l j

k k+1 j

=

C

F F

i j

M

=i ij j−1

j

j| C

i j

M

ui+1 u+1i j−1 j

j i

M

j−1 ji u u+1|C

Page 13: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Recursions for Linear RNAs

Fij free energy of the optimal substructure on the subsequencex [i , j].

Cij free energy of the optimal substructure on the subsequencex [i , j] subject to the constraint that i and j form a base pair.

Mij free energy of the optimal substructure on the subsequencex [i , j] subject to the constraint that that x [i , j] is part of amultiloop and has at least one component, i.e., asub-sequence that is enclosed by a base pair.

M1ij free energy of the optimal substructure on the subsequence

x [i , j] subject to the constraint that that x [i , j] is part of amultiloop and has exactly one component, which has theclosing pair i , h for some h satisfying i ≤ h < j .

Page 14: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Recursions for Linear RNAs

Fij = minFi+1,j , min

i<k≤jCik + Fk+1,j

Cij = minH(i , j), min

i<k<l<jCkl + I(i , j ; k, l),

mini<u<j

Mi+1,u + M1u+1,j−1 + a

Mij = min

mini<u<j

(u − i − 1)c + Cu+1,j + b,

mini<u<j

Mi ,u + Cu+1,j + b, Mi ,j−1 + c

M1ij = min

M1

i ,j−1 + c , Cij + b

Page 15: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Backward Recursion: Base Pairing Probabilities

pij =Z1,i−1Zi ,jZj+1,n

Z1,n+

k<i

l>j

pklΞij ,kl .

Ξij ,kl is a ratio of the two partition functions:

Zij ,kl ... both i , j and k, l pair

Zkl ... k, l pair.Simplest case:Zij ,kl = Zk+1,i−1ZijZj+1,l−1ζkl where ζkl = exp(−βkl/RT ) is theBoltzman factor of the pairing energy

Page 16: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Backward recursion: full modelBackward recursion:

Pkl = Pkl +

p<k;q>l

Ppq

ZBk,l

ZBp,q

e−I(p,q,k,l)

+

p<u<k

ZMp+1,uZ

M1u+1,k−1)

e−(a+(q−l−1)c)

+

l<u<q

ZMl+1,uZ

M1v+1,q−1)

e−(a+(k−p−1)c)

+ ZMp+1,k−1Z

Ml+1,q−1

k lp q

Page 17: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Single-Stranded Circular RNAs

Viroid RNA

Hepatitis Delta Virus Genome

Cryptic by-products of splicing formed intronic sequence

Circularized C/D box snoRNAs were recently reported inPyrococcus furiosus

Synthetic constructs for in vitro selection

Page 18: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Circular, Linear, and Interacting RNAs

In the maximum matching case=⇒ same algorithm for all three cases

CIRCULAR FOLDING LINEAR FOLDING BINARY COFOLDING

i

j j j

i i

1n

1

n

1

n’

n

1’

Page 19: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Linear versus Circular Folding

Linear folding: energy contributions inside a pair (i , j) only.Co-folding: additional contribution for loop spanning [n, 1].

i

j

no energy contribution for external loop

i

j

1

n

k

l

p

q

1n

no external loop

extra contribution

external loop

Page 20: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Strategy 1 (e.g. Michael Zucker’s mfold)For each pair (i , j): compute energy both inside and outsidethe pair⇒ doubles memory requirements

Strategy 2 (Vienna RNA Package)First compute linear folding energies. Then compute energiesfor the loop spanning [n, 1].

1

n

1

n

1

n

hairpin loop interior loop or bulge multi−branch loop

Page 21: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Implementing Circular Folding

Relative to linear folding, only the loop containing the cut has tobe re-evaluated.Three cases: cut in Hairpin, Interior-, or Multi-loop

F = minF H ,F

I ,F

M

M1 M1

= | |n1

n1

n1

n1C

C

C

kl

p

interior

hairp

in kk+1

ql

k

F o

M2

M

k n

=

k nu u+1

M 2

Page 22: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Exterior Hairpin.

F H = min

p<qCpq + H(q, p)

Exterior Interior Loop.

F I = min

k<l<p<qCpq + Ckl + I(q, p, l , k)

Exterior Multi-Loop.

Modified decomposition: one or more components M1,k +exactly two components M2

k+1,n

M2kn = min

k<u<n

(M1

ku + M1u+1,n

)

F M = min

1<k<n

M1,k + M2

k+1,n + a

Folding energy: F = minF H ,F

I ,F

M

Page 23: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Application

sof

Circu

larFold

ing

Itdoes

make

adiff

erence

120

4060

80100

120140

160180

200220

240260

284starting point of linearized sequence

0 10 20 30 40 50

Structure distance

-6 -4 -2 0 2 4Energy difference

CU

GGGGAAUUUCUCUGCGGGACCAA

AUAA

A AACAGCUUGUGGAGGGAACAUACCUGAAGAG

GG

AUCCCCGGGG

AA

A UCU

CUUCAG

AC

UCGUCGAGGGGAGGGCG

CCGCGGAUC

ACUGG

CGUCCAGC

ACC

GGAA

CAGGAG

CU C G

UCUCCUU

CCU

UCC

AU

CGCUGGCU

CCA

CAUCCGAUCG

UCGCUUCUUCCUUCGCGA

CCUGAG

AU

AGA A

ACU

ACCCGGUG

GAUA

CA

ACUCUUGGGUUGUUCCUCCCAGGCUUGUU A

AUAA

AAAU

GGCCCGCGUUUG

AGACCCC

U

Citru

sviro

idIV

RNAfold-circ

inth

eViennaRNAPackage

Page 24: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Local structures

Idea: Restrict Recursion to base pairs (i , j) with j − i < L.Special interest in robust structures:Z

u,Lij . . . partition function of sub-sequence [i , j] when sequence

window [u, u + L] is folded

pu,Lij . . . probability that i and j form a base pair when window

[u, u + L] is folded.

Zu,Lij =

Zij if [i , j] ⊆ [u, u + L]0 otherwise

pu,Lij =

Zu,L1,i−1Z

u,Li ,j Z

u,Lj+1,n

Zu,Lu,u+L

+∑

k<i

l>j

pu,Lkl Ξu,L

ij ,kl

=Zu,i−1Zi ,jZj+1,u+L

Zu,u+L

+∑

k<i

l>j

pu,Lkl Ξij ,kl .

Page 25: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Robust local structuresAverage probability of an (i , j) pair over all folding windowscontaining the sequence interval [i , j]

πLij =

1

L − (j − i) + 1

i∑

u=j−L

pu,Lij .

Direct Recursion:

πLij =

1

L − (j − i) + 1

iX

u=j−L

Zu,L1,i−1

bZu,Li,j Z

u,Lj+1,n

Zu,L1,n

| z π∗L

ij

+1

L − (j − i) + 1

iX

u=j−L

X

k<i

X

l>j

pu,Lkl

Ξij,kl

= π∗Lij +

i−1X

k=j−L

i+LX

l=j+1

kX

u=l−L

pu,Lkl

Ξij,kl

L − (j − i) + 1= π∗L

ij +

i−1X

k=j−L

i+LX

l=j+1

L − (k − l) + 1

L − (j − i) + 1πL

klΞij,kl .

(1)

C C U U G G C C A U G U A A A A G U G C U U A C A G U G C A G G U A G C U U U U U G A G A U C U A C U G C A A U G U A A G C A C U U C U U A C A U U A C C A U G G U G A U U U A G U C A A U G G C U A C U G A G A A C U G U A G U U U G U G C A U A A U U A A G U A G U U G A U G C U U U U G A G C U G C U U C U U A U A A U G U G U C U C U U G U G U U A A G G U G C A U C U A G U G C A G U U A G U G A A G C A G C U U A G A A U C U A C U G C C C U A A A U G C C C C U U C U G G C A C A G G C U G C C U A A U A U A C A G C A U U U U A A A A G U A U G C C U U G A G U A G U A A U U U G A A U A G G A C A C A U U U C A G U G G U U U G U U U U U U G C C U U U U U A U U G U U U G U U G G G A A C A G A U G G U G G G G A C U G U G C A G U G U A C A G U U G U G U A C A G A G G A U A A G A U U G G G U C C U A G U A G U A C C A A A G U G C U C A U A G U G C A G G U A G U U U U G G C A U G A C U C U A C U G U A G U A U G G G C A C U U C C A G U A C U C U U G G A U A A C A A A U C U C U U G U U G A U G G A G A G A A U A U U C A A A G A C A U U G C U A C U U A C A A U U A G U U U U G C A G G U U U G C A U U U C A G C G U A U A U A U G U A U A U G U G G C U G U G C A A A U C C A U G C A A A A C U G A U U G U G A U A A U G U G U G C U U C C U A C G U C U G U G U G A A C A C A C C U U C A U G C G U A U C U C C A G C A C U C A U G C C C A U U C A U C C C U G G G U G G G G A U U U G U U G C A U U A C U U G U G U U C U A U A U A A A G U A U U G C A C U U G U C C C G G C C U G U G G A A G A

133029828 133029088Human chromosome X (minus strand)

mir-92-2mir-19b-2mir-20bmir-18bmir-106a

Local structures (L=100) in a 740 nt region of human X chromosome

Page 26: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Cofold: How to deal with Concentration?

Algorithmically that same as linear foldingspecial energy contribution for “loop with the cut”

Additional energy contribution for forming duplex

At least 5 molecular species need to be taken into account(Dmitrov & Zuker, 2005): A, B , A2, B2, AB .

Their folding energies and partition functions are easilycomputed

Page 27: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Cofold

A U G A A G A U G A C U G U C U G U C U U G A G A C A

A U G A A G A U G A C U G U C U G U C U U G A G A C AAU

GA

AG

AU

GA

CU

GU

CU

GU

CU

UG

AG

AC

A

AU

GA

AG

AU

GA

CU

GU

CU

GU

CU

UG

AG

AC

A

AUGAAG

AUG A

CUG

UC

UGUC

UUG

AGACA

Dot plot (left) and mfe structure representation (right) of thecofolding structure of the two RNA molecules AUGAAGAUGA (red)and CUGUCUGUCUUGAGACA.

Page 28: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Cofold: Concentration dependencies

Q = V n a!b! × (Z ′A)nA(Z ′AA)nAA(Z ′AB)nAB (Z ′BB)nBB (Z ′B)nB

nA!nB ! 2 nAA! 2 nBB !nAB !

where a = nA + 2nAA + nAB . The system minimizes the freeenergy −kT lnQ.solving this optimization problem yields the equilibria:[AA] = KAA [A]2 , [BB ] = KBB [B ]2 . [AB ] = KAB [A] [B ].with [A] = 6.023 × 1023nA, etc., and

KAA =Z ′AA

(ZA)2=

(ZAA − (ZA)2)e−ΘI /RT/2

(ZA)2=

1

2e−ΘI /RT

(ZAA

(ZA)2− 1

)

KBB =1

2e−ΘI /RT

(ZBB

(ZB)2− 1

)

KAB = e−ΘI /RT

(ZAB

ZAZB− 1

)

Page 29: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

1 10total siRNA concentration b [nmol]

0

10

20

conc

entr

atio

n [n

mol

]A.siA.AAsiA’.si’A’.A’A’si’

Example for the concentration dependency for two mRNA-siRNAbinding experiments.

Page 30: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNAup: Small RNAs Binding to Large Ones

RNA folding excludes pseudoknots, i.e., non outerplanargraphs

cofold thus does not allow small RNA binding into loopregions of large ones

... but this happens in reality

Remedy: Compute energy/partition function

Pu[i , j] =Z [1, i − 1] × 1 × Z [j + 1,N]

Z︸ ︷︷ ︸exterior

+∑

p,qp<i≤j<q

Ppq ×Zpq[i , j]

Z b[p, q]︸ ︷︷ ︸enclosed

that subsequence [i , j] is unpaired and the energy of binding ashort molecule in this location

Page 31: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNAup

q

i j

1n

p

lk

i

j i

j

k l

qp p q

1 n 1 n

p q

i j

1 n

i j

1 n

qp p q

i j

1 n

(a) (b’) (b”) (c) (d) (e)

Zpq[i , j ] = exp(−βH(p, q))︸ ︷︷ ︸(a)

+∑

p < i ≤ j < k or

l < i ≤ j < q

Z b[k , l ]e−βI (p,q;k,l)

︸ ︷︷ ︸(b)

+∑

p<i≤j<q

Zm2[p + 1, i − 1]e−βc(q−i)

︸ ︷︷ ︸(c)

+∑

p<i≤j<q

Zm[p + 1, i − 1]Zm[j + 1, q − 1]e−βc(j−i+1)

︸ ︷︷ ︸(d)

+∑

p<i≤j<q

Zm2[j + 1, q − 1]e−βc(j−p)

︸ ︷︷ ︸(e)

Page 32: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNAup: Interaction part

Z I [i , j , i∗, j∗] =∑

i<k<ji∗>k∗>j∗

Z I [i , k, i∗, k∗]e−βI (k,k∗ ;j ,j∗)

Z ∗[i , j] = Pu[i , j]∑

i∗>j∗

Z I [i , j , i∗, j∗];

P∗[i , j] = Z ∗[i , j]/∑

k<l

Z ∗[k, l ]n (3’)

m (3’)

1 (5’)

i

j

k k*

1 (5’)j*

i*

Page 33: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNAup: Application

0

0.2

0.4

0.6

0.8

1

Prob

abili

ties

VR1 straight VR1 HP5_16 VR1 HP5_11

0

0.2

0.4

Exp

ress

ion

Sequence position

160 180 160 180 160 180 160 180

VR1 HP5_6

1060 1080

-25

-20

-15

-10

-5

0

∆G

i [kc

al/m

ol]

Binding of siRNAs to VR mRNA.Pu[i , i ] (dashed line), P∗

i (thick black line), ∆Gi (thick red line).Below: activity of siRNA

Page 34: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Alternative Approach

Consider RNA Folding as a Machine Learning ProblemContext Free Grammar + probabilities for production rules⇒ Stochastic Context Free Grammarssee work by Sean Eddy, Jotun Hein, and collaborators

Page 35: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Folding Kinetics

RNA molecules may have kinetic traps which prevent them from reachingequilibrium within the lifetime of the molecule. Long molecules are oftentrapped in such meta-stable states during transcription.Possible solutions are

Stochastic folding simulations can predict folding pathways and finalstructures. Computationally expensive, few programs available.

Predicting structures for growing fragments of the sequence canshow whether large scale re-folding will occur during transcription.Cheap but inaccurate.

Analysis of the energy landscape based on complete suboptimalfolding can identify possible traps (local minima).

Page 36: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Kinetic Folding Algorithm

Simulate folding kinetics by a Monte-Carlo type algorithm:Generate all neighbors using the move-setAssign rates to each move, e.g.

Pi = min

1, exp

(−

∆E

kT

)

Select a move with probability proportional toits rateAdvance clock 1/

∑i Pi .

P4

P3

P5P6

P7

P8

P1 P2

Page 37: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Characterization of Landscapes

A landscape consists of a configuration space V , a move set within thatconfiguration space and an energy function f : V → R.Simplest move set for secondary structure: opening and closing of basepairs.Speed of optimization depends on the roughness of the Landscape.Measures of roughness suggested in the literature:

Number of local optima

Correlation lengths (e.g. along a random walk)

Lengths of adaptive walks

Folding temperature vs. glass temperature Tf /Tg

Energy barriers between the local optima. Especially, themaximum barrier height (“depth” in SA literature)

Page 38: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Energy barriers

E [s,w ] = min

max

[f (z)

∣∣z ∈ p] ∣∣∣∣ p : path from s to w

,

B(s) = minE [s,w ] − f (s)

∣∣w : f (w) < f (s)

Depth and Difficulty(borrowed from simulated annealing theory)

D = maxB(s)

∣∣s is not a global minimum

ψ = max

B(s)

f (s) − f (min)

∣∣∣∣s is not a global minimum

Page 39: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Energy Barriers and Barrier Trees

Some topological definitions:A structure is a

local minimum if its energy is lowerthan the energy of all neighbors

local maximum if its energy ishigher than the energy of all

neighbors

saddle point if there are at leasttwo local minima that can bereached by a downhill walk startingat this point

Page 40: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Calculating barrier trees

M1

M2

M3

S12

S23

M3

M3

M1

M1

M2

M2

S12

S12

S23

S23

M3

M1

M2

M3

M1

M2

S23

M3

M1

M2

S12

S23

The flooding algorithm:Read conformations in energy sortedorder.For each confirmation x we havethree cases:

(a) x is a local minimum if it hasno neighbors we’ve alreadyseen

(b) x belongs to basin B(s), if allknown neighbors belong toB(s)

(c) if x has neighbors in severalbasins B(s1) . . .B(sk) then it’sa saddle point that merges

these basins.Basins B(s1), . . . ,B(sk) arethen united and are assigned tothe deepest of local minimum.

Page 41: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Information from the Barrier Trees

Local minima Saddle points Barrier heights Gradient basins Partition functions and free energies of (gradient) basins Depth and Difficulty of the landscape

N.B.: A gradient basin is the set of all initial points from which a

gradient walk (steepest descent) ends in the same local minimum.

Page 42: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Energy Landscape of a Toy Sequence

G C U A U U A

GC

GC

G

UG

A

CG

UG

CG

U

UU

A

G C U A U UCG

UG

CG

U

UU

A

C

C

G

G

G

UG

A

A G C U A U UCG

UG

CG

U

UU

A

C C G G A

G

UG

A

G C U A U UCG

UG

CG

U

UU

A

C

C

A

G

UGGGA

G C U A G UCG

UG

CG

U

UU

A

G G G

AC

CU

U

A

G C G U G GCG

UG

CG

U

UU

A

G A

AU

C

CU

U

A

G U G G G ACG

UG

CG

U

UU

A

GC

AU

C

CU

U

A

G U

UG

CG

U

UU

A GC

AU

C

CU

U

A

C G G GG

G U

CG

U

UU

A

GC

AU

C

CU

U

A

C G

G GUGG

G U

GC

AU

C

CU

U

A

C G

GU

C GUUUAGGG

8 9 10

1 2 3

4567

A A

G U

GC

AU

C

CU

U

A

C G

GU

C G

UUUAGGG

11

U AA

Steps [arbitrary]

−8

−6

−4

−2

0

2

Ene

rgy

[kca

l/mol

]

1

2

3

4

5

6

7

8

9

10

11

2.45[5]

[9]3.80 1 [11]

1.60 3 [7]

13

141.30 9

1.40 192.10 11

3.80 2 [1][3] 12

1.20 8 [4]

173.60 4

2.20 73.61 5

1.40 161.90 10

1.50 152.00 205.02 6

3.90 18E

Page 43: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Folding Kinetics

Transition rates from x to y :

ryx = r0e−

E6=yx−E(x)

RT for x 6= y

rxx = −∑

y 6=x

ryx

Kinetics as a Markov process:

dpx

dt=

y∈X

rxypy (t) .

Transition states:E 6=

yx = maxE (x),E (y)

or more complex models (Tacker et al 1994, Schmitz et al 1996)

Page 44: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Reduced Description of the Folding Dynamics

Macrostates = Classes of a partition of the state space.Partition function for a macro state:

Zα =∑

x∈α

exp(−E (x)/RT )

Free energy of a macro state:

G(α) = −RT ln Zα

rβα =∑

y∈β

x∈α

ryxProb[x |α] for α 6= β

=1

y∈β

x∈α

ryxe−E(x)/RT

rβα “on flight” while executing the barriers program.Transition state free energy:

G6=βα = −RT ln

y∈β

x∈α

e−E6=xy

RT

Page 45: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

12

3

4

5

6

7

8

910

11

12

13

14

0.0

2.0

4.0

6.0

1.2

1.5

2.1

2.8

1.1

0.9

2.4

2.8

0.5 1.3

4.7

0.9

1.2

0.8

2.0

lillyA simple model sequence

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

Page 46: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

103

104

105

106

107

108

109

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

101

102

103

104

105

106

107

108

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe255680

Refolding of a tRNA molecule.

Page 47: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Summary I:

RNA structures can be computed efficiently by means ofdynamic programming

Computations are based on a set of carefully measures energyparameters and an additive energy model

Algorithms exist for ground state energy and structure, fullpartition functions, density of states, interacting structures,. . .

The folding kinetics of a given RNA Sequence can also beinvestigated as the level of secondary structures

VIENNA RNA PACKAGE

Page 48: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

PART II: How Do RNAs Evolve

Basic Assumption

Selection Acts on Secondary Structures, Mutations acts on theunderlying sequences⇒ We need to understand the sequence-to-structure map of RNAs(hang on, we’ll discuss the empirical evidence for that a bit later)

Page 49: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Sewall Wright’s Fitness Landscapes

Fitn

ess

Phenotype

How do realistic fitness landscapes look like?

Page 50: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Biological Landscapes

The RNA case is a special case of a very general paradigm:

genotype 7→ phenotype 7→ fitness

What is the relationship between Genotyp and Phenotype?

Central topic in any theory of evolutionbecause:* Selection acts on the Phenotype* Mutation/Recombination acts on the GenotypeBiopolymers as the simplest model:The molecule is both genotype (sequence) and phenotype (structure).

The map from genotype to genotype is determined by physical chemistry:

⇐⇒ folding problem

Page 51: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Computational Analysis of the RNA Map

There are many more sequences than structures.(.)-string: 3-letters (with constraints)

=⇒ less than 3n structures

but 4n sequences.

=⇒ Redundancy

How are sequences folding into the same structure distributed insequence space?Neutral Set S(ψ) = x ∈ Qn

α|f (x) = ψ

Page 52: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Sensitivity and Neutrality

GCGGGAAU

AGCUC

AGUUGG U A

G A G CA

CGA

CC

UU

GC C

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

UC

CA

GCGGGUAUA

GCUCAGU

UGG U A

G A G CA C G

A C CUU G CC A A

G GU

C G G G GU CG C G A G

U U CGA

GUCUCGU

UUCCCGCUCC

A

Effect of a single

point mutation0 100 200 300

Structure Distance

10-4

10-3

10-2

10-1

100

Fre

quen

cy

Distribution of structure distances

Page 53: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

The Random Graph Model

Approach:Model S(ψ) as a random induced subgraph Γ with a given value

λ =〈#neutral neighbors〉

(α− 1)n

Threshold value:

λ∗ = 1 −

(1

α

) 1α−1

Theorem. [Reidys, Stadler, Schuster]If λ > λ∗ then Γ is a.s. dense and connected,if λ < λ∗ then Γ is a.s. neither dense nor connected

Page 54: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

A complication: Base Pairing Rules

Unpaired bases:Alphabet A = A,U,G,C

Paired bases: 5’ and 3’ side correlated:Alphabet: B = AU,UA,GC,CG,GU,UG, .

Thus consider only the set of compatible sequences C (ψ):S(ψ) ⊆ C (ψ) ≡ Qnu

4 ×Qnp

6 .=⇒ Two neutrality parameters λu and λp

Page 55: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Connected Components of Neutral Networks

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

λp

gray many small components red 1 connected componentgreen 2 equal sized components yellow 3 components size 2:1blue 4 equal sized components

Explanation: for this deviation from the random graph model in terms of the energy model. Some structures can

be made only with a significant bias in the G/C ratio.

Page 56: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

x??P (d)

Length of Neutral Path: d !0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

0.250.200.150.100.050.00

0 20 40 60 80 100 120

Chain lenght n

0

5

10

15

20

25

30

Cov

erin

g ra

dius

from enumeration from inverse folding lower bound

Distance to Target structure Covering radiuslength neutral paths

Page 57: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Closest Approach

Intersection Theorem. For any two secondary structures φ, andψ holds

C (φ) ∩ C (ψ) 6= ∅

What is the distance of neutral networks

δ(φ,ψ) = mind(x , y)|f (x) = φ and f (y) = ψ

Random graph Theory: If λ > λ∗ then δ(φ,ψ) ≈ 2.Computer simulations: upper bounds on δ(φ,ψ):

n GC AU AUGC

50 5.6 2.6 2.170 9.3 4.6 3.4100 13.0 7.8 5.6

Page 58: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

AccessibilityFontana & Schuster 1998

Idea: The “interface” between two structures is large is they are“similar”.More precisely: Structure ψ is accessible for φ if x ∈ S(φ) is like tohave neighbor (mutant) x ′ ∈ S(ψ).Structural characterization of “easy” (continuous) transitions:

Shorteningof stacks

Elongationof stacks

Opening ofconstrained stacks

Closing ofconstrained stacks

Page 59: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

SUMMARY: Sequence-Structure Map of RNA

1. Redundancy: Many more sequences than structures

2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure

3. Neutrality: A substantial fraction of mutations does not alter thestructure.

4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).

Implications:

1. Neutral Networks: S(ψ) forms a connected “percolating” network insequence space for all “common” structures.

2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.

3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.

Page 60: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Simulated Trajectories

0.5 1.0 1.5time (arbitrary units)

5

15

25

35

45

stru

ctur

e di

stan

ce

0 510

40

proj

ectio

n co

ordi

nate

Punctuated equilibra = diffusion of neutral networks +constant rate of innovation +exponential selection of rare mutants

Proc.Natl.Acad.Sci. 93: 397-401 (1996)

Page 61: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Diffusion Constant

. . . can be deduced from Moran model:

D = λ6Anp

3 + 4Np(1 + 1/N) ∼

(3/2)A(n/N) p ≫ 0 orN ≫ 1

2Anp p ≪ 1

A . . . replication raten . . . sequence lengthN . . . population sizep . . . mutation rateλ . . . neutrality of network

Page 62: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Dynamics of Interacting Replicators

Ik + Ij −→ Il + Ik + Ij

With mutation:

xk = xk

j

Akjxj −∑

i ,j

Aijxixj

+∑

l ,j

(QklAljxjxl − QlkAkjxkxj)

where

Qkl = (1 − p)n−d(k,l)

(p

α− 1

)d(k,l)

How does this behave in sequence space?

Page 63: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Simplest case: Simplest case: Akl = A0(1 − d(Ik , Il )/n):

0 1000 2000 3000 4000 5000time

0.0

0.2

0.4

0.6

0.8di

vers

ity

0 2×105

4×105

6×105

8×105

time lag τ

0

10

20

30

40

50

60

g(τ)

g(τ) =1

T2 − T1 + 1

T2∑

t=T1

‖p(t + τ) − p(t)‖2

B.M.R. Stadler, Adv. Complex Syst. (2003)

Page 64: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

10-6

10-4

10-2

100

p

10-6

10-4

10-2

100

D

100

101

N/n

10-3

10-2

10-1

100

D

Left: Diffusion coefficient D as a function of the mutation rate for N = 10, 20, 30, 40, 80 and

n = 10, 20, 30, 40, 80 such that N/n = 1 after equilibration for 105 timesteps. Right: Dependence of the ratio

D/p on N/n.

Page 65: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

An RNA-Based Model in the Plane

Target hypercycle with 8 mem-bers.

Model:Hypercyclically coupledspecies, each sequence hasa function that depends onits structure.

Page 66: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Spatial Extension: CA Model

Possible Catalysts Actual CatalystsPossible Replicators

Rules of replication. For each of the neighbors (•) of the empty cell (marked by a bold outline) the replication rate

ρz is computed taking into account their neighbors in the direction of the replication () as potential catalysts.

The neighbor with the largest values of ρz invades the empty position. In this example, for the chosen replicator,

only three of its neighbours are catalysts according to the hypercycle topology.

Page 67: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Spirals formed after 3000 generations in an evolution experimentstarted with 300 random sequences in the absence of parasites.see also Borlijst & Hogeweg (1993)

Page 68: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Diffusion in Sequence Space

2000 4000 6000 8000 10000time lag (tau)

2

4

6

8

10

12

g(ta

u)

0.0005 0.0010 0.0015 0.0020Mutation rate

0.002

0.003

0.004

Diff

usio

n co

nsta

nt

Page 69: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Summary

Neutrality of the Sequence-Structure Map impliesdiffusion/drift-like motion in sequence independent of detailsof the selection/mutation mechanisms and whether spatialextension is taken into account or not.

=⇒ The basic assumption of molecular phylogenetics, namelya dominating influence of drift in sequence evolution, holdstrue even when phenotypic evolution is dominated byinteractions(co-evolution).

TODO Development of a rigorous mathematical theorydescribing the motion in sequence space of a population withstrong interactions.

Page 70: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Evolutionary histories of some structured RNAs

Ribosomal RNAs (rRNAs) are the most frequently used sequencedata for reconstructing phylogenies from molecular dataHow does that work:In a nutshell:(1) compute evolutionary distances from the sequence data(2) “fit” an additive tree to the distances(In reality, there are other methods such as maximum parsimonyand maximum likelihood approaches, but the basic idea is thesame)Observation: all tRNAs have more or less the same clover-leafstructure.

Page 71: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

MicroRNAs

processed from precursorhairpins

short (∼ 22nt) RNAmolecules

highly conserved

Function

bind to 3’UTRs of mRNAtargets

supress expression of thismRNA

mark mRNA molecule fordegradation

in plants involved in DNAmethylation

AGU

GCC

ACACU

CC

GUGUAUUUGACAAGCU

GAGU

U GGACACUC

CAU

G U GGU

AGA

GUGUCAGUUUGUCAAAUACC

CCA

AGUG

AGG

CACA

CGAU

GC

GCAU

Page 72: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

MicroRNAs — processing and function

MicroRNAs ...

transcribed inlonger transcripts(primary-miRNA)

in some cases:polycistronic

“clusters”

Drosha processing→ precursor

miRNA

export to cytoplasmExportin-5 pathway

Dicer processing →mature miRNA

Page 73: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Evolution of microRNA Families: mir-17 clusters

Many miRNAs are transcribed from polycistronic transcriptsMost spectacular example: Human mir-17 clusters

100nt

Chr−13

Chr−X

Chr−7

19b−1 92−1

18X 20X 19b−2 92−2

106b 93 25

18 19a 2017

106a

91 17 18 19a 20 19b 92

106a 18X 20X 19b 92

106b 93 25

I−1

I−X

II−3

J. Mol. Biol. 339: 327-335 (2004)

Page 74: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Case Study: mir-17 clusters

AU

UG

CG

GC

GA

A U

A U

AC

AA

UA

GC

CAU

A

GCA

CG

C

GG

AU

AA

AU

UG

CU

UA

UA

GAUA

UAAG

AG

AU

C_

A_

G_

_U

AC

CA

A_

G_

UG

UA

AU

CG

AU

UG

_G

GC

UC

AU

GU

GC

UA

UA

A U

G C

A U

G C

G C

G C

U A

C G

C G

G U

C G

C G

C G

U A

G U

U _

C G

A U

C G

G C

U A

A U

U A

G C

A U

A U

X-106a

X-18X

X-20X

X-19b-2

X-92-2

. . . . . . . . ( ( ( ( ( ( ( ( ( ( ( ( ( ( . . . . ( ( ( ( ( . ( ( ( ( ( ( . . . ) ) ) . . ) ) ) ) ) ) ) ) . . . . ) ) ) ) ) ) ) ) . ) ) ) ) ) ) ( ( ( . . . . ) ) ) . . . ( ( ( . ( ( ( ( ( . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( ( . ( ( ( . . . ( ( ( . . . . . . . . ( ( ( ( ( . . ( ( ( ( . . . . . . . . ( ( ( . . . ( ( ( ( ( ( ( ( ( ( . ( ( ( ( . ( ( ( . ( ( ( ( ( . . . . . ( ( ( . . . ) ) ) . . . . . . . ) ) ) ) ) . ) ) ) . ) ) ) ) . ) ) ) ) . . ) ) ) ) ) ) ) ) ) ( ( ( . . . . . . . . . . . . ( ( ( ( ( ( ( . . . . . . . . . . . . . . . . . . . . ) ) ) ) ) ) ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( ( ( ( ( ( ( . . . . . . . . . . . ) ) ) ) ) ) ) ) ) . . . . . . . . . . . . . . . . . . . . ) ) ) . . . . . ) ) ) ) . . ) ) ) ) ) . . . ( ( ( . . . . ( ( ( ( ( . . ( ( ( ( ( ( ( ( ( ( ( . ( ( ( ( ( . ( ( ( . . . . . . . . . . . . . ) ) ) ) ) ) ) ) . ) ) ) ) ) ) ) ) ) ) ) . . . ) ) ) ) ) . . . . ) ) ) . . . . ) ) ) . . . ) ) ) . ) ) ) ) . . ( ( ( . . . . . . . ) ) ) . . . . . ) ) ) ) ) ) ) ) . . . . . . . ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( . . . ( ( ( ( . . . . . . . . . . . . . . . . . ) ) ) ) ) ) ) ) ) ) ) ) . . ) ) ) ) ) ) ) ) ) ) ) . . . . . . . . . . . ( ( ( ( . . . . . . . ) ) ) ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( . . . . ) ) ) . . . ( ( ( . ( ( ( . ( ( ( ( ( . ( ( ( ( ( . ( ( ( ( . . . . ( ( ( ( ( ( ( ( ( . . . ) ) ) ) . ) ) ) ) ) . . . ) ) ) ) . ) ) ) ) ) . ) ) ) ) ) . ) ) ) . ) ) ) . . . . . . . .

0 100

200

300

400

500

600

700

X-106a

X-18X

X-20X

X-19b-2

X-92-2

(a) (b)

Structure of the pri-pre-mir-17 at the human X chromosome.

Page 75: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Construction of Gene Treesfrom concatenated sequences in the cluster

HsXPtX

MmX

RnX

Pt1Hs1 Mm1

Rn1

Xt

TnB

TrB

TnA

TrA

DrA

Dr14

974

898

CcX

1000

1000

998

DrB

1000

233617

427

10

00

954

462

464

TnCTrC

0.1

Pt−IIHs−II

Mm−IIRn−II

Xt−II

Dr−II−D

Tr−II−D

0.1

Page 76: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Distant Homologies with unreliable Alignments

How to quantify sequence similarity when we cannot get a goodalignment?

measure pairwise sequence similarity s(x , y)

compare to the distribution of similarity values of alignmentsof shuffled sequences

define a z-score

z(x , y) =s(m, y) − 〈s(π(x), π′(y))〉π,π′

√varπ,π′(s(π(x), π′(y)))

use z(x , y) as similarity measure in WPGMA clustering

Page 77: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Gen

eTre

eof

mir-1

7cl

ust

erm

ember

s

1710

620

X−

20

106b

9318

X−

18

9225

19a

19b1−

19b

X−

19b

Hs1−17

Hs1−18

Hs1−19a

Hs1−19b−1

Hs1−20

Hs1−92−1

Mm1−17

Mm1−18

Mm1−19a

Mm1−19b−1

Mm1−20

Mm1−92−1

Rn1−20

HsX−19b−2

HsX−92−2

HsX−106a

MmX−19b−2

MmX−92−2

MmX−106a

Hs3−25

Hs3−93

Hs3−106b

Mm3−25

Mm3−93

Mm3−106b

Dm−92aDm−92b

Ce−235

Pt1−17

Pt1−18

Pt1−19a

Pt1−19b−1

Pt1−20

Pt1−92−1

Rn1−17

Rn1−18

Rn1−19a

Rn1−19b−1

Rn1−92−1

Xt−17

Xt−18

Xt−19a

Xt−20

Xt−19b

Xt−92

HsX−18

HsX−20

PtX−106a

PtX−18

PtX−19b−2

PtX−20

MmX−18

MmX−20

RnX−106a

RnX−18

RnX−19b−2

RnX−20

RnX−92−2

CcX−19b−2

CcX−92

Pt3−93

Pt3−25

Pt3−106bRn3−106b

Rn3−25

Rn3−93

Cc3−25DrD−25

DrD−19b

DrD−93

TrD−17/106b

TrD−20/93

TrD−25

TrD−19b

XtB−93

XtB−25

DrA−17

DrA−18

DrA−92−1

DrB−17

DrB−18

DrB−19a

DrB−19b−1

DrB−92−1

Dr14−20

Dr14−18

Dr14−19−b1

TrA−17

TrA−18

TrA−19a

TrA−19b−1

TrA−20

TrA−92−1

TrB−17

TrB−18

TrB−19a

TrB−19b−1

TrB−20

TrB−92−1

TrC−18

TrC−19b

TnB−17

TnB−18

TnB−19a

TnB−20

TnB−19b

TnB−92

TnC−18

TnC−19b

TnA−17

TnA−18

TnA−19a

TnA−20

TnA−19b

TnA−92

22.0

20.0

18.0

16.0

14.0

12.0

10.08.0

6.0

4.0

2.0

0.0

z−sc

ore

Page 78: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Collapsed tree of microRNA subgroups

z−score

I−17

I−18

I−19a

I−20

I−19b

I−92

II−106b

II−93

II−19b

II−25

Ce−92

Dm

−92

18.0

16.0

14.0

12.0

10.0

8.0

6.0

4.0

2.0

0.0

mir−17 groupmir−92 group mir−19 group

obtained by collapsingvertebrate, insect,and nematode speciestrees to single vertices

next step:combine gene treesand syntenyinformation to aduplication history

Page 79: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familyancestral mir17 cluster probably contained

mir-17, mir-19 and mir-92

181720

93106b 19b

19−3

1992

25

deletion

Page 80: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familyfirst detectable duplication event:

branch mir-17 and mir-18

181720

93106b 19b

19−3

1992

25

deletion

19b is copy of 19

18 is copy of 17

Page 81: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familyseries of duplications:

branch mir-19 and mir19b, mir-17 and mir-93

181720

93106b 19b

19−3

1992

25

deletion

19b is copy of 19

18 is copy of 17

93 is copy of 17

Page 82: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familygenome wide duplication:

duplication of whole cluster and loss of individual miRNAs

181720

93106b 19b

19−3

1992

25

deletion

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Page 83: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familyindependent miRNA duplications

in type I cluster

III

181720

93106b 19b

19−3

1992

25

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Page 84: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familysplit of teleosts and mammalia

teleost specific genome duplication

Mammalia

III

I−1I−XII

181720

93106b 19b

19−3

1992

25

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Page 85: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Scenario for the evolution of the mir17 familysplit of teleosts and mammalia

teleost specific genome duplication

TeleosteiMammalia

I−BI−AI−CII−D

III

I−1I−XII

181720

93106b 19b

19−3

1992

25

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Page 86: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

History of the mir-17 cluster: updated data

TeleosteiTetrapoda

deletion

cluster duplication

I

II

I−X

IIII−D

I−B

I−C

I−A

20

93

18

106b

19

19b92

2519b−II

I−1

17/106a

20 is copy of 17

18 is copy of 17

93 is copy of 17

Page 87: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Further Examples: let-7 family

a1 f1 df2 m

ir98

a3/c

2

k g i c1 e a2jb

Ancestral Eubilaterian

Tetrapoda only

Ancestral vertebrate

lost in mammals loss of mir100 or mir125 in one paralog in most species

Dro

sop

hila

Ne

ma

tod

es

duplicationteleost genome

Mammalian transcription from

intron of coding sequenceintron of non−coding sequenceexon

Page 88: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Further Examples: mir-1 and mir-30

1−1 1−2Teleosts Teleosts

TetrapodaTetrapoda

Nematoda

Urochordata

Arthropoda

Sea Urchin

Xtr

0.10.0

Gga−1b

d

b

bd

c2

c2

teleosts

tetrapoda

teleosts

tetrapoda

tetr

ap

od

a

tetrapoda

teleosts

teleosts

tetrapoda

teleosts

tetrapoda

c1

e

a

c1e

a

0.10.0

Dr

mir-1 mir-30

Page 89: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Further Examples: mir-9, mir-23, mir130/301

tt

t t

t

t

t

Apis

Sea Urchin

Nematoda

Schistosoma (?)

Tetrapods

Teleosts Diptera

9−3

9−2

9−3

9−4

mir−79

mir−9a

mir−9b

mir−9c

9c 79 9b306 9a

0.1

mir-9

(2)

Dr22(

6.1)

/ Tn1

Dr22(

8.8M

) / T

n23

Dr11

/ Tn3

Dr8 /

Tn12

Dr22(

10.1

)

lost in Tn/Tr

23a

27a

24.2

23b

27b

24.1

Hs19 Hs9

(1)

mir-23 cluster

Gg

Md

Xt

Rn, Mm

Cf, Pt, Hs

Dr

Fr

Tv

mir−301

mir−130b

mir−130a

mir−301

mir−130a

? ?

mir-130 cluster

Page 90: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Expansion of the Metazoan MicroRNA Repertoire

0 10 20 30 40 50 60 70 80

21

140

Rn Bt Cf PtMdGgXtDr Tn TrCsCiOdSpAgBmTcAmCb Ce D.sp. Mm Hs

40

miRNA innovationsnon−local duplicationslocal duplications

2

1

11

18

171

4 44

114

22

23

131

10

11

1124

56

2

13

11

5

46

26 1

1

Sm

Page 91: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Similar Situation: snoRNA

snoRNAs direct chemical modification of other RNAs (mostlyrRNA, snRNA, and (some?) messenger RNAs

two classes: box-H/ACA and box-C/D

known in eukaryotes and archea, not in eubacteria

Page 92: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

H/A

CA

box

snoR

NAs

inVer

tebr

ates

Mm_4_E1_1Mm_4_E1_2

Rn_8_E1Mm_9_E1

Hs_1_E1_1Pt_1_E1_1

Hs_1_E1_2

Pt_1_E1_2Oc_E1_1r

Ss_E1_1Cf_2_E1_1

Cf_2_E1_2Bt_E1

Gg_E1Xl_E1_6r

Xl_E1_1rXl_E1_5r

Xl_E1_4rXl_E1_3r

Xl_E1_2rTr_E1_3

Tr_E1_4Tr_E1_5Tr_E1_6

Ol_E1_1Tn_E1_1

Tn_E1_2Tn_E1_4

Tn_E1_3Om_E1_r

St_E1_rDr_E1_3r

Dr_E1_2rDr_E1_5r

Dr_E1_4r

98

.79

9.8

90

.9

90

.49

6.3

82

.0

89

.6

80

.2X

enop

us

Tel

eost

s

Chi

ck

Mam

mal

s

0.00

0.02

0.04

0.06

0.08

0.10

DrE2_1

DrE2_2

Tr_E2

Cf_23_E2_1

Hs_3_E2_1

Mm_9_E2_1

Rn_8_E2_1

Gg_E2_2

Xt_E2_2

Xt_E2_1

Mm_9_E2

Rn_8_E2

Hs_3_E2

Cf_23_E2

Gg_E2_1 0.00

0.05

0.10

66

.6

92

.3

85

.4

74

.7

Tet

rapo

da−

1

Tet

rapo

da−

2

93

.0

Dr_25_E1_1Dr_25_E3_3

Dr_25_E3_5Dr_25_E3_2

Dr_25_E3_4Dr_25_E3_6

Tr_E3_1Tn_E3_2

Tr_E3_2Tn_E3_1

Xl_E3_2Gg_9_E3_1

Xl_E3_1Cf_34_E3_2

Hs_3_E3_2Pt_2_E3_2

Rn_11_E3_1Mm_16_E3_2

Cf_34_E3_1Hs_3_E3_1

Pt_2_E3_1Mm_16_E3_1

Rn_11_E3_2

0.00

0.05

0.10

99

.6

99

.8

10

0.0

81

.2

54

.7

Mam

mal

s−1

Mam

mal

s−2

Tel

eost

s

E1

E2

E3

Page 93: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Ver

tebr

ate

YRN

As

Rn

RnMm

MmRn

FrTnTn

Dr

MmRn

GgMmRn

Out

grou

p

HsY5

XlYa XlY3

XlY3

MmY3

OmY1HsY4

CfY4

XlY4

HsY3

HsY1

ApY3

OcY1

Tr

Y4

Y1

Y3

Y5

Page 94: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Summary

The genotype-phenotype map of RNA is charcterized by aninterplay of “ruggedness” and neutrality

Selection plus drift results in diffusion on neutral networks

Many non-coding RNAs have highly constrained (i.e.,evolutionarily very well conserved) structures but fairly rapidlyevolving sequences

Drift of sequences is independent of the details of theselection mechanism

Ongoing research: elucidate the evolutionary histories ofstructured ncRNAs

Page 95: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

PART III: The Modern RNA World

mRNA

pre−mRNA

tRNA rRNA

miRNA

pre−miRNA7SL + proteinsY + proteins

NUCLEOLUS

CAJAL BODY

7SL Y RNA vRNApre−tRNA RNAse_P

tRNA

MRPmiRNA

pre−miRNA

pri−miRNA

pre−rRNAsnoRNA

rRNASplicosome

snRNA scaRNA

Drosha

Introns mRNA

translational inhibition

RISCRNAi

Ro RNP SRP

Ribosome

vRNA + proteins

vRNP

Dicer

Proteins

NUCLEUS

Page 96: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Multiple Origins of ncRNAs

CraniataCephalochordata

UrochordataEchinodermataHemichordata

ChoanoflagellataFungiMicrosporidia

AlveolatesStramenophilesRhodophyta

Chordata

Metazoa

Green Plantsother protists

Eukarya

Protostomia

Bacteria

ArcheaLUCA

rRNA

tmRNA

snoRNA C/DsnoRNA H/ACA

RNAse MRPtelomerase RNAmost snRNAsvault RNAs ?

7SK ?

U7 Y RNA

tRNARNAseP7SL/SRP small bacterial RNAs

in Kinetoplastids onlygRNA

microRNA in multicellular animals and plants only ?

Page 97: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Surveys for noncoding RNAs

> 5% of the human genome is under stabilizing selection(from man/mouse comparison), less than 1/3 of this codes forprotein

Virtually the entire genome is transcribed as primary nucleartranscripts in at least one direction(ENCODE Genes&Transcripts group, unpublished data)

∼ 80% of the ENCODE regions are transcribed in as parts ofprotein coding transcripts including introns and UTRs

Only a tiny part of the primary transcripts is protein coding

Large fraction of apparently non-protein-coding cDNAs

The functions of most of these transcripts are unclear.

“There is need for reliable experimental and computational methods

for comprehensive identification of non-coding RNAs.”

–International Human Genome Sequencing Consortium, Nature 431, p.943, October 2004

Page 98: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

The ENCODE Project

ENCyclopedia Of DNA Elements

Public research consortium launched by NHGRI in 2003

Purpose: “testing and comparing existing methods torigorously analyze a defined portion of the human genomesequence”.

Focus: specified 30 megabases ( 1% of genome) in more than20 species

Informally organized in subgroups: Sequencing Technology,Comparative Genomics, Genes and Transcripts, GeneticVariation, ...

Results from 1st phase currently under review

Phase 2: scale-up to complete genome

Page 99: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Highlights from

ENCODE Genes and Transcripts Analysis Group

(Data presented by Tom Gingeras in Bethesda, Jan 12 2006)

Only a fraction of processed RNA transcripts correspond toGeneCode annotated transcripts:70% correlated with annotated (m)RNAs52% correlate with annotated protein coding sequences

Substantial fraction of transcription is specific of cellularconditionsonly 2.6% of transfrags are common to all 11 cell-lines.

The same genomic sequence may be processed into multipleRNA sequences with different fates

Virtually the entire genome is transcribed as primary nucleartranscript in at least one direction.

Transcriptional output is MUCH more extensive AND much more

complex than previously thought.

Page 100: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Recall: Sequence-Structure Map of RNA

1. Redundancy: Many more sequences than structures

2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure

3. Neutrality: A substantial fraction of mutations does not alter thestructure.

4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).

Implications:1. Neutral Networks: S(ψ) forms a connected “percolating” network in

sequence space for all “common” structures.

2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.

3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.

Proc.Roy.Soc.B 255 279-284 (1994), Proc. Natl. Acad. Sci. USA 93, 397-401 (1996),

Bull. Math. Biol. 59, 339-397 (1997), RNA 7: 254-265 (2000).

Page 101: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNA Sequencs

Multiple Sequence Alignment

CLUSTAL W

Minimum Energy Folding

Mountain Representations

Vienna RNA Package

Secondary Structures

Aligned Structures

Detect conserved sub-structures Confirmed conserved sub-structures

CHECK:

compensatory

mutations

. . .. . .. . .. . .

. . .

. . .. . .. . .

.

..

.. .. ..

... . .

. . .. . .

. . .

. . ... .. ... . .. . ..

..

...

...

. .

.... .

....

.

..

.. . .

......

.. . . ..

...

......

.

. . ... .. ... . ... .. ..

RNA Sequencs

Dot Plots

Multiple Sequence Alignment

Combined Pair

Table

Conserved sub-structures

McCaskill’s

Algorithm

CLUSTAL W

UGUGGUCGAUAU 0.99

0.01

0.45

0.00

0.77

0.34

sequence and pairing probability

CHECK

compensatory

mutations

Credibility RankingReduce Pair List

Minimum Energy Base Pairing Probabilities

Nucl. Acids Res. 26: 3825-3836 (1998), Comp. & Chem. 23: 401-414 (1999)

Page 102: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Examples: HIV-1 TAR-hairpin

. . . . . (((((((((((. (((((. . . ((((. . . . . . )))))))))))))))))))).

0 10 20 30 40 50

. . . . . ( ( ( ( ( ( ( ( ( ( ( . ( ( ( ( ( . . . ( ( ( ( . . . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . ..

..

.(

((

((

((

((

((

.(

((

((

..

.(

((

(.

..

..

.)

))

))

))

))

))

))

))

))

))

).

GG

UC

UC

UC

UG

GU

UA

GAC

CA

GA

UCU

GA

GC

CUG

GG

A GC

UC

UC

UG

G

CU

AA

CU

AG

GG

A

A

Flaviviridae: Nucl. Acids Res. 29: 5079-5089 (2001), Picornaviridae: J. Gen. Virol. 85: 1113-1124 (2004), Broad

survey: Bioinformatics 20: 1495-1499 (2004)

Page 103: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Examples: Picornaviridae: Cis-acting-Replication Element

(CRE)

The function of the CRE probably involves the initiation of the synthesisof the negative-sense strand template RNA during virus replication.

CGACGGUUA

CAA C C A

GC

AGACCGUCG C

AUAC

AGUUCAAG

UCCA

A A U GCCG

UAU

UGAAC

CU

GUAUG

A UU A

G C

ACGGCCAC

AAACA

CC C A A U

CAACU

GUUG

GCCGU A

UCAUAUACCG

AACA

A A C ACUA

UAG

GUGAUGAU G

AAGUCAUCGUUG

AGAA

A A C GAAA

CAG

ACGGUGGCC

UC

U G

ACGGCUA

CAA A C A

AC

AAGCUGU

C G

UUUUGCAUUUUG

CA

A AUU

CAAGAUGUAGAG

C G

U AG U

Aphthovirus Enterovirus Cardiovirus HRV-A HRV-B Teschov. Hepatov.region:2C 2C 1B 2A 1B 2C 2C

Aphto ~~~~CGAC-GGUU------ACA-CCAAGCA------GACCGUCG~~~~~Entero CAUACAGU-UCAAG--------UCCAAAU-GCCGUAUUGAACCUGUAUGCardio ~~~~~ACG-GCCA---CAAACACCCAAUCAACUGU-UGGCCGU~~~~~~HRV-A ~~~AUCAUAUACCGAACAAACA---------CUAUAGGUGAUGAU~~~~HRV-B GAAGUCAU-CGUUGAGAAAACG---AAACA------GACGGUGGCCUC~Tescho ~~~~~~AC-GGCU--ACAAACA-----ACA------AGCUGU~~~~~~~Hepato UUUUGCAU-UUUG---CAAA--------------UUCAAGAUGUAGAG~ ~~~(((((-((((.......................)))))))))~~~~ 1.......10........20........30........40.........

predicted in Nucl. Acids Res. 29 5079-5089 (2001),

experimentally detected by Gerber, Wimmer Paul J.Virol. 75 10979-10990 (2001).

Page 104: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

A Method for Large Genomes: RNAz

∗ Two ingredients: Thermodynamic Stability & StructureConservation

Measuring thermodynamic stability of ncRNAs

Naturally occurring structured RNAs have a lower foldingenergy compared to random sequences of the same size andbase composition?

1. Calculate native MFE m.2. Calculate mean µ and standard deviation σ of MFEs of a large

number of shuffled random sequences.3. Express significance in standard deviations from the mean as

z-score

z =m − µ

σ

Negative z-scores indicate that the native RNA is more stablethan the random RNAs.

Page 105: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Efficient calculation of stability z-scores

The mean µ and standard deviation σ ofrandom samples of a given sequence arefunctions of the length and the basecomposition:

µ, σ(length,GC

AT,G

C,A

T)

Calculating z-scores is thus a 5 dimensionalregression problem.

The regression problem is solved using aSupport Vector Machine regression algorithm.

The SVM was trained on 10,000 syntheticsequences spaced evenly in the variable space.

The regression calculation is of the sameaccuracy as the sampling procedure.

-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

Sam

pled

z-s

core

s-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

Sampled z-scores

-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

Cal

cula

ted

z-sc

ores

Page 106: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

z-scores of known ncRNAs

ncRNA Type No. of Seqs. Mean z-score

tRNA 579 −1.845S rRNA 606 −1.62Hammerhead ribozyme III 251 −3.08Group II catalytic intron 116 −3.88SRP RNA 73 −3.37U5 spliceosomal RNA 199 −2.73

Functional RNAs are clearly more stable than randomsequences.

However: The scores are too small to discriminate reliably in agenome-wide screens since the z-score distributions haveheavy tails.

Page 107: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Consensus folding using RNAalifold

RNAalifold uses the same algorithms and energy parametersas RNAfold

Energy contributions of the single sequences are averaged

Covariance information (e.g. compensatory mutations) isincorporated in the energy model.

It calculates a consensus MFE consisting of an energy termand a covariance term:

J.Mol.Biol. 319:1059-1066 (2002)

Page 108: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

The structure conservation index

The SCI is an efficient and convenient measure for secondarystructure conservation.

Page 109: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Separation of native ncRNAs from random controls in two

dimensions

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2

5S rRNA tRNASignal recognitionparticle RNA

RNAseP U2 spliceosomal RNA U5 spliceosomalRNA

z-score

Str

uctu

re c

onse

rvat

ion

inde

x

Page 110: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Classification based on both scores

Page 111: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Classification based on both scores

Page 112: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Implementation and availability

The approach is implemented in ANSI C in the program RNAz.

The z-score regression is limited to 400 nucleotides.

The classification model is currently limited to alignments ofsix sequences.

At least an order of magnitude faster than other programs.

RNAz is freely available:Download from www.tbi.univie.ac.at/∼wash/RNAz

Proc. Natl. Acad. Sci. USA 102: 2454-2459 (2005)

Page 113: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Screening the human genome

Large scale comparative screen including: human, mouse, rat, dog chicken fugu, zebrafish

Reduction of the ≈ 3.095 MB human genome: Take ≈ 5% of the best conserved regions Remove all annotated coding exons Only take alignments strictly conserved in all 4 mammals.

→ 438,788 alignments alignments covering 82.64 MB

Page 114: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

92.0M 94.0M 96.0M 98.0MMost conserved noncoding regions (present in at least human/mouse/rat/dog)

RNAz structural RNAs (P>0.5)

RNAz structural RNAs (P>0.9)

RefSeq Genes

90801000 90801500RNAz structural RNAs (P>0.9)

miRNAsmir-17 mir-19a mir-19b-1

mir-18 mir-20mir-92-1

(((((..((((((..((((((((.((.(((((...(((........)))...))))).)).))))))))...))))))....)))))

GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATGT-GCATCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-GTGAC

GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATGTGT-GCATCTACTGCAGTGAGGGCACTTGTAGCATTA-TG-CTGAC

GTCAGGATAATGTCAAAGTGCTTACAGTGCAGGTAGTGGTGTGT-GCATCTACTGCAGTGAAGGCACTTGTGGCATTG-TG-CTGAC

GTCAGAGTAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATATAGAACCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-TTGAC

GTCAATGTATTGTCAAAGTGCTTACAGTGCAGGTAGTATTATGGAATATCTACTGCAGTGGAGGCACTTCTAGCAATA-CACTTGAC

GTCTGTGTATTGCCAAAGTGCTTACAGTGCAGGTAGTTCTATGTGACACCTACTGCAATGGAGGCACTTACAGCAGTACTC-TTGAC

HumanMouseRatChickenZebrafishFugu

G U C A GA A

U A A U G UC A

A A G U G C U UA C A

G U G C A GG

U AG U G

AU A

UG

U_G

CAUC

UACUGCA

GUGAAGGCACUU

GUAGCAUUA

_UG_UUGAC

93104k 93106k 93108kRNAz structural RNAs (P>0.5)

RNAz structural RNAs (P>0.9)

H/ACA snoRNAs

C/D-box snoRNAs

ACA25ACA32

ACA1ACA8

ACA18ACA40

mgh28S-2412mgh28S-2410

Chr. 13

Chr. 13

Chr. 11

a

b

d

c

Page 115: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Results of Human Genome Screen

Genome Coverage Alignments RNAz hits p > 0.9Size Fraction Number Size Fraction of Number

(MB) (%) (MB) input (%)Human genome 3,095.02 100.00 –PhastCons most conserved 137.85 4.81 1,601,903without coding regions 110.04 3.84 1,291,385without alignments < 50nt 103.83 3.33 564,455Set 1: 4 Mammals 82.64 2.88 438,788 5.46 6.62 35,985Set 2: + Chicken 24.00 0.85 104,266 1.34 5.50 8,802Set 3: + Fugu or zebrafish 6.86 0.24 30,896 0.14 2.03 996

Nature Biotechn. 23: 1383-1390 (2005)

Page 116: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Pictures instead of Numbers

91,676

35,958

20,391

8,8022,916 9965,6616,898

2,281

26,508

2087950

30,000

60,000

90,000

Str

uctu

rale

lem

ents Observed

Expected

P > 0.5 P > 0.9

6.6%15.1%

Structural RNA

Estimated false positives

Other conservednoncoding element s

4 Mammals

4 Mammals+ chicken

All vertebrates

P > 0.5 P > 0.5 P > 0.5P > 0.9 P > 0.9 P > 0.9

Page 117: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Distribution related to known protein gene annotation

Known gene< 10 kb from nearest gene> 10 kb from nearest gene

Intron of coding region3’−UTR (exon or intron)

1538016860

3745

2866

283011205

5’−UTR (exon or intron)

Page 118: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Sensitivity on known classes of ncRNAs

Detected ( > 0.9)P

Detected (0.5 < < 0.9)P

Not detected

Not in input setmicroRNA

(207)C/D snoRNA

(256)H/ACA

(86)

45

150

7 522

14

9

41

12977

2624

Page 119: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Not all ncRNAs have conserved secondary structures!

chr7: RNAz_set1_50

EvoFoldsno/miRNA

Conservation

RepeatMasker

26.90m 26.95m 27.00m 27.05m

HOXA1

chr7.279

HOXA2

HOXA3

chr7.283

HOXA4

HOXA5

HOXA6

chr7.287

HOXA7

HOXA10HOXA9

chr7.290

HOXA11

hoxa11-as

HOXA13 chr7.295

EVX1

HOXA1HOXA1

AC004079.7AC004079.7AC004079.7

AC004079.7HOXA2

HOXA3

HOXA3HOXA3

AC010990.1HOXA3HOXA3

AC010990.1AC010990.1

AC010990.1

HOXA4

AC004080.14HOXA5HOXA5

HOXA6HOXA6

AC004080.14HOXA6

AC004080.14AC004080.14

HOXA7HOXA7

HOXA9HOXA9

HOXA9

HOXA9

HOXA9

HOXA10HOXA10

HOXA10

HOXA10HOXA10

HOXA11HOXA11

HOXA13 EVX1

AC004080.12AC004080.12AC004080.12

AC004080.13

AC004080.15AC004080.15

AC004080.1AC004080.1

AC004080.1AC004080.17

AC004080.18AC004080.19

Affy Transcription

GENCODE

GENCODEputative

mRNA

alternativesplicing

Page 120: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Other RNAz Screens

Urochordates: Ciona intestinalis & Ciona savignyi

only a few conserved RNA with Oikopleura dioica

Bioinformatics 21(S2): i77-i78

Nematodes: Caernorhabditis elegans & Caenorhabditis

briggsae

JEZ:MDE 2006 epub

Teleost fishes: Danio rerio, Takifugu rubripes, Tetraodon

nigroviridis, Oryzias latipes (partial)(in progress)

Trypanosomatids: Trypananosoma and Leishmania species

Yeasts. (joint work with Kay Nieselt and Stephan Steigele)

Page 121: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Summary

Predicted structured RNAs (RNAz predictions, p > 0.9)

Teleosts Mammals36000

Yeasts Nematods Insects Urochordates4000 20002500

? ?

Trypanosomatids

rRNA, tRNA, snoRNA, snRNA ...

>10000

1000 unknownconserved

500<200

Page 122: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Novel Human ncRNA Candidates

GU

GGAGGCCU

UUGUCCGCUG

GAG

G C A G CGUU A U

GG G A

AG C

A G G C CA C CUUCCAAAGCCUGCACAAGGG

CCU C

CAG

GCAG

UGG A G G

U AGA

CCCCUC

GG

UG

CU

CC

AG

C AC

AUG

CU

GG

AG

UG

A

CGC

GGC

GCG

CGC

CGG

C

GC

GC

GC

GC

UA

GC

0 10 20 30 40 50 60 70 80 90 100

110

120

Page 123: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Novel ncRNA Candidates in Caenorhabditis

__

__

_ACCUU

AC

UCGAA

AU _ A C C

CG

UCGAUGAA

GA

CCACUAA

AU

GA C

GA A

UC

CU

AA

UA

AC

CCA A

UGG

GU

UUCA

UU

GCG

GA

UAU

GA

GGCA

UU

UG

UC

U

GAG

CG

GG

GUCU C

GGUC

C

GG

CGUC

AGUGGGUUAU

CG

UAUUUCUCUC

CC

UU

CGGG G

_AAU

UU

CCCAU

CGGC

ACCAA

CUU

GACCG U

UG

CGU C

AAU

U CGGUC

CGG

AGUCAAUGGGUU

AUCU

UU C A A

AACC

C_C

CCAUUGACAA

CAA

CUU

GACCG

GCG

U

CeN23 (UM1) CeN74 (UM3) CeN77 (UM3)unknown sb-RNA sb-RNA

__GAUCA

UGC

__U

CAUGCU

__

__

CUCAACCAG

UUA C C C U A C C

UGUCC

UGG CUGUGG

ACA

C C C A C A GU

AC

GC

AUUCG

GUACAG

UA

AC

CA

UCA

A CG

UG

GC A C

AAU

UA

CA

CCG A C AC

C C A CA

A C CG

GA

CAUGACACU

GG

UC

G UC

GG

AUC

A AG

ACA

A U AAC

ACGU

CUC

UUGU

CC

AGU

GGC

CA

ACU

GU

CC

GAUGG

C C G G GU

AUACGGU

AGGUGGCG

AC

GCGGU

GU

ACA

UG

GA

CG

GA

UU

CAAGAG

UG

G

UCUGA

CUAU

C A GAAA

UAAUCGAU

UCCG

GUUUGAAUUGUUUCAAUUGU

GA C

UGCAAGGAAACAAU C C

GC

UUCAAA

GCU

CG

AUCAAUCU

UCUC

GCCA

CA

AC

AU

AGCAUA

GA

UC

UC

UG

CUCAGAUAUCAAUU__UCUACAACA

UG

GA

GGU_

_ GA

GA

CGG

AC G

AG

CC

UC

UU

CA

UGAUUAGCAU

GA

UU C

UCA

UC

A

C

C

GC

AG

AA A C C A A A A

UA

ACAGA

AA

AA

ACAAACCAC

UU

A__

CA

513253 515948 513590(UM2) (UM3) (UM1)

Page 124: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Efforts to Annotate the RNAz Results

ongoing effort

Large number of microRNA candidates approximately 30-40 good H/ACA-box snoRNAs only 6% of hits (comparable to estimated false positive rate)

overlaps with predicted coding regions few clusters of signals with high sequence-similarity

work in progress: structure-based clustering (joint work withRolf Backofen’s lab in Freiburg)

BOTTOM LINE: most signals still unclassified.We need MUCH better methods to recognize members of knownRNA classes

Page 125: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

RNAmicro: A classificator for microRNA Precursors

Input: Multiple Sequence alignment

Preprocessing: non-restrictive check for almost-hairpinstructureSome known microRNA precursors, notably some let-7

family members have small branches!

SVM Classification with few descriptors:Property # DescriptorsStructure 2 ls , lhSequence composition 1 G+CSequence conservation 4 S5′ , S3′ , S0 , SminThermodynamic stability 4 E , ǫ, η, z

Structure conservation 1 Econs

ISMB 2006, in press

Page 126: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Results: Caenorhabditis

351 158

RNAmicroP > 0.5 P > 0.9

3666

19

00

6

86

9

2 7

2

45

626 31 5

1251452675

RNAz

miRNA registry 7.1

Grad et al 2003

other RNAs

206

Page 127: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Results: Mammals

5440 1491

RNAmicroP > 0.5 P > 0.9

RNAz

208481

177

00

2

2541

72

10 21

33

miRNA registry 7.1

Berezikov et al. 2005

203014 3826 1260

38

846

Page 128: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Clustering

A

B

D E

AB DE C

A

B

DE

CE

BD

AE

AD

AC

AB

BC

BE

CD

C

DE

C

dot.ps

U A C G A C G G A C U U A C G G A C U U A C G

U A C G A C G G A C U U A C G G A C U U A C GUA

CG

AC

GG

AC

UU

AC

GG

AC

UU

AC

G

UA

CG

AC

GG

AC

UU

AC

GG

AC

UU

AC

G

dot.ps

A U C A C U C G U A C U G U A C

A U C A C U C G U A C U G U A CAU

CA

CU

CG

UA

CU

GU

AC

AU

CA

CU

CG

UA

CU

GU

AC

dot.ps

A U C A C U C G U A C U G U A C

A U C A C U C G U A C U G U A CAU

CA

CU

CG

UA

CU

GU

AC

AU

CA

CU

CG

UA

CU

GU

AC

dot.ps

A U C A C U C G U A C U G U A C

A U C A C U C G U A C U G U A CAU

CA

CU

CG

UA

CU

GU

AC

AU

CA

CU

CG

UA

CU

GU

AC

dot.ps

U A C G A C G G A C U U A C G G A C U U A C G

U A C G A C G G A C U U A C G G A C U U A C GUA

CG

AC

GG

AC

UU

AC

GG

AC

UU

AC

G

UA

CG

AC

GG

AC

UU

AC

GG

AC

UU

AC

G

Page 129: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Pro

ofof

Con

cept:

tRN

As

inCio

na

inte

stin

alis

Arg/Asn2 2 Arg 2 Thr 2 ~ Ile2 Phe Cys 2 2 Lys 2 Gln 3 Gly 2 2 Val Glu 2 Pro~ 2 Met Arg Met Gln Ala ~~~4 2 Ile 4 Ser Leu Tyr

0.0

0.050.1

0.150.2

0.250.3

0.350.4

Page 130: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Summary

Some classes of ncRNAs, namely the structures ones, can befound efficiently by means of comparative genomics

There are Tens of Thousands of structured RNAs of unknownfunction in the human genome

Some of them probably act, like microRNA and snoRNAs bybinding to other RNAs. These could be investigated usingRNA cofolding approaches (ongoing research).

So far, we know only of the proverbial tip of the iceberg of the

complexity of cellular regulation

& RNA bioinformatics is a really cool research topic ...

Page 131: RNA Bioinformatics - Dartmouth Computer Sciencerockmore/CSSS2006/stadler...RNA Bioinformatics Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center

Acknowledgments: It’s not my fault . . .

Leipzig: Kristin Missal, Dominic Rose, Jana Hertel, ManjaLindemeyer, Matthias Kruspe, Sonja J. Prohaska, ClaudiaFried, Roman Stocsits, Axel Mosig Bettina Muller (FH

Weihenstephan), Katrin Sameith (U Jena)

Vienna: Stefan Washietl, Ivo L. Hofacker, Christoph Flamm,Andrea Tanzer, Stefan BernhartSusanne Rauscher, Caroline Thurner, Christina Witwer, Ingrid

Abfalter, and many others

Havard: Walter Fontana

Beijing: Xiaopeng Zhu, Wei Deng, Geir Skogerbø, RunshengChen

Tubingen: Stephan Steigele, Kay Nieselt,

Copenhagen: Jan Gorodkin, Stefan Seemann

Freiburg: Rolf Backofen, Sebastian Will


Recommended