Graduate School ETD Form 9 (Revised 12/07)
PURDUE UNIVERSITY GRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By
Entitled
For the degree of
Is approved by the final examining committee:
Chair
To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
Approved by Major Professor(s): ____________________________________
____________________________________
Approved by: Head of the Graduate Program Date
Kejie Li
A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships
Doctor of Philosophy
Wen Jiang
Michael Gribskov
Daisuke Kihara
Dabao Zhang
Michael Gribskov
Peter J. Hollenbeck 07/25/2011
Graduate School Form 20 (Revised 9/10)
PURDUE UNIVERSITY GRADUATE SCHOOL
Research Integrity and Copyright Disclaimer
Title of Thesis/Dissertation:
For the degree of Choose your degree
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*
Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.
I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.
______________________________________ Printed Name and Signature of Candidate
______________________________________ Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships
Doctor of Philosophy
Kejie Li
07/25/2011
A GRAPH THEORETIC APPROACH FOR IDENTIFYING RNA STRUCTURE AND FUNCTION RELATIONSHIPS
A Dissertation
Submitted to the Faculty
of
Purdue University
by
Kejie Li
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
August 2011
Purdue University
West Lafayette, Indiana
ii
Dedicated to my beloved wife Juan Liao, my great father Changgui Li, and my dear
mother Hanfang He.
以此献给我心爱的妻子:廖娟,我伟大的父亲:李常贵,以
及我亲爱的母亲:何汉芳。
iii
ACKNOWLEDGEMENTS
I would like to express my greatest and the most sincere gratitude to my major advisor,
Dr. Michael Gribskov, for his support, patience, understanding and encouragement
during my graduate study in Purdue. I thank him for giving me freedom and support to
explore the research areas I am interested in. He is always ready to help and is such a
superb mentor throughout the development of my projects. I sincerely appreciate the
time he spent to improve my writing of the annual progress reports, the graduate thesis
and publications. To me, he is like a huge library which holds all kind of sources. I can
retrieve information I need at anytime and the process could take less time than I
google it around.
I would like to extend my sincere appreciation to the members of my PhD advisory
committee: Dr. Wen Jiang, Dr. Daisuke Kihara and Dr. Dabao Zhang. I truly appreciate all
their support, advice and guidance throughout my graduate study.
Special thanks go to Reazur Rahman and Aditi Gupta, who are my lab mates as well as
RNA group mates. We worked closely and I do enjoy our great teamwork.
Extra thanks go to the past and current members in the Gribskov lab: Hao Jiang, Ying Li,
Damion Junk, Doug Yatcilla, Omer Ijaz, Prasad Siddavatam, Greg Ziegler, Emre Demirors,
iv
James Hengenius, Qiong Wu, Minming Li, Jiajie Huang, Biaobin Jiang, and Junhui Wang
(according to the time order I met with them) for their help and advice on my projects. I
would also like to thank Nina Robinson for her full support of our lab activities. She is
priceless to us.
Finally, to my mom and dad, Hanfang He and Changgui Li, without their true love and
support, I would not be able to come to this stage of my life. To my beloved wife, Juan
Liao, who is the most important part of my life, with your trust and believe, we would
always hold our hands tightly and being together in our life journey.
v
TABLE OF CONTENTS
Page
LIST OF TABLES ..................................................................................................................... x
LIST OF FIGURES .................................................................................................................. xi
ABSTRACT ........................................................................................................................... xv
CHAPTER 1 INTRODUCTION ................................................................................................ 1
1.1 RNA’s double life ....................................................................................................... 1
1.2 Importance of Pseudoknots ...................................................................................... 4
1.2.1 Definition of pseudoknot ................................................................................... 4
1.2.2 Pseudoknot functional significance ................................................................... 5
1.3 Representations of RNA secondary structure .......................................................... 6
1.3.1 Simple representations ...................................................................................... 6
1.3.2 Graph theoretical representations .................................................................... 8
1.4 Methods for studying RNA structure ...................................................................... 11
1.4.1 Experimental approaches ................................................................................ 12
1.4.2 Computational approaches ............................................................................. 13
vi
Page
1.5 Decision making in novel molecule design ............................................................. 22
1.6 Summary/organization of this work ....................................................................... 24
1.7 Data sources ........................................................................................................... 26
1.7.1 Manually curated dataset ................................................................................ 26
1.7.2 STRAND dataset ............................................................................................... 28
CHAPTER 2 PATTERN MATCHING IN RNA STRUCTURES ................................................... 42
2.1 Introduction ............................................................................................................ 42
2.2 XIOS RNA graphs ..................................................................................................... 45
2.2.1 Definition ......................................................................................................... 45
2.2.2 Training data .................................................................................................... 46
2.2.3 DFS Lexicographical ordering ........................................................................... 46
2.2.4 Enumeration N-stem structures ...................................................................... 48
2.3 Greatest conserved structures ............................................................................... 48
2.3.1 Extension of the gSpan algorithm .................................................................... 48
2.3.2 Graph matching algorithm (similar to gSpan) ................................................. 50
2.3.3 Greatest conserved structure(s) in a set of RNAs ........................................... 50
2.3.4 Characteristics of biological graphs ................................................................. 51
2.4 Future directions ..................................................................................................... 52
vii
Page
2.4.1 Graph preprocessing ........................................................................................ 52
2.4.2 Reduction of graph complexity ........................................................................ 53
2.4.3 Adding labels .................................................................................................... 54
2.4.4 Motif identification tool .................................................................................. 54
2.4.5 Database search tool ....................................................................................... 55
CHAPTER 3 RNA STRUCTURAL FINGERPRINT.................................................................... 68
3.1 Enumeration of XIOS graphs ................................................................................... 68
3.2 Structural motif library construction ...................................................................... 70
3.3 RNA structural fingerprint ...................................................................................... 73
3.3.1 Background ...................................................................................................... 73
3.3.2 Definition of RNA structural fingerprint .......................................................... 73
3.3.3 Fingerprint searching algorithms ..................................................................... 75
3.3.4 Possible applications ........................................................................................ 85
CHAPTER 4 MATCHING UNKNOWN RNA STRUCTURES .................................................. 103
4.1 Introduction .......................................................................................................... 103
4.2 Methods and dataset ............................................................................................ 105
4.2.1 XIOS Graph ..................................................................................................... 105
4.2.2 Dataset ........................................................................................................... 105
viii
Page
4.2.3 Indexing and searching .................................................................................. 105
4.2.4 Scoring function ............................................................................................. 106
4.3 Results ................................................................................................................... 108
4.3.1 Validation using known biological structures ................................................ 108
4.3.2 Size only graph database search .................................................................... 109
4.3.3 Embedding simulation ................................................................................... 109
4.3.4 Blast search .................................................................................................... 111
4.4 Discussion ............................................................................................................. 111
CHAPTER 5 IDENTIFICATION OF TOPOLOGICAL FEATURES THAT DISCRIMINATE BETWEEN
RNA CLASSES ................................................................................................................... 121
5.1 Introduction .......................................................................................................... 121
5.1.1 RNA importance, RNA function determined by RNA structure..................... 121
5.1.2 Contribution ................................................................................................... 123
5.2 Material and Methods .......................................................................................... 123
5.2.1 Reverse cIndex basic feature selection on RNA fingerprints ......................... 123
5.2.2 RNA structure classification ........................................................................... 125
5.2.3 RNA structure datasets .................................................................................. 126
5.3 Results ................................................................................................................... 126
ix
Page
5.3.1 Feature selection on the fingerprint generated ............................................ 126
5.3.2 Top unique feature selected (in the same order as weights): ...................... 127
5.3.3 Validation of RNA structure classification ..................................................... 127
5.4 Discussion ............................................................................................................. 129
5.5 Future directions ................................................................................................... 131
LIST OF REFERENCES ....................................................................................................... 143
VITA ................................................................................................................................. 154
PUBLICATIONS ................................................................................................................. 155
x
LIST OF TABLES
Table .............................................................................................................................. Page
Table 1.1 Manually curated structures. ............................................................................ 36
Table 1.2 RNaseP Sequences Used. .................................................................................. 37
Table 1.3 Group I Intron Sequences Used. ....................................................................... 39
Table 1.4 tmRNA Sequences Used .................................................................................... 40
Table 1.5 Dataset collected from STRAND database. ....................................................... 41
Table 2.1 Brief description of RNA datasets. .................................................................... 66
Table 2.2 Number of possible RNA topologies for different numbers of stems, N. ......... 67
Table 3.1 Design of the NH index array. ......................................................................... 102
Table 4.1 Kolmogorov-Smirnov test results .................................................................... 119
Table 4.2 Statistics of embedding simulation ................................................................. 120
Table 5.1 Statistics of the selected structural features for four RNA families from two
datasets. .......................................................................................................................... 139
Table 5.2 Classification performance.............................................................................. 140
Table 5.3 Leave one out cross validation result ............................................................. 141
Table 5.4 Classification test ............................................................................................ 142
xi
LIST OF FIGURES
Figure ............................................................................................................................ Page
Figure 1.1 Common types of Pseudoknots ....................................................................... 29
Figure 1.2 Common RNA secondary structure representations. ...................................... 30
Figure 1.3 Rooted-labeled tree ......................................................................................... 32
Figure 1.4 Dot plots ........................................................................................................... 33
Figure 1.5 RNA tree graph, RNA dual graph and RNA digraph representations. ............. 35
Figure 2.1 XIOS definition. ................................................................................................ 58
Figure 2.2 tRNA 3D structure and corresponding XIOS graph representation. ............... 59
Figure 2.3 Unique three-stem XIOS graphs, including pseudoknots. ............................... 61
Figure 2.4 Identification of the common structure in S. cerevisiae and H. sapiens RNase P
RNA. .................................................................................................................................. 63
Figure 2.5 Correlation between number of stems and sequence length. ........................ 64
Figure 2.6 Length of RNA stem structures in biological RNAs .......................................... 65
Figure 3.1 Enumeration of XIOS graphs. ........................................................................... 87
Figure 3.2 RNA secondary structure visualization. ........................................................... 89
Figure 3.3 Flow of generating fingerprint. ........................................................................ 90
Figure 3.4 RNA structural fingerprint tRNA example. ...................................................... 91
xii
Figure ............................................................................................................................ Page
Figure 3.5 Architecture comparison of CPU and GPU. ..................................................... 92
Figure 3.6 Comparison of CPU and GPU. .......................................................................... 93
Figure 3.7 Prefix tree structure stores structural motif library for efficient subgraph
isomorphism check ........................................................................................................... 94
Figure 3.8 Neighborhood indexing (NH indexing). ........................................................... 95
Figure 3.9 Triangle descriptors. ........................................................................................ 97
Figure 3.10 Full List of Mathematically Possible Triangle Descriptors. .......................... 101
Figure 4.1 Positive hit ratio (PHR) in a Blast search. ....................................................... 113
Figure 4.2 NH database search result. ............................................................................ 114
Figure 4.3 Size only database search result. ................................................................... 115
Figure 4.4 Embedding simulation. .................................................................................. 116
Figure 4.5 Embedding simulation database search result. ............................................. 117
Figure 4.6 Blast search result .......................................................................................... 118
Figure 5.1 Idea of graph containment search. ................................................................ 133
Figure 5.2 Graph feature matrix. .................................................................................... 134
Figure 5.3 Feature selection weight vs. iteration ........................................................... 135
Figure 5.4 Selected top unique structural features in dataset Table 1.1. ...................... 136
Figure 5.5 Selected top unique structural features in dataset Table 1.5. ...................... 137
Figure 5.6 Link from structure to function...................................................................... 138
xiii
LIST OF ABBREVIATIONS
3D Three Dimensional
CGI Common Gateway Interface
CM Covariance Model
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DDT dichlorodiphenyltrichloroethane
DFS Depth First Search
DOS Diversity Oriented Synthesis
DP Dynamic Programming
GPGPU General Purpose computing on GPUs
GPU Graphics Processing Unit
HMM Hidden Markov Model
HTS High Throughput Screening
xiv
LOOCV Leave One Out Cross Validation
LWP The World-Wide Web Library for Perl
MFE Minimum Free Energy
MSA Multiple Sequence Alignment
ncRNA non-coding RNA
NGS Next-Generation Sequencing
NH Neighboring Index
PDB Protein Data Bank
PHR Positive Hit Ratio
QSAR Quantitative Structure-Activity Relationships
RAG RNA As Graphs
SCFG Stochastic Context Free Grammars
STRAND The RNA secondary STRucture and statistical Analysis Database
TOS Target Oriented Synthesis
UID Unique ID
VARNA Visualization Applet for RNA
XIOS eXclusive, Included, Overlap and Serial
xv
ABSTRACT
Li, Kejie. Ph.D., Purdue University, August 2011. A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships. Major Professor: Michael Gribskov.
Understanding of structure-function mapping is crucial to the study of the nature of
biopolymers. This mapping can be used to extract information to aid in the prediction of
molecular function based on structural topological patterns. This study presents a graph
theoretical approach for understanding RNA structural topological features, and
revealing the mapping from biological RNA structural topological features to biological
functions. We have built a package that represents ensembles of suboptimal RNA
structures as a graph, the XIOS graph, for easy structural comparison and analysis by an
extended version of the gSpan algorithm. In order to detect structural similarities, The
Neighbor Indexing algorithm has been extended by adding additional RNA structure-
specific information, and introducing the concept of an RNA structural fingerprint, from
a structural descriptor point of view, to represent the topological information of
ensembles of RNA structures. Based on the cIndex feature selection strategy, I have
developed and applied a new feature selection approach for RNA structures which
xvi
reveals important structural topological patterns that provide specific information about
the functional class of RNAs. This information can be used to relate RNA structural
patterns to function. In addition, I have developed a novel structure indexing and
database searching method for finding RNAs with similar characteristics (topological
modules).
It is remarkable that even without using RNA primary sequence information RNA
structures can be classified into the correct classes. By combining information from both
sequence and topology, unclassified or misclassified RNAs can be correctly classified and
categorized with high confidence. The structure-based classification described here is
significantly better than sequence-based classification using Blast (Kolmogorov-Smirnov
test).
1
CHAPTER 1 INTRODUCTION
1.1 RNA’s double life
The central dogma of molecular biology (2) says that RNA is the biopolymer copied from
DNA (transcription) that serves as the template during protein synthesis (translation). In
other words, RNA is the intermediate message interpreter from heritable DNA genetic
information to various biological functions. This assumption underlines that it is genes
that encode proteins, and proteins play most of the important biological roles such as
catalytic and regulatory functions. From that point of view, biological complexity is
determined by the number of protein-coding genes.
Two great figures initiated the study of evolution in vitro, or so called RNA evolution in
the test-tube: Sol Spiegelman and Manfred Eigen. Spiegelman’s serial transfer
experiments with the Qβ assay (3-5) and Eigen’s extensive kinetic study of the
mechanism of Qβ RNA replication (6-8) revealed that the primary sequence and spatial
structure of the same RNA molecule are its genotype and phenotype, respectively. RNA
molecules have to satisfy structural requirements in order to be recognized and
replicated by enzymes (9). Structure alone is not sufficient to infer function, but a
complete understanding of the functional molecule requires information about its
2
organization in space. It has been suggested that the spatial structure of RNA is a crucial
factor in determining its function.
In the meantime, Carl Woese discovered that RNA forms complex secondary structures,
which suggested, for the first time, that RNA could act as a catalyst (10). Later in the
1980s, Thomas R. Cech (11-13) and Sidney Altman (14,15) separately discovered the
catalytic properties of RNA molecules, making proteins no longer the only biopolymers
with catalytic function. An RNA molecule, therefore, is not just a chemical entity that
carries genetic information which is very chemically similar to DNA, but remarkably, also
possesses catalytic activity as a function executor. Although, when compared with
proteins, RNA molecules, ribozymes (13), have a limited catalytic repertoire, it is more
than sufficient to process genetic information and (self) replicate in a pre-biotic
environment. This gave rise to the idea that RNA molecules could play a bridging role
between the lifeless pre-biotic environment and the beginning of life (16-18), the RNA
world hypothesis. The RNA world provides a possible answer to the long-standing
question: the origin of life. It suggests that versatile RNA, with its abilities for both
storing information like DNA and catalyzing enzymatic reactions like proteins, came first.
RNA-encoded proteins evolved after RNA but before DNA (19).
The traditional definition of RNA secondary structure is based on base-pairing
interactions: Watson-Crick base pairs (A∷U, G⋮⋮C) (20) and wobble base pairs (G::U, I::U,
I::A and I::C) (21), within an RNA molecule. The basic structural elements are stacked
base-pairs (or stems), hairpin loops, interior loops, and bulge and multiple loops (Figure
3
1.2 A). In classical secondary structures there are only nested structural elements, which
means one base-paring region must be completely within the loop of the other base-
pairing region. Early studies (22,23) have found that the catalytic cores of many
ribozymes have uniquely shaped conserved RNA secondary structures that allow them
to perform their catalytic function. More recent studies show that small ribozymes
exhibit a broad range of catalytic activities (24-31) and that RNA catalysis plays essential
roles in the metabolism of cells (32-34). The involvement of RNA in such diverse
catalytic functions gives further support to the RNA world hypothesis.
Traditionally, we believed that the genome is a simple combination of separate genes,
one gene → one protein. Most of the gene transcripts were thought to be protein-
coding and rarely non-coding RNAs (ncRNAs), the bulk of the cellular RNA: tRNA and
rRNA are exceptions. ncRNAs are RNA molecules that perform biological functions
without being translated into proteins. In the course of the recent rapid development of
high throughput techniques, comprehensive large scale transcriptome studies (35-37)
across species as diverse as plants, bacteria and mammals (38-46) have changed our
understanding of RNA. New functions of ncRNAs have been discovered, and ncRNAs and
RNA-based biological processes are now known exist in all life forms. Gradually, RNA has
been recognized as a central player in cellular regulation (47). In particular, ncRNAs play
active roles in multiple regulatory layers from transcription, to RNA maturation, and
RNA modification to translational regulation (47). The current view is that transcripts are
potentially overlapping and bidirectional, and non-coding transcripts are abundant. In
4
spite of the importance and ubiquity of ncRNAs, we still know relatively little about
them (48).
1.2 Importance of Pseudoknots
It is widely accepted that RNA functions are mainly determined by RNA structures.
Reciprocal relationships like this demand comprehensive study and analysis of RNA
structures, in order to better understand RNA catalytic and regulatory functions.
1.2.1 Definition of pseudoknot
With regard to RNA structure analysis, I have to mention an important structural
element called a pseudoknot. Compared with nested RNA secondary structures,
pseudoknots are base-paired regions that are only partially nested: RNA base-pairing
between the bases loop region of one base paired region with a region outside this
base-paired region. Let us consider two stems, S1 and S2, where S1 is formed by base-
paired regions A and B, and S2 is formed by regions C and D. The sequential order of
those base paired regions is A, C, B, D in the RNA sequence. S1 and S2 form a
pseudoknot structure because region C of stem S2 lies inside the “loop” of stem S1, and
region D of S2 is outside of that loop. Such knotted structures were first discovered in
yellow mosaic virus in 1982 (24), and they occur frequently in RNA functional sites and
catalytic cores, often being directly involved in RNA catalytic and regulatory functions.
There are many types of pseudoknots, simple and common ones include the H-type
pseudoknot (classic pseudoknot), kissing pseudoknot, simple recursive pseudoknot and
hairpin-bulge pseudoknot (three stemmed pseudoknot) (Figure1.1).
5
1.2.2 Pseudoknot functional significance
Pseudoknots are well known to play diverse fundamental roles in cells (49). The diversity
of pseudoknots corresponds to their various functions. The formation of H-type
pseudoknots, for instance, can lead to compact structures. Site directed mutagenesis
kinetic studies showed that slight sequence changes can destabilize pseudoknot
structures. It has further been suggested that pseudoknotted structures are just slightly
more stable than the corresponding non-knotted hairpin structures (less than 2kcal/mol)
(50,51). This small free energy difference allows the knotted structure to fold (compact
structure) and unfold (less compact structure) without having to cross high energy
barriers. This suggests a possible role for pseudoknots as conformational switches or
control elements.
Pseudoknots fold locally in RNA molecules, so their positions in the sequence may also
reflect their function (52). In several viruses, replicase expression is controlled by
ribosomal frameshifting (53-57) or in-frame stop codon readthrough (58). In these cases,
pseudoknot formation is necessary, which suggests that pseudoknots near the 5’ end of
mRNAs may be involved in translational control. tRNA-like motifs at the 3'end of several
groups of plant viral RNA genomes also contain pseudoknot structures (52), and in
tobacco mosaic virus, pseudoknots before the tRNA-like domain were shown to be
required for substituting as a poly(A) tail to stabilize the mRNA and increase gene
expression (up to 100-fold) (59). This evidence shows 3’ end pesudoknots have roles in
preserving replication signals. In catalytic and regulatory RNA molecules, pseudoknots
6
are often found at the core of the tertiary structures (60). In addition to these functions,
evolutionarily conserved pseudoknots are also involved in self-splicing (49).
Overall, pseudoknots are biologically important due to their appearance in critical
regions of functional RNAs. Among all the fundamental RNA structural elements,
pseudoknots are possibly the most important.
1.3 Representations of RNA secondary structure
1.3.1 Simple representations
There are many ways of representing RNA secondary structures. The most common are
stem-loop diagrams, circle plots, dome plots, mountain plot, linear dot-bracket
representations, dotPlots and tree graphs.
One of the most common representations is the biological stem-loop or “squiggle”
diagram (Figure1.2 A and Figure1.2 B). It describes RNA secondary structure in terms of
stem-loop structural elements. Stem-loop diagrams are not rotationally invariant and
therefore very similar structures may appear very differently.
Nussinov suggested a new representation called a circle plot in 1978 (61) (Figure1.2 C
and E). The backbone nucleotides of RNA are arranged along a circle, and base pairs are
drawn as arcs linking the paired bases. With this representation, different length
sequences can look quite different.
Dome plots can be considered as a simple alternative representation (Figure1.2 D and F).
Instead of organizing the RNA sequence along a circle, nucleotides are placed along a
7
line. Base pairs are still drawn as arcs. In contrast to circle plots, dome plots look similar
even when sequence lengths differ. Both circle plots and dome plots can be used to
show classical stem-loop secondary structures as well as pseudoknots.
Hogeweg and Konings (62) developed a graph called the mountain representation or
mountain plot, to compare RNA secondary structures (Figure1.2 G). The x-axis of
mountain plot corresponds to the RNA sequence, and the y-axis shows the number of
base-pairs in which a specific nucleotide is nested. While mountain plots have the
advantage of being rotationally invariant, they cannot be used to show pseudoknots.
The linear dot-bracket representation (or Vienna notation) (Figure1.2 H) is both
common and useful. It is a string with dots and brackets, where dots indicates unpaired
positions, and a pair of matching brackets at position i and j indicates there is a base-
pair (i, j) between the bases at position i and j. Pseudoknots can be included in this
representation by using different types of parentheses to distinguish the corresponding
base-paired regions.
The dot bracket representation can be translated into a rooted-labeled tree (63). In this
representation, the interior nodes of the tree correspond to base-pairs, and the leaf
nodes are unpaired nucleotides. An additional dummy node is added as the root of the
tree graph which serves as the parent of all nodes in the tree to ensure structures with
free end(s) are not represented by a forest (disconnected trees). Stems appear as chains
of interior nodes (rope-like) and multi-loops appear as bush-like branching centers in
8
this representation (Figure1.3). Similar representations have been proposed by other
groups (64,65). Trees can capture all the pairwise interactions of an RNA secondary
structure but not other tertiary interactions, such as pseudoknots and base triples (66).
Dot plots represent structures as a two-dimensional matrix in which a dot is placed at
the (i,j) position of each base-pair. There are two common types of dot plots. One is a
visualization produced by Zuker’s mfold algorithm called the energy dot plot for an RNA
sequence (67,68). For each possible pair (i, j), this plot position (i, j) shows the lowest
free energy for a structure that contains base pair (i, j). It provides a picture of all
alternative secondary structures within a specific free energy increment from the lowest
free energy structure (Figure1.4). When many dots are displayed with a small energy
increment, it suggests a less well-determined secondary structure prediction (i.e., the
presence of many competing structures with similar energy). The Vienna RNA package
implemented McCaskill’s partition function algorithm (69). The output of this program is
a base-pairing binding probability matrix (Figure1.4), which is shown as a so called
partition function dot plot. It is a visualization of the thermodynamics of ensemble of
structures.
1.3.2 Graph theoretical representations
1.3.2.1 Basics of graph theory
A graph is a collection of vertices (nodes) and edges linking the vertices. Graph theory,
which uses graphs to model relationships of natural and artificial objects, has been used
9
in many different areas, such as communication networks, network flow and data
organization, molecular structures in chemistry and physics, and social network analysis
in sociology. Graph is normally represented by connectivity matrix which describes the
connectivity of vertices by different types of edges. In this connectivity matrix, the
values in the i-th row and j-th column describe the edge (such as directionality, weight
or edge type) which links vertices i and j in the graph. In the simplest case, the values
are either zero (no edge between the I and j vertices) or one (an edge exists between i
and j ).
To find the difference between graphs, methods for graph comparison are required. An
important concept related to graph comparison is graph isomorphism, which describes a
structure preserving bijection between the sets of vertices of two graphs. If an
isomorphism exists between two graphs, these two graphs are called isomorphic. The
easier way to understand it is, if two graphs are isomorphic, they are structurally
equivalent (have equivalent vertices linked by equivalent edges) even though they have
different layouts of the vertices and edges. Considerable work has been done on the
graph isomorphism problem, i.e., how to determine if two graphs are structurally
equivalent or isomorphic. One way to test for isomorphism is to compare the
eigenvalues of the connectivity matrices of the graphs. If two graphs have identical
eigenvalues, they are isomorphic graphs.
Graph similarity is related to another problem called subgraph isomorphism, which
involves the determination of whether graph G contains a subgraph that is isomorphic
10
to a subgraph of graph H. The subgraph isomorphism problem is NP-complete (70).
Given a method for measuring the similarity of graphs, available clustering and
classification methods can be efficiently applied, and novel analyses and predictions can
be made.
In chemistry, chemical structure graphs are used to model molecules. The vertices
represent atoms and bonds are represented by edges. Chemical structure graphs have
been used to identify similar chemical structures, for structure search and for function
identification (71,72). One specific area, called quantitative structure-activity
relationships (QSAR), quantitatively studies the correlation of small molecule structure
and function by using structural determinants to generate graph models (73). Linear and
nonlinear relationships between structure and activity are considered. Such
relationships allow predictive models of synthetic small molecule activity based on
existing knowledge of small molecules.
1.3.2.2 Using graph theory in RNA study
Similar to chemical studies, secondary structures of RNA molecules can also be
represented as planar graphs, examples are shown in section 1.4.1. RNA graphs use
vertices to represent nucleotides or structural elements; edges between vertices
corresponds to relationships such as bonds, ordering or connectivity.
The RNA tree graph (see also section 1.4.1) was introduced by the Schlick group (74)
(Figure1.5). It was used to represent the connectivity of secondary structure elements
11
(Figure1.2 A). Vertices in the graph represent loops and the edges correspond to stems.
This graph models topological, rather than geometric, aspects of RNA molecules. For
example, it does not tell the stem length, loop length and so on. It simply provides a
coarse resolution image of the actual secondary structure. One advantage of this
representation is that there are existing tree enumeration theorems (75,76). This is
useful since enumeration can be used to estimate the size of the RNA structural space.
On the other hand, new RNA folds can be discovered by enumerating non-existing
graphs. However, RNA tree graphs cannot represent pseudoknotted structures. Later,
Schlick group proposed a new representation called a dual graph to include
pseudoknots (74) (Figure1.5). In the dual graph, vertices correspond to stems and loops
are represented arcs. Similarly to RNA tree graphs, the dual graph also captures the
topological characteristics of a folded RNA. One issue with the dual graph
representation is different RNA topologies can share the same dual graph. The digraph
representation is an RNA dual graph with directed edges (Figure1.5). The direction of
the edges can resolve some ambiguity in representing RNA topologies.
1.4 Methods for studying RNA structure
As I mentioned before biological complexity has been measured in terms of the number
of protein-coding genes. At the molecular level, the spatial structure of molecules is the
phenotype. All biological functions are properties of phenotypes (77), and the mapping
from genotypes into phenotypes is the key to understand information and complexity in
12
biology (77). To crack the code that relates molecular structures to biological
phenomena is the greatest current challenge for life science (77).
The genetic code, the relationship between the nucleotide sequences of DNA or RNA
and the amino acid sequences of proteins (2), is well accepted as only one part of the
highly complex language of evolution. We have clear knowledge about how the
biological machinery processes genetic information according to the central dogma. The
language linking sequences and three-dimensional structures of biopolymers has been
called the second half of the genetic code (78). Coarse grained notions of structures of
biopolymers, like RNA secondary structures, can be considered to be phenotypes (77).
Thus the connection between RNA structure and biological function could be considered
to be a third part or extension of the language of evolution.
1.4.1 Experimental approaches
Several experiments have been developed to determine the location of base-paired and
single-stranded regions in RNA. Nuclease protection assays use ribonucleases which
specifically cleave single-stranded RNA regions (or specifically cleave double-stranded
RNA regions), leaving the double-stranded regions intact (or leaving the single-stranded
regions intact). Many ribonucleases can be used, examples include V1 ribonuclease
which cleaves phosphodiester bonds 3′ of double stranded RNA regions, and S1
nuclease which specifically cleaves 3′ of single stranded RNA regions. The limitation of
this kind of approaches is that it is time consuming and costly.
13
Pioneering efforts also have been made by projects such as Doudna’s structural
genomics of RNA project, which sought to decipher the biological and functional
properties of RNA molecules by determining their molecular three-dimensional
structures (79). The Puglisi group is trying to understand RNA function in terms of
molecular structure and dynamics using biophysical tools. They use nuclear magnetic
resonance (NMR) spectroscopy to determine the structures of RNA and RNA-protein
complexes, in order to determine the role of RNA in cellular processes and diseases.
These approaches could lead to novel therapeutic strategies targeting processes
involving RNA. Solving RNA structure using NMR spectroscopy is a powerful biophysical
tool. But using NMR to solve RNA structures is more difficult than for proteins due to
the intrinsic biophysical and biochemical properties of RNA. One important reason is
that the lower proton concentration of RNA molecules (compared to proteins) results in
fewer restraints for structure calculations. These projects are likely to be time
consuming, but by integrating the results, we could have a better picture of the true life
of RNA molecules.
1.4.2 Computational approaches
Today, there are many publicly accessible genomic sequences, RNA secondary
structures, RNA three-dimensional (3D) structures, protein sequences and 3D structures.
The main challenge in biological studies is computing. Data mining and computational
predictions are prevalent. Compared with experimental RNA structure determination,
computational methods hope to speed up the annotation of RNA structures by using the
14
power of computer models and predictions, with much lower cost. Many bioinformatics
tools help to pull important information related to RNA from the vast amounts of
sequence data generated by high throughput techniques, such as whole genome and
transcriptome sequencing. However, accurate annotation of RNAs is still an extremely
difficult issue.
1.4.2.1 The comparative covariance approach
For a given RNA sequence, computer programs can enumerate all possible RNA
secondary structures based on thermodynamic and free energy minimization
considerations. But the selection of the true structure out of the large set of predicted
structures requires more information, such as the experimental evidence mentioned in
section 1.4.1.
Alternatively, a comparative approach can be used to solve this problem more
efficiently. The assumption is that evolution of RNA structures is slower than the
evolution of RNA sequences. In order to maintain the same biological function, RNA
molecules should have similar structures even if they do not share high sequence
similarity. One mutation in one a base pair breaks the structure, but we do see many
RNA structures in a same family share similar structural patterns. That means changes in
sequence coincide to conserve structure (base-paired bases), which is also called
covariance or co-evolution. Covariance ensures that base-pairs are maintained and RNA
structure is conserved. The presence of the same stable tRNA structures in different
organisms is a good example. The idea of the comparative approach is that when
15
homologous RNA sequences are available, the consensus secondary structure can be
inferred from the multiple sequences. Covariance models (CM) have been used in
sequence analysis to search for regions sharing similar base complementary patterns
(66). Similar base pairing patterns indicate similar structures. The CM is a generalization
of hidden Markov models (HMM). CM uses primary sequence consensus and pairwise
covariations with respect to consensus secondary structure to describe a RNA multiple
sequence alignment (MSA) and represent the representative secondary structure of a
set of RNA sequences. For easier understanding, CM generates probabilistic
representative structure of a set of RNA. With more sequences, the inferred structure
becomes less ambiguous due to additional sequence diversity and support to the
pairwise covariations. However, there are only four nucleotides, and the random chance
of finding complementary bases is high. The lack of primary sequence conservation,
which makes it difficult to obtain a reliable MSA for divergent RNA families, is the
primary limitation for CM.
1.4.2.2 The thermodynamic energy based models
Thermodynamically, calorimetric data on the stability of RNA secondary structures are
used to build a base-pair energy model. These are four main categories in this area:
minimum free energy RNA structure prediction (80-89); RNA sequence secondary
structure prediction including pseudoknots (90-96); RNA secondary structure prediction
including suboptimal structures (97-100); and RNA secondary structure prediction with
suboptimal structures and pseudoknots (101).
16
In the1970s, Tinoco et al. first studied the thermodynamics of RNA folding (base-pairing)
by using short oligonucleotides (102,103). They measured the free energy decrease
between the denatured RNA molecule and its native state. In the Tinoco model, the
total free energy difference is the summation of the contributions of independent
elements in the structure. One assumption that can be made is that, if we assume the
interactions in the secondary structure are stronger those in the tertiary structure, the
sum of free energies of the secondary structures are a good approximation of the total
free energy of the RNA molecule. Stacked base-paired regions and hydrogen bonding
are generally considered as the source of stabilization of RNA molecules, while loops
generally destabilize the RNA molecular structure due to the introduced entropy.
Two factors that make it difficult to find solutions to biological problems are the size of
biological data and high computational complexity. We are always trying to find the
“true” solution. Sometimes computational models do not exactly mirror the physical
world, or sometimes we are forced for computational reasons to simplify the problem.
Instead of searching for the “true” solution of the problem, focusing on the optimal
solution is always a good alternative. In biology, we usually do not know the exactly
answer for which we are looking. What biologists do is to try to find optimal solutions
that closely resemble the true solution. For example, genome sequence assembly tries
to use all the sequence reads from a genome, and assemble them in the correct order
and location to find the correct sequence of the whole genome. In the process of
assembly, there are problems such as repeat sequences. There are many choices for
17
how to place the repeat sequences in the genome. Without further experimental results,
such as genome walking, it’s hard to know the correct answer. But computer programs
either leave repeats out of the assembled sequence or try to calculate the likelihood of
possible locations for these repeat sequences; these are optimal solutions to the true
solution. Understanding the mechanism of the effect of a certain gene is another
example of how the lack of biological knowledge prevents our models from exactly
describing the biology. Until recently, there was no way to find all of the regulatory
factors related to gene X. However, with the support of microarray or RNA-seq
experiments, it is now possible to find a set of genes, even not a complete set, that are
correlated with X. From here, biologists can construct the regulatory network to gain
some understanding of the pathways X involves in, and can propose possible hypothesis
about the mechanisms of the effect of gene X. At this point, the solution is optimal, but
it can get closer to the true answer with further experimental support or technology
which can provide more detail.
In computer science, a popular technique called dynamic programming (DP) was
developed by Bellman in 1950s to solve optimization problems (104). DP enables the
combination of sub-problem solutions to solve the overall problem. It calculates the
solution for each sub-problem only once, and stores them in a matrix. By doing so, it
avoids recalculating the answer when the same sub-problem is reencountered during
the calculation. However, the limitation of DP is that it only deals with problems with
recursively nested sub-problems. In RNA research, we know RNA function is determined
18
by its spatial structure, and DP has been the main approach used to predict minimum
free energy (MFE) RNA structures (95). Nussinov et al. tried to computationally predict
RNA secondary structures by maximizing the number of base-pairs (61). Later, Nussinov
and Jacobson introduced DP to RNA structure prediction by implementing a simple
recurrence based on the decomposition of structure into base-paired elements (105). It
is most likely for RNA molecules to fold into the MFE structures (not necessarily unique)
which are the most stable confirmations. Zuker, et al., proposed a method to predict
RNA secondary structures by looking for the base-pair combinations with the minimum
free energy (83). It uses the same recurrence principle as Nussinov’s approach, and also
decomposes structures into base-paired elements. Zuker pointed out that the free
energy of a structure is associated with the regions between bonds (base stacking
regions), rather than hydrogen bonding, as was done by Nussinov et al. (61). Zuker’s
energy function is based on free energy, and experimental thermodynamic data was
taken into consideration, such as loops (Zuker treated multiloops as interior loops, and
the positive energy of loops is usually modeled as dependent on the log of the loop
length). Stacking regions have stabilizing effects, while various loops have destabilizing
effects. The Zuker algorithm basically uses DP to do a forward recursion and completely
fill a matrix with the lowest energies of admissible structure ending at each base-pair.
The last number to be computed is the overall MFE of an admissible structure. One can
recursively follow the path that produced the MFE value, tracing back the MFE structure
from the optimal free energy. Zuker’s algorithm has computational complexity O(n4) and
19
O(n2) memory requirement. By limiting the interior loop size to a constant value, for
example 30, its computational complexity can be further reduced to O(n3).
The prediction of single optimal structures has received the most attention, but it is
worthwhile to explore a broader energy and conformational landscape. RNA structures
are dynamic. In slightly different the environments, such as varying temperature and
salt concentration, conformation of RNA molecules vary widely. Even in a uniform stable
environment, RNA molecules exist as ensemble of structures instead of as a single
specific conformation. This ensemble of structures is a distribution of structures, in
which the expected frequency of each structure is determined by its free energy.
McCaskill described this ensemble of structures in terms of a partition function (69) That
allows the calculation of structural melting curves, base-pair binding probabilities and
frequencies of possible structures at thermodynamic equilibrium. Basically, it relates
the RNA folding problem to the Boltzmann distribution. This ensemble structural
partition-function algorithm can be derived from the MFE algorithm by substituting the
minimum operations by summations and additions by multiplications. Suboptimal
structure prediction involves searching the near optimal conformational space for
structures with low, but greater than minimum, free energies. There are some
algorithms that compute MFE and near optimal (suboptimal) structures. Some of them
include pseudoknot prediction. Hofacker et al. implemented both MFE and partition
function algorithms in the Vienna RNA package (85). Similar to the Vienna RNA package,
UNAFOLD (100) also provides suboptimal structure predictions. Both packages identify a
20
MFE structure using DP, and identify base-pairs that have the potential to form
pseudoknots.
Pseudoknot prediction is computationally challenging for such MFE structure prediction
algorithms because pseudoknot base-paired regions are not in linear order along the
sequence, which breaks the basic recursion. DP cannot deal with this kind of non-nested
problem. Pseudoknots can be predicted by heuristic approaches, however, some of the
approaches have restrictions such that they can only be applied to certain type of
pseudoknots, and the overall performance such heuristic approaches is good only when
the RNA sequence is short. Rivas and Eddy proposed a new approach using DP to predict
single MFE structures with pseudoknots (95). More recently, Reeder and Giegerich
reduced the computational complexity by restricting the prediction to only certain types
of pseudoknots (which are easy to compute) (94,106). Ren et al. applied Zuker’s
algorithm to predict the substructures and identify pseudoknots. Among those
substructures, they search for the energetically favorable pseudoknots (90).
Sperschneider et al. used base pair probability dot-plots, produced by RNAfold from
Vienna RNA package, to identify candidate substructures and prediction pseudoknots
(107,108).
The ensemble of structures captures the dynamics of RNA molecules, I believe it could
be used to enhance the identification and classification of RNAs. Pseudoknots can
potentially be predicted with higher confidence using dynamic information. The XIOS
21
framework I propose in this thesis provides the ability to specifically include multiple
structures as well as pseudoknots in a single graph representation.
The predictions of thermodynamic energy model approaches mainly depend on the
thermodynamic parameters. But there is doubt about the reliability of the optimization
approaches and lack of precision in the energetic parameters. A better understanding of
the thermodynamic parameters would provide better predictions, but there are some
limitations. The Watson-Crick and wobble base pairs are not the only base pair
interactions in RNA. Hoogsteen base pairs and pyrimidine-pyrimidine base-pairs are
frequently found in hairpin loops and the GNRA-tetraloops. Tandem A-G pairs and non-
Watson-Crick A-U pairs are also present in the three-way junction of the hammerhead
ribozyme. Trans hoogsteen/sugar edge base-pairing in RNA plays important role in
stabilizing folded RNA molecules. Those noncanonical (Watson-Crick) base-paired
regions are not considered in existing prediction programs. The presence of such
noncanonical regions will lead to incorrect local or global predictions. Without
considering noncanonical interactions and possible RNA protein binding interactions, we
cannot get a reliable energy model for RNA structure prediction, which is also true for
protein structure prediction.
In short, in the computational biology area, efficient RNA secondary structure prediction
algorithms have been available since 1980s. The accuracy of these algorithms,
handicapped by biochemical and computational limitations, are still not very satisfactory.
Currently, the accuracy of RNA secondary structure prediction is around 80% (60-80%
22
for pseudoknot structure predictions). Those programs can be used to obtain a general
scaffold of RNA structure, which still needs further validation and support. In the mean
while, there is plenty of space to improve the prediction accuracy, such as combining
noncanoical interactions into the prediction models. In my thesis work, I use existing
programs to predict RNA secondary structures of RNA families (if there is no structure
available). Despite their unsatisfactory performance, at least some of the structure
prediction results still capture some true aspects of the conserved structure patterns. By
combining all the predicted structures together, my method is still able to find the
important structure patterns conserved across RNA family and achieve the goal of my
study.
1.5 Decision making in novel molecule design
In the modern age, with the advances in biology, chemistry, laboratory technologies and
equipment, we understand and realize that many small natural products can benefit
human beings. One well known example is the discovery of penicillin antibiotics. It was
the first drug against many previously serious diseases and infections. The average
human life span was increased 8 years due to the introduction of antibiotics. Also, we
have been designing and synthesizing many kinds of substances to accomplish desirable
tasks, such as the famous synthetic pesticide dichlorodiphenyltrichloroethane (DDT).
With the discoveries of the effects of natural products and chemical intuition about
them, synthetic chemists use synthetic organic techniques to synthesize compounds
with desirable effects. We use synthetic products as a basic part of our daily life: in food
23
products, commodities and drugs. In cancer research, scientists would like to find drugs
that target cancer cells in order to cure cancer. Virologists are also looking for ways to
inhibit viral effects on hosts.
But the big problem is that we still do not have deep understanding of the chemicals.
The need to have ways to describe molecules in terms of mathematical or logical rules
has never been greater. Without that there is no means to efficiently design new
molecules with novel effects, and the progress would be slow and costly. Structure
determines function. The understanding of molecular structures and knowledge of their
structural diversity are essential in aiding the novel synthetic compound design process.
Now there are two main strategies in modern drug discovery: target oriented synthesis
(TOS) and diversity oriented synthesis (DOS) (109). Both of them involve small molecule
high throughput screening (HTS) for the ones can bind to specific targets. Planning of
the small molecule library is the most important step related to the efficiency of the
drug discovery process and result. With predictive models and classification methods,
based on structure to function correlations, one could provide practical insights into
construction, selection, and screening of small molecule libraries. Chemists can prioritize
the synthetic process and focus on the small molecules with higher likelihood of having
higher binding specificity to the target.
Similarly in the case of RNA, with strategies such as in vitro evolution and the knowledge
of the catalytic functions of various RNA molecules, it is possible to make RNA molecules
for predefined purposes, such as RNA catalysts. This has been called evolutionary
24
biotechnology or applied molecular evolution (110-113) in the pharmaceutical drug
engineering described above and biotechnology related applications. Nowadays, the
standard procedure is to perform selection experiments on a large pool of partially or
completely randomly synthesized nucleotide sequences for desired functions (114-118).
The strategy is similar to DOS, in chemical organic synthesis, which is expensive and
inefficient without guidance from the predictive models.
Computationally, understanding the link between RNA structure and its function would
give a solid foundation for computational predictive models, which calculate the
likelihood of having a specific desired function for a given RNA sequence.
1.6 Summary/organization of this work
RNA related research is the shining star of the booming next-generation sequencing
(NGS) industry. Research focus is shifting from protein to protein-RNA complexes and
RNA molecules. At the current stage, high throughput techniques have enabled
researchers to collect humongous genomic and transcriptomic data at an ever
increasing speed. There’s no way for human beings to analyze and annotate this large
scale data manually. This is the perfect time for bioinformatics and computational
biology to aid the hypothesis generating process.
In chapter 2 Pattern matching in RNA structures, I present a new graph theoretic RNA
secondary structure representation, XIOS, which describes RNA secondary structures on
a topological basis, and includes pseudoknots. XIOS also includes ensembles of
25
structures in a single graph in order to capture the dynamics of the RNA molecule. It
allows analysis of RNA structures using graph matching approaches, such as similarity
comparison and classification.
Chapter 3, RNA structural fingerprint. This chapter describes a new concept, the RNA
structural fingerprint, that describes the topological characteristics of an ensemble of
RNA suboptimal structures. A small structural motif library has been constructed by
graph enumeration, and an advanced graph indexing technique is modified and
improved to speed up the fingerprint searching process.
Chapter 4, Matching unknown RNA structures. This chapter demonstrates a database
searching tool which identifies topologically similar structures based on a query
structure. It uses the XIOS representation as a core framework and applies advanced
modified graph indexing technique to provide fast structure searching ability. The
search result provides insights of the possible function of the query structure.
Chapter 5, describes an RNA structure topological feature selection study, based on RNA
structural fingerprint, in which a feature selection method is applied to study the
correlation between topological features and function. It reveals important structural-
topological patterns that provide information that can be used to relate RNA structural
topology to RNA function. I demonstrate that topological patterns can be used to
classify known types of RNAs.
26
1.7 Data sources
1.7.1 Manually curated dataset
For this thesis, we have identified a diverse set of RNA sequences of varied lengths
which are involved in a variety of biological processes: tRNA, tmRNA, RNase P RNA and
Group I Intron RNA (Table 1.1) from different reliable resources (119-122). Each dataset
contains only non-redundant sequences with low sequence similarity (< 50%). Further
steps taken in curating datasets for each category are detailed below.
1.7.1.1 tRNA
16 non-redundant tRNA sequence with crystal structures (resolution < 3 Å) from the
Protein Data Bank (PDB) (120). Base pairing information was extracted by RNAView
(123), single base-pairs were not included. The PDB IDs: 1C0A, 1F7U, 1GAX, 1H4S, 1QF6,
1QTQ, 1QU2, 1TTT, 2BTE, 2CSX, 2DXI, 2FMT, 2ZM5, 2ZUF, 2ZZM, 3EPH. Note that non
Watson-Crick base-pairs are often formed adjacent to the stems of the RNA cloverleaf
secondary structure, giving rise to at least one pseudoknot in most tRNA tertiary
structures.
1.7.1.2 RNase P RNA
Representative RNase P RNA sequences were selected from classes as defined by Ellis
and Brown (124). In each case, all sequences of a class were retrieved from the RNaseP
database (125) and purged to remove sequences with greater than 50% sequence
identity. Stems and pseudoknots were then reviewed manually and adjusted for
consistent labeling. A total of 39 curated structures were obtained from the following
27
classes: A1 (5), A2 (2), A3 (1), A4 (2), A5 (2), AX (2), B1 (7), B2 (1), B3 (3), BX (1), C (1),
Archaeal type A (11), and Archaeal type M (3). Secondary structures and pseudoknots
were assigned according to Ellis and Brown (124) and folding diagrams in the RNase P
database entries (125). A complete list of sources is given in Table 1.2.
1.7.1.3 Group I Intron RNA
152 sequences were downloaded from the RNA STRAND database (121). The shortest
and longest 10% of the sequences were removed on the assumption that these were
most likely to be incomplete or poorly annotated. Sequences with greater than 50%
sequence identity were purged leaving 36 sequences ranging from 240 to 602 bases in
length. With one exception, PDB structure IL8V, for which the RNA structure is assigned
by RNAView (126), stems and pseudoknots were assigned according to expert curation
in the CRW database (127). A complete list of the sequences is given in Table 1.3.
1.7.1.4 tmRNA
632 complete sequences (from 514 species) with structural assignments were obtained
from the tmRNA website (128). Sequences were purged to remove sequences with >40%
sequence identity, leaving 165 sequences. 48 of these sequences contained asterisks,
indicating the absence of some bases, and were removed. The final dataset consists of
117 sequences, with sequence lengths ranging from 230 to 393 bases. The structure
assignments in the tmRNA database are used as the curated structures. A complete list
of sequences is given in Table 1.4.
28
1.7.2 STRAND dataset
RNA STRAND (The RNA secondary STRucture and statistical Analysis Database) is a
database with comprehensive collection of known RNA secondary structures
(experimental solved and computational predicted) from different organisms. Dataset
collect from STRAND are the corresponding 4 RNA families in our manually curated
dataset: tRNA, RNAseP, Group I intron RNA and tmRNA Table 1.5.
Compared to the manually curated dataset, the dataset from STRAND is a mixture of
reliable structure data as well as partial structures and noise, and even misclassification.
The manually curated dataset is a clean and high quality dataset. SRTAND dataset is a
bag of all kinds of structure data, good and bad, which represents the average quality of
other RNA structural databases out there.
29
Figure 1.1 Common types of Pseudoknots
30
Figure 1.2 Common RNA secondary structure representations.
A. Stem-loop diagram depicts RNA secondary structure elements: S stacking base pair (or stem), H hairpin loop, I interior loop, B bulge and M multiple loop. B. Stem-loop digram with Pseudoknot. C. Circle Plot. The backbone nucleotides of RNA are arranged along a circle, and base pairs are drawn as arcs. D. Dome Plot. The backbone nucleotides of RNA are placed in a line, and base pairs are drawn as arcs. E. Circle plot including a Pseudoknot. Pseudoknot structure is indicated by arcs crossing each other. F. Dome plot including a Pseudoknot. Pseudoknot structure is again indicated by the arcs crossing each other. G. Mountain Plot. The x-axis of Mountain plot corresponds to the RNA sequence, and y-axis shows the number of base pairs in which a specific nucleotide is enclosed. H. Primary sequence (X means any one of the four nucleotides) and its linear dot-bracket representation. A dot indicates a non-base-paired position, and pair of matching brackets at position i and j indicates there is a base-pair (i, j) between positions i and j.
31
Figure 1.2
32
Figure 1.3 Rooted-labeled tree
Closed circles represent base-pairs, and leaf nodes represent unpaired nucleotides. The root of the tree, box, is a dummy node added as the root of the tree graph which serves as the parent of all nodes in the tree to ensure structures with free end(s) are not represented by a forest (disconnected trees). Stems are rope-like and loops are bush-like.
33
Figu
re 1
.4 D
ot p
lots
Two
type
s of d
ot p
lots
are
show
n he
re. T
his i
s a tR
NA
exam
ple.
Cen
ter i
s the
stem
-loop
dia
gram
show
ing
the
fam
iliar
tRN
A cl
over
leaf
. Lef
t. Pa
rtiti
on fu
nctio
n do
t plo
t is a
bas
e pa
iring
bin
ding
pro
babi
lity
mat
rix. I
t is a
vi
sual
izatio
n of
the
ther
mod
ynam
ics o
f an
ense
mbl
e of
stru
ctur
es. T
he c
olor
of t
he d
ot re
flect
s the
neg
ativ
e lo
g of
bas
e pa
ir bi
ndin
g pr
obab
ility
of t
he b
ase
pair.
The
ord
er fr
om re
d to
bro
wni
sh, t
o gr
een
and
then
to b
lue
is th
e de
crea
sing
orde
r of t
he p
roba
bilit
y. R
ight
. Ene
rgy
dot p
lot s
how
s, fo
r a sp
ecifi
c ba
se p
air,
the
low
est f
ree
ener
gy
for a
stru
ctur
e th
at e
nds a
t thi
s bas
e pa
ir. T
he o
rder
from
red
to b
row
nish
, to
gree
n an
d th
en to
blu
e is
the
incr
easin
g or
der o
f the
free
ene
rgy.
Fou
r ste
m re
gion
s are
hig
hlig
hted
by
the
colo
red
boxe
s. D
ot p
lots
are
ge
nera
ted
by th
e RN
AStr
uctu
re so
ftwar
e (1
).
34
Figu
re 1
.4
35
Figure 1.5 RNA tree graph, RNA dual graph and RNA digraph representations.
A is the tRNA secondary structure in its squiggle notation. B is its tree graph representation. Each of vertices of the tree graph is a loop region (the 3’ and 5’ ends of stem is also considered as a loop region), and edges represent stems. C is its dual graph representation. Vertices represent stems, and edges are loop regions (3’ and 5’ ends of stem is not a loop in this representation). D is the digraph representation. This is an RNA dual graph with directed edges. Direction of the edges can resolve some ambiguity in representing RNA topologies.
36
Table 1.1 Manually curated structures.
Dataset tRNA RNAseP Group I intron tmRNA Sample size 16 40 36 117
Source PDB (120) RNAseP database (122) STRAND (121) The tmRNA Website (119)
Min graph size 3 11 9 4** Max graph size 6 26 25 22
Average graph size 5.25 19.55 16.11 16.65 ** tmRNA has an outlier tmRNA/BaciPhage_G, this structure has only 4 stems.
37
Table 1.2 RNaseP Sequences Used.
RNaseP stems and pseudoknots were assigned based on expert curation in the RNAseP database and structural types assigned according to Ellis and Brown (2009). All structural assignments were manually reviewed, in some cases minor adjustments had to be made to the RNaseP database structures to make the labeling of stems consistent across all structures.
Species Length (bases) Type Buchnera APSa 376 A1Carboxydothermus hydrogenoformans 331 A1Neisseria meningitidis 360 A1Pseudomonas fluorescens 354 A1Serratia marcescens 378 A1Cupriavidus necator 341 A2Nitrosomas europaea 285 A2Chlamydia pneumoniae 406 A3Anacyctis nidulans 385 A4Pseudoanabaena PCC6903b 450 A4Bacillus pertussis 414 A5Chlrobium tepidum 381 A5Mycobacterium leprae 423 AXd
Streptomyces lividans 405 AXd
Bacillus anthracis 408 B1Bacillus magaterium 408 B1Entercoccus faecalis 389 B1Mycoplasma capricolum 356 B1Staphylococcus epidermis 401 B1Staphyloccus gordonii 382 B1Ureaplasma urealyticum 370 B1Mycoplasma flocculare 412 B2Mycoplasma fermentans 302 B3Mycoplasma pneunomiae 369 BXc
Thermomicrobium roseum 350 CAeropyrum pernix 330 Archaeal type A Halobacterium cutirubrum 375 Archaeal type Ae
Halococcus morrhuae 475 Archaeal type Af
Metallosphaera sedula 304 Archaeal type A Methanobacterium thermoautotrophicum 293 Archaeal type A Methanosarcina barkeri 371 Archaeal type A Natronobacterium gregoryi 474 Archaeal type A Pyrococcus abyssi 330 Archaeal type A Sulfolobus acidocaldarius 315 Archaeal type A Sulfolobus solfataricus 311 Archaeal type Ah
Thermoplama volcanumg 305 Archaeal type A Archeoglobus fulgidus 229 Archaeal type M Methanococcus jannaschii 252 Archaeal type M Methanococcus maripaludus 233 Archaeal type M a Also known as Ralstonia or Alcaligenes eutrophus b Labeling of stems does not fall easily into standard scheme due to second stem coming off of L15
38
c This structure is midway between B1 and B2 having stem 10.1 and P9, but lacking P19. Also has an extra pseudoknot between the L9 and the region before P20. d Clearly A type due to presence of P6, P13 and P14 and lack of P15.1. Has additional stem (annotated in this work as P16.1) coming off of L15. e This structure is difficult to label due to three stems branching from P12. In this work these were annotated as P12.1 - P12.3. The structure given for the P15-P17 region may not be correct. f RNaseP database annotated structure may not be entirely correct. g RNaseP database diagram labelled T. volvanum h No RNAML file available, structure annotated based on .ct file and structure diagram.
39
Table 1.3 Group I Intron Sequences Used.
RNA Strand ID CRW IDa Genbank Accession CRW_00009 b.I1.e.C.hypophloia.E.SSU.989.bpseq AF015912 CRW_00012 b.I1.e.H.rubra.1.C1.SSU.1506.bpseq L19345 CRW_00015 b.I1.e.M.anisopliae.2.E.LSU.2066.bpseq AF197123 CRW_00608 a.I1.b.Dermocarpa.sp.ATCC29371.C3.tMET.bpseq U10480 CRW_00609 a.I1.b.P.hollandica.1.C3.trnL.bpseq U29955 CRW_00619 a.I1.c.N.tabacum.C3.tLEU.bpseq M16898, Z00044CRW_00626 a.I1.e.A.adeninivorans.C1.LSU.2449.bpseq Z50840 CRW_00633 a.I1.e.B.ciliata.JCM6865.C1.SSU.1506.bpseq D38233 CRW_00634 a.I1.e.B.ciliata.JCM6865.C1.SSU.943.bpseq D38233 CRW_00637 a.I1.e.C.botrytis.C1.SSU.1506.bpseq X77453 CRW_00639 a.I1.e.C.ellipsoidea.IAMC-87.C1.SSU.1506.bpseq D13324 CRW_00641 a.I1.e.C.grayi.UNK.SSU.1046.bpseq Z14026 CRW_00642 a.I1.e.C.grayi.UNK.SSU.1516.bpseq Z14026 CRW_00643 a.I1.e.C.luteoviridis.B.C1.SSU.1052.bpseq X73998 CRW_00645 a.I1.e.C.merochlorophea.UNK.SSU.1210.bpseq Z14025 CRW_00651 a.I1.e.C.saxonicum.C1.SSU.1506.bpseq X79497 CRW_00652 a.I1.e.C.sorokiniana.C1.SSU.323.bpseq X73993 CRW_00653 a.I1.e.D.parva.C1.SSU.1512.bpseq M62998 CRW_00656 a.I1.e.E.dermatitidis.C1.SSU.943.bpseq Z75304 CRW_00657 a.I1.e.G.planctonica.C1.SSU.943.bpseq Z28970 CRW_00658 a.I1.e.G.spirotaenia.C1.SSU.1506.bpseq X74753 CRW_00659 a.I1.e.L.dispersa.UNK.SSU.1046.bpseq L37734 CRW_00662 a.I1.e.L.dispersa.UNK.SSU.1516.bpseq L37734 CRW_00664 a.I1.e.L.dispersa.UNK.SSU.516.bpseq L37734. CRW_00673 a.I1.e.P.clavariiformis.C1.SSU.943.bpseq AB003945.CRW_00687 a.I1.e.S.paniceum.UNK.SSU.1052.bpseq D49657. CRW_00689 a.I1.e.S.paniceum.UNK.SSU.1210.bpseq D49657. CRW_00690 a.I1.e.S.paniceum.UNK.SSU.1506.bpseq D49657. CRW_00692 a.I1.e.S.sclerotiorum_1837.C1.LSU.798.bpseq AJ226089. CRW_00694 a.I1.e.Staurastrum.sp.M753.C1.SSU.1506.bpseq X77452. CRW_00703 a.I1.m.M.grisea.B2.ND1.bpseq X96412. CRW_00715 a.I1.m.S.luteus.A1.LSU.2504.bpseq L47586. CRW_00717 a.I1.m.S.luteus.B4.LSU.1923 L47586 CRW_00721 a.I1.m.S.pombe.B1.OX1.bpseq M15669 CRW_00814 b.I1.e.B.fuscopurpurea.7.C1.SSU.516.bpseq AF172557 PDB_00140a PDB 1L8V a the final structure is from PDB not CRW
40
Table 1.4 tmRNA Sequences Used
Sequence ID Length (Bases)
Sequence ID Length (Bases)
Sequence ID Length(Bases)
Acibm_capsu 339 Fibro_succi 346 Propi_acne2 354Acthi_ferr2 353 Frnkia_EAN1 367 Psalt_atlan 340Alkph_metal 334 Fusob_nucl2 329 Psmon_aerug 337Aquif_aeoli 336 Geobl_stear 334 Rbrbr_xylan 343Artbr_sFB24 354 Geobr_metal 340 Rhopi_balti 366Bacil_liche 340 Graci_tenui 342 Rumin_albus 339BaciPhage_G 264 Guill_theta 267 Salbr_ruber 355Bdell_bacte 333 Helcb_hepat 349 Solib_usita 343Bifid_long2 385 Helcb_muste 343 Stapc_aure3 347Bloch_flori 375 Helcb_pylo2 365 Stmyc_averm 372Bloch_penns 374 Kineo_radio 348 Strpc_zooep 330Borde_pertu 371 Lacbl_casei 343 Sulfh_azore 336Borre_garin 349 Lacbl_plan2 350 Syntr_acidi 341Brevi_linen 364 Laccc_lact1 335 Tanne_forsy 380Buchn_aphi1 352 Lactb_aciph 347 Thala_pseud 254Buchn_aphi3 351 Lawso_intra 364 Thana_tengc 338Bvora_marin 374 Leifso_xyli 379 Thdsb_commu 339Caldc_sacch 354 Lepta_inter 336 Therv_yello 338Campbr_lari 340 Leptm_grup2 348 Thmic_cruno 340Chlbm_tepid 393 Leuno_mesen 339 Thmic_roseu 339Chlfl_auran 345 Magcocc_MC1 343 Thmus_ther2 333Chrbm_viol2 351 Marbr_aquae 350 Thtog_neapo 343Clavi_michi 360 Mecoc_capsu 339 Trepo_denti 339Clost_aceto 340 Mesos_virid 298 Trepo_palli 342Clost_diffi 331 Micbu_degra 373 Troph_whipp 347Clost_perf2 344 Moore_therm 342 Uncul_bone3 343Clost_therm 374 Mycpl_arthr 371 Uncul_farm3 352Copth_prote 337 Mycpl_mobil 355 Uncul_farm7 346Coxie_burne 348 Mycpl_pneum 368 Uncul_flatw 280Crbth_hydro 344 Mycpl_pulmo 354 Uncul_phako 356Cyanp_parad 231 Myxoc_xanth 351 Uncult_HF10 323Cytph_hutch 389 Nephr_oliva 264 Verru_spino 336Dehco_ethen 339 Nostoc_7120 374 Wiggl_gloss 344Deinc_radio 336 Ntsco_ocean 345 Wolin_succi 349Desfm_aceto 336 Ocebl_iheyi 343 Xanmo_camp1 380Destl_pschr 347 Odont_sinen 281Dicgl_therm 337 Oenoco_oeni 333Diche_nodos 335 Paeni_larva 345Emiln_huxle 230 Phobm_profu 351Entco_faecm 348 Porpa_purpu 266Exiguoba_sp 346 Prevo_inter 387
41
Table 1.5 Dataset collected from STRAND database.
Dataset collect here are the corresponding 4 RNA families in our manually curated dataset: tRNA, RNAseP, Group I intron RNA and tmRNA.
Dataset tRNA > 50 nt
RNAseP 100 - 300 nt
Group I intron 100 - 300 nt
tmRNA 100 - 300 nt
sample size 601 36 21 30 Min graph size 3 5 7 8 Max graph size 6 22 15 23
Average graph size 4.13 11.47 11.05 16.71
42
CHAPTER 2 PATTERN MATCHING IN RNA STRUCTURES1
2.1 Introduction
RNA molecules perform a variety of important biological functions in addition to
carrying information from the chromosome to the ribosome, or acting as structural
scaffolds. Catalytic RNAs play key roles in translation, RNA processing and splicing, and
gene regulation (36). Motifs that are important for RNA function are structural and
correspond to base-paired regions of secondary structure, which in turn, provide the
scaffold for the three-dimensional fold of the RNA (129,130). RNA sequences that have
the same structural motifs may have sequences that are impossible to align because
they have no detectable sequence similarity.
While programs that predict RNA secondary structure have been available since the
1980s, RNA structure prediction is handicapped by both biochemical and computational
limitations. Firstly, RNA exists as an ensemble of rapidly interconverting structures.
Protein structures (usually) show relatively minor fluctuations from a single minimum
free-energy state. The case is much different for RNA where there are usually many
1 This is the paper published in the Proceeding of 2008 International Symposium on Bioinformatics Research and
Applications (ISBRA2008). My contributions were participating in the algorithm and experiment design, implementing the algorithm, analyzing the data and results, making figures and writing the manuscript.
Full reference: Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.
43
structures with similar free-energies; these structures may be distinctly different in
terms of base-pairing (67,97). Secondly, while we know that pseudoknot structures are
very important in RNA structure and catalytic function (49), it remains difficult to
reliably predict pseudoknotted structures. This is due both to our incomplete
understanding of the energetics of pseudoknot formation, as well as to the
computational time complexity. The most efficient pseudoknot prediction algorithms,
e.g., pknotRG, have O(n4) time for certain classes of RNAs(94)), but achieve this by
placing significant limitations on which structures can be found. Memory complexity of
RNA structure prediction is O(n2), where n is the length of the RNA sequence, and
usually ranges from 10,000-100,000 bases for primary RNA transcripts.
In biology, functionally important features can often be recognized because they are
conserved over evolutionary time. A common approach is to obtain a set of sequences
using some biological criterion (such as similarity of regulation), and use pattern
recognition methods to identify unusually conserved features. Searching for sequence
motifs (approximately common substrings) in this way has been a powerful tool for
analysis of DNA and proteins; this approach does not work as effectively with RNA
because conserved RNA structures may have no detectable sequence similarity. And
while great progress has been made, it remains difficult to accurately predict MFE
structures for RNA sequences. To further complicate the picture, RNAs exist as
ensembles of structures, in addition to the MFE structure, that are constantly
interconverting and fluctuating. The biologically important structures (those that are
44
conserved over evolutionary time) may be present only transiently, or as minor
components of this structural ensemble. The problem is further complicated by the fact
that biology is messy; one can rarely get completely clean sets of sequence data in
which every sequence actually contains the structure of interest. This makes many
approaches unfeasible. In addition, in biological systems, conservation is only
approximate, no set of structures will exactly match.
We are building a system that allows one to find the greatest approximately conserved
structure(s) in a set of RNA sequences, in the presence of extraneous sequences that do
not share a common structure. This conserved common structure can then be used as
the basis for hypotheses about the importance of the structure in the biological
functioning of the RNA. These hypotheses can be tested either experimentally or by
further computational work.
We convert RNA structures to a graph representation that specifically includes
pseudoknots and is capable of representing an ensemble of RNA structures in a single
graph. Computationally, finding conserved structures corresponds to finding the
greatest approximately isomorphous subgraphs in a set of graphs, where each graph
represents a single RNA sequence. We use modifications of existing maximal subgraph
isomorphism algorithms to identify the similar portions of the graphs, and propose to
combine this with constrained MFE structure prediction tools (131), and a database
search capability.
45
Graph theoretical approaches have previously been applied to RNA structures (74,132),
but our approach differs significantly. The XIOS approach introduces the ability to
represent ensembles of structures, and emphasizes the topology of stems. Our
approach is most similar to that of Gan et al., but focuses on stem topologies rather
than the topology of loops and bulges (74). The XIOS approach also allows structural
motifs to be exactly matched without using heuristics (132).
2.2 XIOS RNA graphs
In this section, we describe the graph framework that we have developed to represent
ensembles of RNA structural topologies. We introduce the XIOS RNA graph
representation for RNAs, and discuss extensions to existing subgraph isomorphism
algorithms as they are apply to XIOS RNA graphs.
2.2.1 Definition
XIOS RNA graphs represent ensembles of RNA structural topologies. In XIOS graphs,
each base-paired stem is represented by a vertex, and the edges connecting the vertices
indicate the topological relationship between the stems. Topologically, two stems can
be eXclusive (X, i.e., both cannot simultaneously form because they use the same
sequence ranges), Included (I, indicates the direction of I edges with respect to the
higher numbered vertex and J indicates the opposite, i.e., one is nested within the loop
of the other), Overlapping (O, i.e., the stems have a pseudoknot relationship) or Serial
(S, i.e., adjacent, non-overlapping stem and loop structures) (Figure2.1). Each pair of
vertices is related by one and only X, I, O or S relationship.
46
2.2.2 Training data
We have developed Perl packages that translate Vienna RNA format (85) and the
MFOLD (83) connect format into XIOS graphs. Because the predicted MFE structure is
only one structure in a structural ensemble, we enumerate all energetically favorable
short stems and label the entire set as X, I, O, and S, as described above. The graph is
therefore an image of the entire structural ensemble. Our test datasets are described in
Table 2.1. Highly similar sequences with sequence identity >40% are removed from the
dataset to avoid selection bias.
2.2.3 DFS Lexicographical ordering
DFS (Depth-First Search) lexicographical ordering was originally developed by Yan et al.
(133,134) in their gSpan algorithm for identifying common chemical structures in
chemical datasets. In the chemical structure case, both the vertices (atoms) and edges
(bonds: single, double and triple) are labeled, and all edges are undirected. gSpan is a
powerful search algorithm that reduces the search space for isomorphous subgraphs
using a clever DFS preordered search tree.
The traversal order of edges and vertices in the DFS of a graph can be canonically
ordered. This is called the DFS tree, or when serialized, the DFS code. Yan et al. proved
that graphs with the same DFS code are, by definition isomorphous. Lexicographic rules
provide an unambiguous best order to the canonical DFS code (133).
47
The direct path from the first traversed vertex (root) to the most recently added vertex
(right-most vertex) is the right-most path. The extension of DFS graphs by edge growth
is restricted to extension from the rightmost path, similarly to the approach of
TreeminerV (135). Graphs are extended in the following order: edges to existing vertices
(backward edges), edges to new vertices extending from the right-most vertex, and
extension from internal vertices on the right-most path. An intrinsic property of the DFS
lexicographical ordering is that it creates a preorder that can be used to efficiently
explore the search tree when searching for isomorphous subgraphs. Isomorphic forms
of a graph fall in different positions in the search tree, but the canonical DFS
representation of a particular isomorph is guaranteed to be found first. Hence, the
lexicographically first instance of an isomorph in the search tree is its minimum
representation or canonical labeling and other instances can be efficiently pruned. Each
edge in the DFS code is described by a 3-tuple, (vi, vj, li,j), where vi and vj are two
connected vertices and li,j is the label of the edge. Figure2.3 shows how the canonical
labeling can easily be identified using lexicographic rules even though many different
DFS codes are possible. There are two additional rules that prune the search space.
Firstly, if the initial edge of a minimum DFS code is type e0, then no following edge can
have a lexically smaller edge label, and secondly, for any backward edge growth to vj, an
edge cannot be lexically smaller than any edge that is already connected to vj or vrightmost
(133). Each distinct mapping of vertices to a DFS code is the support for that potential
solution. Since many such mappings are possible, each graph may have multiple support
48
for a DFS code. As a simple example, Figure2.2 shows the XIOS graph for a tRNA,
according to the experimentally determined 3-dimensional structure (PDB ID: 1EHZ).
2.2.4 Enumeration N-stem structures
Every RNA structure can be represented by a XIOS graph. For n stems, the upper bound
on the number of possible structures2, N, can be calculated by Equation (1),
!n2
(2n)! N n ⋅
= (1)
For example, there is only one possible one-stem structure, two possible two-stem
structures, and 10 possible three-stem structures, but only eight unique structures
(Table 2.2). Figure2.3 shows the XIOS graphs for the eight unique structures that can be
formed from three stems. The other two three-stem structures are either redundant or
physically impossible.
2.3 Greatest conserved structures
2.3.1 Extension of the gSpan algorithm
XIOS graphs have several differences from the chemical structure graphs considered by
Yan and Han. XIOS graphs
2 For the n-stem case, there are 2n half stems. We assign integer labels to each half stem from 1 to 2n-1.By definition,
the first half stem is labeled 1, and there are 2n-1 possible half stems that can pair with the first half stem; the third half stem has only one possible label (2 or 3), and there are 2n-3 possible half stems that can pair with this half stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1.
49
• have both directed and undirected edges. I edges are directed because it is
highly important whether a stem is nested within or outside another stem. X, O,
and S edges are undirected.
• do not have vertex labels. Because every vertex is simply an anonymous
elemental stem, no labels are available.
The use of unlabeled vertices with the gSpan algorithm is fairly straightforward, but
results in a decreased ability to rapidly prune the search tree. Directed edges are a little
more difficult to accommodate because the direction of the edge depends on the vertex
from which one looks. The simplest approach is to label the edge as either I or J from the
point of view of the lowest numbered vertex. I and J are treated as lexicographically
distinguishable edges.
In the original application of gSpan to chemical structures, Yan and Han were interested
in identifying frequently occurring chemical substructures. In their case, structures that
occur many times in a single graph are equally interesting. The case of RNA differs;
motifs that occur in multiple graphs (molecules), rather than many times in a single
graph (molecule), are considered more important. In addition, the presence of
incorrectly classified sequences, i.e., sequences that have no common structure, means
that not all graphs will support the biologically relevant subgraph. For XIOS graphs,
therefore, support is calculated as the number of graphs that containing a subgraph,
rather than the total count of matching subgraphs.
50
2.3.2 Graph matching algorithm (similar to gSpan3)
begin: For a XIOS graph G with edges eG I. Sort edges in eG by edge type eG ∈ {X,I,O,S} II. For each edge type E
1. Find all lexicographically minimal one edge subgraphs, S, from the given XIOS graphs;
2. For each edge e in S 3. Do Subgraph_mining(G, S, e):
i. If the graph is NOT a minimum graph according to DFS lexicographical order, return; ii. Generate all potential children with one edge growth, enew iii. If support for each child is above threshold
Recursively call Subgraph_mining with updated edge list (G, S+, enew) 4. Remove all edges of edge type E from G after all descendents have been searched 5. If eG = Ø, break;
end.
2.3.3 Greatest conserved structure(s) in a set of RNAs
Many computational approaches use pairwise or multiple DNA or protein sequence
alignments to find conserved motifs, but this approach is generally impossible with RNA
sequences because of their lack of conserved sequences, and because of the difficulty of
obtaining unambiguously correct alignments. However, secondary and higher order
structures in RNA are conserved, so matching the topology of two RNA structures with a
graph matching approach can identify conserved motifs that cannot be seen in the
sequences. The pre-ordered DFS search approach of gSpan provides an effective
approach to this problem.
3 Adapted from (133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE
Computer Society, Maebashi City, Japan, pp. 721. with minor modification
51
The time complexity for the worst case of this algorithm is suggested to be O(kn)
(133,134), where k is the maximum number of subgraph isomorphisms existing between
the two graphs and n is the size of the greatest common match. Figure2.4 shows the
application of the XIOS graph approach to the structure of S. cerevisiae and H. sapiens
RNase P.
2.3.4 Characteristics of biological graphs
The graph isomorphism approach is limited by the size of the graphs. We examined
sequences from snoRNA, 5S rRNA, microRNA, tRNA, and RNase P (See Appendix for
details) to determine how the number of stems varies with sequence length in biological
RNAs. The sequences were obtained from online databases (Table 2.1) and their
predicted MFE structures were obtained using the RNAsubopt program of the Vienna
RNA package (97). Predicted MFE structures were also obtained for random sequences
in a similar fashion. Random sequences were obtained by randomizing the order of
bases in the corresponding biological sequences, thus preserving the base composition
and sequence length.
Figure2.5 indicates the overall trend of linear increase in number of stems as a function
of sequence length. This rapid increase in the number of stems is due to the intricately
folded structures of the RNAs. This observation further necessitates the development of
an efficient system for searching biologically relevant structural patterns in RNA. It is
notable the biological RNAs and random RNAs have very similar numbers of structures.
52
As one can see in Figure2.6, stem structures in biological RNAs are predominantly less
than ten base-pairs long.
2.4 Future directions
The number of stem structures in an RNA MFE structure can be very large (Figure2.5);
the total number of possible stems, however, grows quadratically with the length of the
sequence. If one assumes that stem-loop structures require on average 24 bases, the
number of possible stems would be something like (SequenceLength/24)2. For a relative
short 10kb mRNA sequence this would lead to graphs with over 150,000 vertices. Our
ultimate goal is to analyze 10-20 sequences of much longer length (many biological
RNAs are over 100,000 bases long), a daunting problem. There are a number of
approaches that can be used to reduce the size of the problem. These include
preprocessing the structure to include only the most interesting stems (rather than all
possible stems), the application of graph contraction methods, and the introduction of
vertex labels.
2.4.1 Graph preprocessing
While the most biologically interesting RNA structure need not be the minimum free
energy (MFE) structure, it is likely that the important structures are close to the MFE
(136). This follows from the Boltzmann relationship, which indicates that the relative
frequency of a given structure in the structural ensemble depends on its energy. Rather
than identifying all short energetically favorable stems, we can greatly reduce the size of
the problem by including only stems that participate in a structure within some energy
53
interval, ∂, from the predicted MFE structure. The total number of stems can be
controlled by altering ∂; ∂=0 produces the MFE structure.
2.4.2 Reduction of graph complexity
Graph contraction reduces graph complexity by pruning irrelevant vertices and edges.
There are a number of different approaches one can take to pruning XIOS graphs. Firstly,
as we pointed out above, one can simply discard the S edges; since there are exactly
four edge types and each pair of vertices has exactly one edge, only three edge types
need be used. Secondly, we can place limits on the construction of edges of other types,
especially of I edges. One of the advantages of the XIOS representation is that nested
stems, represented by I edges, have an edge with every other stem in which they are
included. This embedding can be many levels deep, generating a huge number of highly
connected vertices. This is a great advantage because it obviates the need for
introducing gaps (137) which make the matching problem much more complex (and ad
hoc since there is no way to determine correct gap parameters). We postulate that we
would lose little matching power if the depth of I edge nesting was limited to a fixed
depth such as four. This would still permit extraneous stems to be easily omitted but
greatly reduce the number of edges in the graphs. Finally, because we can enumerate all
possible XIOS structures with a fixed number of stems, we can create a dictionary of
these substructures and condense the graphs to a smaller number of vertices based on
this dictionary, at the same time converting the unlabelled vertices to labeled vertices
(the labels then correspond to the dictionary structures).
54
2.4.3 Adding labels
The dictionary strategy, described above, faces difficulties since the isomorphous
structure of interest is buried in a huge field of random noise. If the dictionary based
labels are dominated by the non-matching (noise) portion of the graph, the re-encoded
graph will lose the information needed to match to other graphs (e.g., if the dictionary
structures overlap but do not exactly correspond to the interesting conserved
structures). A similar strategy, unique to the XIOS graph, is to examine all three vertex
triangles, of which there are a strictly limited number of types due to the limitations
both of the graph and of the biochemistry of RNA, and replace each triangle with a
corresponding labeled vertex. Triangles may share one or two edges which can be
incorporated as an extended set of edge labels. Such graphs would be modestly smaller,
but much more heavily labeled, greatly increasing the search speed. At the same time,
little information is lost since the original graph can be almost completely reconstructed
from the triangle-condensed graph.
2.4.4 Motif identification tool
RNAs that interact with specific molecules, such as proteins, generally have common
topological motifs. For example, in alternative splicing the donor, acceptor, and branch
point all have specific conserved structures important in recognition and catalysis. Such
conserved structures, when identified in molecules of unknown function, immediately
generate experimentally testable hypotheses. Once motifs are identified, they can be
used to search for additional sequences that could form the same structure. This
55
provides a means for both statistically evaluating the significance of the structural motif,
as well as for validating matches by examining them for biological similarities, e.g., by
comparing the GO annotations (138) of the sequences. A number of approaches may be
suitable for this, including stochastic context free grammars (SCFG) (139) which are
frequently used to identify RNA structures based on biological knowledge (140).
2.4.5 Database search tool
For searching of large databases, SCFGs are likely to be too slow. We are developing a
fast database search tool for RNA motifs. Since we can enumerate all possible XIOS
graphs up for structures of up to 7 or 8 stems (hundreds of thousands) we believe that
we can use the enumerated structures to prescreen graphs in much the same way that
BLAST (141) uses identically matching words. This is closely related to the dictionary
concept introduced above. Because matching to the enumerated structures in the
dictionary can be precalculated, we plan to develop a fast system based on the
observation that one need not do the complete isomorphous subgraph search if two
sequences share no dictionary motifs, and that if they do, the isomorphism search can
be seeded by the matching motifs. Such a search tool would allow users to both extend
and validate motifs found through subgraph isomorphism matching, and would also
provide a means to functionally classify unknown RNAs. RNA is still rather poorly
understood and such an approach will be of great use in identifying novel structural and
functional motifs.
56
Because RNA structures are relatively degenerate, it is likely that a post-processing
system will be needed to identify the most interesting possible structures out of a large
number of possibilities. This issue is similar to the problem of relevance ranking in web
indexing. In sequence comparisons, statistical probability calculations are commonly
used as a relevance ranking mechanism, and this may be possible in the XIOS system;
we anticipate that the distribution of maximal matching structures will follow an
extreme value distribution. Any two large RNAs, however, will have common structures
that are almost completely trivial: they will match as a long series of serial stems. This is
generally not biologically interesting, suggesting that there is a notion of biological
complexity which can be used as a relevance ranking function. This biological notion of
complexity may or may not correspond to mathematical notions of graph complexity
(142). Another possible relevance function would be to choose only structural motifs
that can form near-MFE predicted structures using a constrained folding approach
(motif stems are constrained to base-pair in the predicted structure) such as are
available in MFOLD and the Vienna RNA package.
The XIOS graph representation has great promise for identifying biologically interesting
structural motifs in RNA based on sequence alone. Constructing a sufficiently fast motif
search system will allow RNA studies to take advantage of the same bootstrap process
that is commonly used for DNA and protein sequences, namely 1) identify biologically
related sequences, 2) identify statistically significant structural motifs, 3) use structural
57
motifs to identify additional candidate sequences (iterating to convergence), and 4) use
the structural motif as a basis for laboratory experiments.
58
Figure 2.1 XIOS definition.
Relationships (edges) are defined as X (exclusive), I (included), O (overlapping), and S (serial). I indicates the direction of I edges with respect to the higher numbered vertex and J indicates the opposite.
59
Figure 2.2 tRNA 3D structure and corresponding XIOS graph representation.
I.A. 3-D structure of tRNA (PDB ID, 1EHZ). I.B, the simple three-leaf clover shape of tRNA is shown, where the acceptor stem, D-arm, anticodon-arm, and T-arm are represented by vertices 0, 3, 2 and 5 respectively. Vertex 1 represents an interaction between the D-loop and a region between the D-arm and acceptor-arm, and vertex 4 represents an interaction between the D-loop region and the region between anticodon-arm and T-arm. In the XIOS representation (I.C), vertex 1 is included in the acceptor stem and overlaps with the D-arm, vertex 4 overlaps with the D-arm and the Anticodon arm is included in vertex 4. II a, b, and c show the sequential extension of the DFS graph, and II d shows the minimum DFS tree and corresponding DFS code. At the each stage of graph extension, all the possible extensions are shown in dotted lines. For each edge extension, only the canonical graph (shown by dotted ellipse) is used in the next stage.
60
Figure 2.2
61
Figure 2.3 Unique three-stem XIOS graphs, including pseudoknots.
Fifteen XIOS graphs with three vertices are possible, three of them in the first row are not true three-stem topologies (at least one of the stems has only S relationships with other stems); the other four three-stem structures are either redundant or physically impossible.
62
Figure 2.3
63
Figure 2.4 Identification of the common structure in S. cerevisiae and H. sapiens RNase P RNA.
Left panel (top) shows the secondary structure of the S. cerevisiae RNAse P RNA. Each stem is labeled with a capital letter A-L. Left panel, bottom, shows the XIOS graph. I edges are shown as single lines and O edges as double lines. Right panel shows the secondary structure (A-N) and XIOS graphs for a single human RNAse P RNA. In both panels, matching secondary structures are enclosed by boxes and the uniquely matching part of the XIOS graphs shown in dark lines. Dotted lines in the XIOS graphs indicate where there are multiple mapping between stems H and I of the S. cerevisiae structure and the human structure; these multiple mapped stems are also indicated by arrows in the secondary structure diagrams. The right panel shows two of the mappings as an example.
64
Figure 2.5 Correlation between number of stems and sequence length.
Number of stems in biological (♦) and randomized (×) RNA sequences versus sequence length. The number of stems increases roughly linearly with sequence length. Each biological sequence was permuted to generate a corresponding random sequence, preserving the sequence length and base
miRNA
05
10152025303540
0 200 400 600 800 1000
Sequence Length (bases)
Num
ber o
f Ste
ms
snoRNA
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700 800
Sequence Length (bases)
Num
ber o
f Ste
ms
RNaseP
0
5
10
15
20
25
0 100 200 300 400 500 600
Sequence Length (bases)
Num
ber o
f Ste
ms
tRNA
0
2
4
6
8
10
12
0 50 100 150 200 250
Sequence Length (bases)
Num
ber o
f Ste
ms
5S rRNA
0123456789
0 20 40 60 80 100 120 140 160
Sequence Length (bases)
Num
ber o
f Ste
ms
65
Figure 2.6 Length of RNA stem structures in biological RNAs
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0-9
10-1
9
20-2
9
30-3
9
40-4
9
50-5
9
60-6
9
70-7
9
80-8
9
90-9
9
100-
199
200-
299
>300
Stem Length
Freq
uenc
y
66
Table 2.1 Brief description of RNA datasets.
Formats are: A, alignment; C, MFOLD connect; S, sequence; V, Vienna RNA package.
Type of RNA
Database or Program
Format Link
microRNA miRNA S http://microrna.sanger.ac.uk/sequences/index.shtml 5S rRNA Database S http://biobases.ibch.poznan.pl/5SData/ rRNA RDP II A, S http://rdp.cme.msu.edu/index.jsp RNase P RNase P Database C http://www.mbio.ncsu.edu/RNaseP/ snoRNA snoRNABase S http://www-snorna.biotoul.fr/ snoRNA Plant snoRNA
Database A, S
http://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home
snoRNA Human snoRNA Database
S http://www.trex.uqam.ca/~snorna/Seqs.html
tRNA GtRNAdb V http://lowelab.ucsc.edu/GtRNAdb/ tmRNA tmRNA A, S http://www.indiana.edu/~tmrna/ Noncoding RNA
ncRNA Database S
http://biobases.ibch.poznan.pl/ncRNA/
All Pseudobase V http://biology.leidenuniv.nl/~batenburg/PKB.html All RNAbase S http://www.rnabase.org/ All Rfam A, S http://www.sanger.ac.uk/Software/Rfam/index.shtml All RNAfold/MFOLD C, V Installed on local server
67
Table 2.2 Number of possible RNA topologies for different numbers of stems, N.
In the enumerated graph results, there are isomorphic graphs (redundant structures).
Unique topologies are the remaining graphs after removing isomorphic graphs.
N Total Unique Topologies % Unique 1 1 1 100 2 2 2 100 3 10 8 80 4 78 46 58.97 5 746 368 49.33 6 8566 3914 45.69 7 114834 51390 44.75
68
CHAPTER 3 RNA STRUCTURAL FINGERPRINT
3.1 Enumeration of XIOS graphs
In order to understand and explore the RNA topology space, I have developed a
systematic way to efficiently enumerate all small XIOS graphs which are physically
possible. An n-vertex XIOS graph corresponds to an n-stem structure. Based on the
results shown in Figure2.5 and Figure2.6, the average size of RNA stems is ~20nt (not
counting loop regions).
For an n-stem structure, there are 2n half-stems (each stem has two base-paired regions,
and each region is called a half-stem). In enumerating the possible structures, we assign
labels to the 2n half-stems such that the label on left half-stem of each stem is lower
than all labels on the left half-stems to its right, and the label on each right half-stem is
higher than all labels on the right half-stems to its left. By definition, the label of the first
half-stem is 1, and there are 2n-1 possible regions that can pair with half-stem 1. The
half-stem chosen to pair with half-stem 1 is labeled 2. At this point, one stem (1, 2) is
formed, with the two half-stems 1 and 2 paired. Here I have defined the sequence
direction from labeled half-stem 1 to half-stem 2 as the positive direction, and the
opposite direction is the negative. For the third half-stem, there are three possible
69
locations it can be placed relative to the position of half-stem 1 and half-stem 2: A. in
the negative direction from half-stem 1, B. between half-stem 1 and half-stem 2 and C.
in the positive direction from half-stem 2. The directionality chosen here is arbitrary.
Cases A and C are symmetric, which means they lead to redundant structures. From
now on, we just consider the positive direction cases (B and C). If the third half-stem is
between half-stem 1 and half-stem 2, by definition, the third half-stem should be
labeled as half-stem 2 (the half-stem previously labeled as half-stem 2 is now assigned
label 3). Otherwise, the third half-stem is in the positive direction from half-stem 2, and
the third half-stem is then labeled 3. As a result, in a unique structure the third half-
stem could only have one possible label (2 or 3), and there are 2n-3 possible half-stems
that can pair with this half-stem, and so on (Figure3.1 A).
As described in section 2.2.4 the upper bound of the number of possible n-stem
structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1 = ( )!∗ ! . The final structure is
stored in a format I call paired format (Figure3.1 B). Each pair of matching parentheses
in the paired format represents a stem. The labels of the two half-stems associated with
the stem are shown inside the parentheses.
This enumeration method only guarantees the structure can form physically. In the
enumerated graph results, there are isomorphic graphs (redundant structures). All the
paired format representations are converted into graphs and then into their minimum
DFS code. Yan et al. proved that two graphs are isomorphic if their minimum DFS codes
are the same (133). In order to purge the redundant graphs, all the minimum DFS codes
70
were built into a perl hash data structure, and only unique minimum DFS codes were
kept in the final set. After these steps, this set contains only unique XIOS graphs for
further research.
In my thesis work, I have tried to enumerate as many small XIOS graphs as possible. One
observation I have made is that once the graph size (vertex number) reaches 8 or 9, the
total number of possible graphs becomes very large. In my experiments, I have
enumerated the entire set of possible 1 to 10 stem XIOS graphs following the steps
described in Figure3.1 A. However, as a result of the large number and size of the graphs,
it took over 7TB of hard drive storage space, not mention the time consuming steps of
generating minimum DFS code for each of the structures in the dataset and purging the
redundant structures. The take home message is that it is possible to use our approach
to enumerate as many graphs as one wants. But due to the limitations of the computer
hardware and intractable computational time, one needs to decide where to stop. I
chose to keep all 1 to 7 stem unique XIOS graphs for later on work in this thesis. The
non-redundant set of 1-7 stem graphs comprises 55,728 graphs in total as described in
Error! Reference source not found..
3.2 Structural motif library construction
As mentioned above, all 1 to 7 stem small XIOS graphs were enumerated. I define a
concept called the structural motif, which represents the building blocks used in
biological RNA XIOS graphs. I further define the collection of 55,728 enumerated small
XIOS graphs as the RNA structural motif library. This library can be extended when more
71
graphs are enumerated and built into this collection. My assumption here is that
different RNA structures contain different structural motifs which correspond to their
functional differences. The RNA structural motifs embedded in a RNA XIOS graph are its
intrinsic properties.
An N-vertex graph has a maximum of N*(N-1)/2 edges. In the structural motif library,
the graphs may have up to 21 edges, namely 21 spatial RNA stem-stem relationships. I
clustered all the 55,728 graphs into groups based on the number of edges. The rationale
behind this clustering criterion is that each edge in the XIOS graph represents one
topological relationship between a pair of stems, and it is reasonable to group
structures with same number of pairwise stem relationships. Indeed, I clustered the
graphs based on their stem number at the very beginning. But as you may have guessed,
some graphs have strong correlations with each other, and small graphs appear within
bigger graphs. In chapters four and five, which focus on RNA classification and
identification, the stem number-based graph-clustering strategy does not work well. As
an alternative, I then developed the following edge based clustering strategy. Each
graph has a pre-generated minimum DFS code associated with it (calculated in the
redundant graph purge step). Within each N-edge graph group, all the graphs are sorted
based on the alphabetical order of their minimum DFS codes. By doing so, a unique ID
(UID) was assigned to each of the graphs of the form N_row_motif_X, where N is the
number of edges in the graph and X is the rank of the graph in the sorted order. I call
72
these row motifs because the clustering was based on their edge numbers, which in
terms of the DFS code is also the number of rows of code.
In order to manage this big set of graphs, a MYSQL database was set up to store them
and provide easy access. A table called row_motif was created with the graph UID as a
primary key and the minimum DFS codes stored as a column in the table.
Each structural motif is represented by its minimum DFS code, which is an abstract
concept. For better visualization and understanding of structural motifs, in the
beginning, I wrote a PERL CGI (Common Gateway Interface) script, row_motif_check.cgi,
to render them as RNA stem-loop diagrams by using minimum DFS code as input. This
script uses LWP (The World-Wide Web library for Perl) package to interact with
PseudoViewer3 web service (143) and retrieve the result. PseudoViewer3 is an excellent
visualization tool which was the first one developed for the automatic drawing of RNA
with any type of pseudoknots as a planar graph (Figure3.2). Despite its useful features,
we experienced a lot internet connection difficulties and inconsistency since the server
is located in Korea. As an alternative tool, I added another visualization application
called VARNA (Visualization Applet for RNA) (144), which is a lightweight Java applet
that draws RNA secondary structures, to the row_motif_check.cgi script . VARNA is runs
locally from our server and this guarantees its speed and availability. One drawback is
that VARNA cannot produce as nice layout of pseudoknotted structures as
Pseudoviewer3 (Figure3.2). Since we do not have a strict requirement about the layout,
VARNA is sufficient to perform the visualization task.
73
3.3 RNA structural fingerprint
3.3.1 Background
In the bioinformatics research related to DNA and protein function, it is the constraint
that function places on mutational change that gives rise to the observed sequence
conservation. Traditionally bioinformatics tools rely on sequence conservation;
sequence similarity often translates into functional similarity. But in the case of RNA,
function is often only weakly linked to sequence, while the main player is the structure.
Our RNA XIOS graph representation captures the dynamic topological characteristics of
a folded RNA molecule, and it can be thought of a coarse resolution picture of the actual
structure without details such as sequence and stem length, and loop size. The XIOS
graph framework is topology based, compared to sequence based (145) or shape based
(146-148) frameworks. I present a tool which can use such structural topological
features to identify functionally related RNA molecules.
3.3.2 Definition of RNA structural fingerprint
With the RNA structural motif library constructed, based on the assumption that
different RNA structures contain different RNA structural motifs, I propose a new
concept called the RNA structural fingerprint (simply referred to as fingerprint in the
later part of the thesis). The definition of the RNA structural fingerprint is a list of
structural motifs (defined in section 3.2) that are found in a specific biological RNA
structure. This comprehensive list of structural motifs summarizes the spatial
relationships between the stems in an RNA structure.
74
The fingerprint idea is simple and straightforward. Figure3.3 shows the work flow of
generating fingerprints for RNA structures or sequences. If one has an RNA sequence,
the RNA folding program, UNAFOLD, is used to predict its suboptimal structures (up to 5%
above its MFE). Note that UNAFOLD can only predict non-pseudoknotted structures.
Our lab developed a strategy to compare suboptimal structures and identify possible
pseudoknotted structures (149), representing the ensemble of structures as a single
XIOS graph. Or if one has RNA secondary structure, it can be directly converted into a
XIOS graph by using the XIOS package described in Chapter 2. The XIOS graph is used as
query to search against the RNA structural motif library using a subgraph matching
process. This search identifies all the structural motifs that motifs are found in the RNA
XIOS graph; this is the fingerprint of the RNA XIOS graph. Each element of the feature
vector corresponds to one specific RNA structural motif in our library, and the value of
that element is the corresponding count.
The concept of an RNA structural fingerprint is slightly abstract and an example of tRNA
is given in Figure3.4 to better illustrate it. This example shows the actual structural
motifs that comprise a tRNA structural fingerprint. The tRNA XIOS graph is surrounded
by 1 to 3 row structural motifs with their corresponding XIOS graphs. There are
additional, bigger structural motifs embedded in the tRNA XIOS graph but they are not
shown here for simplicity. The colors of the stems represent one of their possible
corresponding stems found in tRNA XIOS graph. The tRNA secondary structure, 3D
75
structure and XIOS graph are colored using the same color scheme to better highlight
the corresponding structures.
3.3.3 Fingerprint searching algorithms
The fingerprint generating process, which requires searching against the RNA structural
motif library, would be time consuming if a brute-force method was used. Let me break
this down for you. This task mainly involves determination of whether a query XIOS
graph contains a subgraph that is isomorphic to a specific structural motif XIOS graph. If
the query RNA structure is compared with every structure in the library, the
computational complexity of such a search is O(nmm), where n is the number of
structural motifs to be compared and m is the number of edges in the query XIOS graph.
For the structural motif library, n is 55,728; it could be even bigger in a more complete
motif library. This subgraph isomorphism problem is known to be NP-complete (70).
It is inefficient to scan the whole library to match structural motifs one by one. An
efficient strategy is needed to make this fingerprint search faster. A filter and
verification method is a common approach to speed up the search efficiency of
subgraph isomorphism checking over large sets of graphs. The filtering step, which
omits graphs that do not satisfy restraints defined by the user, is the key to improve
search efficiency, since the efficiency is largely determined by the number of graphs left
to be checked in the verification step (the fewer graphs left, the faster the search is).
Therefore, many approaches have proposed using indexing techniques to speed up the
76
filtering (150-158). Here I am going to describe the strategies I used in my fingerprint
search, including CUDA GPU programming and two indexing techniques.
3.3.3.1 CUDA GPU programming
The graphics processing unit (GPU) is a specialized circuit designed to efficiently
manipulate computer graphics. GPUs are normally embedded in a graphics card, or
integrated on the motherboard. The highly parallel, multithreaded, multi-core processor
structure of the GPU makes it more powerful than central processing unit (CPU) when
processing large blocks of data in parallel, Figure3.5. A simple comparison of floating
point operations per second (GFLOP/s) and memory bandwidth (GB/s) between CPU
and GPU is shown in Figure3.6.
NVIDA Corporation, a major supplier of graphics cards, released the Compute Unified
Device Architecture (CUDA), which is an extended C/C++ mixed language, for general
purpose computing on GPUs (GPGPU). It is becoming one of the hot computational
research areas with promise to advance computationally challenging problems in areas
such as large database searching, protein folding, and molecular dynamics simulation.
The RNA structural fingerprint search process includes thousands of independent
subgraph isomorphism checks. GPGPU was used to parallelize the searching process and
improve its searching efficiency. We used the NVIDIA GeForce 9800 GTX+ graphics card
(16 multiprocessors, 128 streaming cores) to implement CUDA code and perform the
search. While it seems that the subgraph isomorphism problem is suitable for
77
implementation on GPU, my fingerprint search requires reasonable large amount of
memory. I stopped implementing the search code in CUDA due to the graphics card
memory limitation (512MB). GPUs with larger memory could be used to solve this
problem and speed up the search process.
3.3.3.2 Prefix tree search
Binary tree searching has O(nlogn) computational complexity, which is far better than
O(nmm). In Chapter 2, I described using a graph sequentialization method, the gSpan
algorithm DFS coding (133,134), to translate a graph into its minimum DFS code which
can be considered to be canonical labeling of the graph. Yan and Han showed that if two
graphs are isomorphic, they must share the same minimum DFS code (133). I proposed
a prefix tree data structure to efficiently store and retrieve structural motifs in the
library (Figure3.7). As mentioned before, the minimum DFS codes were pre-calculated
for each structural motif, and those codes were stored in this prefix tree. The prefix tree
stores all n row motifs in level n. Each node of the tree only holds one row of DFS code.
In order to retrieve the complete minimum DFS code of a structural motif, it is necessary
to trace from root node to the leaf node representing the last row of the DFS code for
the structure and retrieve the code. Node X is a parent of node Y if and only if the DFS
code from the root node to X is a prefix of the DFS code from the root node to Y.
Most subgraph isomorphism checking methods used in large scale graph set are filter-
and-verification, which means they first filter out graphs that do not satisfy restraints
defined by the user, and then perform isomorphism checking on remaining graphs. My
78
prefix tree strategy employs the verification-and-filter style. Compared with filter-and-
verification, this style does isomorphism checking start with a small graph. The program
filters out the large number of graphs which are extensions from this small graph, if this
small graph fails to pass the isomorphism checking. The details of the prefix tree
searching are as follows: when doing fingerprint search for a query XIOS graph, it starts
from the root of the prefix tree. The root node contains just the “zero” row motif graph
which has only serial stem-to-stem topological relationships among all the vertices, for
example, the first graph in Figure2.3. There are two one row motifs that are children of
this root node, which are the 2nd and 3rd graphs in the first row of Figure2.3. From here
on, a depth first searching strategy is used in the search. One of the two motifs is chosen
for subgraph isomorphism checking with respect to the query XIOS graph. If this motif
passes the check (matches), then one of its child node motifs is retrieved and used for
subgraph isomorphism checking with respect to the query XIOS graph. This search
process is repeated until a motif M which fails the subgraph isomorphism checking is
found. Since this is a prefix tree data structure, all the motifs represented by the child
nodes must contain M as a subgraph. If M is not a subgraph of the query XIOS graph,
then all its child nodes represent motifs that cannot be subgraphs of the query XIOS
graph, because they are basically extensions from M.
The filtering power of this strategy is that once the subgraph isomorphism checking fails
at a specific node, the whole child branch of this node no longer needs to be checked,
since this node is a prefix of all its child nodes.
79
During the test of this approach, I experienced some inconsistency in searching speeds.
In spite of this, the overall searching speed was faster than the brute-force method.
After carefully looking at the layout of the tree in the memory, I found that the order in
which the tree nodes are allocated in memory is very important. Using the perl language,
one does not have full control of memory allocation; the sequence of construction
nodes in the tree was the cause of the inconsistency. What we found was memory page
problem occurs whenever the tree is big, spanning more than one page. If the physical
address in which a parent node stores to the address of its child node is bigger than a
page range, the memory access time is far higher. One take home message is that
efficient layout of the tree in memory is important, which can potentially save a lot of
computational time.
3.3.3.3 NH indexing
Prefix tree searching improved the fingerprint generating speed, but it was still not
satisfactory. Inspired by the Neighborhood indexing (NH indexing) method (159)
(Figure3.8), I have developed a modified version of the NH indexing strategy to speed up
the RNA structure database searching process. The main idea of the NH indexing
strategy is that a vertex plays a role proportional to its significance when we are
matching two graphs. The neighbors of the vertex and its degree can be used to
determine the significance of a vertex in the matching process. This information is used
in the indexing as well as in the query search process. Besides the neighbor and degree
information, I also define triangle descriptors (Figure3.9) that describe vertex properties.
80
For a given vertex i, vertices j and k are its neighbor (connected by edges). A triangle can
be formed by i, j and k. Depending on the edges linking these three vertices, 36 different
triangles are possible (Figure3.10).
XIOS graphs are further separated into connected components (modules), i.e., distinct
subgraphs that have only serial (non-nested) relationships with each other. Modules
represent independent pieces found in biological RNA structures. Every vertex of each
module in the database is indexed by the modified NH indexing strategy (159).
Besides the triangle descriptors shown in Figure3.9, there are cases that cannot
physically form, even though they are mathematically feasible. A complete list of all of
the 36 mathematically possible triangle descriptors can be found in Figure3.10. I use a
list called the NH index array to store the vertex properties for each vertex. . The design
of the NH index array is shown in Table 3.1.
The NH index array has 42 elements. It includes the counts of all 36 triangle descriptors
listed in Figure3.10, the number of I, J, O, and X edges that extend from the vertex, its
degree (d) and number of edges between its neighbor vertices (nc). The details of
generating NH index array for a specific vertex are described in Algorithm 3.2.
Algorithm 3.1 build_NH_index_array
Input: graph vertex ni Output: NH index array NH(ni) 0: Initialize NH index array of vertex ni with zeroes 1: Find all neighbors (vertices connected to ni) of vertex nI, and put them into neighbor list NB
81
2: FOR each vertex nj in neighbor list NB 3: FOR each vertex nk ≠ nj in neighbor list NB 4: For each triangle descriptor Tl
5: if vertices ni, nj and nk match triangle Tl 6: increment the count of triangle descriptor by 1 7: is the type of edge between vertex ni and nj, ∈ { , , , } 8: is the type of edge between vertex ni and nk, ∈ { , , , } 9: Increment the count of and in NH(ni) 10: END FOR 11: END FOR 12: RETURN NH index array of vertex ni
Each RNA structure is converted to a XIOS graph, and further separated into its XIO-
edge-connected components (modules). For each module, I calculate the NH index
array for every vertex of the graph, and create a database of structure vertices indexed
by triangle descriptors. For example: if vertex n of graph S and its neighbors can form
triangle descriptors T0 and T18. Feature T0 and T18 are used as keys to index vertex n
and graph S. After the vertices in the entire structure database have been indexed, it is
easy and fast to look up all the vertices and graphs in the database which associated
with a specific triangle descriptor features.
Algorithm 3.2 NH indexing Input: all database XIOS graphs ( ) Output: index , a list of database graph vertex ids containing each feature Dk 1: FOR each graph Si in the XIOS graph database S 2: separate graph into modules list*, ( ) 3: FOR each module in module list ( ) 4: FOR each graph vertex in module 5: CALL build_NH_index_array function with vertex , it returns array 6: FOR each of the 36 possible triangle descriptors T0 to T36
82
7: IF vertex is associated with descriptor Tk THEN 8: Append vertex to index entry Dk 9: END IF 10: END FOR 11: END FOR 12: END FOR 13: END FOR 14: RETURN index * x,i,o edge connected components
The searching method uses the NH indexing aided database search as well as complete
subgraph matching. The database comprises a set of XIOS graphs derived from biological
RNA structures. Search queries are indexed in the same way as the database and each
query vertex is compared to the database index in order to find topologically similar
vertices in the database as candidates/seeding vertices. This is the NH indexing
screening step. My search strategy does not require searching against the whole
database, but just the graphs that contain the seeding vertices found by the NH indexing
screening step. The performance of the search is greatly improved due to the smaller
searching space. Algorithm 3.3 describes the search process step by step.
Algorithm 3.3 NH indexing search Input: query XIOS graph and index Output: search hit list , where each Hi indicates a database module that matches the query 1: Separate query XIOS graph Q into its connected components module list, ( ) 2: FOR each module in module list ( ) 3: FOR each graph vertex in module 4: CALL build_NH_index_array function with vertex , it returns list NH(ni)* 5: FOR each of the 36 triangle descriptors 6: IF vertex is not associated with descriptor Tk THEN 7: Put all the vertices from the list Dk into the non-candidate list NC(ni)
83
8: END IF 9: END FOR 10: FOR each of the 36 triangle descriptors 11: IF vertex is associated with descriptor Tk THEN 12: FOR each vertex nj in list Dk 13: IF nj is not in the non-candidate list NC(ni) THEN 14: Append nj to the candidate list C(ni) 15: END IF 16: END FOR 17: FOR each vertex nc in the candidate list C(ni) 18: FOR each k-th value in list NH(ni): NH(ni)[k], where 1<=k<=42 19: IF k-th value in list NH(nc): NH(nc)[k] is smaller than NH(ni)[k] THEN 20: skip to the next vertex in the candidate list C(ni) 21: END IF 22: END FOR 23: lookup the database graph module mhit that contains vertex nc 24: Append mhit into hit list H 25: END FOR 26: END FOR 27: END FOR 28: FOR each graph module hit mhit from hit list H 29: IF query module and this graph module hit mhit have equal or bigger number of vertices THEN 30: IF simple_subgraph_isomorphism_check(ml, mhit)** is NOT true THEN 31: Remove this graph hit mhit from the hit list H 32: END IF 33: END IF 34: END FOR 35: END FOR 36: RETURN search hit list ; * The detail of the function described in Algorithm 1. It basically returns a list of the count of each of the 36 triangle descriptors associated with the vertex, edge types going out from this vertex and degree of the vertex. ** This function would take two graphs as input and do complete subgraph match test. Line 2-27 graph vertex filtering. According to the graph containment search exclusive
logic (155), if a feature f is not embedded in query graph Q, any graph Gi in the database,
84
which has feature f, should not be a matching candidate. First we find out which triangle
descriptors are not associated with the query structure vertex ni. From the database
index, I identify and push all the vertices that contain any triangle descriptor feature,
not associated with ni, onto the non-candidate vertex list NC(ni). And all the vertices that
contain any of the query vertex triangle descriptor features are included in the
candidate vertex list C(ni). This is followed by removing the intersection of NC(ni) and
C(ni) ( ( ) ∩ ( )) from the candidate list C(ni). Further screening was done by
checking the triangle descriptor feature counts. The K-th count (1 ≤ ≤ 42) of the
query structure vertex NH(ni)[k] should not be smaller than the k-th count of NH(nc)[k],
where nc is the database vertex. If NH(ni)[k] ≥NH(nc)[k], nc is removed from the
candidate list C(ni). Later, a module list mhit is built from the candidate list C(ni) by
looking up the index of the vertex to module association. At this point, a list of
candidate vertices C(ni) and graph module list mhit are available for next step.
Line 28-35 module size screening and simple subgraph isomorphism check. This step
efficiently rules out candidates from the list mhit, leaving a small number of candidates
for the more accurate and the most time consuming test. In order to perform a specific
biological function, RNA needs to have a certain topological module set, and each of the
modules needs to be complete. That is to say the query structure topological module
needs to have the same or bigger size (number of vertices) as a database module. This
simple module size test filters the false positive matching very quickly. If the size
requirement is not met, that database module is discarded from the list mhit. The next
85
step is the most computationally expensive step of the searching process, a simple
subgraph isomorphism check. In this case, it goes through an accurate complete
subgraph containment (looping through all candidate vertex combinations till the first
complete match is found, the worst case is going through all combinations) search using
the query XIOS graph module to search against all the database module hits in the
module list mhit. This is looking for database modules are the same size as the query
module or completely nested in the query module. The false-positives are omitted from
the result hit list H.
3.3.4 Possible applications
It is intuitive that RNA molecules contain different structural motifs, but members of the
same RNA family share more common motifs than RNAs from different families. For a
given biological RNA, we represent its ensemble of suboptimal secondary structures
(predicted by UNAFOLD) by a XIOS graph and describe its topological features by
fingerprint. Comparison of the fingerprints of a set of RNAs can give one a clue about
their relationships and functional similarities. It is a natural extension of this work to
index experimentally determined or computationally predicted RNA structures from
different publically accessible data sources by their fingerprints. This approach allows
the construction of an RNA topology database with all RNA topological information.
Furthermore, a database search utility can be developed to perform RNA topological
similarity search. With the aid of the RNA topology database, feature selection and
86
classification methods can be used to identify important topological features
corresponding to specific biological functions.
87
Figure 3.1 Enumeration of XIOS graphs.
A. the steps of assigning half-stem labels. For the n-stem case, there are 2n half-stems. We assign integer labels to each half-stem from 1 to 2n.By definition, the first half-stem is labeled 1, and there are 2n-1 possible regions that can pair with the first half-stem. Here we defined the sequence direction from labeled half-stem 1 to half-stem 2 as the positive direction, and then the opposite direction is the negative. The third half-stem has only one possible label (2 or 3), and there are 2n-3 possible half-stems that can pair with this half-stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1. B.tRNA example. This is a result resembles the tRNA structure (3 leaves cloverleaf structure). The code below is the paired format representation of this structure. Each pair of matching parenthesis represents a stem. Each stem has two half-stems associate with it. Their labels are inside the parenthesis.
88
Figure 3.1
89
Figure 3.2 RNA secondary structure visualization.
On the top of the figure shows an example structure dot-bracket representation. A. RNA structure visualization done by PseudoViewer3. B. RNA structure visualized by VARNA.
90
Figure 3.3 Flow of generating fingerprint.
91
Figure 3.4 RNA structural fingerprint tRNA example.
This is one example showing actual structural motifs listed in tRNA RNA structural fingerprint. A. tRNA XIOS graph is located in the center, and it is surrounded by 1 to 3 row structures motifs with their XIOS graphs. There are bigger structural motifs embedded in tRNA’s XIOS graph but they are not showing here for simplicity. Colors of the stem represent one of their possible corresponding stems found in tRNA XIOS graph. B. tRNA 3D structure C. tRNA secondary structure stem-loop diagram. Note that tRNA secondary structure, 3D structure and XIOS graph are using the same color schemes to characterize different stems.
92
Figure 3.5 Architecture comparison of CPU and GPU.
Adopted from CUDA C Programming Guide 4.0
93
Figure 3.6 Comparison of CPU and GPU.
Left Floating-point operations per second. Right Memory bandwidth. Adopted from CUDA C Programming Guide 4.0
94
Figu
re 3
.7 P
refix
tree
stru
ctur
e st
ores
stru
ctur
al m
otif
libra
ry fo
r effi
cien
t sub
grap
h is
omor
phis
m c
heck
.
95
Figure 3.8 Neighborhood indexing (NH indexing).
Left panel: the open circle in the center is the vertex we are focusing on. All the squares are its direct neighbors, and the diamond is not its neighbor. Information such as number of I, J O and X edges extend from the vertex, degree of the vertex (d) and connections between (nc) its neighbors are considered as the properties of the vertex. A list of all the information is called the NH index array of vertex. Right panel: With the help of the NH index array, some vertices can be easily anchored from the query graph (smaller graph on the left) to the database graph (bigger graph on the right). Those vertices serve as seeds (closed circles) of the initial step of graph matching. Extending to their neighbor vertices (open circles) would lead to the maximum subgraph out of the two graphs more efficiently.
96
Figure 3.9 Triangle descriptors.
All physically possible triangle descriptors are shown. A complete list of all mathematically possible triangle descriptors can be found in Figure3.10. Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.
97
Figure 3.9 Triangle descriptors.
98
Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.
Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.
99
100
101
Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.
102
Table 3.1 Design of the NH index array.
There are 42 elements in this array. Basic elements of the array include: 36 mathematically possible triangle descriptors (T0 to T 35), counts of I, J, O and X edges (I, J, O, X), degree of the vertex (d) and number of edges connect its neighbor vertices (nc).
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24 T25 T26 T27 T28 T29 T30 T31 T32 T33 T34 T35 I J O
X d nc
103
CHAPTER 4 MATCHING UNKNOWN RNA STRUCTURES
4.1 Introduction
Graph theoretical approaches have been used to identify chemical moieties associated
with functional properties (71,72,153,161). Chemical structures have been represented
by molecular graphs in quantitative structure-activity relationships (QSAR) studies, using
structural determinants to model and predict physicochemical and biological properties
(73). Graphical representation of chemical structures has been used to compare
structure similarity and to identify function (71,72). The correlation between chemical
properties and function can be used to predict the function of novel molecules.
ChemIDplus (161), PubChem (162), ChemBank (163) and BindingDB (164) are examples
of databases using graph theoretical approaches for chemical structure search and
comparison.
While increasing attention has been drawn to the important biological roles of RNA,
RNA functional annotation remains difficult. Approaches based on RNA primary
sequence alone have been extensively studied and implemented (66,140,146,165,166),
but there are many cases where structurally similar RNAs have little or no detectable
sequence similarity; in these cases sequence based approaches fail to correctly identify
104
and classify the RNAs. RNA function is dependent on RNA tertiary folding, and tertiary
structure, in turn, is largely determined by base-paired secondary structures and
pseudoknotted structures, which are not secondary structure. The RNA-XIOS database
provides a means to link RNA secondary structure, including pseudoknots, to its
biological function and physicochemical properties by associating topological patterns
with the functions of currently known of RNA families.
Similar to graph database studies in chemical informatics, the RNA structure XIOS graph
database provides extensive RNA-secondary-structure topological information, including
pseudoknots, and ensembles of suboptimal RNA secondary structures (160). For a given
query RNA structure, it can quickly identify topologically similar RNAs for further
analysis, such as function identification.
Several RNA motif databases based on graph theory are currently available (167), but
they do not provide a RNA structural topological searching service that includes
pseudoknot topologies. Additionally, techniques for efficiently identifying structural
similarity between RNAs are not well developed. Our approach is similar to the
RNAshapes approach of Giegerich et al. (98), but we also consider pseudoknots and
suboptimal structural ensembles. It is also similar to the RAG (RNA-As-Graphs) (168),
which describes RNA structures as graphs. However, some RAG structures that are
mathematically possible cannot form in the physical world. In our approach we only
enumerate physically possible graphs, greatly reducing the search space for topological
similarity.
105
4.2 Methods and dataset
4.2.1 XIOS Graph
We have developed a framework , XIOS, which represents an ensemble of RNA
secondary structures in a single graph; pseudoknots and suboptimal structures are
specifically included (160). XIOS graphs are constructed based on base-pairing in actual
and/or predicted stems. Each vertex in the XIOS graph represents a RNA stem and each
edge/link corresponds to one of the four spatial stem-stem relationships: exclusive (X),
included (I), overlapping (O) and serial (S). A special case of the reverse relationship of
included (I) is denoted as J (Figure2.1). As this is a complete list of possible relationships
between base-paired regions in a RNA, the XIOS approach can be used to enumerate all
physically possible graphs. The XIOS graph is then converted into minimum depth first
search (minimum DFS) code for fast RNA structure comparison (133). In contrast to
traditional sequence-based approaches, XIOS is a topology-based approach, which
allows comprehensive and efficient exploration of the RNA structure space.
4.2.2 Dataset
The data set used here is the manually curated dataset described in Table 1.1.
4.2.3 Indexing and searching
The basics of the database search and fingerprint search are all the same as described in
chapter 3. The indexing algorithm is the same as algorithm 3.2, and the search algorithm
is the same as algorithm 3.3. But there are differences: the database in the fingerprint
search is the structural motif library (see section 3.2), while here it is the manually
106
curated dataset (Table 1.1). Real biological RNA structures are indexed in the RNA
structural database considered in this chapter. The structures I consider here are
significantly bigger than the enumerated structural motifs and more biologically
relevant. The large size and complexity of real biological RNAs are challenging aspects in
this database search study.
4.2.4 Scoring function
The similarity of RNA structures is evaluated based on the number of indexed subgraphs
they have in common. For a query structure, each query module has a true candidate
database-module-list generated by algorithm 3.3. By examining the combinations of lists
of all graph modules, the database structures with the most module hits, as well as the
closest size-matches, can be found. A combination of these terms is used to define the
best matching structure.
For a specific query structure, a XIOS graph with certain number of modules, large
graphs in the database would tend to have a higher number of module hits and larger
matches. This is because large graphs have larger modules and more kinds of small
modules, regardless of their true similarity to the query. Large database graph modules
will therefore tend to have larger matches to query graph modules. For example,
consider a query graph module with N (N≥2) vertices, namely size N. Suppose there are
two database graph modules A and B (where the size of A is N and the size of B is less
than A), and that both A and B match to the query. Module A would tend to have a
larger match with the query module, since A is bigger than B. In general, larger database
107
modules will therefore tend to have larger matches regardless of the query. We penalize
the unmatched regions of a database graph match to correct for this effect. The bigger
the unmatched region, the higher the penalty it receives. Among the best database hits
with the same matching size and number etc., higher scores are given to database
modules that are the most similar in size to the query graph.
The database search result is affected by the following factors: structure overlap size
(the number of vertices that can be mapped between query and hit), and query and hit
module size differences. We added penalties to the scoring function (eq 4.1) to penalize
the unmatched part of the structures. This helps to promote the structures, with similar
size to the query graph, to the top of the result hit list.
= ___ _ _ eq 4.1
where score is in the range of (0, _ /2] The denominator of eq 4.1 adjusts for size differences between the query and database
modules. If the query and hit sizes are the same, the hit score reaches its maximum
value. If the query and hit size are very different, say query size >> hit size or query size
<< hit size, the hit score reaches its minimum value which asymptotically approaches
zero. For all other cases, the hit score would lie between the two extreme values.
108
4.3 Results
4.3.1 Validation using known biological structures
An NH indexing database search was conducted using each XIOS graph in the manually
curated structure dataset (Table 1.1) to search against the whole database of manually
curated structures. Performance of the database search is evaluated by the Positive hit
ratio (PHR), which is the number of correctly labeled hits divided by the total number of
hits. The label is the known family of its best match in the database. A sample PHR
calculation for a BLAST search is shown in Figure4.1. In this example, the black line
represents a query search sequence, red lines are the true positive hits, and blue lines
are false positive hits. The positive hit ratio for this search is 3/5. Figure4.2 shows that
we correctly identified and classified RNA structures using topological criteria across
four distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA.
The Y-axis of the charts represents the percentage of NH index searches that achieved a
certain PHR. For example, in our dataset, I perform 16 separate searches for each of 16
tRNAs. 10 times out of 16 I observed 100% PHR, the percentage of having 100% PHR for
tRNA is 0.625 (62.5%). Over 75% of Group I Intron RNA, RNAseP RNA and tmRNA
queries retrieve 100% PHR in the top 5 hits, while over 55% of tRNA queries have 100%
positive hit ratio. We also examined the top 10 scoring hits for each RNA family. In this
case, Group I Intron RNA and tmRNA queries still show high recall, while RNaseP RNA
queries rank somewhat lower.
109
Classification accuracy is lowest for tRNA queries. There are two possible reasons: First,
tRNA XIOS graphs are small - such small motifs maybe found in many larger, but
unrelated, graphs in the database. Second, the number of tRNAs in the database is
relatively small compared to the other groups (16 structures collected from PDB
database). If a couple of tRNAs match to other RNA families, the fraction would be
relatively big in comparison to the other three RNA families which have more instances
in the database. Overall, this result confirms that our database search is able to identify
RNAs with similar structure and function based on topological matching.
4.3.2 Size only graph database search
Many RNAs within a functional class have very similar lengths. This raises the possibility
that the classification shown in Figure4.2 was simply due to matching between
structures with similar sizes (the structure (graph) size is the number of vertices in the
XIOS graph). To eliminate this possibility, we implemented a function that scored the
RNAs based only on their XIOS graph sizes. Figure4.3 shows that matching by size alone
achieves about only about 20% classification accuracy, much lower than the level
achieved by the topological matching (Kolmogorov-Smirnov test result is shown in Table
4.1). This shows that graph size is not the main factor contributing to the correct RNA
classification.
4.3.3 Embedding simulation
Another key issue in matching to a structural database is whether a topologically and
functionally similar structure can be found even when it is embedded in a larger
110
structure. Such embedding could occur either in a biologically meaningful sense (a true
relationship), or be due to misassembly or misannotation of the source sequence. For
each of the structures in our dataset, we performed an embedding simulation which
automatically mutated the sequence of the query RNA structure graph while generating
unique structures bigger (in size) than the input structure, and with the input structure
embedded in it (Figure4.4). We call these bigger structures extended structures. Table
4.2 lists the statistics of the embedding simulation.
For each of the original query structures in the database, we generated circa 15
extended structures (Table 1.1). An NH index database search is then performed using
the extended structures as queries to see if the original graph and its related graphs can
still be found. This embedding simulation (Figure4.5) clearly shows that the NH index
database search is able to find the embedded structure and its related structures,
compared with results shown in Figure4.2 (Kolmogorov-Smirnov test result is shown in
Table 4.1). It further suggests that NH indexing XIOS database search acts more like a
local similarity search than a global similarity search, since it can identify the local graph
patterns within a larger overall graph pattern. Indeed, the performance of the extended
query search is even better than the original queries, most likely because the bigger the
structure is, the more likely it is to have hits to its family. The topological structure
database search is therefore robust and it successfully classified RNA structures into
their correct RNA families.
111
4.3.4 Blast search
A study understanding the difference between NH indexing database search and Blast
search was performed. The result is shown in Figure4.6. The Blast result shows some
good result for RNAseP and tmRNA, but not tRNA and Group_I_Intron RNA.
Kolmogorov-Smirnov test (Table 4.1) shows that the NH indexing database search
achieved same performances for RNAseP and tmRNA, while both search results are
good. Also it shows that NH Indexing database search results for Group_I_Intron RNA
and tRNA are statistically significantly better that Blast search.
4.4 Discussion
Identification of conserved sequences has played an important role in establishing
sequence-structure-function relationships in proteins, but has been less useful with RNA
because it is the folded structure rather than the sequence that is most closely related
to function. We have developed a structure database searching algorithm, NH indexing
database search that can identify and classify topologically, and probably functionally
similar, RNA structures. This knowledge can be used to build experimentally testable
hypotheses about the function of the query RNA.
The NH-indexing database-search algorithm can accurately classify RNA structures
without using primary sequence information. Integrating primary sequence information
into the framework would improve performance for RNAs that have significant
sequence similarity to others in their functional class. Combining both sequence and
topological information should improve the classification of unclassified or misclassified
112
sequences with low sequence identity but relatively close structural/topological
distances. Any significant sequence similarity should improve the ability to assign the
novel member to the correct family.
The NH-indexing database-search algorithm can also be used as a topological distance
measure of RNA structure similarity. Identifying conserved sequence motifs associated
with unknown functions using multiple alignments and HMMs have been highly useful
in identifying and classifying proteins according to their function (169-171), and should
be similarly useful for RNA.
The current design of the database search requires that the database modules be
smaller or equal in size to the query module. This work can be extended to implement a
subgraph similarity search that would allow a certain amount of mismatch in order to
maximize the searching power of our approach. If mismatches are allowed, more
database structures would be included in the search result, but the results, presumably,
would include more false positives. Thus more information can be used to classify and
identity query structures, but at the expense of increased noise. Another concern is
whether the current module size correction is reasonable. Currently, if the structure is
small, fewer results would be found in database search, since fewer indexed graph
modules are found in small XIOS graphs. It is likely that if we add smaller and family
specific motifs to the database, the search performance would be better for small RNA
query structures. Such family specific motifs can be identified and obtained by applying
feature selection methods to specific RNA family structure datasets.
113
Figure 4.1 Positive hit ratio (PHR) in a Blast search.
The positive hit ratio is the number of true positve divided by the total number of hits in the result. In this example, the black line represents a query search sequence, red lines are the true positive hits, and blue lines are false positive hits. The positive hit ratio for this search is 3/5.
114
Figure 4.2 NH database search result.
Topological criteria can be used to correctly identify and classify RNA structures across 4 distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA. The x-axis shows the Positive hit ratio, which is calculated as the count of correct hits over total number of hits. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10. The y-axis is the percentage of queries showing the specified positive hit ratio.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 5 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 10 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
115
Figure 4.3 Size only database search result.
The horizontal axis shows the PHR, and the vertical axis the fraction of queries reaching the specified level. The results here show distributions of close to random matching, indicating that matching between structures based on size alone is not the main factor contributing to the classification result shown in Figure 4.2.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Top 5 hitsFraction graph size only
All
tRNA
group1
RNAseP
tmRNA 0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Top 10 hitsFraction graph size only
All
tRNA
group1
RNAseP
tmRNA
116
Figure 4.4 Embedding simulation.
A. Sequence embedding. A given sequence (blue), is embeded into two flanking sequences (yellow and orange) resulting in a longer sequence. B. Graph embedding. Using the same idea as in sequence embedding, graph embedding is applied to a graph by adding extra vertices (orange) and edges to form a bigger graph. In our study, we implemented a perl script to automatically mutate the base pairing of the input RNA structure graph on the sequence level and generate unique XIOS graphs which are bigger (number of vertices) than the input graph and have the input graph as a subgraph embedded in it
117
Figure 4.5 Embedding simulation database search result.
The x-axis shows the Positive hit ratio, and the vertical axis shows the fraction of queries in each family achieving the specified PHR.. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 5 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 10 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
118
Figure 4.6 Blast search result
By using RNA sequence, Blast search was performed on the manually curated dataset.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positvie hit ratio
Blast Search Top 5 hits
All
tRNA
RNaseP
group1
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Blast Search Top 10 hits
All
tRNA
RNAseP
group1
tmRNA
119
Table 4.1 Kolmogorov-Smirnov test results
Kolmogorov-Smirnov test p-values vs. blast search vs. size only vs. embedment
top 5 hits top10 hits top 5 hits top10 hits top 5 hits top10 hitstRNA_M 0.0657 0.0657 0.999 0.0071 0.8998 0.3544RNaseP_M 0.2307 0.005 0 0.0009 1 1Group_I_Intron_M 0.0059 0.001 0 0 1 0.5769tmRNA_M 0.866 0.058 0 0 0.9631 0.9919
120
Table 4.2 Statistics of embedding simulation
tRNA RNAseP group1 tmRNA All
Simulated graph number 237 567 631 1778 3213
Simulation per structure 14.81 15.75 15.78 15.20 15.37
Min graph size 4 6 12 5 Max graph size 12 29 33 28
Average graph size 7.80 15.88 22.42 19.48
121
CHAPTER 5 IDENTIFICATION OF TOPOLOGICAL FEATURES THAT DISCRIMINATE
BETWEEN RNA CLASSES
5.1 Introduction
5.1.1 RNA importance, RNA function determined by RNA structure
Like proteins, RNAs also perform important cellular functions, and our understanding of
this fact is increasing rapidly (172,173). As our existing knowledge about RNA grows, the
large scale characterization and analysis of RNA structures and functions, namely
structural genomics of RNA, becomes increasingly important (148,174). The core focus
of the structural genomics of RNA is to find all unique structural motifs and 3D folds;
molecular structure determines function. Current RNA function predictions mostly are
based on finding conserved sequence motifs, similarly to what is done with proteins. In
order to identify sequence motifs, multiple sequence alignments have to be generated.
In principle, conserved motifs can be identified from the alignments and the function of
RNAs predicted. The problem is that, compared to proteins, there are not many RNA
classes currently known, and RNA sequences sharing the same structural motifs may
have no detectable primary sequence similarity, making it impossible to align them. RNA
secondary structure prediction can help the prediction and identification of conserved
122
tertiary structure, but accurately predicting RNA secondary structures from sequence
information alone is not trivial.
As an alternative approach, graphs have been used to represent RNA secondary
structures. Notable efforts in this direction include RNAshapes approach of the
Giegerich group (98), but RNAshapes does not include pseudoknots.
Another graphical approach is the RAG (RNA-As-Graphs) of the Schlick group (168)But
they enumerate all mathematically possible graphs, and thus the search space is really
big.
Reliable RNA secondary structural information is needed to solve the RNA function
prediction problem because experimental determination of large RNA structures is very
difficult. While there are many RNA secondary structure prediction programs, few of
them can predict the key elements called pseudoknots. Pseudoknots are a common
structural motif in many RNA classes, such as self-splicing introns and telomerase. They
play important roles catalytic functions of RNA, such as forming the catalytic core of
ribozymes, and altering gene expression by inducing ribosomal frame shifting in many
viruses (175). The ability to annotate novel RNA secondary structures can give insight
into the possible different functions and roles of RNA. In addition, the ability to find
novel RNA secondary structures can help with the design of pharmaceuticals by
providing an accurate target site for drug recognition.
123
5.1.2 Contribution
Biological RNAs have different topological features than random RNAs (98,168,174). By
examining RNA families and selecting the most important features, and filtering out
common features shared by different families, one should be able to identify the unique
features contributing to the unique functions of different RNA classes. This would
constitute a mapping between topological features and function. Such a topology-
function mapping would lead to better understanding of RNA structural patterns and
also lead to more efficiently engineering/design of RNA-based drugs/complexes with
specific functions and effects. In addition, by identification of important topological
features in biological RNA molecules would help us to further refine our motif library to
contain the most discriminatory structural motifs.
5.2 Material and Methods
5.2.1 Reverse cIndex basic feature selection on RNA fingerprints
For each specific RNA family, the members of the family should share similar structural
patterns (topological features) because they perform similar biological functions. On the
other hand, different RNA families should contain a lower fraction of similar structural
patterns because they play different roles in the biological processes. Feature selection,
that is, identification of the features that most powerfully discriminate between
different classes of RNAs, based on RNA fingerprint should provide useful information
about which structural motifs contribute to a specific RNA family and, implicitly, to
specific RNA biological functions.
124
Our feature selection strategy is based on the cIndex idea (155) (Figure5.1), but uses the
opposite selection order. The cIndex strategy selects features from the most frequent to
least frequent; our strategy is to select features from the rarest features to common
ones. We refer this algorithm as cIndex-Basic-Min. Algorithm 5.1 outlines its
pseudocode. A graph feature matrix is used to show the containment relationship
between features and graphs (Figure5.2). Its (i,j) value tells the count of feature i found
in graph j. Support of a feature f in graph feature matrix is the number of graphs in the
graph feature matrix that contain the feature, f. For a given graph feature matrix , it first
selects a feature with minimum support (greater or equal than 1), then removes this
feature (row) and all graphs have this feature (columns). This process is repeated until
the graph feature matrix is empty. The rationale of the reverse the cIndex feature
selection order is that, in our study, we have a lot of small structural motifs in the library.
Those small motifs appear randomly everywhere, in every structure. If we use original
cIndex algorithm, for the first couple iterations those small motifs would be selected as
important features. The graph feature matrix would then be empty. It would fail to
capture real important structural motif features.
Algorithm 5.1 cIndex-Basic-Min Input: Graph Feature Matrix . Output: Selected Feature list . 1: Set the selected feature list as an empty list { } 2: FOR each feature f in M 3: IF support* of feature f support(f) > 0 4: Index f by support(f) value 5: END IF 6:END FOR 7: REPEAT 8: Feature is the list with all the features with the minimum support, support(f), from the index
125
9: Append the features from the list Feature to F 10: Record the iteration number iter and support(f) 11: FOR each feature from the list Feature 12: Find the corresponding row and delete all columns with non-zero values (remove feature hits) in M 13: Delete the corresponding row in M 14: END FOR 15: END FOR 16: UNTIL Matrix M is empty 17: RETURN selected feature list ;
* support of a feature f means number of graphs in the graph feature matrix that contain the feature, f.
5.2.2 RNA structure classification
With the information gathered by cIndex-Basic-Min feature selection, we have
developed a classification method to classify RNA structures.
The classification scoring function is based on the following four factors. 1) feature
number (the number of features in each RNA family); 2) iteration number (for a feature
selected in iteration n of the cIndex-Basic-Min procedure, n is the iteration number); 3)
feature support (number of structures containing this feature); 4) feature size (number
of stems in the feature graph).
1. Iteration function: ( ) = − + , where variable i is the iteration number. 2. Support function: ( , ) = × ( ) , where variable s is the support.
Support is the number of graphs contains that specific feature in the graph feature matrix. Basically a combination of support and iteration function. Features with the same support do not necessary get the same weight. More weight was given to common feature.
3. Feature size function: size(fs) = fs2 , where variable fs is the feature size.
We give higher weights to bigger found structural motif features.
Overall classification scoring function:
= ( ) ∗ ( , ) ∗ ( )/ _ .
126
The classification scoring function associates the selected structural motif feature with
an RNA family, and gives each feature specific weight for later on classification
calculation.
The RNA structural classification is based on the features found in the query and their
corresponding weights (scores). The query RNA structure receives a score for each RNA
family in the database. The query structure is classified the RNA family with the highest
score.
This feature selection process finds the features that are important to a group of RNA
structures, these features can be associated with functions in known families, and aid in
forming hypotheses about the function of novel RNAs.
5.2.3 RNA structure datasets
The datasets used in this work are a manually curated dataset (Table 1.1) and a dataset
collected from the STRAND database (Table 1.5).
5.3 Results
5.3.1 Feature selection on the fingerprint generated
By using cIndex-Basic-Min feature selection strategy, we have successfully identified
features which are important to specific RNA families in our manually curated dataset
(Table 1.1), as well as in a dataset downloaded from the STRAND database (Table 1.5).
Each dataset contains only non-redundant sequences with low sequence similarity (<
50%). The manually curated dataset contains more reliable RNA secondary structures
127
(curate process described in section 1.7.1). Because the STRAND dataset is collected
from all different sources, noise and partial structures are likely to be more common in
this dataset.
Table 5.1 shows the statistics of the selected structural motif features. Figure5.3 shows
that in our feature selection strategy, higher weights are given to the features which are
neither too rare nor too common. This agrees with the well accepted fact (176).
5.3.2 Top unique feature selected (in the same order as weights):
Figure5.4 and Figure5.5 show that the top four structural features found are the most
important in discriminating each of the RNA families from our datasets (Table 1.1 and
Table 1.5).These top structural motif features highlight characteristics known to be
important in the RNA families (Figure5.6). Conceptually, this provides us ability means to
decode the relationship between structure and functions. In other words, we can
understand the mapping from structural motif features to RNA biological functions. One
can ask whether these features are specific enough, since the actual motif sizes in the
structural motif library range from 1 to 7 stems. From the result in Figure5.6, we can see
that they are sufficient to describe the RNA family structure.
5.3.3 Validation of RNA structure classification
Table 5.2 shows the RNA structure classification result. The performance of the
classification is good for most of the RNA families included here. Only RNAseP (STRAND)
is slightly poor. This reflects the poorer curation of the structures in the STRAND
128
database; the RNaseP dataset we downloaded from STRAND database contains many
small/partial structures which can be easily misclassified.
To further the classification performance, we performed leave one out cross validation
(LOOCV). The LOOCV classification rate result is shown in Table 5.3. All the RNA families
show a high LOOCV classification rate, with the exception of the RNAseP STRAND,
discussed above. This is expected since there are small/partial structures and the
performance on that family is not as good as the performance of other families.
Finally, we classified RNA structures collected from the STRAND source by using the
selected features from the manually curated dataset and vice versa. The performance
(Table 5.4) is still reasonably good, but slightly worse than the result we saw above.
While using features selected from manually curated dataset to classify STRAND
structures, the overall performance is not good. RNAseP STRAND and Group I intron
STRAND correct classification rate are around 0.57. Again, we would like to mention
STRAND dataset contains partial structures and misclassified structures. These
structures are likely to be misclassified, either because they actually belong to other
classes, or because their small size interrupts or truncates the selected features on
which the classification is based.
On the other hand, when using STRAND dataset learned features to classify the
manually curated structures, the performance is very good. One thing we need to
emphasize is that for the tRNA manually curated classification by using tRNA STRAND
129
selected features; there is only 1 correctly classified case. This is what we expect to see,
since the tRNA manually curated data was based on three-dimensional structures in PDB,
they contain pseudoknotted structures in them. But on the other hand, when classifying
tRNA STRAND data using the tRNA manually curated selected features, the performance
is as good as when using the features selected from tRNA STRAND data (STRAND data is
based on secondary structure and does not contain pseudoknots). This indirectly shows
the robustness of our feature selection method.
5.4 Discussion
This feature selection process should be able to find the structural features that are
important to different RNA families. Identification of these features, in turn, should
help create a mapping between structural features and biological function.
The maximum structural motif size used in this study is only 21 edges (7 stems), which is
relatively small comparing with the complicated large graphs seen in biological RNA
structures. This limitation can be overcome by the subgraph enumeration approach
from biological structures. This would generate bigger motifs which are subgraphs of big
biological structures. As we collect more biological RNA structure data, we will be able
to add bigger structural motifs back into the motif library. That would provide bigger,
biologically relevant structural motifs. The feature selection result would be more
specific and, possibly, more biologically meaningful.
130
But we can see from our results that using the 1 to 7 stem structural motifs is already
sufficient to describe RNA family structures well. As the size of the structural motifs
increase, the match of a structure would be more specific. In our situation, we would
like the matching of the structural motifs to be somewhere not too specific and not too
random (176). More work needs to be done in order to understand what size structural
motifs achieve the best performance.
From our classification test (Figure5.4), we found that features learned from our reliable
dataset (manually curated dataset) perform poorly against more poorly annotated
datasets. Interestingly, features learned from the dataset with lower quality (STRAND
dataset) can give very good classification result. After thinking about it more, we believe
that the features learned from the low quality dataset (such as STRAND) would still tend
to include features that are representative of the dominant family of structures in the
dataset, which can represent the specific RNA family, as well as extraneous features
arising from mis-classified structures. This feature set can be considered as a super-set
of features are important to specific RNA family. In the sense of the super-set, all good
features are included, so the reliable RNA structures can still be correctly classified. This
suggests our approach can detect the poorer quality of the uncurated dataset.
One more use of this is that we can use our feature selection framework to clean up
RNA structure data in the publicly accessible databases. This is currently an annoying
task for RNA structure analysts: How to obtain the gold standard structure dataset? Our
strategy can provide a potential solution.
131
We are continuously retrieving RNA sequence and structure information from available
public databases (NCBI, RFAM, RNASPE database, etc.), and building a comprehensive
RNA fingerprint database containing the structural features, sequence, and function for
each entry. The sequence and function information for the known structures are
available to aid in assignment of function to novel RNA queries.
In the pharmaceutical industry, design of inhibitory or therapeutic RNAs often begins
with randomly generated RNA sequences of specific lengths, and tests whether the
generated molecules have specific function(s). This random approach is both time
consuming and costly due to combinatorial search space. With the help of our feature
selection approach, we can identify structural motifs from a relevant biological
sequence pool, which would be likely to perform desired functions (based on similarity
to known molecules). For researchers, this significantly reduces the search space to
identify functional RNA molecules of therapeutic use and saves time and expense.
Above all, our new application, by unleashing the power of molecular structures, could
benefit busy biologists in many ways. Our web service can be accessed freely by public
at http://xios.genomics.purdue.edu.
5.5 Future directions
The RNA structure classification problem is complicated. Different feature
selection/extracting methods, such as PCA, SVM, cosine-distance clustering, or machine
learning methods (e.g.: Naive-Bayes), and semi-supervised statistical methods could be
132
applied. In the end we would like to find the "sequences to structure N to 1 mapping" as
well as a “structure to function N to N mapping". With this knowledge, RNA family
classification, annotation and RNA structural and functional prediction will greatly
benefit from our new approach.
133
Figure 5.1 Idea of graph containment search.
Modified from (155) paper.
134
Figure 5.2 Graph feature matrix.
This matrix describes number of graphs (g) in the database contains a certain feature (f).
135
Figure 5.3 Feature selection weight vs. iteration
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
wei
ght/
scor
e
iteration number
Weight/Score vs iteration
tRNA
group1
RNAseP
tmRNA
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
wei
ght/
scor
e
iteration
Weight/Score vs iteration
tRNA STRAND
group1 STRAND
RNAseP STRAND
tmRNA STRAND
136
tRNA manually curated:
5_stem_motif_253 4_stem_motif_5 4_stem_motif_26 3_stem_motif_1 RNAseP manually curated:
7_stem_motif_38580 7_stem_motif_13161 7_stem_motif_6225 7_stem_motif_11091 Group I intron manually curated:
6_stem_motif_2232 5_stem_motif_41 5_stem_motif_35 4_stem_motif_20 tmRNA manually curated:
7_stem_motif_13849 7_stem_motif_31245 7_stem_motif_9944 7_stem_motif_14051
Figure 5.4 Selected top unique structural features in dataset Table 1.1.
Structural features selected by using algorithm 5.1 for RNA families in the manually
curated dataset. The order of its appearance is the same as its weight contributing to
that specific RNA family
137
tRNA STRAND:
4_stem_motif_27 5_stem_motif_211 5_stem_motif_180 3_stem_motif_1 RNAseP STRAND:
7_stem_motif_36950 7_stem_motif_49000 7_stem_motif_11091 7_stem_motif_18408 Group I intron STRAND:
7_stem_motif_27347 7_stem_motif_38147 7_stem_motif_37624 7_stem_motif_45173 tmRNA STRAND:
7_stem_motif_13161 7_stem_motif_32198 7_stem_motif_36615 7_stem_motif_15157
Figure 5.5 Selected top unique structural features in dataset Table 1.5.
Structural features selected by using algorithm 5.1 for RNA families in STRAND dataset. The order of its appearance is the same as its weight contributing to that specific RNA family
138
Figure 5.6 Link from structure to function.
Left, selected features for RNaseP manually curated and STRAND (blue frame), as well as RNAseP secondary structure. Right, selected features for tRNA manually curated and STRAND (blue frame), as well as tRNA secondary structure plus pseudoknots found in 3D structure.
139
Table 5.1 Statistics of the selected structural features for four RNA families from two datasets.
RNA family tRNA manually curated
RNAseP manually curated
Group I intron manually curated
tmRNA manually curated
Selected Feature # 25 1307 131 276
RNA family tRNA STRAND
RNAseP STRAND
Group I intron STRAND
tmRNA STRAND
Selected Feature # 15 753 85 176
140
Table 5.2 Classification performance
Classification performance tRNA manually curated 16 out of 16 (1.00) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 36 out of 36 (1.00) tmRNA manually curated 115 out of 117 (0.98) tRNA STRAND 585 out of 601 (0.97) RNAseP STRAND 28 out of 36 (0.78) Group I intron STRAND 21 out of 21 (1.00) tmRNA STRAND 30 out of 30 (1.00)
141
Table 5.3 Leave one out cross validation result
RNA family LOOCV Sample size tRNA manually curated 0.94 16 RNAseP manually curated 1.00 40 Group I intron manually curated 1.00 36 tmRNA manually curated 0.99 117 tRNA STRAND 0.97 601 RNAseP STRAND 0.58 36 Group I intron STRAND 1.00 21 tmRNA STRAND 1.00 30
142
Table 5.4 Classification test
Manually curated features tested on STRAND data RNA family Classification performance tRNA STRAND 558 out of 601 (0.93) RNAseP STRAND 21 out of 36 (0.58) Group I intron STRAND 12 out of 21 (0.57) tmRNA STRAND 30 out of 30 (1.00)
STRAND features tested on Manually curated data RNA family Classification performance tRNA manually curated 1 out of 16 (0.06) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 34 out of 36 (0.94) tmRNA manually curated 115 out of 117 (0.98)
LIST OF REFERENCES
143
LIST OF REFERENCES
1. Reuter, J.S. and Mathews, D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129-137.
2. Crick, F.H. (1958) On protein synthesis. Symp Soc Exp Biol, 12, 138-163. 3. Mills, D.R., Peterson, R.L. and Spiegelman, S. (1967) An extracellular Darwinian
experiment with a self-duplicating nucleic acid molecule. Proc Natl Acad Sci U S A, 58, 217-224.
4. Spiegelman, S. (1971) An approach to the experimental analysis of precellular evolution. Q Rev Biophys, 4, 213-253.
5. Kramer, F.R., Mills, D.R., Cole, P.E., Nishihara, T. and Spiegelman, S. (1974) Evolution in vitro: sequence and phenotype of a mutant RNA resistant to ethidium bromide. Journal of molecular biology, 89, 719-736.
6. Eigen, M. (1971) Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58, 465-523.
7. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1983) Kinetics of RNA replication. Biochemistry, 22, 2544-2559.
8. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1985) Kinetics of RNA replication: competition and selection among self-replicating RNA species. Biochemistry, 24, 6550-6560.
9. Biebricher, C.K. (1987) Replication and evolution of short-chained RNA species replicated by Q beta replicase. Cold Spring Harb Symp Quant Biol, 52, 299-306.
10. Woese, C. (1967) The Genetic Code: The Molecular Basis for Genetic Expression. Harper.
11. Cech, T. (1986) RNA as an enzyme. Scientific American 255, 64-75. 12. Cech, T.R. (1990) Self-splicing of group I introns. Annu Rev Biochem, 59, 543-568. 13. Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E. and Cech, T.R.
(1982) Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31, 147-157.
14. Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N. and Altman, S. (1983) The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35, 849-857.
15. Guerrier-Takada, C. and Altman, S. (1984) Catalytic activity of an RNA molecule prepared by transcription in vitro. Science, 223, 285-286.
16. Gilbert, W. (1986) Origin of life: The RNA world. Nature, 319, 618-618.
144
17. Joyce, G.F. (1989) RNA evolution and the origins of life. Nature, 338, 217-224. 18. Joyce, G.F. (1991) The rise and fall of the RNA world. New Biol, 3, 399-407. 19. Freeland, S.J., Knight, R.D. and Landweber, L.F. (1999) Do Proteins Predate DNA?
Science, 286, 690-692. 20. Watson, J.D. and Crick, F.H. (1953) Molecular structure of nucleic acids; a
structure for deoxyribose nucleic acid. Nature, 171, 737-738. 21. Crick, F.H. (1966) Codon--anticodon pairing: the wobble hypothesis. Journal of
molecular biology, 19, 548-555. 22. Pyle, A.M., Murphy, F.L. and Cech, T.R. (1992) RNA substrate binding site in the
catalytic core of the Tetrahymena ribozyme. Nature, 358, 123-128. 23. Cate, J.H., Gooding, A.R., Podell, E., Zhou, K., Golden, B.L., Kundrot, C.E., Cech,
T.R. and Doudna, J.A. (1996) Crystal Structure of a Group I Ribozyme Domain: Principles of RNA Packing. Science, 273, 1678-1685.
24. Unrau, P.J. and Bartel, D.P. (1998) RNA-catalysed nucleotide synthesis. Nature, 395, 260-263.
25. Illangasekare, M. and Yarus, M. (1999) A tiny RNA that catalyzes both aminoacyl-RNA and peptidyl-RNA synthesis. RNA, 5, 1482-1489.
26. Lee, N., Bessho, Y., Wei, K., Szostak, J.W. and Suga, H. (2000) Ribozyme-catalyzed tRNA aminoacylation. Nat Struct Biol, 7, 28-33.
27. Johnston, W.K., Unrau, P.J., Lawrence, M.S., Glasner, M.E. and Bartel, D.P. (2001) RNA-catalyzed RNA polymerization: accurate and general RNA-templated primer extension. Science, 292, 1319-1325.
28. Baskerville, S. and Bartel, D.P. (2002) A ribozyme that ligates RNA to protein. Proceedings of the National Academy of Sciences of the United States of America, 99, 9154-9159.
29. Joyce, G.F. (2002) The antiquity of RNA-based evolution. Nature, 418, 214-221. 30. Serganov, A. and Patel, D.J. (2007) Ribozymes, riboswitches and beyond:
regulation of gene expression without proteins. Nat Rev Genet, 8, 776-790. 31. Strobel, S.A. and Cochrane, J.C. (2007) RNA catalysis: ribozymes, ribosomes, and
riboswitches. Curr Opin Chem Biol, 11, 636-643. 32. Jeffares, D.C., Poole, A.M. and Penny, D. (1998) Relics from the RNA world. J Mol
Evol, 46, 18-36. 33. Moore, P.B. and Steitz, T.A. (2002) The involvement of RNA in ribosome function.
Nature, 418, 229-235. 34. Doudna, J.A. and Cech, T.R. (2002) The chemical repertoire of natural ribozymes.
Nature, 418, 222-228. 35. Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engstrom, P.G.,
Lenhard, B., Aturaliya, R.N., Batalov, S., Beisel, K.W. et al. (2006) Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet, 2, e62.
145
36. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816.
37. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res, 16, 11-19.
38. Majdalani, N., Chen, S., Murrow, J., St John, K. and Gottesman, S. (2001) Regulation of RpoS by a novel small RNA: the characterization of RprA. Mol Microbiol, 39, 1382-1394.
39. Havilio, M., Levanon, E.Y., Lerman, G., Kupiec, M. and Eisenberg, E. (2005) Evidence for abundant transcription of non-coding regions in the Saccharomyces cerevisiae genome. BMC Genomics, 6, 93-100.
40. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A, 103, 5320-5325.
41. Manak, J.R., Dike, S., Sementchenko, V., Kapranov, P., Biemar, F., Long, J., Cheng, J., Bell, I., Ghosh, S., Piccolboni, A. et al. (2006) Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nature genetics, 38, 1151-1158.
42. Miura, F., Kawaguchi, N., Sese, J., Toyoda, A., Hattori, M., Morishita, S. and Ito, T. (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc Natl Acad Sci U S A, 103, 17846-17851.
43. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Research, 16, 11-19.
44. He, H., Wang, J., Liu, T., Liu, X.S., Li, T., Wang, Y., Qian, Z., Zheng, H., Zhu, X., Wu, T. et al. (2007) Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res, 17, 1471-1477.
45. Li, D., Willkomm, D.K., Schon, A. and Hartmann, R.K. (2007) RNase P of the Cyanophora paradoxa cyanelle: a plastid ribozyme. Biochimie, 89, 1528-1538.
46. Wilhelm, B.T., Marguerat, S., Watt, S., Schubert, F., Wood, V., Goodhead, I., Penkett, C.J., Rogers, J. and Bahler, J. (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453, 1239-1243.
47. Bompfunewerer, A.F., Flamm, C., Fried, C., Fritzsch, G., Hofacker, I.L., Lehmann, J., Missal, K., Mosig, A., Muller, B., Prohaska, S.J. et al. (2005) Evolutionary patterns of non-coding RNAs. Theory Biosci, 123, 301-369.
48. Caetano-Anollés, G. (2010) Evolutionary Genomics and Systems Biology. Wiley-Blackwell.
49. Staple, D.W. and Butcher, S.E. (2005) Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3, e213.
146
50. Puglisi, J.D., Wyatt, J.R. and Tinoco, I. (1991) RNA pseudoknots. Accounts of Chemical Research, 24, 152-158.
51. Mans, R.M., Van Steeg, M.H., Verlaan, P.W., Pleij, C.W. and Bosch, L. (1992) Mutational analysis of the pseudoknot in the tRNA-like structure of turnip yellow mosaic virus RNA. Aminoacylation efficiency and RNA pseudoknot stability. Journal of molecular biology, 223, 221-232.
52. Mans, R.M., Pleij, C.W. and Bosch, L. (1991) tRNA-like structures. Structure, function and evolutionary significance. Eur J Biochem, 201, 303-324.
53. Brierley, I., Rolley, N.J., Jenner, A.J. and Inglis, S.C. (1991) Mutational analysis of the RNA pseudoknot component of a coronavirus ribosomal frameshifting signal. Journal of molecular biology, 220, 889-902.
54. Tzeng, T.H., Tu, C.L. and Bruenn, J.A. (1992) Ribosomal frameshifting requires a pseudoknot in the Saccharomyces cerevisiae double-stranded RNA virus. J Virol, 66, 999-1006.
55. Chamorro, M., Parkin, N. and Varmus, H.E. (1992) An RNA pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger RNA. Proc Natl Acad Sci U S A, 89, 713-717.
56. ten Dam, E.B., Pleij, C.W. and Bosch, L. (1990) RNA pseudoknots: translational frameshifting and readthrough on viral RNAs. Virus Genes, 4, 121-136.
57. Dinman, J.D., Icho, T. and Wickner, R.B. (1991) A -1 ribosomal frameshift in a double-stranded RNA virus of yeast forms a gag-pol fusion protein. Proc Natl Acad Sci U S A, 88, 174-178.
58. Wills, N.M., Gesteland, R.F. and Atkins, J.F. (1991) Evidence that a downstream pseudoknot is required for translational read-through of the Moloney murine leukemia virus gag stop codon. Proceedings of the National Academy of Sciences, 88, 6991-6995.
59. Gallie, D.R., Feder, J.N., Schimke, R.T. and Walbot, V. (1991) Functional analysis of the tobacco mosaic virus tRNA-like structure in cytoplasmic gene regulation. Nucleic acids research, 19, 5031-5036.
60. Westhof, E. and Jaeger, L. (1992) RNA pseudoknots. Current Opinion in Structural Biology, 2, 327-333.
61. Nussinov, R., Pieczenik, G., Griggs, J.R. and Kleitman, D.J. (1978) Algorithms for Loop Matchings. SIAM Journal on Applied Mathematics, Vol. 35, No. 1 68-82.
62. Konings, D.A. and Hogeweg, P. (1989) Pattern analysis of RNA secondary structure similarity and consensus of minimal-energy folding. Journal of molecular biology, 207, 597-614.
63. Le, S.-Y., Nussinov, R. and Maizel, J.V. (1989) Tree graphs of RNA secondary structures and their comparisons. Computers and Biomedical Research, 22, 461-473.
64. Zuker, M. and Sankoff, D. (1984) RNA secondary structures and their prediction. Bulletin of Mathematical Biology, 46, 591-621.
147
65. Shapiro, B.A. and Zhang, K. (1990) Comparing multiple RNA secondary structures using tree comparisons. Computer applications in the biosciences : CABIOS, 6, 309-318.
66. Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic acids research, 22, 2079-2088.
67. Zuker, M. (1989), Science, Vol. 244, pp. 48-52. 68. Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization.
Methods Mol. Biol, 25, 267-294. 69. McCaskill, J.S. (1990) The equilibrium partition function and base pair binding
probabilities for RNA secondary structure. Biopolymers, 29, 1105-1119. 70. Cook, S.A. (1971), Proceedings of the third annual ACM symposium on Theory of
computing. ACM, Shaker Heights, Ohio, United States, pp. 151-158. 71. HAGADONE, T.R. (1992) Molecular substructure similarity searching : efficient
retrieval in two-dimensional structure databases. Anglais, 32, 515-521. 72. Willett, P., Barnard, J.M. and Downs, G.M. (1998) Chemical Similarity Searching.
Anglais, 38, 983-996. 73. Hansch, C., Muir, R.M., Fujita, T., Maloney, P.P., Geiger, F. and Streich, M. (1963)
The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. Journal of the American Chemical Society, 85, 2817-2824.
74. Gan, H.H., Pasquali, S. and Schlick, T. (2003) Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. Nucleic acids research, 31, 2926-2943.
75. Harary, F. and Prins, G. (1959) The number of homeomorphically irreducible trees, and other species. Acta Mathematica, 101, 141-162.
76. Harary, F. (1969) Graph Theory. Addison-Wesley, Reading, MA. 77. Schuster, P. (1997) Genotypes with phenotypes: adventures in an RNA toy world.
Biophys Chem, 66, 75-110. 78. Gierasch, L.M. and (editor), J.K. (1990) Protein Folding: Deciphering the Second
Half of the Genetic Code. Amer Assn for the Advancement. 79. Doudna, J.A. (2000) Structural genomics of RNA. Nat Struct Biol, 7 Suppl, 954-
956. 80. Ogurtsov, A.Y., Shabalina, S.A., Kondrashov, A.S. and Roytberg, M.A. (2006)
Analysis of internal loops within the RNA secondary structure in almost quadratic time. Bioinformatics (Oxford, England), 22, 1317-1324.
81. Do, C.B., Woods, D.A. and Batzoglou, S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (Oxford, England), 22, e90-98.
82. Flamm, C., Fontana, W., Hofacker, I.L. and Schuster, P. (2000) RNA folding at elementary step resolution. RNA, 6, 325-338.
83. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9, 133-148.
148
84. Ying, X., Luo, H., Luo, J. and Li, W. (2004) RDfolder: a web server for prediction of RNA secondary structure. Nucleic acids research, 32, W150-153.
85. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M. and Schuster, P. (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly, 125, 167-188.
86. Hofacker, I.L. and Stadler, P.F. (2006) Memory efficient folding algorithms for circular RNA secondary structures. Bioinformatics (Oxford, England), 22, 1172-1176.
87. Danilova, L.V., Pervouchine, D.D., Favorov, A.V. and Mironov, A.A. (2006) RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J Bioinform Comput Biol, 4, 589-596.
88. Ding, Y., Chan, C.Y. and Lawrence, C.E. (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic acids research, 32, W135-141.
89. Dawson, W., Fujiwara, K., Kawai, G., Futamura, Y. and Yamamoto, K. (2006) A method for finding optimal rna secondary structures using a new entropy model (vsfold). Nucleosides Nucleotides Nucleic Acids, 25, 171-189.
90. Ren, J., Rastegari, B., Condon, A. and Hoos, H.H. (2005) HotKnots: heuristic prediction of RNA secondary structures including pseudoknots. RNA, 11, 1494-1504.
91. Huang, C.H., Lu, C.L. and Chiu, H.T. (2005) A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics (Oxford, England), 21, 3501-3508.
92. Xayaphoummine, A., Bucher, T. and Isambert, H. (2005) Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic acids research, 33, W605-610.
93. Zadeh, J.N., Steenberg, C.D., Bois, J.S., Wolfe, B.R., Pierce, M.B., Khan, A.R., Dirks, R.M. and Pierce, N.A. (2011) NUPACK: Analysis and design of nucleic acid systems. J Comput Chem, 32, 170-173.
94. Reeder, J. and Giegerich, R. (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, 104-115.
95. Rivas, E. and Eddy, S.R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of molecular biology, 285, 2053-2068.
96. Huang, X. and Ali, H. (2007) High sensitivity RNA pseudoknot prediction. Nucleic acids research, 35, 656-663.
97. Wuchty, S., Fontana, W., Hofacker, I.L. and Schuster, P. (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49, 145-165.
98. Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J. and Giegerich, R. (2006) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics (Oxford, England), 22, 500-503.
99. Clote, P. (2005) RNALOSS: a web server for RNA locally optimal secondary structures. Nucleic acids research, 33, W600-W604.
149
100. Markham, N.R. and Zuker, M. (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol, 453, 3-31.
101. Shapiro, B.A., Kasprzak, W., Grunewald, C. and Aman, J. (2006) Graphical exploratory data analysis of RNA secondary structure dynamics predicted by the massively parallel genetic algorithm. J Mol Graph Model, 25, 514-531.
102. Tinoco, I., Jr., Borer, P.N., Dengler, B., Levin, M.D., Uhlenbeck, O.C., Crothers, D.M. and Bralla, J. (1973) Improved estimation of secondary structure in ribonucleic acids. Nat New Biol, 246, 40-41.
103. Tinoco, I., Jr., Uhlenbeck, O.C. and Levine, M.D. (1971) Estimation of secondary structure in ribonucleic acids. Nature, 230, 362-367.
104. Bellman, R. (1952) On the Theory of Dynamic Programming. Proc Natl Acad Sci U S A, 38, 716-719.
105. Nussinov, R. and Jacobson, A.B. (1980) Fast Algorithm for Predicting the Secondary Structure of Single-Stranded RNA. Proc. Natl. Acad. Sci. U. S. A., 77, 6309-6313.
106. Reeder, J., Steffen, P. and Giegerich, R. (2007) pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows. Nucleic Acids Res., 35, W320-W324.
107. Sperschneider, J., Datta, A. and Wise, M.J. (2011) Heuristic RNA pseudoknot prediction including intramolecular kissing hairpins. RNA, 17, 27-38.
108. Sperschneider, J. and Datta, A. (2010) DotKnot: pseudoknot prediction using the probability dot plot under a refined energy model. Nucleic acids research, 38, e103.
109. Schreiber, S.L. (2000) Target-Oriented and Diversity-Oriented Organic Synthesis in Drug Discovery. Science, 287, 1964-1969.
110. Joyce, G.F. (1992) Directed molecular evolution. Sci Am, 267, 90-97. 111. Kauffman, S.A. (1986) Autocatalytic sets of proteins. J Theor Biol, 119, 1-24. 112. Kauffman, S.A. (1992) Applied molecular evolution. J Theor Biol, 157, 1-7. 113. Eigen, M. and Gardiner, W.C. (1984) Evolutionary molecular engineering based
on RNA replication. Pure Appl. Chem., 56, 967-978. 114. Horwitz, M.S., Dube, D.K. and Loeb, L.A. (1989) Selection of new biological
activities from random nucleotide sequences: evolutionary and practical considerations. Genome, 31, 112-117.
115. Ellington, A.D. and Szostak, J.W. (1990) In vitro selection of RNA molecules that bind specific ligands. Nature, 346, 818-822.
116. Bartel, D.P. and Szostak, J.W. (1993) Isolation of new ribozymes from a large pool of random sequences. Science, 261, 1411-1418.
117. Chapman, K.B. and Szostak, J.W. (1994) In vitro selection of catalytic RNAs. Curr Opin Struct Biol, 4, 618-622.
118. Lorsch, J.R. and Szostak, J.W. (1994) In vitro evolution of new ribozymes with polynucleotide kinase activity. Nature, 371, 31-36.
119. The tmRNA Website. http://www.indiana.edu/~tmrna/.
150
120. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic acids research, 28, 235-242.
121. Andronescu, M., Bereg, V., Hoos, H.H. and Condon, A. (2008) RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9, 340.
122. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol, 6, 362-369. 123. Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and
Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic acids research, 31, 3450-3460.
124. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol., 6, 362-369. 125. Brown, J.W. (1999) The Ribonuclease P Database. Nucleic Acids Res., 27, 314-314. 126. Yang, H.W., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and
Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res., 31, 3450-3460.
127. Cannone, J.J., Subramanian, S., Schnare, M.N., Collett, J.R., D'Souza, L.M., Du, Y.S., Feng, B., Lin, N., Madabusi, L.V., Muller, K.M. et al. (2002) The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2.
128. Williams, K.P. (2002) The tmRNA Website: invasion by an intron. Nucleic Acids Res., 30, 179-182.
129. Zarrinkar, P.P. and Williamson, J.R. (1996) The kinetic folding pathway of the Tetrahymena ribozyme reveals possible similarities between RNA and protein folding. Nat Struct Biol, 3, 432-438.
130. Doherty, E.A. and Doudna, J.A. (1997) The P4-P6 domain directs higher order folding of the Tetrahymena ribozyme core. Biochemistry, 36, 3159-3169.
131. Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci U S A, 101, 7287-7292.
132. Kim, N., Shiffeldrim, N., Gan, H.H. and Schlick, T. (2004) Candidates for novel RNA topologies. Journal of molecular biology, 341, 1129-1144.
133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE Computer Society, Maebashi City, Japan, pp. 721.
134. Yan, X. and Han, J. (2003), Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington, D.C.
135. Zaki, M.J. (2002), Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, Edmonton, Alberta, Canada.
136. Jaeger, J.A., Turner, D.H. and Zuker, M. (1989) Improved predictions of secondary structures for RNA. Proc Natl Acad Sci U S A, 86, 7706-7710.
137. Wang, Z. and Zhang, K. (2001), Proceedings of the 26th International Symposium on Mathematical Foundations of Computer Science. Springer-Verlag, pp. 690-702.
151
138. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25, 25-29.
139. Grate, L., Herbster, M., Hughey, R., Haussler, D., Mian, I.S. and Noller, H. (1994) RNA modeling using Gibbs sampling and stochastic context free grammars. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB, 2, 138-146.
140. Lowe, T.M. and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research, 25, 955-964.
141. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of molecular biology, 215, 403-410.
142. Pudlák, P., Rödl, V. and Savický, P. (1988) Graph complexity. Acta Informatica, 25, 515-535.
143. Byun, Y. and Han, K. (2009) PseudoViewer3: generating planar drawings of large-scale RNA structures with pseudoknots. Bioinformatics (Oxford, England), 25, 1435-1437.
144. Darty, K., Denise, A. and Ponty, Y. (2009) VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics (Oxford, England), 25, 1974-1975.
145. Janssen, S., Reeder, J. and Giegerich, R. (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinformatics, 9, 131.
146. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. and Eddy, S.R. (2003) Rfam: an RNA family database. Nucleic acids research, 31, 439-441.
147. Weinberg, Z. and Ruzzo, W.L. (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics (Oxford, England), 20, i334-i341.
148. Weinberg, Z. and Ruzzo, W.L. (2006) Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics (Oxford, England), 22, 35-39.
149. Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA Structural Ensembles Including Pseudoknots. Submitted.
150. Eppstein, D. (1995), Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, San Francisco, California, United States, pp. 632-640.
151. Kukluk, J.P., Holder, L.B. and Cook, D.J. (2004) Algorithm and experiments in testing planar graphs for isomorphism. Journal of Graph Algorithms and Applications, 8, 313-356.
152. Yan, X., Yu, P.S. and Han, J. (2004), Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, Paris, France, pp. 335-346.
152
153. Yan, X., Yu, P.S. and Han, J. (2005), Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore, Maryland, pp. 766-777.
154. Yan, X., Yu, P.S. and Han, J. (2005) Graph indexing based on discriminative frequent structure analysis. ACM Trans. Database Syst., 30, 960-993.
155. Chen, C., Yan, X., Yu, P.S., Han, J., Zhang, D.-Q. and Gu, X. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 926-937.
156. Williams, D.W., Huan, J. and Wang, W. (2007), Proceedings of 23rd International Conference on Data Engineering. IEEE, Istanbul, Turkey, pp. 976-985.
157. Zhao, P., Yu, J.X. and Yu, P.S. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 938-949.
158. Shang, H., Zhang, Y., Lin, X. and Yu, J.X. (2008) Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow., 1, 364-375.
159. Tian, Y. and Patel, J.M. (2008), Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 963-972.
160. Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.
161. ChemIDplus. http://chem.sis.nlm.nih.gov/chemidplus. 162. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J. and Bryant, S.H. (2009)
PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37, W623-633.
163. Seiler, K.P., George, G.A., Happ, M.P., Bodycombe, N.E., Carrinski, H.A., Norton, S., Brudz, S., Sullivan, J.P., Muhlich, J., Serrano, M. et al. (2008) ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic acids research, 36, D351-D359.
164. Liu, T., Lin, Y., Wen, X., Jorissen, R.N. and Gilson, M.K. (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic acids research, 35, D198-201.
165. Eddy, S.R. (2002) Computational genomics of noncoding RNA genes. Cell, 109, 137-140.
166. Eddy, S.R. (2002) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18.
167. Fera, D., Kim, N., Shiffeldrim, N., Zorn, J., Laserson, U., Gan, H.H. and Schlick, T. (2004) RAG: RNA-As-Graphs web resource. BMC Bioinformatics, 5, 88.
168. Gan, H.H., Fera, D., Zorn, J., Shiffeldrim, N., Tang, M., Laserson, U., Kim, N. and Schlick, T. (2004) RAG: RNA-As-Graphs database—concepts, analysis, and features. Bioinformatics (Oxford, England), 20, 1285-1291.
153
169. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A. and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic acids research, 26, 320-322.
170. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D. and Sonnhammer, E.L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic acids research, 27, 260-262.
171. Sigrist, C.J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A. and Bucher, P. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform, 3, 265-274.
172. Washietl, S., Hofacker, I.L., Lukasser, M., Huttenhofer, A. and Stadler, P.F. (2005) Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol, 23, 1383-1390.
173. Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E.S., Kent, J., Miller, W. and Haussler, D. (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol, 2, e33.
174. Giegerich, R., Voss, B. and Rehmsmeier, M. (2004) Abstract shapes of RNA. Nucleic acids research, 32, 4843-4851.
175. Hofacker, I.L., Fekete, M. and Stadler, P.F. (2002) Secondary structure prediction for aligned RNA sequences. Journal of molecular biology, 319, 1059-1066.
176. Zipf, G. (1949) Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley Press., Oxford, England.
VITA
154
VITA
Kejie Li
Department of Biological Sciences, Purdue University
Education
B.S., Biological Sciences, 2004, Sichuan University, Chengdu, Sichuan, P.R. China
Ph.D., Biological Sciences, 2011, Purdue University, West Lafayette, Indiana
Kejie Li was born in Chengdu, Sichuan Province, P.R. China on June 4th, 1982. Kejie grew up in his hometown and went to Sichuan University in 2000. In Sichuan University, Kejie was selected to an Educational Exchange Program and spent his junior year in University of Washington, Seattle, USA, as a visiting student. Kejie graduated from Sichuan University in 2004 with a Bachelor’s Degree in Biological Sciences. In the Fall of the same year, Kejie was admitted to the Bioinformatics master program at Wageningen University, Wageningen, Netherlands. Fall semester of 2005, Kejie was admitted to the PhD program in Department of Biological Sciences at Purdue University, West Lafayette, USA, and joined the laboratory of Dr. Michael Gribskov. His research focus is the understanding of RNA structure and function relationships. Kejie finished his PhD studies and received his Ph.D. degree in Aug 2011. Kejie will pursue postdoctoral studies at Broad Institute, Boston, USA.
PUBLICATIONS
155
PUBLICATIONS
Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) RNA structure topological
pattern study reveals link between topology and function. (In preparation)
Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) Matching unknown RNA
structures: RNA XIOS topological pattern database. (In preparation)
Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA
Structural Ensembles Including Pseudoknots. (Submitted)
Banks, J.A., Nishiyama, T., Hasebe, M., Bowman, J.L., Gribskov, M., Li, K. et al. (2011)
The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of
Vascular Plants. Science, 332, 960-963.
Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I.,
Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium
on Bioinformatics Research and Applications. Bioinformatics Research and Applications,
Atlanta, GA, Vol. 4983/2008, pp. 317-330.