A REVIEW OF
ARTIFICIAL INTELLIGENCE TECHNIQUES
APPLIED TO PROTEIN STRUCTURE PREDICTION
Jiang Ye
B.Sc., University of Ottawa, 2003
A PROJECT SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
in the School
of
Computing Science
@ Jiang Ye 2007
SIMON FRASER UNIVERSITY
Spring 2007
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name:
Degree:
Title of project:
Jiang Ye
Master of Science
A Review of Artificial Intelligence Techniques Applied to Pro-
tein Structure Prediction
Examining Committee: Dr. Diana Cukierman
Chair
Date Approved:
Dr. Veronica Dahl, Senior Supervisor
Dr. Kay C. Wiese, Supervisor
Dr. Alma Barranco-Mendoza, Examiner,
Assistant Professor of Computing Science,
Trinity Western University, Langley
SIMON FRASER U N I Y E R S I ~ ~ i bra r y
DECLARATION OF PARTIAL COPYRIGHT LICENCE
The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the "Institutional Repository" link of the SFU Library website <www.lib.sfu.ca> at: ~http:llir.lib.sfu.calhandlell8921112~) and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.
The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.
It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.
The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.
Simon Fraser University Library Burnaby, BC, Canada
Revised: Fall 2006
Abstract
Protein structure prediction (PSP) is a significant, yet difficult problem that attracts at-
tention from both biology and computing worlds. The problem is to predict protein native
structure from primary sequence using computational means. It remains largely unsolved
due to the fact that no comprehensive theory of protein folding is available and a global
search in the conformational space is intractable. This is why A1 techniques have been
effective in tackling some aspects of this problem.
This survey report reviews biologically initiated A1 techniques that have been applied to
PSP problem. We focus on evolutionary computation and ANNs. Evolutionary computation
is used as a population-based search technique, mainly in ab initio prediction approach.
ANNs are most successful in secondary structure prediction by learning meaningful relations
between primary sequence and secondary structures from datasets. The report also reviews
a new generative encoding scheme L-systems to capture protein structure on lattice models.
Keywords: Protein structure prediction, evolutionary computation, artificial neural
networks, L-systems.
Acknowledgments
My sincere gratitude goes to Dr. Veronica Dahl for the initiation of the project, for her
guidance, support and always being there to listen and to give advice. It has been my
greatest pleasure getting to know her and learning from her during my graduate studies.
I would also like to thank Dr. Kay C. Wiese for his constructive suggestions and com-
ments.
Special thanks to my parents for their unconditional love and to my son for always being
a caring boy.
Contents
Approval ii
Abstract iii
Acknowledgments iv
Contents v
1 Introduction 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Protein Structure 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Protein basics 3
. . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Protein Structure Hierarchy 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Experimental Methods 6
. . . . . . . . . . . . . . . . . . . . . . 1.2 Algorithmic Processing of Evolution : 7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Protein Structure Databases 8
. . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Evaluation of Prediction Methods 9
Problem Overview 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Significance 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Challenges 13
. . . . . . . . . . . . . . . . . . . . . . . 2.3 Representation of Protein Structure 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 All-atom Model 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Simplified Models 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 HP Lattice Model 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Potential Energy Functions 16
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Measure of Prediction Accuracy 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Related Problems 18
3 Prediction Approaches Overview 2 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Knowlege-based Prediction 22
. . . . . . . . . . . . . . . . . . . . 3.1.1 Homology (Comparative) Modeling 22
. . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Fold Recognition (Threading) 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Ab initio Prediction 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Dynamic Modeling 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Energy Minimization 26
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Structural Features Prediction 28
. . . . . . . . . . . . . . . . . . . . . . 3.3.1 Secondary Structure Prediction 28
4 A1 Techniques for PSP 30
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Evolutionary Computation 31
. . . . . . . . . . . . . . . . 4.1.1 Introduction to Evolutionary Algorithms 32
. . . . . . . . . . . . . . . . . . . . . 4.1.2 Evolutionary Algorithms for PSP 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Discussion 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 L-systems 51
. . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Introduction to L-systems 52
. . . . . . . . . . . . . 4.2.2 L-system-based Encoding for Protein Structure 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Discussion 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Artificial Neural Networks 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Introduction to ANNs 58
. . . . . . . . 4.3.2 A Basic ANN Scheme for Predicting Structural Features 61
. . . . . . . . . . . . . . . . . . . . . . 4.3.3 Secondary Structure Prediction 62
. . . . . . . . . . . . . . . . . . . 4.3.4 Other Structural Features Prediction 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Discussion 71
5 Summary 76
Bibliography
Chapter 1
Introduction
"Biology easily has 500 years of exciting problems for computer science." (Donald Knuth,
2001) Protein structure prediction (PSP) is definitely one such problem.
Proteins form the very basis of life. They perform a variety of essential functions in
organisms, from replication of the genetic code to transporting oxygen, from making up our
cell skeleton to catalyzing chemical reactions that make life possible. Proteins are formed
by joining amino acids into a linear chain. In water, the solvent environment in cells, the
chain folds up into a unique three-dimensional structure. Determining this structure is the
key to understanding how proteins work, thus is essential for our understanding of biological
processes and our ability to enhance the quality and span of our lives.
Currently, the structures of about less than 30,000 proteins have been determined
through experimental methods [98]. This is in contrast to more than a million protein
sequences known as a result of the explosion of genome sequencing projects. The sequence-
structure gap has dramatically increased. Since the costly and time-consuming experimental
methods for structure determination cannot keep pace with sequencing speed, we need effec-
tive computational tools which are able to translate the sequence into an accurate structure.
But unfortunately, despite of the growth of computing power and several decades of research
effort, the problem of predicting protein structure from sequence remains largely unsolved
and has therefore been the "holy grail" in computational biology for many years. The main
reason for this, as indicated in [59], is that no comprehensive theory of protein folding is
available and a global search in the conformational space of proteins is intractable.
The bright side, however, is that more and more research attention has been drawn
to tackle this problem and there have been some promising results in some aspects of this
CHAPTER 1. INTRODUCTION 2
problem. The structure prediction community is growing rapidly. In the first CASP (Critical
Assessment of Structure Prediction) contest in 1994, 35 research groups submitted 100
predictions for 33 protein targets, while in CASP6 in 2004, there were 230 groups submitting
more than 41,000 predictions for the 76 targets. Also, the databases that keep various protein
structural information and the web servers/programs for the task have greatly increased.
There are two major categories of approaches to this problem: 'knowledge-based' and
'ab initio'. Some knowledge-based methods have achieved rather accurate prediction for
a limited number of proteins. Approaches in the third category predict structure features
such as secondary structures and become very useful in the general prediction problem. All
three categories of approaches have been attracting active research.
This report is intended to give an overview of computational approaches to the PSP
problem. The focus is on applications of some interesting biologically originated A1 tech-
niques. The role of computers has been dramatically enhanced in all areas of biological and
medical research with the exponential growth of biological data. While at the same time,
biological systems have been inspiring computing science advances with new concepts, in-
cluding genetic algorithms, artificial neural networks, artificial life, DNA computing ... When
humans try to solve problems, it is always exciting looking at Nature's amazing solutions.
"When looking for the most powerful natural problem solver, there are two rather
straightforward candidates: the human brain (that created 'the wheel, New York, wars
and so on'), and the evolutionary process (that created the human brain)" [26]. Trying to
design problem solvers based on human brains leads to the field of neuro-computing. Evo-
lutionary process forms a basis for evolutionary computing. In this report, we will examine
and review how evolutionary algorithms and artificial neural networks are applied in PSP.
We will also introduce Lindenmayer systems as a novel protein structure representation
scheme. Although not as powerful as evolutionary computation and ANNs, L-systems, also
inspired by natural systems, have found many applications in computing world and has been
recently applied to encode lattice protein conformations.
This report has five chapters. Chapter one provides a basic introduction to protein struc-
ture and the important resources in the protein structure prediction research field. Chapter
two serves as a problem overview through discussing several issues in the problem domain.
Chapter three is a general introduction to various approaches to the problem. This chapter
serves as the big picture of protein structure prediction and helps to understand where in-
dividual computational techniques or methods are fitted. Chapter four is the main chapter
CHAPTER 1. INTRODUCTION 3
of this report. It reviews and analyzes three biologically initiated computing techniques:
evolutionary algorithms, L-systems, and ANNs and their applications in the PSP problem.
The report concludes with a short summary.
1.1 Protein Structure
This section introduces basic ideas about proteins, protein structure, and the current ex-
perimental methods to determine protein structure.
1.1.1 Protein basics
A protein is a chain of amino acids, also referred to as residues. A single amino acid, shown
in the diagram below, always has: a central carbon atom Ca, an amino group -NH2, a
carboxyl group -COOH, a hydrogen -H and a chemical group or side chain -"R".
Figure 1.1: Single amino acid structure
There are 20 different amino acids commonly found in proteins, each being coded using
one English letter. For example, protein Cysteine is coded as C ' . All the 20 amino acids
have the same general structure as shown in Figure 1.1, but their side chains (Rs) vary
in composition and structure, thus in properties like size, shape, charge etc. It is the side
chain that determines the identity of a particular amino acid. One useful classification of
the amino acids divides them into two kinds, the polar (or hydrophilic) amino acids have
side chains that interact with water, while those of the hydrophobic amino acids do not.
Amino acids can be linked when the carboxyl group of one amino acid reacts with the
amino group of the next amino acid, releasing a water molecule and forming a peptide bond
between the two amino acids, as shown in Figure 1.2.
Using the peptide bonds, long sequences of amino acids (polypeptide chains or proteins)
are generated. Most of the proteins are a few hundred residues long although there are
proteins as short as less than 100 residues or as long as over 1000 residues. Relative to the
CHAPTER 1. INTRODUCTION
peptide bond
Figure 1.2: Two amino acids reacting
side chains, the sequence of three repeating groups: amino group, C a atom, and carboxyl
group is called the protein backbone or main chain. The two ends of a polypeptide chain are
chemically different: the end carrying the amino group is the N-terminal, and that carrying
the carboxyl group is the C-terminal. Conventionally, the amino acid sequence of a protein
is always presented in the N to C direction.
The peptide bond itself (the CO-NH group indicated with a rectangle) is planar, but
there is flexibility for rotation around the N-Ca bond and around the Ca-C bond, forming
two dihedral angles, 4 and $J on each side of the C a atom, as shown in Figure 1.3. These
two dihedral angles are the main degrees of freedom in forming 3-d polypeptide chain.
Figure 1.3: Dihedral angles of the backbone
Although the values of the dihedral angles are restricted to small regions in natural
proteins, it is this freedom that protein can fold into a specific 3-dimensional structure, or
conformation.
'Valid values are specified on so-called 'Rarnachandran plot'
CHAPTER 1. INTRODUCTION 5
1.1.2 Protein Structure Hierarchy
Conventionally, protein structure is represented with four levels of description: primary,
secondary, tertiary, and quaternary structure.
0 Pr imary: the ordered sequence of amino acid residues. Formally, it can be modeled
as a string from a finite alphabet C where ICI = 20 (There are 20 amino acids).
Protein sequences differ in length from 30 to over 30,000 amino acids, but mostly a
few hundreds 1741.
0 Secondary: local arrangement of amino acids in a short range of protein chain.
Only main chain atoms are involved in the secondary structure. There are two main
secondary structure patterns: a-helix (H) and P-sheet (E). They may be connected by
loop regions or coils (C).
An a-helix is a tightly coiled, rodlike structure. It is built up from one continuous
region in protein sequence through the formation of hydrogen bonds between C=O
group of residue in position i and NH group of residue i + 4. A P-sheet is formed by
two or more ,B-strands hydrogen bonded side by side. A ,B-strand is just a fragment
of consecutive residues, but different P-strands forming a pleated P-sheet are usually
distant in sequence. Coils have no fixed regular shape. In a slightly higher level,
we can define motifs or supersecondary structure which are some commonly found
secondary-structure arrangements, such as helix-loop-helix.
Every amino acid in the sequence belongs to one of the three structural types (some-
times finer classification of eight structural types are used), thus protein secondary
structure can be flattened and represented by a string from an alphabet ICI = {H, E, C )
with the same length of the primary structure. Take an example from [59], a fragment
of protein primary and secondary structure is as follows:
... P Y E L A M S P T I M C K D N W M A L E M L T ... e-- Primarystructure
... C C H H H H C E E E E E E E E H H H H H C C C ... e-- Secondarystructure
0 Tertiary: three-dimensional conformations resulted from secondary structures folding
together. Interactions of amino acid side chains are the predominant drivers of tertiary
structure [2]. This tertiary structure into which a protein naturally folds is also known
as its native structure. Normally, the interior of a (folded) protein molecule tend to
CHAPTER 1. INTRODUCTION 6
be hydrophobic, while the exterior of a protein is largely composed of hydrophilic
residues, which are able to bond with water molecules. This allows a protein to have
greater water solubility.
0 Quaternary: results from the interactions of multiple independent polypeptide chains.
This level structure will not be discussed in this report.
1.1.3 Experimental Methods
Protein structures are determined by two main experimental methods: X-ray crystallogra-
phy and nuclear magnetic resonance (NMR). In X-ray crystallography, the target protein
must first be isolated and highly purified. Then a series of procedures are required to grow
a crystal which is then exposed to X-rays. From the diffraction pattern recorded, the 3-d
structure could be solved. Using X-ray crystallography depends on successfully obtaining
protein crystals, which is sometimes a major difficulty. (Some proteins do not crystallize.)
Often, It has to take months to solve even a single protein structure by X-ray methods,
although recently this process has been sped up by some high throughput techniques. An-
other drawback of the x-ray crystallography techniques is that the crystallization process
may cause the protein to assume a structure other than its native conformation. In the
second method of NMR, 3-d structure is constructed from pairwise distances estimated by
exciting nucleus and measuring the coupling effects on their neighboring nucleus. Generally,
NMR has been successful for only small proteins and compared with X-ray crystallography,
the resolution is poorer.
Protein tertiary structures solved by X-ray crystallography or NMR are deposited in
Protein Data Bank [98] and are used to evaluate how accurate the computer prediction
models are. But it should be noted that both X-ray and NMR are indirect methods and
they have their limitations. Protein structures solved by them may not represent the native,
active conformation of the protein. For the time being, they are the best data available to
be used as a test for computer models. In the future, however, with more understanding
about protein's native conformation, the standards by which the predicted models are being
judged may be altered.
CHAPTER 1. INTRODUCTION 7
1.2 Algorithmic Processing of Evolution
Homology is an important concept in protein structure prediction. It is defined as similar-
ity in structure, physiology, development and evolution of organisms based upon common
genetic factors [7]. Evolution at molecular level is commonly modeled as a process in which
currently observed sequences have diverged from a common ancestor sequence. This process
involves such events as: mutations, deletions, and insertions of amino acids in a sequence
and selection of those having environmental advantages. In general, 3-d structures, hence
functions are more conserved than the sequences [14, 731.
Usually, two proteins are considered to be homologous when they have identical amino
acid residues in a significant number of positions, thus resulting in similar structures, i.e., the
essential fold of the two proteins is identical, details such as additional loop regions may vary.
However, it is frequently found that two proteins with low sequence identity can also have
similar structures. Sequence similarity can be observed by optimal alignment algorithms
which usually employ dynamic programming techniques [55, 831. If a pair-wise alignment
shows sequence identity above some threshold, e.g. 25-30%, it is generally assumed that the
two sequences have diverged from the same ancestor, and therefore they are likely to share
a similar structure. If it is below the threshold, there are two possibilities. Either the two
proteins have diverged from the same ancestor (but their sequences are too divergent for
their homology to be detectable) or the two proteins are unrelated. Also, there are multiple
alignment algorithms comparing multiple sequences.
The evolutionary information in a multiple alignment of N sequences and L positions
can be expressed using a profile, a 20 x L Position Specific Scoring Matrix (PSSM) that lists
the frequencies of each amino acid in each position. This evolutionary information is often
considered when designing computational tools. Suppose we want to predict the structure
of a protein sequence s, besides for exploiting the information contained in s directly, if
we can find a set of similar sequences with s, we can think that this set contains more
structural information than s itself. The success of the most effective predictive systems is
largely based upon this empirical argument and on their ability to process the information
provided by multiple alignments of similar sequences.
CHAPTER 1. INTRODUCTION 8
1.3 Protein Structure Databases
The development of computational tools is undoubtedly crucial in increasing prediction
accuracy. Another very important aspect is the growth of protein sequence and structure
databases. Except for some ab initio approaches, most structure prediction methods are
dependent on detecting homologies with structures existing in the databases. Thus the
more protein structures deposited in databases, the more likely we can predict a novel
protein structure accurately. In a broader sense, having sizable databases of sequences and
structures provides raw data of evolution. It is the use of evolutionary information and
finding patterns in the information that has pushed forward the field of bioinformatics,
subsequently the sub-field of protein structure predication.
The most important protein structure database is the Protein Data Bank (PDB) [98].
It has existed for three decades and is a primary database that contains all experimentally
determined biological macro-molecular structures, mainly proteins. The PDB is updated
frequently and as of December 2006, about 37,300 protein structures have been deposited in
it. The availability of this large quantity of protein structures allows many analytical studies
to be carried out. Also, there are several structure classification databases that are derived
from the PDB, two of which are SCOP [loo] and CATH [96]. They are both hierarchical
databases of protein structure. SCOP (the Structural Classification Of Proteins) divides
the world of proteins structures to reflect both structural and evolutionary relatedness. The
major levels in the hierarchy are family, superfamily and fold. For example, proteins put in
a same family are clearly evolutionarily related, while proteins in a same fold category may
not have a common evolutionary origin but share some structural similarity. CATH clusters
proteins at four major levels: Class, Architecture, Topology and Homologous superfamily.
It provides a slightly different view on clustering different structures. Both databases are
widely used in structure prediction, in particular, in fold recognition approach.
In addition to structure databases, protein sequences databases, e.g. GenBank and
SwissProt, are also important in structure prediction. With the recent enormous growth of
these databases, some powerful sequence alignment tools, such as PSI-BLAST, can detect
extremely remote homologous relationships between proteins. The evolutionary information
detected is valuable in structure prediction.
CHAPTER 1. INTRODUCTION 9
1.4 Evaluation of Prediction Methods
A large number of approaches or methods have been applied to PSP problem. There is the
need to evaluate the effectiveness of different prediction methods. In this section, we briefly
introduce three world-wide experimental competitions in protein structure prediction field.
CASP
Critical Assessment of Structure Prediction (CASP) is a world-wide protein structure
prediction contest initiated in 1994. It is held every two years since then and the most
recent one was CASPG in December, 2004. During each prediction season, CASP provides
participants with the amino acid sequences of proteins whose structures are close to being
determined experimentally but not released to public yet. The participants then work on
the blind prediction of structures of these target proteins and submit their structure models
generated by computer programs. (Often the models are produced by a combination of
computer programs and human intervention). CASP assessors then compare the predicted
models with experimentally determined structures. Each CASP contest will conclude with
a meeting to discuss the results.
Work in protein structure prediction is very complex and computationally intensive.
CASP provides the PSP research community with an assessment of the various approaches
and critical review of the field. As with the growth of the field, the number of partici-
pants and the extent of prediction have greatly increased. In the first CASP contest, 35
research groups submitted 100 predictions for 33 protein targets, while ten years later, in
recent CASPG, there were 230 groups submitting more than 41,000 predictions for the 76
targets [93].
Figure 1.4 shows two prediction results of protein TM0919. A good prediction and a
not-so-accurate prediction are shown.
Despite of the enormous value of the CASP experiments, they do have some limitations.
In [28], some limitations are discussed: The assessment is carried out by humans, thus bears
the issue of subjectivity; The number of targets is relatively small, therefore, the results
may not always be significant; The assessments cover only proteins determined in a period
of about four months every two years; Users cannot always reproduce CASP predictions,
because computer programs or the required human expertise are often not available.
C AFASP
In contrast to CASP in which human intervention is allowed in the prediction process,
CHAPTER 1. INTRODUCTION
Figure 1.4: Crystal structure of TM0919, one of the 76 target pro- teins of CASP6. (b) Comparison of a successful prediction (red) for TM0919 with the crystal structure. (c) Comparison of a less successful prediction. (The image was taken from [93].)
Critical Assessment of Fully Automated Structure Prediction (CAFASP) aims to assess the
performance of fully automatic structure prediction servers. Thus, what is measured is the
capability of the computer program itself, rather than the capability of prediction groups
aided with the programs. This is of significance to biologists who just want to choose a
better prediction tool.
The benefits of an assessment of fully automated methods are listed in [SO] . First, the
nonspecialist users can choose which is the best method for them to use on their prediction
targets. Second, users can evaluate and better interpret the results they obtain from the
various prediction programs. And last, fully automated predictions are reproducible, unlike
the cases where human intervention is part of the model-building process.
The CAFASP results demonstrated that although in most cases human intervention
resulted in better predictions, several programs could already independently produce rea-
sonable models.
LiveBench
Like CAFASP, LiveBench also evaluates automatic servers only, but it is carried out
in a continuous fashion and use a larger number of prediction targets. Each week the
Protein Data Bank is checked for new entries. Proteins with low sequence similarity to
other proteins of known structure are chosen as prediction targets and are immediately
submitted via Internet to the participating servers. After a few months, a large collection
CHAPTER 1. INTRODUCTION
of prediction targets is thus obtained, and the predicted models can be evaluated.
Chapter 2
Problem Overview
In this chapter, we discuss several issues that will help understand the problem domain.
2.1 The Significance
Knowledge about the structure of a protein is essential in understanding its biological func-
tion. It helps us to understand substrate and ligand binding, devise intelligent protein
engineering experiments with improved specificity and stability, perform structure-based
drug design, and design novel proteins. Thus, being able to predict the 3-d structure of a
protein from its amino acid sequence would greatly benefit molecular biology research. It
would provide educated guesses about the function of newly discovered proteins without
the time and cost required to perform x-ray crystallography and NMR. Indeed, if structure
prediction is good enough, it may remove the need for the lab experiments at all. At least,
in many situations, even a crude or approximate model can greatly help experimental de-
termination of protein structure. Thus, even though most of the current approaches cannot
produce accurate results yet, prediction of structures is of great value. Also, structure pre-
diction is important for the progress of protein engineering as it would enable changes to be
made in the amino acid sequence with some expectation of how the change will affect the
structure.
On the other hand, the study of protein structure prediction problem drives the devel-
opment of computing techniques, e.g. this problem on the simplified models is a good test
problem for developing and evaluating evolutionary algorithms.
CHAPTER 2. PROBLEM OVERVIEW 13
2.2 The Challenges
Protein structure prediction is a very difficult problem. We have not even come close to
solving it. In [79], David Searls outlined some major challenges around the problem:
The physical basis of protein structural stability is not fully understood. Although
Anfinsen [I] experimentally showed that the primary sequence plus thermodynamic
principles should suffice to completely account for the native structure of a protein,
what exactly those principles are and the best way to apply them are still not certain.
The search space of the problem is too huge because of the vast range of possible
conformations of even relatively short polypeptides.
The primary sequence may not fully specify the tertiary structure. - "There are no
rules without exception in biology."
To illustrate a little more about the second challenge, take a small protein of 100 amino
acids as an example. Even with a very modest estimation of three possible structural
arrangements for each amino acid, the total number of conformations for this small protein
is 3"' = a number which is far beyond the computing capability of modern computers.
While the other two challenges need to be addressed primarily by biophysists/biochemists
who study and model protein folding processes, the secondary one, however, is a rich source
of interesting and challenging computational problems in A1 field: e.g., what are some
intelligent ways to explore the conformation space? Why is the Nature so efficient and
accurate with respect to the protein folding and what can we learn from the Nature?
2.3 Representation of Protein Structure
When we approach this computational problem, probably the first thing is to represent pro-
tein structure in the problem space. Because protein structure can be specified at different
levels of the hierarchy and for each level, it may be viewed at different levels of detail, there
are various ways of representing protein structures. Meanwhile, due to the complexity of the
'Despite of some new discoveries, e.g. chaperones, a special type of proteins whose function is to assist other proteins in achieving proper folding, this argument largely remains valid in PSP community and is the fundamental principle underlying all prediction methods.
CHAPTER 2. PROBLEM OVERVIEW 14
problem, in practice, further simplified or restrained models are often used to accommodate
limited computing resources.
Roughly, we can represent protein structures using two categories: all-atom model and
simplified models. Choosing a suitable representation not only makes the problem space
explicit but may help to find solutions more efficiently and effectively.
2.3.1 All-atom Model
In Protein Data Bank [98], protein structures are represented by lists of 3-d coordinates of
all atoms in a protein. Although an accurate all-atom model is desired in the structure pre-
diction, it causes too huge a computation overhead even for very small proteins. Besides, it
is difficult to identify similar sub-structures across different proteins using all-atom coordi-
nates representation, consequently, it is difficult to carry out generalization and abstraction.
Thus for the PSP problem, various simplified models and representations are used.
2.3.2 Simplified Models
Since all-atom model is not feasible, at least currently, it is attractive to explore simplified
structure models to see if they are good enough to at least allow approximate solutions,
which are useful either directly or as initial models for further improvement.
Simplified models can range from a very abstract model such as an HP lattice model (see
2.3.3) to almost realistic models in which proteins are represented by a geometry description
of the main-chain atoms and a rotamer library of side chains. Roughly, simplified models
can be classified into lattice models and off-lattice models. Lattice models adopted lattice
environment which is a grid and structural elements are positioned only at grid intersec-
tions; whereas off-lattice models use off-lattice environment in which structural elements are
positioned in a continuous space.
Lattice models have two aspects of simplification. Each amino acid is modeled as a
single "bead" without considering the different atoms in the amino acid; and, the beads are
restricted to a rigid lattice, rather than being able to take any position in space. Thus in a
legal conformation in lattice model, one residue occupies one vertex in the lattice and the
adjacent residues in the sequence must be adjacent in the lattice. Thus a legal conformation
is actually a self-avoiding path on a lattice. These lattices may be two-dimensional, e.g.
square or triangular, or three-dimensional, e.g. cube or diamond.
CHAPTER 2. PROBLEM OVERVIEW 15
There is debate about the 'physical reality' of a lattice protein. For example, refer-
ence [35] addressed the issue and suggested that simplified lattice models do not contain
the biological information necessary to solve the protein structure prediction problem. But
some other researchers, as an example, N. Krasnogor in [48] think that, simple lattice models
can capture many global aspects of protein structures and other than their inexpensiveness
to use, it is possible to design test problems for which the best conformational structure is
known (for small protein sequences).
Off-lattice models represent protein structure in various ways:
Depending on what level of details each model represents the polypeptide chain composition,
there are models considering:
- at individual residue level, often the central Ca atoms;
- all backbone atoms;
- backbone atoms and side-chain centroids;
- all heavy atoms.
Depending on how to represent the positions of structure units, there are models
- using dihedral angles;
- directly address coordinates, either absolute or relative coordinates;
- using distance matrix.
These different representations carry more or less information about the protein structure.
Some structure prediction approaches use multiple representations and move among them
for different purposes.
2.3.3 HP Lattice Model
The 2-d HP lattice model is perhaps the simplest lattice protein model. It was proposed
by Dill [25] and is widely studied for ab initio prediction. This is the main model we use in
Chapter four when we discuss evolutionary algorithms and Lindenmayer generative encoding
systems.
In this model, the 20 amino acids are classified into only two classes: hydrophobic (H)
and hydrophilic (P), according to their interaction with water molecules. Thus a protein
sequence s is reduced to s E {H, PI+. In addition, the sequence is assumed to be embedded
or a protein sequence of n residues, the corresponding distance matrix contains n * n entries, each representing the distance between the C-a atoms of a pair of residues.
CHAPTER 2. PROBLEM OVERVIEW 16
in a certain 2-d lattice. The free energy of a conformation is inversely proportional to
the number of H-H contacts. A H-H contact refers to a hydrophobic non-local bond and
it occurs if two H-residues occupy adjacent vertices in the lattice but are not consecutive
in the sequence. Thus the more there are H-H contacts, the lower the free energy of the
conformation. The forming of lowest energy conformation would result in H-residues forming
a hydrophobic core while being surrounded by P-residues that interface the environment.
This concept is normally quantified by giving a value e = -1 for every pair of H-H contact,
and trying to maximize the total number of H-H contacts.
The following figure shows examples of two HP lattice models. H-residues are represented
by dark circles and P-residues are white circles. HH contacts are highlighted only in (b)
with curved lines:
Figure 2.1: HP models in (a) square lattice and (b) triangular lattice.
The embedding of a HP sequence in a lattice may be represented in two ways: the
location of each residue on the lattice is specified independently; or relative to the previous
residue. In the latter case, the structure is specified as a sequence of moves (e.g. up, down,
left, right) taken on the lattice form one residue to the next.
Although the degrees of freedom, thus the amount of computation is greatly reduced
for the HP lattice model, it has been shown that the PSP problem for the HP model is
NP-hard on both the 2-d square lattice [19] and 3-d cubic lattice [6]. This justifies the use
of intelligent search techniques, e.g. evolutionary algorithms, to tackle this problem.
2.4 Potential Energy Functions
During the prediction process, we need energy functions to provide information on what
conformations of a protein are better or worse. Apparently, energy functions are very
important to the prediction result. A poorly defined energy function may render an energy
CHAPTER 2. PROBLEM OVERVIEW 17
hyper-surface that has little correlation with a protein's true conformation. An energy
function is needed in almost all computational approaches.
A wide variety of energy functions have been used in protein structure prediction. These
range from the very simple hydrophobic potential in HP lattice protein to energy models
based on more detailed molecular mechanics, such as CHARMM (Chemistry at HARvard
Macromolecular Mechanics) package [95]. Current energy functions can be roughly classified
into three categories: physical potential functions; mean force potentials and simplified
potentials.
Physical potential functions take into account the bond and non-bonded potentials be-
tween atoms, such as torsion (bonded), electrostatics(non-bonded) ..., and typically have the
form
where R is the vector representing the conformation of the protein, typically in Cartesian
coordinates or torsion angles. A popular example in this category is CHARMM. The v.27
CHARMM energy function adds up seven energy terms [95].
Mean force potentials are derived from databases of known protein structures. They can
be based on statistics of frequencies of contacts between amino acids, or in a finer manner,
between functional groups. For example, the amino acids pair R and D is frequently found
to occur a short distance apart relative to random expectation. This indicates that such
an interaction is favorable. Mean force potentials are quite successful in fold recognition
approach, but generally they are not accurate enough due to their crude representation and
their statistical nature.
Simplified empirical energy functions are often related to simplified protein models,
e.g. hydrophobic potential in the simple HP lattice protein. In these potentials, more
consideration is on computation efficiency and ease of use, rather than accuracy.
Potential energy function is a very important factor for the accuracy of structure pre-
diction. An energy function should be sufficiently close to the right potential for the native
state, otherwise, the lowest energy state will not correlate with the native conformation.
Development of energy functions is a very active research area, new models are frequently
published and tested. Often these new functions are a combination of atomic forces and
statistical properties taken from observed protein structure. Currently, potential functions
are still not accurate enough.
CHAPTER 2. PROBLEM OVERVIEW 18
2.5 Measure of Prediction Accuracy
How do we measure the accuracy of our predicted result assuming we know the real native
structure of the protein? In literature, the most popular metric is the 'root mean square
deviation' (RMSD). It measures the average distance between corresponding atoms after the
predicted and the real structures have been optimally superimposed on each other. This
distance is usually measured using Angstrijm (A) which is the unit of length equal to one
millionth of a centimeter. RMSD is given by the formula:
where rai and rh are the positions of atom i of structure a and b, respectively.
In general, a prediction with RMSD of about 6A is considered non-random, but not
useful, RMSD of 4 - 6fL are meaningful, but not accurate, and RMSD of below 4A are
considered good [33]. Of course, the required accuracy also depends on the purpose of the
prediction. For example, identifying the overall fold for understanding the function of a
given protein requires less precision than designing an inhibitor for a protein.
RMSD is widely used for structure comparison. The major problem with this metric is
that two structures have to be appropriately superimposed. Finding best superimpose align-
ment itself is a hard problem and meanwhile, best alignment does not always mean minimal
RMSD. When all equivalent parts of the proteins cannot be simultaneously superimposed,
this is not a good measure. Other problems with the RMSD metric is that significance of
RMSD depends on the size and type of the protein.
Another metric for measuring the accuracy of a predicted structure is the Distance
Matrix Error (DME). But we are not discussing it here.
2.6 Related Problems
The field of protein structure prediction has grown and diversified greatly since the first
attempts. Initially, researchers focused on understanding physical and chemical principles
and using these principles to simulate the folding and obtain protein structure. While this
has not worked out a solution, with more and more experimental data available, researchers
try to derive empirical rules from them and predict new protein structure accordingly. On
the other hand, because protein structure can be viewed at different levels and different types
CHAPTER 2. PROBLEM OVERVIEW
of proteins possess different structure features, the general problem of structure prediction
may be simplified or varied to address different prediction tasks. Here I briefly introduce
three closely related problems, which may help in understanding the problem of PSP.
0 Protein folding
The PSP problem attempts to predict the native structure of a protein given its pri-
mary structure, while the protein folding problem consists in predicting the folding
process or pathways to reach the native structure. Both problems explore protein
structure, but with complementary aims. Studies of protein folding are mainly
concerned with fundamental physicochemical principles and less concerned about pro-
ducing accurate 3-d structure models. A solution to the protein folding problem will
provide a solution to the PSP problem; but knowing the final structure does not solve
the folding problem. In this sense, protein folding problem is more complex than PSP
problem.
The progress in protein folding problem is definitely helping PSP problem because a
better understanding of physicochemical principles in protein folding will help devel-
oping more appropriate energy functions for PSP problem.
When talking about protein folding, we cannot ignore a distributed computing project:
folding@home [97]. It was launched on October 1, 2000, and is managed by the Pande
Group in Stanford University. It is designed to perform the intensive computations of
protein folding simulation. As of February 2006, more than 210,000 CPUs world-wide
were actively participating, and with a total of over 1,600,000 CPUs registered with
the project.
0 Seconda y structure prediction
Predicting protein secondary structure sometimes is considered as a sub-problem of
PSP, although it can stand on its own. The term 'protein structure prediction' in
early (back to 80s) research often actually referred to secondary structure prediction.
Given a protein sequence, if the secondary structure is known, the 3-d structure prob:
lem becomes arranging the known secondary structure elements into the correct 3-d
structure. Some of the other uses of secondary structure prediction are fold recogni-
tion, genome annotation, and predicting regions of a protein that are likely to undergo
3 ~ n old literature, sometimes the two terms are not distinguished.
CHAPTER 2. PROBLEM OVERVIEW 20
structural changes. In the next chapter, we will address this very important task using
ANNs.
Protein design problem
This problem is to identify the amino acid sequences folding into a given native con-
formation. Thus, it can be considered as the inverse problem of PSP. Unlike PSP that
has only one desired solution (the native structure), the inverse problem is likely to
have many solutions because it has been recognized that different protein sequences
may fold into a very similar structure. For example, it was reported in [49] that 2
non-homologous proteins, the third domain of ovomucoid and the C-terminal fragment
of ribosomal L7/L12 protein, have very similar structure, while possessing completely
different sequences.
Protein design problem on simplified lattice models has also been shown to be NP-
hard [63]. This problem is also attracting active research, and researchers are asking
whether the (partial) success of the various A1 techniques that were applied to PSP
problem could be replicated in the inverse problem.
Chapter 3
Predict ion Approaches Overview
Many computational techniques have been employed in the PSP problem. To name a
few: artificial neural networks, evolutionary computation, monte carlo search techniques.
To see the big picture of where and how these individual techniques are applied in the
landscape of protein structure prediction, it is useful to introduce the two main categories
of approaches, namely: knowledge-based prediction and ab initio prediction. The use of
knowledge-based approaches relies on the existence and detection of homologous proteins
with known structure that serves as a template to model the target protein structure.
Overall, it is estimated that knowledge-based approaches can be applied to only less than
half of novel proteins [74]. In many cases, given a novel protein sequence, there is no
homologous protein with known structure available in existing databases. Thus its structure
has to be modeled ab initio, which means we have to do a direct prediction based on the
sequence alone, plus known physical-chemical principles. Ab initio structure prediction
is arguably more useful than knowledge-based prediction because it can be applied more
generally. But currently ab initio prediction is very difficult and less successful. Both
knowledge-based and ab initio approaches are trying to predict directly 3-d model of protein
structure, although sometimes just simplified models. The third category of approaches
to protein structure prediction focuses on predicting the intermediate structures or values
of structural features, such as secondary structure, residues distances or contact maps,
which are important information in aiding 3-d prediction. Compared with full 3-d model,
predicting these 1-d or 2-d features are more tractable and various A1 techniques have been
applied and achieved good results.
This chapter will only cover general ideas of how the above-mentioned approaches work
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 22
to find protein native structures. It is organized by different categories of approaches.
Analysis of selected A1 techniques involved in these approaches will be discussed in the next
chapter. PSP is a complex problem, the classification of approaches itself is not the focus of
this chapter. Others might reasonably classify some approaches differently, as many overlap
or share characteristics. Our intent is to structure the presentation to give the big picture
for individual techniques discussed in Chapter four.
3.1 Knowlege-based Prediction
Homology (comparative) modeling and fold recognition (threading) are two major knowledge-
based approaches. In these approaches, we do not have to care about the folding mechanics
of a protein. We make use of the large amount of available sequence and structure data:
comparing, analyzing, and inferring from them. This is an example of a scientific problem
that can be (partially) solved in practice, without first obtaining a complete understanding
of the protein folding process as it occurs in nature.
The major difference between comparative modeling and threading relies on whether a
homologous protein of the target one can be found through mere sequence alignments. If
sequence comparison cannot help finding the template, threading approach has to be tried.
3.1.1 Homology (Comparat ive) Modeling
Based on the principle that significant sequence similarity implies similarity in 3-d structure,
homology modeling first identifies an evolutionarily related protein with the target protein
through sequence alignment, then builds the 3-d model of the target protein using the known
structure of the related protein as a template. The basic assumption of homology modelling
is that the target and the template have identical backbones. The task is to correctly
place the side chains of the target and build loop regions. To build side chains, molecular
dynamics simulations or other techniques can be applied. In more detail, homology modeling
comprises the following four steps.
1. Select the template. This is facilitated by searching databases through programs like
BLAST, FASTA, etc. If no such template exists, homology modeling is not applicable,
other approaches need to be used.
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 23
2. Construct a sequence alignment of the target protein and the template protein. The
aim of this step is to match each residue in the target sequence to its corresponding
residue in the template structure, allowing for insertions and deletions.
3. Build the model based on the target-template alignment. When the sequence align-
ment is good, use the template structure directly as the target structure, replacing
the side chains of the residues that differ. A subsequent optimization step would then
take care of the side chain interaction. When the target-template sequence similarity
is low, first build the backbone, then place the side chains and finally optimizing the
entire structure. Some of these techniques need a large amount of computation1 time
and user expertise.
4. Refine the model. Additional adjustments may be needed. Various methods exist for
this optimization stage, such as packing, energy calculations.
The accuracy of homology modelling clearly depends on the amount of target-sequence
identity. With high levels of identity (70%), homology-derived models can be as accurate
as the experimentally-derived. But if the identity is only about 30% or less, the model built
on the alignment would probably be completely wrong.
So far comparative modeling is still the most accurate approach in solving PSP, but it
is limited by the absolute need for a related template structure.
3.1.2 Fold Recognition (Threading)
If a highly similar sequence with known structure cannot be found, a new protein may
still be structurally similar to some protein with known structure. In this case, the two
proteins are said to be remote homologous. Fold recognition is aimed at identifying the
remote homologue from a collection of candidate folds. If such a fold template exists,
threading is used to provide a sequence-structure alignment between target sequence and
template structure, rather than mere sequence alignment as in homology modeling. In
actual operation, the two tasks are usually handled together: given a collection of potential
fold templates, for each template, the query sequence is threaded onto the known structure
template. It then follows an assessment of how well the query sequence fits each structure
template (sequence-structure compatibility) using some scoring function. Threading can
be no-gap or gapped, where gapped threading allows gaps in the match of sequence to
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 24
fold. The scoring function can be either amino acid structural propensities 1321 or mean-
force (statistical) potentials [42]. To speed up the process, other techniques have also been
proposed. In [43], profile-based sequence alignments are used to align the query sequence
and the sequence of the candidate template. Feed-forward neural networks are then used to
score the structural similarity of the two proteins. Also, kernel methods have been applied
to detect a remote homology with good results [43].
Various fold recognition methods generally share four components:
0 A library of possible structural templates.
0 A scoring function that distinguishes better threading.
0 An efficient algorithm that searches all possible alignments of the target sequence with
every possible fold in the library. Computing the optimal gapped alignment is a NP-
complete problem if the scoring function takes into consideration pair interactions. In
these cases, approximations or heuristics need to be used.
0 A method to assess significance when selecting best template candidate.
Success of this approach also depends on the degree of similarity between the known
and modeled structures.
3.2 Ab initio Prediction
Ab initio prediction approaches are those that do not rely on known 3-d structures, rather,
they are based on Anfinsen's "thermodynamic hypotheses" [I] asserting that the native
structure of a protein corresponds to its minimum free energy state. Accordingly, many ab
initio prediction methods are formulated as optimizations and computationally intensive. If
this category of methods works fine, it can not only identify in vivo structures of natural
proteins, it can identify structures of arbitrary polypeptides in arbitrary environments.
Therefore ab initio prediction is not only significant to the new proteins that cannot be
modeled with knowledge-based methods, but also significant to drug design. However,
compared with knowledge-based methods, ab initio prediction is less successful and models
produced are not very useful yet - limited to short proteins and coarse models.
In the ab initio prediction category, there are roughly four major approaches: dynamic
modeling, energy minimization, specific protein structure prediction and other approaches.
CHAPTER 3. PREDICTION APPROACHES OVERVIEW
In this section we briefly discuss dynamic modeling and energy minirnization. Specific
protein structure prediction refers to structure prediction of some specific types of proteins
e.g. transmembrane proteins. It generally needs more specific domain knowledge. "Other
approaches" refers to those hybrid approaches or those that are hard to classify. Many
of them achieve good prediction results, e.g. building block approach [88] and Rosetta
program. But like specific protein structure prediction, it lacks generality. We do not cover
them here.
Most research in ab initio approaches is focused on improving the energy function and
search techniques to achieve faster or higher accuracies of prediction; examples can be found
in [85, 571. More about energy function and search techniques will be discussed in Chapter
four.
3.2.1 Dynamic Modeling
Dynamic Modeling uses molecular dynamics (MD) simulation to obtain protein native struc-
ture. Assume our description of all forces at the atomic level is accurate, given any confor-
mation state of a protein system, we should be able to calculate the forces that atoms in
the system exert on each other and to where each atom is moving. Following the trajectory
of the system, eventually the system will rest on its lowest energy state which corresponds
to the native conformation of the protein.
However, there are two problems with this approach. Firstly, we do not have accurate
description of all forces at the atomic level. There are approximate models available, e.g.
empirical potentials, or quantum-mechanical formula, but they are not accurate enough.
Secondly, dynamic modeling often encounters limits of computational power. In the dynamic
system, while one atom moves under the influence of all the other atoms, the other atoms
are also in motion, in other words, the force fields are constantly changing. Thus, we need to
constantly recalculate the forces between each pair of atoms and their positions in very small
timestep. In principle, it requires n2 calculations in each timestep, where n is the number of
atoms in the protein and its surrounding environment. Because the timestep to recalculate
must be chosen small enough to avoid discretization errors (usually at the order of 10-l4
s, which is the same timescale as bond formation.) and the number of timesteps, thus the
simulation time, must be large enough to capture the effect, the calculation becomes huge.
Actually the need to recalculate the forces is the main bottleneck of this method. So far,
we can only simulate very short time of this dynamic process, at the order of s or
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 26
s, which is far from enough because proteins fold on the timesale of s, or longer [82].
Normally dynamic modeling simulations require a full atomic description of the protein
and a detailed energy function.
3.2.2 Energy Minimization
Because it is believed that the native state of a protein corresponds to its minimum free
energy state, if we can find the minimum energy state on the energy landscape, we can
obtain the native conformation. The energy landscape of a protein is the variation of its
free energy as a function of its conformation, owing to the interactions between the amino
acid residues. As shown in Figure 3.1, this energy landscape usually has a funneled shape
which leads towards the native state. For a realistic-sized protein, the energy landscape is
very complicated because it has many parameters and has an enormous number of local
minimums.
Figure 3.1: A hypothetical energy landscape exhibiting a folding funnel
In general, energy minimization approaches are composed of the following three compo-
nents 1581: a representation of protein geometry, a potential energy function that can dis-
tinguish between favorable and non-favorable structures, and a search technique to explore
the conformational space. In each of the components, large approximations are required be-
cause of the complexity of the problem. Different computational approaches differ in which
simplifications are made. Brief discussion about each component is as follows. More details
about protein representation and energy function can be found in Chapter two and four.
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 27
For protein representation, because an all-atom model of protein is computationally ex-
pensive, often a simplified protein representation is adopted. Simplifications include meth-
ods using one or a few atoms per residue, as well a s a lattice representation of proteins.
Recent computational analyses of PSP problem have shown that this problem is intractable
on the simplest 2-d HP lattice models [6]. Simplified models actually cannot give 3-d struc-
ture prediction of proteins, but they are inexpensive to use while they capture many global
aspects of protein structures, thus current research in energy minimization approach mainly
focuses on simplified models.
Formulating a good energy function is always important, yet difficult. Approximate
energy functions include atom-based potentials from molecular mechanics packages such
as CHARMM [51] or AMBER [17], statistical potentials of mean force derived from many
known structures of proteins, and simplified potentials based on chemical intuition.
Given an energy function, many intelligent search techniques have been applied to im-
prove on the sampling and the convergence of the search, such as Monte Carlo, simulated
annealing, evolutionary computation. Take Monte Carlo method as an example. To mini-
mize a given energy function, take a small conformational step and calculate the free energy
of the new conformation. If the free energy is reduced compared to the old conformation
(i.e. a downhill move), then the new conformation is accepted, and the search continues
from there. If the free energy increases, then a nondeterministic decision is made: the new
conformation is accepted if the Metropolis test is positive. These search methods sometimes
are coupled with the use of other structural information or multiple processors to achieve
better results. For example, an interesting approach 1811 uses a Monte Carlo optimization
of a statistical energy function to assemble the whole protein model from relatively short
building blocks. These candidate blocks are obtained from known protein structures using
energetic, geometric, or sequence similarity filters. While the energy function issue needs to
be addressed primarily by biochemists, the searching for an optimal or near-optimal solution
attracts research attention of computing scientists.
To conclude, to use energy minimization approach to investigate protein structure pre-
diction, we need to pick a representation of protein geometry, an appropriate energy function
and a search technique. It is not that all choices of different combinations will work fine.
For example, the commercial energy packages CHARMM or AMBER is not suitable as fit-
ness functions for evolutionary algorithms partly because they examine atomic interactions
and the energy computation per generation is too expensive 1331. General problems with
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 28
this category of approaches are: expensive calculation; some energy functions do not have
strong physics basis; and may not converge to correct result. Another aspect to be noted is
that since most of the research in this approach focuses on simplified models, they are more
conceptual rather than being able to produce working 3-d structure models.
3.3 Structural Features Prediction
Structural features of a protein include secondary structure, inter-residue distances, disulfide
bond formation, etc. Structural features prediction is to relate these structurally measurable
features onto an amino acid sequence. These structural elements can be used to provide
constraints for tertiary structure prediction method or as a part of prediction process. For
example, the results of secondary structure prediction has been integrated in many tertiary
structure prediction approaches. Compared with a complete 3-d structure, structural fea-
tures prediction has less scale and difficulty and they contribute more and more significantly
to the final goal of predicting full tertiary structure.
When predicting structural features, normally, statistical or empirical approaches are
adopted. Examples of sequences and their corresponding known structural features have to
be collected from the existing databases. Then techniques from statistics or A1 are used to
derive meaningful relationships which could be of the form of a neural network, a set of rules
, or an analytical relationship. Then they are applied to sequences of unknown structure to
predict structure features. Among the many techniques applied, artificial neural networks
is more recent and more successful and will be discussed more in detail in Chapter four.
3.3.1 Secondary Structure Prediction
Secondary structure is a very important feature when examining tertiary structure. If
the secondary structure of a given protein sequence is known, the 3-d problem becomes
arranging the known secondary structure elements into the correct 3-d structure. In this
sense, secondary structure prediction can be considered as a sub-problem of PSP.
This prediction problem can be viewed as classifying each amino acid in a sequence into
one of the three classes of secondary structures: H(helix), B(strand), and C(coi1).
Among the many different techniques used in secondary structure prediction, ANNs have
proven successful. One of the first attempts to achieve over 70% prediction accuracy was
PhD [70] using a sliding window and a standard Slayer neural network that was trained
CHAPTER 3. PREDICTION APPROACHES OVERVIEW 29
on a carefully selected set of proteins. ANNs have also been used successfully in PSI-
PRED [43], and [64]. Now the ANN-based methods can achieve a Qg (discussed in Section
4.3.3) accuracy of almost 80%.
More discussion on neural networks applied in secondary structure prediction will be
provided in Section 4.4. Recently, Kernel methods have been applied and also perform well
in accuracy.
Chapter 4
A1 Techniques for PSP
While Chapter three gives a big picture of various approaches to the PSP problem, this
chapter focuses on some selected A1 techniques involved in those approaches. As discussed in
the previous chapters, protein structure prediction is a very complex problem and we do not
fully know about the search space, thus we cannot address it fully analytically. This is why
many A1 techniques have long been applied to it. Among them, I am particularly interested
in those that are initiated from biological systems, especially evolutionary computation,
artificial neural networks and Lindenmayer systems.
When humans try to solve problems, looking at Nature's solutions has always been
a source of inspiration. Two powerful natural problem solvers are the human brain and
the evolutionary process [26]. Trying to design problem solvers based on human brains
leads to the field of neuro-computing. Evolutionary process forms a basis for evolutionary
computing. Although not as powerful as ANNs and evolutionary computing, Lindenmayer
systems, also initiated biologically, have found many applications in computing world.
For PSP problem, evolutionary computation is used as a population-based search tech-
nique, mainly in ab initio prediction approach. It represents an intelligent way of search for
an optimal solution. It has general applicability: whenever there is some reasonable method
for scoring candidate solutions to a problem, evolutionary computation can be applied. Lin-
denmayer systems, as a novel generative encoding scheme to capture protein structure in
lattice model, has been tested in evolutionary algorithms. But further research is needed to
investigate its applicability in PSP problem. Artificial neural networks are most successful
in secondary structure prediction. For humans, a large memory of stored examples can serve
as the basis for intelligent inference. For PSP problem, ANNs infer meaningful relations
CHAPTER 4. A1 TECHNIQUES FOR PSP
between primary sequence and seondary structures from selected dataset.
4.1 Evolutionary Computation
Evolution and intelligence are closely related. Evolutionary Computation (EC) is considered
as a subfield in Computational Intelligence by IEEE Computational Intelligence Society. If a
system can adapt its behavior and evolve itself to meet certain goals in certain environments,
it is an intelligent system. By imitating the evolution process on computers, EC mimics
the intelligence associated with the problem solving capabilities of the evolution process. In
real life, evolution creates very robust organisms; on computers, EC often produces good
solutions to hard problems.
Broadly speaking, EC refers to any biologically inspired and population-based search
technique that involve iterative development of final solutions, such as ant colony optimiza-
tion technique. Narrowly speaking, EC refers to Evolutionary Algorithms (EAs) which are
a family of computational models inspired by Darwin's theory of evolution. EAs solve hard
computational problems by simulating the evolutionary process of inheritance, mutation,
recombination and selection to finally evolve a good solution to a problem. As such, EA
represents an intelligent way of search for a near-optimal solution.
For ab initio prediction of protein structure, even for the simple HP lattice model,
it is proven that the problem is computationally intractable 1191. Consequently, there is
much interest in effective techniques that can discover reasonably good solutions within an
acceptable time. Evolutionary computation was first applied to PSP problem in early 90s,
with noticeable success. Not only to PSP problem, the basic technique is both broadly
applicable and easily tailored to many bioinformatics problems. [33] is a good reference
for evolutionary computation in bioinformatics in general. In late 90s or early OOs, some
researchers began to seek multi-objective evolutionary approach to the PSP problem, as
in [20, 241.
In this section, we first give an introduction to EAs in general. Then we discuss how
EAs were used in the various approaches to the PSP problem. Further discussions will be
on some important issues raised from applying EAs in PSP problem.
CHAPTER 4. A1 TECHNIQUES FOR PSP 32
4.1.1 Introduction to Evolutionary Algorithms
How does an evolutionary algorithm work? Generally, EA manipulates a population of
individuals. Each individual represents a single possible solution to the problem under in-
vestigation. EA starts with an initial population of size n of randomly generated solutions
and a fitness value is then calculated for each solution using the fitness function of the prob-
lem. Individuals with better fitness scores represent better solutions to the problem. After
this initialization, the main iterative cycle of the algorithm begins. Using certain variation
operators, the n individuals in the current population produce a number of children. The
children are then assigned fitness scores as well. Then, according to some selection criteria,
a new population of n individuals is selected from the current population and their children.
This new population becomes the current population and the iterative cycle is repeated
until some condition is met.
The above generic framework can be wrapped as follows:
1. Initialize population of candidate solutions and evaluate each of them.
2. Select some of the population to be parents.
3. Apply variation operators to the parents to produce children.
4. Evaluate the children and include them into the population.
5. Repeat from step 2 until some termination condition is met.
While the basic computational framework is quite simple, it is the design and imple-
mentation details that significantly affect the performance of EAs. There are no general
guidelines in choosing a specific design or implementation for different problems. Recent
theory suggests the search for an "all-purpose" algorithm may be fruitless [26]. Thus the
choice of implementation is often based on experience or on trial and error. Some impor-
tant factors determining the performance are representation of individuals; variation and
selection operators; and fitness evaluation.
Representation
In evolutionary computing, representation is translation of the problem space into encodings
that can be used for evolution, i.e., to represent individual candidate solutions in a manner
CHAPTER 4. A1 TECHNIQUES FOR PSP 33
that can be manipulated by evolutionary operators. Some commonly used representations
are: binary representations (gray coding can be used to ensure that consecutive integers
always have Hamming distance one); integer representations; real-valued or floating-point
representation; and permutation representation.
Traditionally different types of EAs have been associated with different representations.
For example, Genetic Algorithms (GAS), the most widely known type of EAs, often use
fixed length binary strings; while finite state machine representation is often associated
with Evolutionary Programming. But, there is no restriction as to what representation to
use in a particular problem or an algorithm. For example, since binary encoding is often
inappropriate for many problems, in current GAS, non-binary representations such as integer
string individuals or even more general representations such as tree and matrix structures
can also be used.
Thus, the best strategy is to choose representation to suit the problem under investi-
gation and then choose variation operators to suit representation. Selection operators only
use fitness, it is independent of representation.
Not only individuals need to be represented, so does the population. Different types
of representation for the population can be seen in the literature. Two popular ones are
single population and structured populations. In single population, any individual may be
mated with any others. In structured population, the population is decentralized into many
sub-populations, thus the algorithm is decentralized. Often greater performance is achieved
using structured populations, but the implementation complexity is also greater.
Variation
Variation operators act on one or two parent individuals to produce offsprings. They create
the necessary diversity of the population and heavily influence how effectively the algorithm
explores the search space.
Two types of variation operators are mutation and recombination. Mutation can be
viewed as single-parent production: a new individual is created by a random and slight
change from one parent. Thus mutation is always stochastic. Recombination, also called
"crossover" in evolutionary computing, can be viewed as two-parent production (or more
than two parents): each pair of parents selected is recombined to produce (a pair of) children.
When designing Variation operators, it is obvious that variation operators have to match
the given representation, e.g., binary representation and real-valued representation have
CHAPTER 4. A1 TECHNIQUES FOR PSP 34
different variation techniques applied on them. For specific problems, standard operators
can be considered, but, it may be more beneficial if some thought could be given to designing
operators that take advantage of domain knowledge.
Often variation operators have probability rates associated with them. These proba-
bilities are parameters of the algorithm and must be set beforehand. In actual problems,
we often need to tune these parameters to find reasonable setting for the problem under
investigation. A very small mutation rate may lead to premature convergence in a local
optimum. A mutation rate that is too high may lead to loss of good solutions. There are
theoretical but not yet practical upper and lower bounds that can help guide tuning these
parameters.
Selection
Like in natural selection, the selection operator in evolutionary computing applies evolu-
tionary pressure and is responsible for pushing improvement of population. As opposed to
variation operators that act on individuals, selection operators work on population level. In
EAs, selection is based on fitness scores and is applied either when choosing individuals to
breed children - parent selection or when choosing individuals to form a new population -
survivor selection.
There are different selection methods and selection can be deterministic or probabilistic.
Because selection only considers fitness information, it works independently from the actual
representation. Therefore, selection methods are universally applicable to different problems
and representations. Popular and well-studied selection methods include roulette wheel
selection and tournament selection. In roulette wheel parent selection, each individual is
assigned a sector of a roulette wheel that is proportional to its fitness and the wheel is spun
to select a parent. While in tournament selection, global knowledge of the fitness of the
population is not required, instead, it requires an ordering relation that can rank any two
individuals. Tournament selection looks at relative rather than absolute fitness.
Most selection schemes are designed to enable a small portion of less fit solutions being
selected, which helps keep the diversity of the population and prevent premature convergence
on local optimum.
CHAPTER 4. A1 TECHNIQUES FOR PSP 35
Fitness evaluation
The fitness of each individual is evaluated by fitness functions. A fitness function can be
viewed as a particular type of objective function that quantifies the goodness of a solution or
viewed in terms of a fitness landscape which shows the fitness for each possible individual.
An ideal fitness function correlates closely with the algorithm's goal, and yet may be
computed quickly. Speed of execution is very important, since an evolutionary cycle must
be iterated many, many times before producing a useable result for a non-trivial problem.
EA variants
Some specific versions of EAs often addressed in literature are listed as follows.
0 Genetic algorithm (GA) - Initially proposed as an adaptive search technique [38], GA
is the most widely known type of EAs. Typically, candidate solutions are represented
by binary strings called chromosomes. The operators used in GAS reflect those found
in natural reproduction, namely mutation and crossover.
0 Evolution strategy - Individuals are often represented as tuples of real values which,
compared to GAS, are closer to the natural problem representation. The main vari-
ation operator used is mutation. Mutations are usually introduced as Gaussian per-
turbations. Evolution strategies have been successfully applied to many engineering
applications.
0 Evolutionay programming - Looks at evolving computer programs. Fogel [34] pro-
posed using the processes present in natural evolution to design intelligent agents,
these agents taking the form of computer programs, which in turn were represented
as finite state automata. These agents could then be used for prediction, control, or
perhaps classification tasks.
0 Genetic programming - Individuals are in the form of computer programs, and their
fitness is measured by their ability to solve a computational problem.
These variations of EAs are similar in their underlying framework, but differ in the nature
of the particular problems to which they can be applied, and the implementation details.
However, given their similarity in nature, detailed implementations such as representation
and variation operators are often borrowed from one type of EAs to another, thus there is no
CHAPTER 4. AI TECHNIQUES FOR PSP 36
clear distinction between them. But in literature, to solve computational biology problems,
GAS are much more frequently seen to be used than other EAs.
4.1.2 Evolutionary Algorithms for PSP
Evolutionary algorithms were first applied to PSP problem in early 90s when Dandekar and
Agros conducted a series of studies [21, 22, 231. Since then, many researchers have used the
EC technique in various approaches to the problem. But most commonly, EAs are applied
to ab initio prediction approach, using genetic algorithms. EAs have also been applied to
secondary structure prediction, one example can be found in [91] that a GA was used to
supervise an artificial neural network to predict secondary structures.
In this section, we mainly discuss EAs applied to ab initio prediction approach in which
the PSP problem is cast as an optimization problem where the conformational space is
searched to find the structure with lowest free energy.
As discussed in the above general settings of EAs, design considerations for an EA for
ab initio PSP problem involve decisions on the following major issues:
0 a protein representation;
0 mutation and recombination operators for effective exploration of the conformation
space;
0 individual selection policies;
0 a molecular interaction model (energy function) with which individual fitness will be
measured.
In the following sections, we discuss these issues and survey common practices dealing
with these issues. Of course, for a full specification of an EA used in PSP problem, other
issues, for instance, the population size, termination criteria, the probability rates of muta-
tion and crossover, etc., have to be considered and specified to produce an executable EA.
We will briefly discuss some of these issues in the "Discussion" section. But largely, our dis-
cussion focuses on the major issues, and mostly on conceptual level, rather than executable
level.
CHAPTER 4. A1 TECHNIQUES FOR PSP 37
Representation
In literature, EAs are seen applied to both off-lattice and lattice models. For each type of
model, structure representation can be further categorized as follows:
Dihedral Cartesian coordinates
Off-lattice Absolute
matrix coordinates
Relative direction
Figure 4.1: Classification of (direct) structure representations
All these representations use direct encoding of the folded chain, i.e., how each amino acid
(or other structural unit) along the protein chain is arranged in space is directly described.
Recently, some researchers proposed a completely different representation scheme for lattice
proteins - L-systems [27]. L-systems do not encode protein structures directly, but it can
generate directly-encoded-structures. Thus it is a generative encoding scheme. It will be
discussed in next section. Here we give an overview of all the direct representations that
can be used with EAs.
Off-lattice representation For EAs used on off-lattice protein model, individual solution
of protein structure can be encoded by dihedral angle representation as in 1221. Because the
main degrees of freedom in determining protein 3-d conformation are the two dihedral angles
q!J and ?I, at each side of Ca atom, a protein conformation can be represented as a vector of
these angle pairs along the main chain: [(q!J1, ( 4 2 , ?I,2)l ..., (&, $ I~ ) ] . This representation
can be easily converted to Cartesian coordinates of C a atoms. The conversion formula can be
found in [35]. Dihedral angle representation also has the advantage of keeping well-predicted
local segments since local fragments of the structure are encoded continuously. E.g., when
crossover operator is applied, some well-predicted secondary structure segments are more
likely to be kept and inherited to the next generation. For the values of these dihedral angles,
real numbers can be adopted. As an alternative, because dihedral angles are found to be
restricted to a certain range of values, they can be discretized and each discrete dihedral
angle can be encoded as integer numbers, or as bit strings as in the study [22]. In practice,
the range of these angles can be further bounded through preprocessing to further reduce
CHAPTER 4. A1 TECHNIQUES FOR PSP 38
the size of the conformational space.
Another type of off-lattice representation that can be used for EAs was given in [62]
which introduced a distance matrix representation of residue positions. A distance matrix
contains distances for every residue pair and the Cartesian coordinates can be inferred from
the distance matrix.
Lattice representation For EAs used on lattice models, individual structure can be
represented using Cartesian coordinates [89], or more commonly, by internal coordinates
representation [60, 18, 771.
In Cartesian coordinates representation, each vertex in the lattice has a set of coordi-
nates, thus a protein conformation on a 2-d lattice can be encoded as a vector of coordinates
[(XI, yl), (22, y2), ..., (xn, yn)], where (xi, yi) is the Cartesian coordinates of the vertex occu-
pied by the ith amino acid. A 3-d lattice will require three coordinates for each amino
acid.
In internal coordinates representation, the location of one amino acid is specified in terms
of its previous one on the protein sequence. Thus, a protein conformation can be represented
by a direction list expressing a sequence of moves. Obviously this representation depends
on the particular lattice topology considered. Internal coordinates representation can be
further classified into two major schemes: absolute and relative.
The absolute scheme, as studied in 1481, uses absolute direction reference system and
the moves are specified with respect to it. Take 2-d square lattice as an example (an
extension to other lattices is straightforward), four absolute directions of North, South,
East and West can be naturally chosen as the reference system for it. Using this reference
system, a conformation can be expressed as a sequence S E {N, S, E, Win-' where n is
the length of the protein sequence (the location of the first amino acid is fixed). Thus a
very simple 6-residue conformation as shown in the Figure 4.2(a) below can be expressed as
SabsolzLte = ENESE. In relative direction scheme [60, 771, the reference system is not fixed,
but each move is specified relative to the direction of the previous move, rather than relative
to the absolute axes defined by the lattice. Still take 2-d square lattice as an example, three
directions: Forward, Right-turn and Left-turn are enough to specify each new move relative
to the previous one, thus a conformation can be expressed as a sequence S E {F, R, Ljn-l
(the first move is always Forward). The example structure in Figure 4.2(a) is then expressed,
in this reference system, as S,elati,, = FLRRL.
)
CHAPTER 4. AI TECHNIQUES FOR PSP
Figure 4.2: (a): A very simple &residue conformation is represented in absolute direction as ENESE, in relative direction as FLRRL. (b) and (c) show two possible arrangements after point mutation at the 3rd residue position.
This relative direction representation scheme has the advantage of guaranteeing that all
solutions are at least 1-step self-avoiding since there is no "back" move. self-avoiding (no
clash between chain elements) is the basic condition on which a valid lattice conformation
can be formed. In a comparative study [48], it shows that this representation scheme is
almost always better than the absolute encoding of directions for the square and cubic
lattices.
One problem when using these representations is that some mechanism needs to be in
place to ensure the encoded structure is collision-free, which means the representation has
to observe geometrical constraints to be valid. More discussion about general constraint
handling in EAs for PSP is given later (see 4.1.2 - Other design issues).
Variation operators
When designing Variation operators, it is obvious that they have to match the protein
representation.
We first discuss variation operators of EAs on 2-d lattice models. An early study of the
use of EAs on 2-d square lattice model was [89] in which Genetic Algorithms were investi-
gated and protein conformations were encoded as actual lattice coordinates. In this study,
mutations were implemented by a rotation of the structure around a randomly selected co-
ordinate. Not like most GAS applied to other problems in which mutation rate were kept
low, they found that, for protein structure prediction on simple lattice models, a higher rate
of mutation is beneficial. Crossover was implemented by swapping a pair of selected parent
CHAPTER 4. A 1 TECHNIQUES FOR PSP
structures at randomly selected cutting points. On a square lattice, there are three possible
orientations by which two fragment structures can be joined. All three possibilities were
tested in order to find a valid, collision-free one. In the study, a quality control mechanism
was introduced to the recombination process by requiring the fitness value of the child con-
formation to be not worse than the average fitness of its parents. This was implemented
by performing a Metropolis test comparing the energy of the child to the average energy of
its parents. If the child conformation was rejected, new parents had to be selected. This
study also demonstrated that the performance of EA approach, at least on simple models,
was better than Monte Carlo based approaches.
If protein conformations were not encoded as actual lattice coordinates, but using in-
ternal coordinates, the effect of mutation operators relies on specific representation used.
Consider the effect of one point mutation on the structure in Figure 4.2(a), We know from
the previous section that, using relative direction representation, this 6-residue lattice con-
formation is SrelatiVe = FLRRL. A mutation on the 3rd position value could produce either
of S,'elat,,e = FLFRL or S~ela,ive = FLLRL, which are shown in Figure 4.2(b) and (c)
respectively. However, if the structure in (a) is expressed using absolute direction repre-
sentation as Sabsolute = ENESE, to produce the same conformation as in (b) and (c), all
the three position values beyond the 3rd position have to be mutated; the corresponding
representations are S~bsolute - - E N N E N and S~bsolute = E N W N W respectively.
We can see from this example that a one-point mutation in the relative direction repre-
sentation produces a rotation effect in the structure at the mutated point. To produce the
same effect in the absolute direction representation, a multiple-point mutation is needed,
i.e., all the position values beyond the mutation point need to be simultaneously mutated to
produce the same change in the structure. On the contrary, a one-point mutation in an ab-
solute direction representation leaves the orientation of the rest of the structure unchanged.
To achieve the same effect in the relative representation, changes at two subsequent position
values are needed.
As for crossover operation, most studies use a cut-and-paste-type. But reference [68]
presented an interesting deviation. They investigated GAS on lattice-based models. The
mutation was introduced as an Monte Carlo step, where each move changed the local ar-
rangement of short (2-8 residues) segments of the protein chain. The crossover operation
was performed by averaging two selected parents: first the parents were superimposed on
each other to ensure a common frame of reference and then the locations of corresponding
CHAPTER 4. A1 TECHNIQUES FOR PSP 41
structural elements in each parent were averaged to produce a child structure that lay in
the middle of the two parents. A refitting step was then required in order to place the child
structure back within lattice coordinates. In the study, this new implementation of GA was
compared to Monte Carlo search and to standard GA. It was shown that it is more effective
than standard GA implementations. And the superiority of both GA methods over MC
search was also demonstrated.
The above discussion is on lattice proteins. For dihedral angle off-lattice representation,
a simple way to introduce a mutation is to change the value of a single dihedral angle. This
can be done in two ways: allowing only a small change in the value, or allowing complete
random assignment of the dihedral angle values for a single amino acid. Like in relative
direction representation in lattice models, one change in a dihedral value might have a large
effect on the overall structure, because it causes the rotation of the entire arm of the structure
beyond the mutated dihedral angle point, which may cause collisions between many atoms.
The crossover operator mostly are implemented as a cut-and-paste operation over the lists
of the dihedral angles, as in [22]. Thus the "children" structure will contain part of each
parents' structure. Similar as mutation, this may also lead to collision. Since detecting
collision in off-lattice models is much more difficult than in lattice models, almost every
implementation needs to carefully address this issue and come up with a way to handle it.
When the child structure resulted from crossover operator does not contain collision, it may
also have another problem of being too open (not compact enough to be globular) and not
likely to be a good candidate for further modification. To overcome these problems, many
of the implementations include explicit quality control procedures that are applied after
variation operators. These procedures may include several rounds of energy minimization
process to relieve collision, loose conformations, etc.
While some ordinary implementations of variation operators are shared by many studies,
the manner and order in which they are applied is different for each actual algorithm. Other
than the above-mentioned regular operators, many special operators have been devised in
literature. We have already given an example research of [68] in which a Cartesian space
operator is used for recombination in GA. Two more examples are as follows. In [77], a
specially devised operator named "partial optimization" was employed on lattice proteins.
The idea of this operator is to randomly select two non-consecutive residues of the protein
and fix their positions in the lattice and then locate some intermediate residues by calculating
all the different possibilities of the intermediate residues. The conformation that gives the
CHAPTER 4. A 1 TECHNIQUES FOR PSP 42
best fitness is kept. The number of intermediate residues to be permutated is a user-defined
parameter named partial optimization size. Another example is a rotation operator designed
in [48], which is actually a mutation operator, that flips a part of the folded chain along a
certain symmetry axis.
Fitness functions
It cannot be over-emphasized how important the fitness function is to the prediction result.
The fitness of each solution must be an accurate reflection of the problem or else the evolu-
tionary process will find the right solution for the wrong problem. Defining an appropriate
fitness function can be challenging in any evolutionary algorithm.
In almost all EA approaches to PSP problem, the fitness function adopts a certain form
of potential energy functions. This makes the design of EA fitness function easier because
there are many existing energy functions available, but this also creates a problem of hardly
distinguishing between the performance of energy function and the EA algorithm itself. The
wide variety of energy functions that have been used in EAs range from the hydrophobic
potential in HP lattice model to much more detailed energy models such as CHARMM (see
2.4). Because it is very easy to incorporate and modify the various energy functions in the
framework of EAs as fitness functions, many researchers develop their own energy function
terms to suit their specific needs, thus energy functions used in EAs are very versatile. In
this section, we mainly survey some typical energy functions used in EAs, with emphasized
discussion on the simple HP model. Further discussion on the dilemma of energy functions
used as fitness functions will be given in Discussion section 4.1.3 - 'More on energy function'.
For lattice models, the simplest energy function is that used in HP model in which every
direct hydrophobic-hydrophobic (HH) amino acid contact is awarded, as shown in the table:
Table 4.1: Energy potential pij for the HP evaluation function
The optimal structure is the one with the most nuqber of HH contact for a given protein
CHAPTER 4. A1 TECHNIQUES FOR PSP 43
sequence. Figure 2.l(b) shows sequences embedded in triangular lattice with HH contacts
highlighted in curved lines. Given that each HH contact has a value of -1 as specified in
Table 4.1, the conformation in Figure 2.l(b) has an energy of -4. Many EAs working on HP
lattice model use this simple energy potential to measure fitness of individual solutions, yet
it is too gross in some cases. For instance, examine the two conformations in the following
Figure 4.3:
Figure 4.3: (a) and (b) are different conformations but have equal energy values.
(a) is obviously closer to forming the optimal conformation than (b). But because only
direct HH contacts are rewarded, these two conformations have the equal energy values
judged by the simple energy function in Table 4.1. In other words, this function cannot
effectively distinguish between some individual solutions in a EA, thus will cause many
plateaus in the energy landscape and trap the search.
There are ways to avoid the trap. One remedy is by augmenting the energy function to
allow a distance-dependent HH potential, as proposed in [48]. Since the distances between
amino acids form a countable set, it is possible to construct a distance-dependent potential
that preserves the ranking of the conformations in the standard HP model while enabling a
finer level of distinction between conformations with the same number of HH contacts. For
example, if dij is the distance between two hydrophobic amino acids Hi and Hj, reference [48]
gave a modified energy potential as follows:
where NH equals the number of hydrophobic amino acids in the sequence, and k = 4 for
the square lattice and k = 5 for the triangular and cubic lattices. And it was suggested
that the modified energy formulation is especially effective for hybrid EAs that use a local
search method.
CHAPTER 4. A I TECHNIQUES FOR PSP 44
Another remedy was proposed in [77]. A concept "radius of gyration" (RG) was used to
estimate the compactness of a set of amino acids: the more compact a conformation is, the
smaller is its radius of gyration. Hopefully, by integrating RG in the fitness function, the
fitness landscape can be changed in such a way that more compact conformations with the
same number of HH bonds will be rewarded, bringing the evaluation closer to reality.
The above simple energy function for HP model can be extended in various ways to
either fit for more complicated lattice models, or account for more detailed energy items.
In [69], the charge property of amino acids is taken into consideration, thus amino acids are
classified into four types as hydrophobic, positively or negatively charged, or neutral, rather
than just two classes. The energy potential table is expanded to 4 x 4 accordingly. Besides,
different degrees of polarity, or hydrophobicity for different amino acids can be used to make
the energy function more detailed in the hope that it should yield conformations closers to
the native ones. Some of these function examples can be found in [35].
For off-lattice models, a very simple energy function will be an adaptation of the lattice
HP function to off-lattice environments. The energy function can just take into account
the distance between interacting residues which can be calculated using the empirical mean
distance between consecutive residues in proteins l. An optimal interaction potential equiv-
alent to the lattice interaction potential for neighboring hydrophobic residues occurs at unit
distance l. Smaller distances are penalized to enforce steric constraints, i.e., to avoid residue
clashes. In [35], one version of the calculation of the total energy is provided as:
where E is the total energy of a conformation, eij is energy potential between residues i and
j, dij is the distance between residues i and j, y and E are constant parameters, and pij is
the interaction potential according to Table 4.1.
For dihedral angles off-lattice representation, generally, total energy is calculated as the
sum of several energy potentials. The typical form would be like the equation shown in
'This distance is roughly 3.8W and can be set as the unit distance. The distance between a pair of interacting residues can be calculated using this distance and angular values.
CHAPTER 4. A1 TECHNIQUES FOR PSP
section 2.4 in which various bonded or non-bonded potentials are calculated. The popular
CHARMM force field is in this category of energy functions. Another example is that used
in [22] in which small helical proteins are successfully folded using a GA. The fitness(energy)
function took into account the effect of bad clashes, secondary structure formation, tertiary
structure formation, hydrophobic burial, and hydrogen bonding. Normally, this category
of energy functions are linear sums of several energy terms. But in the interesting energy
function used in [61], the terms were normalized and then multiplied rather than added. By
this way, it makes sure that all the terms have reasonable values, since even one bad term
can significantly affect the total score.
One more special type of energy functions adopted in EAs for PSP uses empirically
derived contact potentials for amino acid interactions. A contact potential describes the
energy between two residues close enough to each other (typically 5 6.5A). In [53], a contact
potential emp was determined for all pairs of amino acid types using 1168 known structures.
Then these potentials are used in some function similar to that in Section 2.4 to calculate
the total energy.
Other design issues
Prevention of premature convergence on undesired solutions: These undesired solu-
tions are often local minimums. It is common that, during successive generations, one
or very few solutions take over the population. Once this happens, the rate of evo-
lution drops dramatically: crossover becomes meaningless and advances are achieved
only by mutations at a very slow rate. Several approaches have been suggested to
avoid this situation. These include temporarily increasing the rate of mutations until
the diversity ofthe populations is regained; isolating unrelated sub-populations and
allowing them to interact with each other whenever a given subpopulation becomes
frozen, and rejecting new solutions if they are too similar to solutions that already
exist in the population.
Geometrical constraints: Like many practical problems, PSP problem is constrained.
Two types of constraints that need to be enforced to define a feasible conformation
are: the connectivity of the chain and the collision-free conformation.
Many implementations use internal coordinates representations to implicitly handle
the first constraint (the off-lattice dihedral angle representation is actually a kind
CHAPTER 4. A1 TECHNIQUES FOR PSP 46
of internal coordinates representation). As for the second constraint, for off-lattice
models, it means some torsion-angle ranges are not allowed and residues should not
collide; for lattice-models, it means the conformational path has to form a self-avoiding
walk in the lattice. Thus, not all possible individuals represent valid solutions. In
one perspective, it provides extra information the EA can use to narrow down the
search space; in another perspective, it adds extra dimension(s) to the already high-
dimensional problem, thus may make the search more difficult to handle.
Generally speaking, in EAs, constraint handling is not straightforward, because the
variation operators(mutation and recombination) are typically "blind" to constraints.
That is, there is no guarantee that even if the parents satisfy some constraints, the
offspring will satisfy them as well. In [26], some ways for handling constraints in EAs
at the conceptual level are introduced as follows:
- Use penalty functions to reduce the fitness of infeasible solutions, the fitness may
be reduced in proportion to the number of constraints violated, or to the distance
from the feasible region.
- Use mechanisms that take infeasible solutions and "repair" them to the closest
feasible one.
- Use a specific alphabet for the problem representation, plus suitable initialization,
recombination, and mutation operator such that the feasibility of a solution is
always ensured.
These constraint handling methods have all been employed in various EAs for PSP
problem. In [44] and [77], penalty functions are used to measure to which extent
the constraints are violated. Infeasible solutions are allowed, but they are assigned a
lower fitness value due to the existence of a penalizing term. In [18], an alternative was
explored. A repair procedure maps.infeasible solutions to feasible conformations, and
evolutionary operators are designed such that they are closed in feasible space. There
are also other techniques designed for particular representation. E.g., to address the
collision-free constraint on lattice model for absolute coordinates representation, one
simple way is just marking lattice vertices as free or occupied.
Human intervention in EAs: How much human intervention will be involved in assist-
ing the algorithm? This is a question for any EA. You can choose to only preset some
CHAPTER 4. A I TECHNIQUES FOR PSP 47
probability parameters and leave all other aspects of the evolving process to random
decisions. Or you can incorporate more domain knowledge to guide and assist the al-
gorithm. For PSP problem, in practice, domain knowledge is often incorporated in the
algorithm to improve the prediction accuracy. One way is to first predict secondary
or supersecondary structures, then use the results as constraints during EA search,
e.g., rather than choosing crossover points totally randomly, the EA can choose some
hot spots selected on the bases of keeping secondary structure. Another way is to in-
clude experimentally derived structural information such as the existence of S-S bonds,
conserved hydrophobic residues, in the prediction scheme to improve the prediction
quality. For example in [5], distance constraints derived from NMR experiments were
used to help a genetic algorithm to calculate protein structure.
4.1.3 Discussion
In this section, we discuss some general and conceptual issues that are raised from using EC
in PSP problem.
Suitability of EA in PSP problem
Evolutionary computation, according to [33], is both effective and computationally efficient
search strategy. It has the advantages of ease of use, general applicability, and success in
finding good solutions for difficult high dimensional problems. Particularly, EAs are useful
when: 1) the problem search space is large, complex or poorly understood; 2) domain
knowledge is scarce or difficult to encode to narrow the search space; 3) no mathematical
analysis is available; or 4) traditional search methods fail. Except for the 2nd case, PSP
problem falls into all the other cases. Besides, many studies demonstrated that, as a general
search method, EA does show superiority over other methods like monte carlo search. This
suggests that PSP problem is suited for EA. This is interesting since EA works on population
level, i.e., many individuals mix and interact to evolve a good individual, while protein
molecule folds individually on single-molecule level, not by mixing different proteins on
population level. [go] gave an explanation and suggested an interesting view of EAs as
being compatible with protein folding pathway: although EAs do not simulate the actual
folding pathway of single molecule, we can refer to the many solutions in the EA system not
as different molecules but as different conformations of the same molecule. Each individual
CHAPTER 4. A1 TECHNIQUES FOR PSP 48
solution can be considered as a point on the folding pathway of the single molecule, and it
examines and evolves itself using the variation and selection operators.
Adaptive and dynamic nature of EAs
Evolutionary computation, by nature, is a dynamic and adaptive process. Thus, when
applying EAs in practical problems, this nature should be given enough consideration. The
consideration is on three levels.
First, the essence of EA's adaptive nature should be taken into consideration when
modeling the problem. Initially, GA, the most popular form of EA, was conceived by
Holland as a means of studying adaptive behavior, as suggested by the title of his book in
which he put together his early research - "Adaptation in natural and artificial systems". In
later studies, however, maybe because EAs generally perform well in searching for optimal
solutions, they have largely been considered as optimization methods. In fact, there are
many ways to view EAs, as pointed out in [26], not only as problem solvers, but also
as basis for competent machine learning, as creative computational models, or as guiding
philosophy. Till now, EAs have been applied to PSP problem only as an optimization search
tool. Maybe the future research on PSP problem would model the problem differently and
combine the macro evolution and micro protein folding in a creative way?
On the second level, when we consider EA as an effective search tool in PSP problem, we
should bear in mind that EA is adaptive and there is no best EA across all problems [33].
PSP problem can be formulated differently or can be focused on different types of proteins.
Thus algorithm components should be developed in such way that they are tuned to the
formulation at hand rather than simply forcing the problem into a particular version of an
EA.
On the third level of setting algorithm parameters for a particular EA, it is suggested
in [26] that using rigid parameters that do not change their values during the running of the
EA is against the adaptive and dynamic nature of it. Globally, there are two major forms
of setting parameter values: parameter tuning and parameter control. Parameter tuning is
the commonly practised approach that values of parameters (population size, mutation rate,
etc.) are set before the run of the algorithm and remain fixed during the run. However, a
run of an EA is an intrinsically dynamic, adaptive process. It is intuitively obvious, and has
been empirically and theoretically demonstrated in [26] that different values of parameters
might be optimal at different stages of the evolutionary process. For instance, large mutation
CHAPTER 4. A I TECHNIQUES FOR PSP 49
steps can be good in the early generations, helping the full exploration of the search space,
while small mutation steps may be needed in the late generations to locate the desired global
optimum. Thus we need dynamic parameter control. For the mutation problem, e.g., one
possible solution is to suggest a range of dynamic mutations, from small to large, during the
evolutionary process and let the EA self-control its parameters. This comes to the idea of
self-adaption. Self-adaptation can be done by associating each individual with an additional
vector that provides instructions on how to best mutate it; or it is also natural to use two
EAs: one for problem solving and another one for tuning the first one. But there is not
much research in this line for PSP problem yet.
Variation of EAs applied in PSP
Among the variants of EAs, Genetic Algorithms are still the predominant EA used in PSP
problem. But it was pointed out in [33] that crossover, the main variation operator in GAS,
is largely ineffective for protein structure prediction and other variants, especially Evolution
strategy which emphasizes on mutation, should be more extensively investigated.
In the literature, memetic algorithms are also seen applied to PSP problem. The memetic
algorithm refers to a hybrid evolutionary algorithm approach that uses a standard EA in
conjunction with local search. The additional localized searches conducted in a memetic
algorithm generally results in a significant improvement in the fitness of the best solution
found.
Another research direction is the multi-objective formulation of the PSP problem. His-
torically, ab initio prediction has been approached as a single-objective optimization prob-
lem. While recently, some researchers reformulate it as a multi-objective optimization prob-
lem. An early research is [24] in which a multi-objective evolutionary algorithm (MOfmGA)
was used for the structure prediction of two small proteins (5 and 14 residues respectively).
Using this idea, Cutello investigated medium size proteins (46-70 residues) with promising
results and further conjectured and partially verified by experiments that PSP problem is
more suitable to be modeled as a multi-objective optimization problem [20]. Their approach
considers the local interaction (bond energy) and non-local interaction (non-bond energy)
among atoms to be the main forces to direct the forming of the protein native state, and is
based on the intuition (or, fact) that the two kinds of interaction are in conflict. This is the
typical characteristic of a multi-objective optimization problem.
CHAPTER 4. AI TECHNIQUES FOR PSP 50
More on energy function
In ab initio structure prediction, the two important aspects of the problem, the energy
function that must discriminate between the native structure and many non-natives and
the search algorithm to identify the conformation with the lowest energy, are fraught with
difficulties [go]. Furthermore, difficulties in each aspect reduce progress in the other. Until
we have a search method that will enable us to identify the solutions with the lowest energy
for a given energy function, we will not be able to determine whether the conformation
with the minimal calculated energy coincides with the native conformation. On the other
hand, until we develop an optimized energy function, we will not be able to verify that a
particular search method is capable of finding the minimum of that specific function. That
is, evaluating the performance of the search tool and evaluating the performance of the
associated energy function are tangled together and making a distinction between them is
hard. This is a dilemma in PSP research. When discussing EAs for PSP, the same problem
arises, and to make things worse, in almost all EA implementations, the energy function is
also used as the fitness function of the EA, thus making the distinction between the energy
function and the search algorithm even more difficult. It was suggested in [go] that, at
least for algorithmic design and analysis purposes, it is possible to detach the issues of the
search from the issue of the energy function, by using a simple model where the optimal
conformation is known by full enumeration of all conformations, or by tailoring the energy
function to specifically prefer a given conformation. But there is not much research in this
line yet.
Another issue about energy function is that complex energy function models could also
be parallelized for more efficient calculation. This is often adopted in knowledge-based ap-
proaches to the PSP problem, as well as for EAs in ab initio prediction. Significant reduction
in convergence time can be achieved by either distributing a single evolving population over
a number of machines or allowing different machines to compute independently evolving
populations. Many practical EA implementations for solving PSP have adopted parallel
computations. Conceptually, it matches the nature of evolution because evolution itself is
just a parallel process.
CHAPTER 4. A1 TECHNIQUES FOR PSP 51
Possible future improvements
Despite the conceptual and technical suitability of EA in the PSP problem, the success of
EA in PSP problem is moderate. Most research focuses on lattice models. What kinds of
improvements might be made to EA methods to improve their performance? One obvious
aspect is improving the energy function. While this is a common problem for all prediction
methods, an interesting possibility to explore within the EA framework is to make a dis-
tinction between the fitness function that is used to guide the production of the emerging
solution and the energy function that is being used to select the final structure. In this way
it might be possible to emphasize different aspects of the fitness function in different stages
of folding.
Another possibility, as suggested in [go], is to introduce explicit "memory" into the
emerging substructure, such that substructures that have been advantageous to the struc-
tures that harbored them will get some level of immunity from changes. This can be achieved
by biasing the selection of crossover points to respect the integrity of successful substruc-
tures or by making mutations less likely in these regions. It seems that the PSP problem
is too difficult for a naive "pure" implementation of EAs. The direction to go is to take
advantage of the ability of the EA approach to incorporate various types of considerations
when attacking this problem.
GAS are still the predominant EA used in PSP problem. It was pointed out in [33]
that crossover, the primary reproduction mechanism used in GAS, is largely ineffective for
protein structure prediction. It was suggested that evolution strategies and evolutionary
programming which place emphasis on mutation as a reproduction mechanism should be
explored in PSP problem.
Finally, a long term effort should be made to better integrating the adaptive and dynamic
nature of evolutionary computing in various levels of approaching PSP problem: in modeling
of the problem; in developing algorithm components; and in setting algorithm parameters.
Both conceptual model and technical implementations need to be explored.
As discussed before, ab initio prediction approaches to PSP problem often use simplified
lattice models to study protein structure. On the 2- or 3-d lattices, the folded structures are
usually represented using direct encoding of the coordinates of every residue on the folded
CHAPTER 4. A1 TECHNIQUES FOR PSP 52
chain (See 4.1.2 - Representation).
Recently, a few researchers proposed using Lindenmayer systems (L-systems) to capture
protein structures [27, 561. After David Searls laid the ground for using generative grammar
in biosequence analysis [78], this is a novel and interesting practice for representing folded
protein structure on lattice models.
In this section, we will give a short introduction to L-systems, then introduce and discuss
the L-system-based encoding for lattice protein in current research.
4.2.1 Introduction to L-systems
L-systems were developed by Aristid Lindenmayer in late 1960s. Originally they were de-
vised to provide a formal description of the growth patterns of simple multicellular organ-
isms. Later on, this system was extended to describe higher plants and complex branching
structures.
L-systems are commonly defined as a tuple < V, C, w, P >, where V, variables, is a set of
symbols that can be replaced; C , constants, is a set of symbols that remain fixed; w, axiom,
is a string of symbols from V+ C defining the initial state of the system; P, productions, (or
rewriting mles), is a set of rules or productions defining the way variables can be replaced
with combinations of constants and other variables. Other than these terms, we also use
alphabet to refer to the set of V + C and symbol to refer to any element in V or C.
As an example, Lindenmayer7s original L-system for modelling the growth of algae is as
follows. Algae consists of cells, each of which could take on one of two values a or b.
variables: a , b
constants: none
axiom: a
rules: a --t ab, b --t a , which successively produces: a , ab, aba, abaab, abaababa, ... This pattern of growth fairly closely matched the growth patterns of the algae that Linden-
mayer was studying.
An L-system is context-free if each production rule has only one variable on the left.
If a rule refers not only to a single variable but also to a combination of this variable and
certain neighbours, it is termed a context-sensitive L-system. An L-system is deterministic
if there is exactly one production for each variable. If there are several, and each is chosen
with a certain probability during each iteration, then it is a stochastic L-system. Finally,
L-systems can be parametric if there are numerical parameters associated with the symbols
CHAPTER 4. A1 TECHNIQUES FOR PSP
or productions. A deterministic context-free L-system is the simplest form of L-systems and
popularly called a DOL-system.
Compared with traditional formal language grammars, the major difference lies in the
way of applying production rules. In formal languages, productions are applied sequentially,
while in L-systems they are applied in parallel, replacing simultaneously all variables in a
given word. This difference reflects the biological motivation of L-systems. Productions
are intended to capture cell divisions in multicellular organisms, where many divisions may
occur at the same time. Another difference is that in L-systems, there is not necessarily
such non-terminals as in traditional grammars. Variables in some L-systems constitute valid
words in the languages of the L-systems. In this case, although they are replaceable, the
variables are more like the terminals in traditional grammars.
4.2.2 L-system-based Encoding for Protein Structure
L-systems are investigated to encode lattice protein conformations only very recently [27,56].
In the research, evolutionary algorithms are used as the inference procedure for discovering
L-systems that represent target protein structures on simple lattice models. At this stage,
the problem they are trying to solve essentially is: given a target structure expressed in
"internal coordinates"(see Figure 4.1), how to find an L-system that, once evaluated, would
produce the original target structure or a closely matched one. They used EAs to search the
space of L-systems and produced promising results for short sequences. However, there is
still long way to go before L-system-based structure representation can be used in the PSP
problem or its inverse problem. We will discuss this point more in detail in the discussion
section.
Why a grammatical encoding?
As discussed in section 4.1.2 - Representation, protein structures on lattice models are
usually represented by a direct encoding of the folded chain. One commonly used direct
encoding is "internal coordinates" that represents the structure by a list of moves on the
lattice. The moves can be absolute or relative. Under the relative scheme, each move is
specified relative to the direction of the previous move. In a 2-d square lattice, e.g., a struc-
ture S is encoded as a string S E {Forward,t~rnRight,turnLeft)~-l. See Figure 4.2(a)
for an encoding example.
CHAPTER 4. A 1 TECHNIQUES FOR PSP 54
However, the string length of the encoded structure is basically the same as that of the
protein sequence, thus causing the search techniques that use this type of encoding hard to
scale. L-systems is a generative or rule-based scheme that specifies how to construct the
structure rather than directly encodes the structure, thus can achieve greater scalability.
But there comes the question: is lattice protein structure suitable to be encoded grammat-
ically? The researchers provide their reasoning in 1271 that can be concluded as: proteins
exhibit regularity and repeated substructures, which is consistent with the recursive nature
of L-systems where rewriting rules lead to modular, auto-similar structures. But the re-
searchers didn't investigate to what level of degree proteins exhibit regularity and whether
the regularity showed in protein structure is enough to be modelled by L-systems in gen-
eral. We will comment on this point more in 4.2.3 Discussion section. Another advantage of
using grammatical encoding is that it is more compact and parts of the encoding are more
easily to be reused. And specifically for evolutionary algorithms, grammatical encoding for
individuals is more suitable for crossover and building block transfer between individuals.
L-system-based encoding
In this section, we briefly introduce how lattice protein structure is encoded by L-systems
based on the methods discussed in 1271 and [56].
The L-system's alphabet will depend on the lattice and coordinate system used. For
square 2-d lattice and relative internal coordinates, the specification of DOL-systems chosen
in [27] is: variable set V = {0,1,2, . . . , m - 1) with each of the number elements representing
one rewriting rule; constant set C = {F, L, R) representing three moves: Forward, Left-turn
and Right-turn in the relative coordinates; axiom w can be any string of combinations of
characters from V + C. The number of production rules is the size of the variable set and
each rule takes the form n -t w, where n E V and w E {V + C)+, the set of all nonempty
words over V + C.
An example of L-system to encode a short lattice protein structure
RFRRLLRLRRFRLLRRFR
would be as follows, with its derivation process shown in Figure 4.4.
CHAPTER 4. A1 TECHNIQUES FOR PSP
axiom = 31
rules = {0 + 3LL2; 1 + RORL; 2 + RRF; 3 + RFR1)
w.'R RORL R 3LLZ RL 1 0
Post- processing
RFRRLLRLRRFRLLRRFR
Figure 4.4: A derivation process example
The maximum lengths of the axiom and rules, as well as the number of rules are param-
eters for the inference algorithm (in [27] it was EAs) that will depend on the length of the
protein.
In further studies [56], knowledge of secondary structures is incorporated in the L-
system-based encoding in the form of predesigned production rules. In the HP 2-d square
model, right-oriented a-helix is designed as RRLL (represented by variable A); left-oriented
a-helix is designed as LLRR (represented by variable H); P-sheet is represented as a string
of F s (the maximum number of F s is 4). Moreover, the L-systems are parametric. There
are numerical parameters associated with the symbols. For example, if a structure segment
in the relative coordinates encoding is FFFF, then in parametric L-systems encoding, it
can be written as F4. Another instance: a 2-d lattice folding RFRRLLRLRRFRLLRRFR
in relative coordinates can be rewritten in parametric L-systems as: R F A R L R 2 ~ R ~ F R .
Thus the parametric L-system has only five symbols in its alphabet: {F, R, L, A, H). And
its rules are fixed and not explicit compared with the DOL-systems discussed above.
CHAPTER 4. AI TECHNIQUES FOR PSP 56
Evolving L-systems-encoded structures
The ability of L-system-based encoding to capture protein native conformation in the 2-d
HP lattice model can be tested using EAs. Given a target structure in direct encoding, the
EA will explore the L-system's space and evolve a set of rules that, once derived, would
produce a conformation that closely matches the target.
The following general description of the EA used to test L-systems is based on [27]. The
approach is close to grammatical evolution. Each individual L-system in the population
is determined by the axiom and the rewriting rules. The maximum number of rules and
string lengths for the axiom and rules are preset as parameters. For initialization, both
the axiom and the rules of an individual L-system are randomly generated strings of sym-
bols of the maximum lengths where each symbol is selected with uniform distribution from
the alphabet. The recombination operator resembles uniform crossover where the rules are
interchanged. During the recombination process, if a selected rule in an offspring makes ref-
erence to a variable symbol (rule) not defined in the offspring L-system, a repair operator is
used to change that variable. The mutation operators are addition, deletion or modification
of a single symbol that conforms either the axiom or the rewriting rules of each individ-
ual. For selection, linear ranking selection and elitism can be used. And, a mate selection
strategy that chooses less similar parents can increase the population diversity. To evaluate
an individual's fitness, its L-system is derived and the Hamming distance is computed be-
tween the derived structure and the target structure. During the evolutionary process, the
L-system that produces illegal (not self-avoiding) lattice conformation is allowed, but will
not be accepted as a final solution.
4.2.3 Discussion
L-systems are recursive in nature. This nature makes L-systems very suitable to describe
fractal-like structures. Is L-system suitable to describe protein structures? The preliminary
research [56] seems to give positive answer by asserting "Results confirmed the suitability
of the proposed (L-systems) representation". However, experiments have also shown that
some protein instances are more difficult than others to evolve an adequate L-system [27]
and instances with high frequencies of a-helices and P-sheets have a clear advantage in their
suitability to be encoded by L-systems [56]. These results show that the suitability of the new
encoding scheme heavily depends on the occurrence of sub-structures and their regularity.
CHAPTER 4. AI TECHNIQUES FOR PSP 57 *
Although it is known that protein structures indeed exhibit some regularity and repeated
sub-structures which can be captured by L-systems, to what degree do protein structures
show regularity? And generally, is the level of modularity and repetition within protein
structures high enough for L-systems to be suitable to encode them? Current research has
not explicitly addressed these questions yet.
It is also worth noting that for 2-d lattice protein, the proposed L-system-based encoding
is not independent of direct encodings. This dependence has two folds: the alphabet of the
L-systems includes all the symbols used in direct encoding; and an L-system needs to be
derived to the direct encoding form before the structure it encodes can be evaluated. Also
note that a given target structure may have various direct encoding representations and
that various distinct L-systems could produce the same direct encoding word [27]. Therefore,
if L-systems are actually used in PSP problem under this scheme, the advantages of using
grammatical encoding have to be evaluated against the cost of adding a layer of complication
in the encoding system.
L-systems grammar has been used in many applications of evolutionary algorithms to
problems in biology, engineering, and computer graphics. One example of L-systems as a
powerful encoding is investigated in [46] where it represents blood circulation of the human
retina. Using L-systems to encode lattice protein conformations as reviewed here is very
recent research. It is limited to short proteins on 2-d square lattice model and it has not
been integrated into any approach to PSP problem. However, it is a very interesting protein
conformation representation scheme and more research in this line is needed to investigate
its possible application in PSP and the inverse PSP problem.
4.3 Artificial Neural Networks
As introduced in Chapter three, protein structural features prediction is an important cat-
egory of PSP. Examples of structural features include secondary structure, residue solvent
accessibility, trans-membrane strands and helices etc. Although these features do not repre-
sent 3-d structure, accurate predictions of them are important steps toward 3-d prediction.
2 ~ h e experiment analysis in [27] shows some sub-strings that appear several times in the folded chain (e.g. RFR) are also present as part of the evolved rules. This supports the idea that L-systems capture the natural occurring sub-structures in lattice protein.
30nly applying to absolute internal coordinates.
CHAPTER 4. AI TECHNIQUES FOR PSP 58
For instance, predicted secondary structures can be regarded as rigid bodies, simplifying
molecular dynamics simulations; or in ab initio prediction approaches, these predicted fea-
tures represent additional information that can help guide the conformational search . The prediction of structural features are often modeled as inferring a mapping from
input amino acid sequences to some kind of output sequences. The output sequence has
the same length as the input sequence and each symbol appearing in the output sequences
describes the structural property of the residue in the same position as in the input se-
quence. This way of modeling the problem enables the application of automatic learning
methods, such as artificial neural networks (ANNs). These networks are capable of mapping
between protein sequence and structure, of classifying types of structures, and identifying
similar structural features from a database. Neural networks have the advantage of making
decisions out of a large number of competing variables without explicit understanding of the
problem. This is particularly important for PSP problem where the principles governing
protein structure forming are complex and not yet fully understood. So far neural network
models are among the most successful approaches in predicting protein structual features,
especially in secondary structure prediction.
In this section, we first give an introduction to ANNs and a basic ANN scheme for
predicting structural features in general. Then we focus on secondary structure prediction:
to illustrate and review how ANNs are applied in this important category of prediction. Then
we briefly introduce a few other types of structural predictions made by ANNs. Further
discussions will be on some important issues raised from using ANNs in the PSP problem.
4.3.1 Introduction to ANNs
Artificial neural networks are inspired from biological neural net which consists of billions of
biological neurons. Neurons are basic computing units of the brain. For each neuron, input
signals are gathered, then processed and evaluated. If the evaluated result is larger than
some threshold, an action potential fires and then propagates down to become the output
signal of this neuron. Before this output signal becomes the input to the next neuron, it will
undergo some processing to determine how the signal will be transmitted from the output
neuron to the next input neuron.
This rather simplified model of biological neuron serves as basis of artificial neurons
(nodes) from which ANNs are constructed. A simple scheme of a generic ANN node is
shown as follows.
CHAPTER 4. A1 TECHNIQUES FOR PSP
inputs , outputs
t h e shhold function
Figure 4.5: A generic scheme of artificial neuron
In this computation scheme, a weight controls how much influence a previous neuron
node has on this node. Suppose there are n previous nodes connecting to this node, then
a vector 3 = (xl, xz, . . . , x,) represents the n inputs from the corresponding n nodes; and
w = (wl, 202,. . . , w,) is the corresponding weights vector. Inside the node, it calculates a
weighted linear combination of all the inputs: 'LZI 3, then maybe subtracts a threshold, and
passes the result through an activation function to produce the output to other nodes that
connect with it. Activation functions can be of different types, the commonly used one is
the sigmoid function F(x) = 1/(1+ e-").
ANNs can take various architectures. Normally, nodes are arranged into layers. The
inter-layer connections can be divided into two kinds: feed-forward and feed-back. A feed-
forward network has only unidirectional connections and signals propagate only forward
from the input layer to the output layer. In feed-back networks, a layer can be connected
to the next layer or any of the previous layers, thus signals can travel in both directions,
causing loops in the network. Feed-back networks are dynamic and very powerful, but can
get very complicated.
While the connections are hardwired, the weights between nodes can be adjusted by the
network during the training process. The idea of network training is to find, or to learn,
the weights that fit the training data so that the learned network can be used to solve new
data. There are two learning paradigms: supervised and unsupervised. Supervised learning
is the method commonly used in structural features prediction.
In supervised learning, the ANN is repeatedly presented with a set of training samples
with known results. The task of the ANN is to modify the weights through these samples.
The process is, first, the network takes input values of one sample and works out an output
using initially random weights; then, compare the observed output value with the known
CHAPTER 4. A 1 TECHNIQUES FOR PSP 60
value of that sample and back-propagate(see the following subsection) an error adjustment
to the weights so that the next time the sample is presented, the observed output is closer
to the desired output. This is repeated for all samples in the training set and results in
one epoch. Then the process is repeated for a second epoch, a third epoch ... until the
network manages to reduce the output error for all samples to an acceptable low value. At
this point, the training is stopped and all weights are settled, the trained ANN can be used
to work out new data, or, if needed, a test phase can begin to determine the validity, or
prediction accuracy of the network.
In unsupervised learning, the network is not presented with the desired output. It must
learn the weights without being able to measure its result and minimize its error. In such an
unsupervised scheme, nodes compete for the opportunity to update their weights, resulting
in self-organization. Generally, unsupervised ANNs are used for finding interesting clusters
within the data.
Error minimization
In supervised learning, the weights have to be adjusted so the error between the desired
output and the actual output is reduced. A best-known algorithm doing this weight opti-
mization is Back-propagation. For node i, the difference between the observed output oi
and desired output di is called the error,
The sum of squared errors is then
where i runs over all output nodes. By calculating the gradient of the error function the
adjustment of weights is
where 7 is the learning rate. Then each weight is updated as
During this process, the weights are adjusted to minimize the errors. One way of conceiving
this error minimization process is to consider each individual weight as a dimension in space.
CHAPTER 4. AI TECHNIQUES FOR PSP 61
If we could plot the value of the error for each combination of weights, we could obtain an
"error surface" in multidimensional space. In one aspect, the objective of network training
is to find the lowest point in the error surface. Like searching for the minimum free energy
on the energy landscape in ab initio prediction approaches, no algorithm can guarantee to
locate the global minimum. But in another aspect, the neural network training should avoid
over-training. If an ANN is over-trained for too many cycles to minimize the errors, it will
overfit the training data while leading to larger errors on test data.
4.3.2 A Basic ANN Scheme for Predicting Structural Features
To apply ANNs in structural features prediction, a common approach is a multi-layer (often
3) feed-forward network. The following figure provides a basic scheme of these structural
predictors.
Figure 4.6: A basic scheme of ANN predictors adopted from [59]
As it shows, the network is moved along the input sequence and computes an output
vector encoding for the structural class of the amino acid in current position (Y in figure).
As it is generally assumed that structural properties of a residue are greatly affected by
its local context (neighboring residues), the input is a window of a certain size of residues
centered at current inspected position. The architectural parameters of the network includes
CHAPTER 4. A1 TECHNIQUES FOR PSP 62
the number of output nodes, the number of hidden layers and nodes, input encoding and
window size. The number of output nodes depends on the specific prediction task, for
secondary structure prediction, e.g., often three output nodes, representing three secondary
structures: helix, strand and coil. The input encoding refers to the encoding of each input
amino acid. There are two main types of input encoding: orthogonal and profile-based.
Orthogonal encoding just takes each amino acid in the sequence as it is and usually encodes
it using a binary string. Since there are 20 amino acids, each one is represented by a 20-bit
binary string consisting of 19 0s and one 1, where the position of the 1 identifies the amino
acid. Profile-based encoding uses the 20 dimensional profile extracted from the PSSM of
a multiple alignment. More about this type of input and its advantages is discussed in
the next section. About PSSM and multiple sequence alignment, refer to Section 1.2. The
input window size controls how much local context information we want to consider in the
prediction and it usually takes an odd length so that the amino acid at the center of the
window is the prediction target. Ideally, one may expect that the larger the window size, the
more information given to the predictor, hence performance should increase. Unfortunately,
the increase of window size also means the increase of possible noises. It is observed that
beyond some threshold size, the signal to noise ratio would decrease. Typical window sizes
range from 9 to 25 residues [92].
4.3.3 Secondary Structure Prediction
The general hypothesis taken when attempting to predict secondary structure (SS) is, firstly,
an amino acid intrinsically has certain conformational preferences due to its chemical prop-
erties; secondly, these preferences may be modulated by the locally surrounding amino acids;
and thirdly, long range interactions between amino acids may also play a role in forming
SS. Various approaches focusing on different factors have been designed to predict an amino
acid's secondary structure given the sequence context with which it is placed.
Before ANNs were first applied to SS prediction in the work [67], prediction methods
mainly used statistical information as in [I51 or physico-chemical properties of amino acids
as in [66] to investigate amino acids' conformational preferences. These methods make
predictions only on information coming from a single residue and average accuracy achieved
was limited to 60%. Then came many years of fruitful research on ANN-based approaches
which take local context of individual amino acid into account and have achieved 80% with
the help of evolutionary profile. While ANN-based research is still on-going, recently, other
CHAPTER 4. A I TECHNIQUES FOR PSP 63
techniques, including Hidden Markov Models [50] and Support Vector Machines [41], have
been applied to SS prediction. But they have not out-performed ANN-based methods yet
in terms of prediction accuracy.
In the following subsections, we first introduce performance measures commonly used
in SS prediction, then review different ANN-based methods ever applied to this problem.
These various ANNs are categorized into four groups: feed-forward networks based on amino
acid local interactions; feed-forward networks based on evolutionary informations; feed-back
networks; and ANNs as combining classifiers.
Performance measures and testing
The performance of prediction methods can be evaluated in terms of four measures: the
Sensitivity, Specificity, Matthew's correlation coefficient and Segment Overlap score.
For the overall sensitivity measure, the most commonly used is the three-state per-
residue accuracy Qg. I t is defined as the percentage of correctly predicted residues out
of the total number of residues. It counts for all three secondary conformational states:
helix, strand and coil. This measure can also be used for a single conformational state,
thus it has three other forms: Qhelir, Qstrand and QCoil, giving respectively the percentage
of correctly predicted helix residues, strand residues and coil residues. Note that the this
accuracy measure does not convey many useful types of information - e.g., it doesnt say
where the errors are, or in what way the prediction failed. Nevertheless, it is commonly
used to compare the performance of SS predictors.
Qindez is based on individual residues. The measure of the prediction of one residue is
relatively independent of the measure of the prediction of its neighbors. But, the secondary
structure is composed by a segment, or a collection of segments of consecutive residues.
To reflect the nature of protein structure, measures should be concentrating on how well
the entire secondary structure elements are predicted instead of individual residues. Thus,
SOV(Segment Overlap quantity) measure was proposed by Rost et al. in [71]. In web
site [loll, this measure was modified and given full descriptions.
Another useful measure of prediction accuracy for each of the three types of secondary
structures can be calculated using the Matthews' correlation coefficient [67]. For a-helix,
e.g., it is: pn - uo
Coefficient = J(P + 21) (P + 0) (n + u) (n + o)
CHAPTER 4. A1 TECHNIQUES FOR PSP 64
with p being the number of residues which are true positive(correct1y positively predicted),
n being the number of true negative, o the number of false positive, and u the number of
false negative. The correlation coefficients are in the range of +1 (totally correlated) to -1
(totally anti-correlated) and the values for the three types of secondary structure can be
combined in a single figure by calculating the geometric mean.
Moreover, a systematic testing of performance is needed. Often it is done by cross-
validation. In k-fold cross-validation, the original samples are partitioned into k subsets. Of
the k subsets, one is retained as the validation set for testing the model, and the remaining
k - 1 subsets are used as training samples. The cross-validation process is repeated k times
for each training epoch, with each of the k subsets used exactly once as the validation
data. The error on the cross-validation set can then be used to stop the training when it
begins to increase. What is a good value for k? According to [76], the exact number of k
is not important provided that the test set is representative, comprehensive and the cross-
validation results are not miss-used to again change parameters. In [76], the requirements
for the cross-validation process are also addressed.
ANNs based on local interactions
The early ANNs are basically feed-forward networks taking into account local interactions of
amino acids by means of an input sliding window with orthogonal encoding. The pioneering
work was [67] in which 62.7% Q3 accuracy was reported. Their network architecture is
very similar to the template given in the previous section: three layers fully connected and
the output layer consisted of three sigmoidal units representing three SS classes. The input
amino acids are encoded by 21-bit binary strings (the 21st bit specifying a gap). This sparse
encoding increases the number of network parameters needed, but it has the advantage of
not imposing an artificial ordering of the input data. Other network parameters, including
the number of input and hidden nodes and the window size, are experimented thoroughly
in their work. One example of feasible arrangements could be: 357 input nodes, 5 hidden
nodes, and 3 output nodes, resulting in 1,808 weights. The 3 output nodes correspond to
the 3 types of secondary structures. The 357 input nodes allows for a segment of 17 amino
acids, i.e., the input window size is 17.
The number of connections, thus the number of weights, mainly relies on the number of
hidden nodes. One interesting point reflected in [67] is that the performance of the network
is almost independent of the number of hidden nodes. In their work, they experimented
CHAPTER 4. A1 TECHNIQUES FOR PSP 65
different number of hidden nodes, from 0 to 40. The test results do not show much difference
in performance.
Although the accuracy reported in [67] was not much an improvement compared with
other prediction methods, this early work led to subsequent years of successful research on
ANNs in SS prediction.
This type of ANNs that are based on single sequences and local windows seemed to
achieve prediction accuracy of at most 65-69%. Increasing the size of the window will
not lead to improvements due to the overfitting problem associated with large networks.
However, some improvement was obtained by cascading the previous architecture with a
second network to clean up the output of the lower network. More on this is introduced in
the subsection 4.3.3 - 'ANNs as filters and combining predictors'.
Other than general prediction accuracy, another major difficulty of the ANNs based
on a window of local context is in predicting P-strands, because P-strand is determined
by comparatively long-range interactions. By this, it is suggested that 65% of secondary
structure depends on local interactions.
ANNs based on evolutionary information
The next generation of ANNs for SS prediction considers not only the information contained
in the local context of the input sequence, but also the information coming from homologous
sequences. The rationale behind this approach is that the structural features, including
secondary structures, within a family of evolutionary related proteins is more conserved
than sequences. This information is processed by first doing a PSI-BLAST search for
homologous sequences in databases and doing a multiple alignment of them, then extracting
a matrix of profiles, PSSM, indicating the frequencies of each amino acid in each position.
Thus each residue is encoded by one matrix column at the corresponding position which is
a vector of 20 real number frequencies.
PHD [70, 71, 731 was one of the first ANN methods using profile-based inputs and
going beyond 70% in accuracy and the researchers at the same time suggested that the
power of neural networks should be fully exploited for the PSP problem. The PHD system
is composed of cascading networks. The first one is that of Figure 4.6. A second one
4PSI-BLAST is a web-based search tool, for identifying biologically relevant sequence similarities in databases. Other local alignment algorithms will also do for this task.
CHAPTER 4. A1 TECHNIQUES FOR PSP 66
takes as input a window sliding on the previous outputs and refines the output of the first
network. A final stage takes a jury decision averaging the outputs from independently
trained models. Although a number of techniques including early stopping and ensembles
of different networks are used, most of the improvements achieved by PHD seem to result
from the use of evolutionary profiles [73]. In [12], it was claimed that the most accurate
SS prediction methods would be found using ANNs. And they developed a system that
involves two neural networks to get an accuracy of 75%. Other example of evolutionary
profile-based ANN method is PSI-PRED [43] which uses two neural networks to analyze
profiles. At present, almost all profile-based ANN prediction can achieve accuracy about
76-78%.
Prediction using recurrent networks
Human brains are recurrent neural nets: a network of neurons with feedback connections.
Recurrent networks are considered computationally more powerful than feed-forward net-
works. For SS prediction, although the forming of SS is mainly driven by local interactions
of residues, which justifies the success of feed-forward networks with evolutionary profiles
as inputs, many researchers suggest that possible long range interactions between different
regions of a sequence should also be taken into account to further improve prediction ac-
curacy. Thus there has been research into recurrent architectures applied in PSP problem
recently.
Recurrent networks permit the state of the hidden (or output) units at the previous time
step to be part of the input at the next time step, as shown in Figure 4.7.
This provides the network with some memory of previous inputs, and this information
can be used when processing current inputs. Recurrent network is useful for modeling time
series data and the acquisition of grammar. One of the common features between protein
structure and sentence structure is the inherent sequential nature of the structures: as
sentence structure is based on sequential characters, protein structure is based on primary
sequence that begins at N-terminal and ends at C-terminal. The other common feature
between sentence structure and protein structure is the possible not-sequential long-distance
dependency existing in the structure. One example of this dependency in protein structures
is the forming of a P-sheet by several strands located apart along the sequence. Feed-forward
networks can hardly capture this long-range dependency, this is why the prediction accuracy
of P-sheets using feed-forward networks is generally lower than that of helices.
CHAPTER 4. A1 TECHNIQUES FOR PSP
output units(t) w I hidden unitsw
I hidden unitso I I hidden units$-1) I Figure 4.7: Sketch of recurrent network
In [4], a bidirectional recurrent neural network (BRNN) architecture was proposed, and
was further refined in [64] to predict protein secondary structure at an accuracy about 76%.
In this architecture, the prediction for the residue at position t is determined by three com-
ponents. First, there is a central component associated with the local window at position
t , as in standard feed forward networks for SS prediction. Then, two other components are
two similar recurrent networks being associated with the central component. These two
recurrent networks act as two "wheels" rolling along the protein chain, one from the N-
terminal and the other from the C-terminal, exploiting upstream and downstream context
in the sequence all the way to the point of prediction. This bidirectional recurrent network
is trained with a generalized back propagation algorithm. But because the algorithm is
gradient descent essentially, the error propagation in both the forward and backward chains
is subject to exponential decay, thus the learning of remote information is not efficient. For
SS prediction, the BRNN can use information within about f 15 residues around the residue
of interest, and it can hardly discover relevant information contained in further distant por-
tions. But anyhow, the researchers in [64] claim that they have developed new algorithmic
ideas that begin to address the problem of long-range dependencies in SS prediction.
There is more research based on BRNN. In [13], segmented memory recurrent networks
were proposed to replace the standard recurrent networks in the BRNN architecture. The
idea of segmented memory is based on the observation that when trying to memorize a long
sequence, humans tend to break it into smaller segments first and then cascade them to form
the final sequence. Thus it is believed that RNNs are more capable of capturing long-term
CHAPTER 4. AI TECHNIQUES FOR PSP 68
dependencies if they have segmented memory and imitate the way of human memorization.
The experiment of applying this idea to refine BRNN to predict SS indicates moderate
improvement in prediction accuracy[l3].
In another research paper [9], bidirectional recurrent networks are used as filtering net-
works to correct the output coming from the first stage prediction by trying to capture valid
segments of SS. In this approach, early stopping mechanism was used to control overfitting
during training process. The experiments showed that this approach reached good accuracy
and a very high value of SOV.
Despite some good results achieved, recurrent networks have not been fully explored for
the PSP problem because most research is based on the bidirectional recurrent architecture
proposed in [4] and other network architecture or implementation for PSP problem can be
hardly found in literature.
ANNs as filters a n d combining predictors
Other than a direct SS prediction method, ANNs are also used as filters and in combining
results from different prediction methods as a consensus meta predictor.
Filtering is to examine the final predictions to make them more realistic by removing
bad predictions. I t is now standard in secondary structure prediction, and is used in many
successful methods. There are various filtering techniques, such as using if-then rewrite
rules found through machine learning method CART [99]. One of the rules specifies:
[ l a , *, * , a , c] -+ c
with a = a-helix, c = coil, * = any, 1 = not. This rule says that if the pattern on the left
is met in a prediction, then the secondary structure in bold on the left is rewritten as the
secondary structure on the right of the rule. Thus, a predicted SS segment [b, b, b, a, c],
after filtering, will be rewritten as [b, b, b, c, c].
The more widely used filtering method in SS prediction is to use ANNs. As early as
in [67], a second, structure-structure network was used to filter the outputs from the first,
sequence-structure network. The inputs to the second network was a window of vectors
resulted from the first network, each vector is the frequencies of the three types of SS at a
residue position:
CHAPTER 4. A1 TECHNIQUES FOR PSP
Inputs : ...( O.6,O.l,O.4)(O.8,O.2,O.2)(O.5,O.6,O.2) ...
The structure-structure network has only three inputs per residue, which allows a much
larger window size for the same number of weights as a sequence-structure network which
has to admit 20 inputs per residue. In [67], a 2% improvement was reported in prediction
accuracy for using a filtering network. Now adding a filtering network becomes a common
approach that is believed to be able to improve both Q3 and SOV and the best performance
so far achieves an accuracy of 78% and a SOV of 73.5% [9]. Generally, the filtering networks
are feed-forward, but in [9], the filtering network used was a bidirectional recurrent net-
works(BRNN). This filtering BRNN has a much simpler architecture than the architecture
based on a BRNN with profiles as input, yet when tested on the predictions of both ANN
and SVM predictors, the performance of this solution on Q3 and SOV index are equivalent
to the latter [9].
ANNs are also found useful in combining predictions from several, or many, networks.
In PHD method [70], for example, a third level network combined the predictions from 10
separate neural network systems that vary in training data and encoding schemes. The
output is the prediction resulted from arithmetic average of the 10 ensemble predictions.
The network also outputs a reliability index that indicates how many of the independently
trained networks agree on the prediction. They reported a 2% improvement in predictive
performance. In the PSI-PRED method, it also averages the output from up to four separate
neural networks to increase prediction accuracy. The study of neural network ensembles,
which is closely linked to the development of Bayesian neural networks, is a potential area
that may further improve the SS prediction accuracy.
Other issues on SS prediction
Limits of accuracy: Currently, the best accuracy of secondary structure prediction is close
to 80% '0761. I t is arguable whether this accuracy will ever be significantly improved. There
are probably three reasons for this doubt: 1) given the 3-d structure of a protein, there is
no complete agreement on how to assign SS to each amino acid, especially for the amino
acids that are located a t the beginning or the end of a SS element. This is largely due to the
fact that secondary structure does not represent clear-cut category of structure in nature,
rather it is a useful piece of terminology; 2) some regions of SS are not solely determined
CHAPTER 4. A I TECHNIQUES FOR PSP 70
by the local sequence, but may also be influenced by long range interactions. Thus without
full understanding of tertiary structure, secondary structure could not be expected to be
accurately predicted. This is why some researchers use tertiary structure information in
constructing ANNs to predict secondary structure and gain some improvement in accuracy,
e.g. [52]. But this approach reverses the objective of SS prediction; 3) Usually in SS pre-
diction, 3 classes of secondary structure are adopted. But the secondary structure database,
the DSSP, describes 8 structure classes. Then a mapping from 8 to 3 classes is needed to
reduce the feature space and enable efficient computations. However, by imposing a coarser
input space, we may impose a limit on the accuracy of SS prediction.
From secondary to tertiary structures: Suppose we can accurately predict secondary
structure of a protein, how would it help in constructing the protein's tertiary structure?
This problem is not trivial because the SS elements does not uniquely define the 3-d struc-
ture. Other information, such as their relative distances, is needed. There are methods
that attempt to derive distance constraints between amino acids on the basis of a multiple
sequence alignment of proteins of the same family, like the one discussed in 4.3.4. But these
methods are not reasonably effective yet. Not only in computational methods, even in NMR
experiments, not all atomic distances can be measured or the uncertainty of the measured
value is rather high. Thus, there is still way to go to reconstructing tertiary structure from
secondary structure. The encouraging part is, however, the accuracy of methods for SS
prediction has reached a respectable value for further research on this problem.
4.3.4 Other Structural Features Prediction
Besides secondary structures, there are other structural features that can help understand
and predict protein tertiary structure, such as residue solvent accessibility, Cysteines bond-
ing state, residue long-range contacts, etc. Since for most of the structural features, the
prediction problem can be modeled as a mapping problem that relates each residue in the
protein sequence to a symbol that describes a certain property, it is not surprising that
ANNs, as an automatic learning methods, finds its application and success in predicting
many other structural features. In this section, we briefly survey a few of them.
Residue solvent accessibility (RSA) describes the relative degree to which a residue
interacts with solvent molecules. I t can be described in several ways. The simplest is a
two-state description: residues with greater RSA are considered as exposed, residues with
lower RSA are considered as buried. ANN methods have long been applied to the prediction
CHAPTER 4. A1 TECHNIQUES FOR PSP 71
of RSA. As in SS prediction, the first attempts only took single sequence as input, as in [37]
and later evolutionary profiles were used, as in [72]. Then in [65], ensembles of bidirectional
recurrent neural networks, similar to those employed in SS prediction, were investigated
and obtained good performance, showing again the ability of ANNs to exploit structural
features.
Cysteine is one of the twenty amino acid and it can occur in either of the two forms:
oxidized or reduced. Two oxidized cysteines can pair to form a disulphide bridge which
is a type of covalent bond important for protein folding and stabilizing. Thus identifying
oxidized cysteines can help predicting disulphide bridges and this problem can be cast to a
binary classification task, i.e., for each cysteine in a given protein, predict whether it is in
a disulphide bridge or not. Both feed-forward and recurrent network have been applied to
this task. The program CYSPRED developed in [29] uses a neural network with no hidden
nodes, fed by a window of residue positions centered at the target cysteine. Evolutionary
profile is used as the input for each residue position. This method achieved 79% accuracy.
In [lo], SVM method and more domain knowledges are investigated, the accuracy achieved
is 84%. Based on this method, they further add a global refinement stage by bidirectional
recurrent networks and reach 88% accuracy.
For predicting long-range contacts of residues, the basic hypothesis is that residues in
contact in a protein structure tend to mutate in a covariant manner. Thus detecting residues
mutating in a correlated manner can be taken as an indication of probable physical contact
in 3-d. There are various methods for this problem. One approach is to train neural networks
using different encoding systems for multiple sequence alignments [30]. For example, each
residue pair in the protein sequence can be coded as an input vector containing 210 elements
(20 x (20 + 1)/2), representing all the possible ordered couples of residues (considering that
each residue couple and its symmetric are coded in the same way) and a single output state
can code for contact and non-contact.
4.3.5 Discussion
What has been described here is the application of ANNs to the protein structual features
prediction, a subproblem of PSP, and in particular secondary structure prediction. This
category of prediction problem can be described as a mapping problem in which we relate
a sequence encoded by an alphabet of twenty letters into a sequence of a certain alphabet
representing some structural features. This way of posing the problem enables the use of
CHAPTER 4. A1 TECHNIQUES FOR PSP 72
ANNs, as well as other automatic learning techniques, to infer the relationships between
sequences and structural features by learning from known cases. ANNs perform quite suc-
cessfully in this task. The general ideas of how ANNs are applied in this task have been
introduced in previous sections. In this section, we discuss a few more issues. Some of the
issues are specific to protein structural features prediction, while others are general problems
of ANNs.
The problem of over-fitting
The purpose of training an ANN is not to learn the training set to the highest degree of
accuracy. Rather, the aim is to generate a network that has the ability to generalize to
other unseen data. Thus a network should avoid being over-trained. Otherwise, it will fit
perfectly to the training data while has poor ability to generalize. I t is like we focus so
much on particular trees that we miss the forest. Another problem with over-training is
that training data normally contain noise. If a network is over-trained, it will learn the noisy
details of the training set and is unlikely to be optimal from the perspective of generalization.
Some factors have been identified concerning with specifying the conditions that make
ANNs generalize well. Examples of these factors are: 1) the ratio of network parameters
to training examples. This ratio should not be too large. 2) The number of hidden nodes.
Although structural features prediction does not necessarily require hidden nodes, most of
ANN designs in literature for realistic length proteins have hidden layers. The number of
hidden nodes are often experimented. Too few hidden nodes will cause the network unable to
learn, but if too many, its generalization will be poor. 3) The number of training iterations.
If there are too few training iterations, the network will be unable to extract important
features from the training set; but if too many, the network will begin to learn the details
of the training set that it will not be able to abstract general features.
In practice, the above-mentioned factors can be handled in differnt ways. For example,
the popular SS prediction method PHD [70] uses two methods to address the over-fitting
problem. One is early stopping. The other is to use ensemble averages by training different
networks independently, using different input information and learning procedures. Cross-
validation techniques (See Section 4.3.3) are also commonly used in the training process to
control over-training. It is effective in handling over-fitting problem, yet computationally
expensive.
CHAPTER 4. AI TECHNIQUES FOR PSP 73
Effects of evolutionary information
The fact that proteins are evolutionarily related affects the application of ANNs in structural
features prediction in the following aspects. First, evolutionary information has been proven
to be useful in improving prediction accuracies. Making use of evolutionary information
during prediction process contributes significant improvement in prediction accuracies 191.
The evolutionary information mainly takes the form of multiple alignment profiles. Secondly,
because evolutionarily related proteins often exhibit very similar secondary structures, in
the process of network training, we have to ensure that no protein homologous to those in
the training set is present in the validation and test sets, otherwise the evaluation of the
network is bound to be incorrect because the network may "learn" to recognize homologous
proteins and to give the same answer for them, rather than to recognize the features of
the sequence. Thus, in practice, the protein sequences in training, validation, and test set
normally have to undergo inspections to make sure that no pairs share significant similarity.
Usually a threshold of about 25% sequence identity is used for this purpose.
About data sets
Application of ANNs is dependent on the use of data sets including training, validation and
test sets. For ANNs used in PSP problem, there are some issues about these data sets worth
noting.
First, as pointed out in [12], increasing the number of non-homologous proteins in the
data sets improves the prediction accuracy, because more biological information improves
the network's ability to discriminate between different types of structures, and the risk of
over-fitting is reduced. Such an example was given in [12]: a 4% improvement in Q3 index
was achieved using a data set of 318 non-homologous protein sequences, compared with Qian
and Sejnowski's network [67] in which a much smaller (three times less) data set was used.
This suggests that as the number of solved non-homologous protein structures increases over
time, prediction based on larger data sets will be more accurate. Although it is hard to find
more evidence for this conjecture in literature, subsequent work in ANN approaches to SS
prediction usually use larger data sets. For example, the data set in [41] published in 2001
contains 513 protein chains with low similarity, while the data set in [9] published two years
later contains 969 chains and almost 184,000 amino acids.
CHAPTER 4. A I TECHNIQUES FOR PSP 74
Secondly, not only larger data sets themselves contribute to the improvement of predic-
tion, but that the data pool from which the data sets are drawn are getting larger and this
also contributes to the improvement of prediction. This contribution comes from two as-
pects: one, as discussed before, the use of evolutionary information increases the prediction
accuracy, and the obtainment of evolutionary information is directly connected to database
size and database search tools; two, larger data pools make it possible and easy for the
selection of good-quality protein data used for ANN methods.
One last issue about ANN data sets is not specific to PSP problem, but a general problem
to ANN method: one ANN that is trained on certain data may produce a prediction different
from another ANN trained on different data. This poses problems for prediction accuracy
and it has not been addressed in PSP problem. PSP researchers do pay attention in choosing
data sets, but their attention is on choosing proteins that are mutually non-homologous
rather than attending this issue.
Opening the black-box
While ANNs have been used successfully in PSP problem, one major complaint about ANN
predictors, especially from biologists, is that there is no explanation why a protein structure
is predicted as such. A trained ANN is like a protein folding machine, being fed with
protein sequences and producing folded structure features. But this machine is a black-box
or unknown function of the amino acid sequence. A trained ANN has obviously learned
meaningful relationships in the training data, but these relationships are encoded as weight
vectors within the network, which are difficult to interpret. Is it possible to see inside the
black box? Is it possible to "fit the curve" to the data points and thus empirically derive
the corresponding function from sequence to structure? If this problem is cast as fitting a
function to data, many techniques in mathematics or computing science are applicable. But
our discussion here focuses on extracting rules from neural networks so that these networks
can do more than being mere "black boxes".
Rule extraction from neural networks has been an active research topic in recent years.
Many methods have been proposed [87]. If the feed-forward net used for SS prediction has
no hidden layers, the values of the weights chosen by the network during training for each
residue type and window location are themselves instructive. But most of the networks
used for PSP problem are multi-layer. For multi-layer networks or other network types
such as recurrent networks, rule-extraction methods vary and are dependent on network
CHAPTER 4. A I TECHNIQUES FOR PSP 75
architecture, training and activation functions. But they can be roughly categorized as being
between 'decompositional' and 'pedagogical' approaches, according to [87]. Decompositional
approaches 'look inside' the network and analyze weights between units to extract rules.
Some of these approaches require specialized restricted weight modification algorithms, while
others require specialized network architectures such as an extra hidden layer of units with
staircase activation functions. Pedagogical approaches do not examine weights inside the
black box, but extract rules by observing the relationship between the network' inputs and
outputs. Thus they are general purpose in nature and can be applied to any feed forward
network architecture.
For PSP problem, the rules extracted from ANN solutions should be applicable to most
protein sequences and compliant with the truths of chemistry and physics. But overall,
there has not been much research in this line yet. In [86], the rules are extracted by
sepecific modulation of the training procedure. The attempt did not improve performance
but it showed that the rules extracted from ANNs are more complicated than were available
by statistical analysis. Because ANNs perform better than other methods in prediction
accuracy, it is worthwhile trying to extract rules from black box ANNs. This improves the
comprehensibility of the solutions without losing the accuracy of the black boxes.
Chapter 5
Summary
In order to understand the function of a protein, it is important to know its structure. This
report deals with the determination of protein structure using computational methods,
especially A1 techniques that are initiated from biological systems.
The structure of a protein may be described at four major levels. Protein structure
prediction operates primarily at the level of the secondary and tertiary structure. The
fundamental principle underlying all the methods is Anfinsen's hypothesis (experimentally
justified) that there is sufficient information contained in the protein sequence that specifies
the final 3-d structure [I].
While the problem remains largely unsolved, researchers have made good effort by re-
sorting to various simplified models and trying various approaches. Some common simplifi-
cations are: focusing only on the residues rather than all the atoms in the protein; reducing
the number of residue types by grouping residues based on physical properties such as hy-
drophobicity, as in HP models; reducing the number of spatial degrees of freedom of the
atoms or residues, for instance, by restricting the residues locations on lattices. Predict-
ing secondary structure can also be seen as simplifying the 3-d problem by projecting 3-d
structure onto 1-d string of secondary structural assignments for each residue.
The various approaches to the problem can be classified into three categories: knowledge-
based - building the structure based on knowledge of a good template structure; ab initio -
building the structure from scratch using first principles; and structural features prediction.
Each category has sub-divisions of approaches. A particular approach is chosen depending
on the protein in question and the amount of data available, or on the research interest of the
research group. In practice, knowledge-based prediction tools are more successful. Hybrid
CHAPTER 5. SUMMARY 77
approaches also perform well and are becoming a trend in PSP research. Currently, most
ab initio methods work on simplified models and at residue level, thus strictly speaking, are
not considered as practical full tertiary structure prediction methods. However, they are
more important in the sense that a true solution to ab initio prediction will permit rational
design of novel proteins with novel functions.
A1 techniques have been applied to many approaches to the problem. The most notable
are evolutionary computation in ab initio prediction and artificial neural networks in pre-
dicting secondary structure. In this report, we reviewed and analyzed the applications of
three biologically initiated A1 techniques to PSP problem: evolutionary computing, ANNs
and L-systems. For each of these techniques, we presented a general framework of how they
can be used for PSP either directly or by discussing important components of the technique.
The rationale of whether or why they are suitable for protein structure prediction is pre-
sented. We also discussed and compared significant studies that were published in recent
years.
Evolutionary algorithms are effective and generally applicable search techniques for hard
problems for which analytical methods or good heuristics are not available. PSP problem,
when formulated as a searching-for-optimal-conformation problem, is a good candidate for
using EAs. EAs explore an energy landscape for a minimal energy conformation which
is believed to correspond to the native state. Three crucial components are addressed:
a representation for structure geometry that translates the problem space into encodings
that can be used for evolution; a potential energy function that can distinguish between
favorable and non-favorable structures, and the specific variation and selection operators to
explore the conformational space. For structure representation and energy function, large
approximations are required because of the complexity of the problem. We addressed this
issue and sampled in research literature how various approximations are handled.
Lindenmayer systems is presented as a novel generative encoding scheme to capture
protein structure in lattice model. We introduced and analyzed the recent research in
this line. L-system-based encoding has been tested in evolutionary algorithms with good
preliminary results. But further research is needed to investigate its applicability in PSP
problem.
For humans, a large memory of stored examples can serve as the basis for intelligent
inference. For PSP problem, ANNs infer meaningful relations between primary sequence and
seondary structures from selected dataset. The learned relationships, although in a hidden
CHAPTER 5. SUMMARY 78
form, are then used to predict the structures of new sequences, with promising results. From
the point of view of pattern recognition, secondary structure prediction can be seen as a
classification task, which assigns to each residue one of the three (sometimes more) classes
of conformational states. Various kinds of ANNs have been used in this task. We examined
feed-forward networks based on amino acid local interactions; feed-forward networks based
on evolutionary informations; feed-back networks; and ANNs as combining classifiers.
Among the many A1 techniques that have been applied to PSP problem, I only sampled
a few of those that are initiated from biological systems. Nature is still 'smarter' than
humans. Maybe eventually we can successfully apply what we've learned from Nature to
biological problems themselves?
Bibliography
[I] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181: 223- 230, 1973.
[2] J . Augen. Bioinformatics in the Post-Genomic Era: Genome, Transcriptome, Pro- teome, and Information-Based Medicine. Addison Wesley, 2004.
[3] P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. The MIT Press, 2001.
[4] P. Baldi, S. Brunak, P. Frasconi, G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. bioinfomatics, 15: 937-946, 1999.
[5] M. J. Bayley, G. Jones, P. Willett and M.P. Williamson. GENFOLD: a genetic algorithm for folding protein structures using NMR restraints. Protein Sci, 7:491-499, 1998.
[6] B. Berger and T. Leight. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J. Comp. Bio., 5: 27-40, 1998.
[7] C. Branden and J . Tooze. Introduction to protein structure. Garland Publishing Inc., 2nd edition, 1999.
[8] R. Casadio, E. Capriotti, M. Compiani, P. Fariselli, I. Jacoboni, P. Luigi, I. Rossi and G. Tasco. Neural networks and the prediction of protein structure. In Artificial intelligence and heuristic methods in bioinformatics, P. Frasconi and R. Shamir (eds.), IOS Press, 2003.
[9] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction. In AI*IA 2003: Advances in Artificial Intelligence, A. Cappelli and F. Turini (eds.), 2003.
[lo] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Predicting the disulfide bonding state of cysteines with combinations of kernel machines. Journal of VLSI Signal Processing, 35:287-295, 2003.
BIBLIOGRAPHY 80
[ll] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Cysteine bonding state: local predic- tion and global refinement using a combination of kernel machines and bidirectional recurrent neural networks. In AI*IA 2003: Advances in Artificial Intelligence, A. Cap- pelli and F. Turini (eds.), 2003.
[12] J. Chandonia and M. Karplus. The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Science, 5:768-774, 1996.
[13] J. Chen and N.S. Chaudhari. Capturing long-term dependencies for protein secondary structure prediction. In Advances in Neural Networks: Lecture Notes in Computer Science, Vol. 3174, Springer Verlag, 2004.
[14] C. Chothia and A. Lesk. Relationship between the divergence of sequence and structure in proteins. EMBO Journal, 5:823-827, 1986.
[15] P.Y. Chou and U.D. Fasman. Prediction of protein conformation. Biochemistry, 13:211- 215, 1974.
[16] J. Cohen. Bioinformatics-an introduction for computer scientists. ACM Computing Surveys, 36: 122-158, 2004.
[17] W.D. Cornell et al. A second generation force field for the simulation of proteins and nucleic acids. J. Am. Chem. Soc., 117:5179-5197, 1995.
[I81 C. Cotta. Protein structure prediction using evolutionary algorithms hybridized with backtracking. Artificial Neural Nets Problem Solving Methods, Lecture Notes in Com- puter Science, 2687:321-328.
[19] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Piccolboni, and M. Yannakakis. On the complexity of protein folding. J. Comp. Bio., 5: 409-422, 1998.
[20] V. Cutello, G. Narzisi and G. Nicosia. A multi-objective evolutionary approach to the protein structure prediction problem. J. R. Soc. Interface, doi:10.1098, 2005.
[21] T. Dandekar and P. Argos. Potential of genetic algorithms in protein folding and protein engineering simulations. Protein Eng., 5: 637-645, 1992.
[22] T. Dandekar and P. Argos. Folding the main chain of small proteins with the genetic algorithm. Journal of Molecular Biology, 236: 844-861, 1994.
[23] T. Dandekar and P. Argos. Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and ex- tended criteria specific for strand regions. Journal of Molecular Biology, 256: 645-660, 1996.
[24] R. Day, J. Zydallis and G. Lamout. Solving the protein structure prediction problem through a multiobjective genetic algorithm. In Proc Computational Nanoscience and Nanotechnology Conference, 2002.
BIBLIOGRAPHY 81
[25] K.A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 24:1501, 1985.
[26] A.E. Eiben and J.E. Smith Introduction to evolutionary computing. Springer, 2003.
[27] G. Escuela, G. Ochoa and N. Krasnogor. Evolving L-systems to capture protein struc- ture native conformations. In Proc Genetic Programming: 8th European Conference, 74-84, 2005.
[28] V.A. Eyrich, M.A. Marti-Renom, D. Przybylski, M.S. Madhusudhan, A. Fiser, F. Pazos, A. Valencia, A. Sali and B. Rost. EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17:1242-1243, 2001.
[29] P. Fariselli, P. Riccobelli, and R. Casadio. Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins, 36:340-346, 1999.
[30] P. Fariselli and R. Casadio. Neural network based predictor of residue contact in pro- teins. Protein Engineering, 12: 15-21, 1999.
[31] D. Fischer, D. Baker and J. Moult. We need both computer models and experiments (correspondence). Nature, 409: 558, 2001.
[32] D. Fischer, D. Eisenberg. Fold recognition using sequence derived properties. Protein Science, 5:947-955, 1996.
[33] G.B. Fogel and D.W. Corne. Evolutionary Computation in Bioinformatics. Elsevier, 2003.
[34] D.B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Intelli- gence. IEEE Press, 1995.
[35] J . Gamalielsson and B. Olsson. Evaluating protein structure prediction models with evolutionary algorithms. In Information Processing with Evolutionary Algorithms, M. Grana, R. Duro, dlAnjou and P. Wang (eds.), Springer, 2005.
[36] I. Halperin, B. Ma, H. Wolfson and R. Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins, 47: 409-443, 2002.
[37] S.R. Holbrook, S.M. Muskal and S.H. Kim. Predicting surface exposure of amino acids from protein sequence. Protein Engineering, 3:659-665, 1990.
[38] J.H. Holland. Adaptation in Natural Artificial Systems. The University of Michigan Press, 1975.
[39] B. Honig. Protein folding: from the levinthal paradox to structure prediction. Journal of Molecular Biology, 293: 283-293, 1999.
BIBLIOGRAPHY 82
[40] G. Hornby and J . Pollack. The advantages of generative grammatical encodings for physical design. Congress on Evolutionary Computation, 2001.
[41] S. Hua and Z. Sun. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. Journal of Molecular Biology, 308:397-407, 2001.
[42] D.T. Jones. W.R. Taylor and J.M. Thornton. A new approach to protein fold recogni- tion. Nature, 358: 86-89, 1994.
[43] D.T. Jones. GenThreader: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 287: 797-815, 1999.
[44] M. Khimasia and P. Coveney. Protein structure prediction as a hard optimization prob- lem: the genetic algorithm approach. Molecular Simulation, 19: 205-226, 1997.
[45] R. King and M. Sternberg. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Science, 5: 2298-2310, 1996.
[46] G. Kokai, Z. Toth and R. Vanyi. Modeling blood vesels of the eye with parametric L- systems using evolutionary algorithms. In Proc Joint European Conference on Artijicial Intelligence in Medicine and Medical Decision Making, 1999.
[47] N. Krasnogor, D. Pelta, P.E. Lopez, and E. Canal. Genetic algorithm for the protein folding problem, a critical view. In Proc of Engineering of Intelligent Systems, 1998.
[48] N. Krasnogor, W. Hart, J. Smith and D. Pelta. Protein structure prediction with evolu- tionary algorithms. In Proc Genetic and Evolutionary Computation Conference, 1999.
[49] D.V. Laurents, S. Subbiah and M. Levitt. Different protein sequence can give rise to highly similar folds through different stabilizing interactions. Protein Science, 3: 1938- 1944, 1994.
[50] K. Lin, V. Simossis, W. Taylor and J . Heringa. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics, 21(2): 152-159, 2005.
[51] A.D. MacKerell et al. All-atom empirical potential for molecular modeling and dynam- ics studies of proteins. J. Phys. Chem. B, 102:3586-3616, 1998.
[52] J. Meiler and D. Baker. Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci, lOO(21): 12105-12110, 2003.
[53] S. Miyazawa and R.L. Jernigan. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading. Journal of Molecular Biology, 256:623-644, 1996.
[54] A. Narayanan, E. C. Keedwell and B. Olsson. Artificial intelligence techniques for Bioinformatics. Applied Bioinformatics, l(4): 191-222, 2003.
[55] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443-453, 1970.
[56] G. Ochoa, G. Escuela and N. Krasnogor. Incorporating knowledge of secondary struc- tures in a L-system-based encoding for protein folding. In Proc Artificial Evolution Conference, 2005.
[57] A.R. Ortiz, A. Kolinski and J. Skolnick. Nativelike topology assembly of small proteins using predicted restraints in monte carlo folding simulations. Proceedings of the Natural Academy of Sciences, 95:1020-1025, 1998.
[58] D.J. Osguthorpe. Ab initio protein folding. Current Opinion in Structural Biology, 10:146-152, 2000.
[59] A. Passerini and A.Vullo. Machine learning in structural genomics. 2004.
[60] A. Patton, W.P.111 and E. Goldman. A standard GA approach to native protein con- formation prediction. In Proc 6th Intl Conf Genetic Algorithm~574-581, 1995.
[61] K. Petersen and W.R.Taylor. Modelling zinc-binding proteins with GADGET: genetic algorithm and distance geometry for exploring topology. Journal of Molecular Biology, 325: 1039-1059, 2003.
[62] A. Picoolboni and G. Mauri. Application of evolutionary algorithms to protein folding prediction. In Proc ICONIP. 1997.
[63] N.A. Pierce and E. Winfree. Protein Design is NP-Hard. Protein Engineering, 15: 779- 782, 2002.
[64] G. Pollastri, D. Przybylski, B. Rost and P Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47: 228-235, 2002.
[65] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47: 142-1 53, 2002.
[66] O.B. Ptitsyn and A.V. Finkelstein. Theory of protein secondary structure and algorithm of its prediction. Biopolymers, 22:15-25, 1983.
[67] N. Qian and T. Sejnowski. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202: 865-884, 1988.
BIBLIOGRAPHY 84
[68] A.A. Rabow and H.A. Scheraga. Improved genetic algorithm for the protein folding problem by use of a cartesian combination operator. Protein Science, 5:1800-1815, 1996.
[69] A. Renner and E. Bornberg-Bauer. Exploring the fitness landscapes of lattice proteins. Pacific Symposium on Biocomputing, 2:361-372, 1997.
[70] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232: 584-599, 1993.
[71] B. Rost, C. Sander and R. Schneider. Redefining the goals of protein secondary structure prediction. Journal of Molecular Biology, 235: 13-26, 1994.
[72] B. Rost and C. Sander. Conservation and prediction of solvent accessibility in protein families. Proteins, 20:55-72, 1994.
[73] B. Rost. PHD: predicting one-dimensional protein structure by profile based neural networks. Methods in Enzymology, 266:525-539, 1996.
[74] B. Rost, J. Liu, D. Przybylski, R. Nair, H. Bigelow, K. Wrzeszczynski and Y. Ofran. Predict protein structure through evolution. In Chemoinformatics, J. Gasteiger and T. Engel (eds.), Wiley, 2003.
[75] B. Rost. Neural networks predict protein structure: hype or hit? In Artificial intelli- gence and heuristic methods in bioinformatics, P. Frasconi and R. Shamir (eds.), IOS Press, 2003.
[76] B. Rost. Rising accuracy of protein secondary structure prediction. In Protein structure determination, analysis, and modeling for drug discovery, D. Chasman (ed), New York: Dekker, 2003.
[77] M.P. Scapin, and H.S. Lopes. Protein structure prediction using an enhanced genetic algorithm for the 2D HP model. 111 Brazilian Workshop on Bioinformatics:183-186, 2004.
[78] D.B. Searls. The Computational linguistics of biological sequences. In Artificial Intel- ligence and Molecular Biology, L. Hunter(eds.), AAAI Press, 1993.
[79] D.B. Searls. Grand challenges in computational biology. In Computational Methods in Molecular Biology, S. L. Salzberg, D. B. Searls and S. Kasif (eds.), Elsevier, 1998.
[80] N. Siew and D. Fischer. Convergent evolution of protein structure prediction and com- puter chess tournaments: CASP, Kasparov and CAFASP. IBM Systems Journal, 40(2): 410-425, 2001.
[81] K.T. Simons, C. Kooperberg, E. Huang and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scroing functions. J. Mol. Biol., 268:209-225, 1997.
BIBLIOGRAPHY 85
[82] J. Skolnick and A. Kolinski. Simulations of the folding of a globular protein. Science, 250:1121-1125, 1990.
[83] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.
1841 S. Sun, P.D. Thomas, and K.A. Dill. A simple protein folding algorithm using a binary code and secondary structure constraints. Prot. Eng., 8:769-778, 1995.
[85] Z. Sun, X. Xia, 0. Guo and D. Xu. Protein structure prediction in a 210-type lattice model: parameter optimization in the genetic algorithm using orthogonal array. Journal of Protein Chemistry, 181:39-46, 1999.
[86] I. Tchoumatchenko, F. Vissotsky and J.G. Ganascia. How to make explicit a neural network trained to predict proteins secondary structure. proceedings of CEDEX 05, 1993.
[87] A. Tickle, F. Maire, G. Bologna, R. Andrews and J . Diederich. Lessons from past, current issues, and future research directions in extracting knowledge embedded in artificial neural networks. In Hybrid Neural Systems, S. Wermter and R. Sun (eds.), Springer-Verlag, Berlin, 2000.
[88] C.J. Tsai, B. Ma, Y.Y. Sham, and S. Kumar. A hierarchical building-block-based com- putational scheme for protein structure prediction. IBM Journal of Research and De- velopment, V. 45, 2001.
[89) R. Unger and J. Moult. Genetic algorithms for protein folding simulations. Journal of Molecular Biology, 231:75-81, 1993.
[go] R. Unger. The genetic algorithm approach to protein structure prediction. Structure and Bonding, 110:153-175, 2004.
[91] F. Vivarelli, G. giusti, M. Campanini, M. Compiani, and R. Casadio. LGANN: a parallel system combining a local genetic algorithm and neural networks for the prediction of secondary structure of proteins. Comp Appli Bioscience, 11: 253-260, 1995.
[92] A. Vullo. On the role of machine learning in protein structure determination. AI*IA Notizie (journal of the Italian Association for Artificial Intelligence), XV(3): 22-30, 2002.
BIBLIOGRAPHY