A REVIEW OF ARTIFICIAL INTELLIGENCE TECHNIQUES...

A REVIEW OF

ARTIFICIAL INTELLIGENCE TECHNIQUES

APPLIED TO PROTEIN STRUCTURE PREDICTION

Jiang Ye

B.Sc., University of Ottawa, 2003

A PROJECT SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in the School

of

Computing Science

@ Jiang Ye 2007

SIMON FRASER UNIVERSITY

Spring 2007

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

APPROVAL

Name:

Degree:

Title of project:

Jiang Ye

Master of Science

A Review of Artificial Intelligence Techniques Applied to Pro-

tein Structure Prediction

Examining Committee: Dr. Diana Cukierman

Chair

Date Approved:

Dr. Veronica Dahl, Senior Supervisor

Dr. Kay C. Wiese, Supervisor

Dr. Alma Barranco-Mendoza, Examiner,

Assistant Professor of Computing Science,

Trinity Western University, Langley

SIMON FRASER U N I Y E R S I ~ ~ i bra r y

DECLARATION OF PARTIAL COPYRIGHT LICENCE

The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the "Institutional Repository" link of the SFU Library website <www.lib.sfu.ca> at: ~http:llir.lib.sfu.calhandlell8921112~) and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

Revised: Fall 2006

Abstract

Protein structure prediction (PSP) is a significant, yet difficult problem that attracts at-

tention from both biology and computing worlds. The problem is to predict protein native

structure from primary sequence using computational means. It remains largely unsolved

due to the fact that no comprehensive theory of protein folding is available and a global

search in the conformational space is intractable. This is why A1 techniques have been

effective in tackling some aspects of this problem.

This survey report reviews biologically initiated A1 techniques that have been applied to

PSP problem. We focus on evolutionary computation and ANNs. Evolutionary computation

is used as a population-based search technique, mainly in ab initio prediction approach.

ANNs are most successful in secondary structure prediction by learning meaningful relations

between primary sequence and secondary structures from datasets. The report also reviews

a new generative encoding scheme L-systems to capture protein structure on lattice models.

Keywords: Protein structure prediction, evolutionary computation, artificial neural

networks, L-systems.

Acknowledgments

My sincere gratitude goes to Dr. Veronica Dahl for the initiation of the project, for her

guidance, support and always being there to listen and to give advice. It has been my

greatest pleasure getting to know her and learning from her during my graduate studies.

I would also like to thank Dr. Kay C. Wiese for his constructive suggestions and com-

ments.

Special thanks to my parents for their unconditional love and to my son for always being

a caring boy.

Contents

Approval ii

Abstract iii

Acknowledgments iv

Contents v

1 Introduction 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Protein Structure 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Protein basics 3

. . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Protein Structure Hierarchy 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Experimental Methods 6

. . . . . . . . . . . . . . . . . . . . . . 1.2 Algorithmic Processing of Evolution : 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Protein Structure Databases 8

. . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Evaluation of Prediction Methods 9

Problem Overview 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Significance 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Challenges 13

. . . . . . . . . . . . . . . . . . . . . . . 2.3 Representation of Protein Structure 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 All-atom Model 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Simplified Models 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 HP Lattice Model 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Potential Energy Functions 16

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Measure of Prediction Accuracy 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Related Problems 18

3 Prediction Approaches Overview 2 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Knowlege-based Prediction 22

. . . . . . . . . . . . . . . . . . . . 3.1.1 Homology (Comparative) Modeling 22

. . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Fold Recognition (Threading) 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Ab initio Prediction 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Dynamic Modeling 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Energy Minimization 26

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Structural Features Prediction 28

. . . . . . . . . . . . . . . . . . . . . . 3.3.1 Secondary Structure Prediction 28

4 A1 Techniques for PSP 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Evolutionary Computation 31

. . . . . . . . . . . . . . . . 4.1.1 Introduction to Evolutionary Algorithms 32

. . . . . . . . . . . . . . . . . . . . . 4.1.2 Evolutionary Algorithms for PSP 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Discussion 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 L-systems 51

. . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Introduction to L-systems 52

. . . . . . . . . . . . . 4.2.2 L-system-based Encoding for Protein Structure 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Discussion 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Artificial Neural Networks 57

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Introduction to ANNs 58

. . . . . . . . 4.3.2 A Basic ANN Scheme for Predicting Structural Features 61

. . . . . . . . . . . . . . . . . . . . . . 4.3.3 Secondary Structure Prediction 62

. . . . . . . . . . . . . . . . . . . 4.3.4 Other Structural Features Prediction 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Discussion 71

5 Summary 76

Bibliography

Chapter 1

Introduction

"Biology easily has 500 years of exciting problems for computer science." (Donald Knuth,

2001) Protein structure prediction (PSP) is definitely one such problem.

Proteins form the very basis of life. They perform a variety of essential functions in

organisms, from replication of the genetic code to transporting oxygen, from making up our

cell skeleton to catalyzing chemical reactions that make life possible. Proteins are formed

by joining amino acids into a linear chain. In water, the solvent environment in cells, the

chain folds up into a unique three-dimensional structure. Determining this structure is the

key to understanding how proteins work, thus is essential for our understanding of biological

processes and our ability to enhance the quality and span of our lives.

Currently, the structures of about less than 30,000 proteins have been determined

through experimental methods [98]. This is in contrast to more than a million protein

sequences known as a result of the explosion of genome sequencing projects. The sequence-

structure gap has dramatically increased. Since the costly and time-consuming experimental

methods for structure determination cannot keep pace with sequencing speed, we need effec-

tive computational tools which are able to translate the sequence into an accurate structure.

But unfortunately, despite of the growth of computing power and several decades of research

effort, the problem of predicting protein structure from sequence remains largely unsolved

and has therefore been the "holy grail" in computational biology for many years. The main

reason for this, as indicated in [59], is that no comprehensive theory of protein folding is

available and a global search in the conformational space of proteins is intractable.

The bright side, however, is that more and more research attention has been drawn

to tackle this problem and there have been some promising results in some aspects of this

CHAPTER 1. INTRODUCTION 2

problem. The structure prediction community is growing rapidly. In the first CASP (Critical

Assessment of Structure Prediction) contest in 1994, 35 research groups submitted 100

predictions for 33 protein targets, while in CASP6 in 2004, there were 230 groups submitting

more than 41,000 predictions for the 76 targets. Also, the databases that keep various protein

structural information and the web servers/programs for the task have greatly increased.

There are two major categories of approaches to this problem: 'knowledge-based' and

'ab initio'. Some knowledge-based methods have achieved rather accurate prediction for

a limited number of proteins. Approaches in the third category predict structure features

such as secondary structures and become very useful in the general prediction problem. All

three categories of approaches have been attracting active research.

This report is intended to give an overview of computational approaches to the PSP

problem. The focus is on applications of some interesting biologically originated A1 tech-

niques. The role of computers has been dramatically enhanced in all areas of biological and

medical research with the exponential growth of biological data. While at the same time,

biological systems have been inspiring computing science advances with new concepts, in-

cluding genetic algorithms, artificial neural networks, artificial life, DNA computing ... When

humans try to solve problems, it is always exciting looking at Nature's amazing solutions.

"When looking for the most powerful natural problem solver, there are two rather

straightforward candidates: the human brain (that created 'the wheel, New York, wars

and so on'), and the evolutionary process (that created the human brain)" [26]. Trying to

design problem solvers based on human brains leads to the field of neuro-computing. Evo-

lutionary process forms a basis for evolutionary computing. In this report, we will examine

and review how evolutionary algorithms and artificial neural networks are applied in PSP.

We will also introduce Lindenmayer systems as a novel protein structure representation

scheme. Although not as powerful as evolutionary computation and ANNs, L-systems, also

inspired by natural systems, have found many applications in computing world and has been

recently applied to encode lattice protein conformations.

This report has five chapters. Chapter one provides a basic introduction to protein struc-

ture and the important resources in the protein structure prediction research field. Chapter

two serves as a problem overview through discussing several issues in the problem domain.

Chapter three is a general introduction to various approaches to the problem. This chapter

serves as the big picture of protein structure prediction and helps to understand where in-

dividual computational techniques or methods are fitted. Chapter four is the main chapter


of this report. It reviews and analyzes three biologically initiated computing techniques:

evolutionary algorithms, L-systems, and ANNs and their applications in the PSP problem.

The report concludes with a short summary.

1.1 Protein Structure

This section introduces basic ideas about proteins, protein structure, and the current ex-

perimental methods to determine protein structure.

1.1.1 Protein basics

A protein is a chain of amino acids, also referred to as residues. A single amino acid, shown

in the diagram below, always has: a central carbon atom Ca, an amino group -NH2, a

carboxyl group -COOH, a hydrogen -H and a chemical group or side chain -"R".

Figure 1.1: Single amino acid structure

There are 20 different amino acids commonly found in proteins, each being coded using

one English letter. For example, protein Cysteine is coded as C ' . All the 20 amino acids

have the same general structure as shown in Figure 1.1, but their side chains (Rs) vary

in composition and structure, thus in properties like size, shape, charge etc. It is the side

chain that determines the identity of a particular amino acid. One useful classification of

the amino acids divides them into two kinds, the polar (or hydrophilic) amino acids have

side chains that interact with water, while those of the hydrophobic amino acids do not.

Amino acids can be linked when the carboxyl group of one amino acid reacts with the

amino group of the next amino acid, releasing a water molecule and forming a peptide bond

between the two amino acids, as shown in Figure 1.2.

Using the peptide bonds, long sequences of amino acids (polypeptide chains or proteins)

are generated. Most of the proteins are a few hundred residues long although there are

proteins as short as less than 100 residues or as long as over 1000 residues. Relative to the

CHAPTER 1. INTRODUCTION

peptide bond

Figure 1.2: Two amino acids reacting

side chains, the sequence of three repeating groups: amino group, C a atom, and carboxyl

group is called the protein backbone or main chain. The two ends of a polypeptide chain are

chemically different: the end carrying the amino group is the N-terminal, and that carrying

the carboxyl group is the C-terminal. Conventionally, the amino acid sequence of a protein

is always presented in the N to C direction.

The peptide bond itself (the CO-NH group indicated with a rectangle) is planar, but

there is flexibility for rotation around the N-Ca bond and around the Ca-C bond, forming

two dihedral angles, 4 and $J on each side of the C a atom, as shown in Figure 1.3. These

two dihedral angles are the main degrees of freedom in forming 3-d polypeptide chain.

Figure 1.3: Dihedral angles of the backbone

Although the values of the dihedral angles are restricted to small regions in natural

proteins, it is this freedom that protein can fold into a specific 3-dimensional structure, or

conformation.

'Valid values are specified on so-called 'Rarnachandran plot'


1.1.2 Protein Structure Hierarchy

Conventionally, protein structure is represented with four levels of description: primary,

secondary, tertiary, and quaternary structure.

0 Pr imary: the ordered sequence of amino acid residues. Formally, it can be modeled

as a string from a finite alphabet C where ICI = 20 (There are 20 amino acids).

Protein sequences differ in length from 30 to over 30,000 amino acids, but mostly a

few hundreds 1741.

0 Secondary: local arrangement of amino acids in a short range of protein chain.

Only main chain atoms are involved in the secondary structure. There are two main

secondary structure patterns: a-helix (H) and P-sheet (E). They may be connected by

loop regions or coils (C).

An a-helix is a tightly coiled, rodlike structure. It is built up from one continuous

region in protein sequence through the formation of hydrogen bonds between C=O

group of residue in position i and NH group of residue i + 4. A P-sheet is formed by

two or more ,B-strands hydrogen bonded side by side. A ,B-strand is just a fragment

of consecutive residues, but different P-strands forming a pleated P-sheet are usually

distant in sequence. Coils have no fixed regular shape. In a slightly higher level,

we can define motifs or supersecondary structure which are some commonly found

secondary-structure arrangements, such as helix-loop-helix.

Every amino acid in the sequence belongs to one of the three structural types (some-

times finer classification of eight structural types are used), thus protein secondary

structure can be flattened and represented by a string from an alphabet ICI = {H, E, C )

with the same length of the primary structure. Take an example from [59], a fragment

of protein primary and secondary structure is as follows:

... P Y E L A M S P T I M C K D N W M A L E M L T ... e-- Primarystructure

... C C H H H H C E E E E E E E E H H H H H C C C ... e-- Secondarystructure

0 Tertiary: three-dimensional conformations resulted from secondary structures folding

together. Interactions of amino acid side chains are the predominant drivers of tertiary

structure [2]. This tertiary structure into which a protein naturally folds is also known

as its native structure. Normally, the interior of a (folded) protein molecule tend to


be hydrophobic, while the exterior of a protein is largely composed of hydrophilic

residues, which are able to bond with water molecules. This allows a protein to have

greater water solubility.

0 Quaternary: results from the interactions of multiple independent polypeptide chains.

This level structure will not be discussed in this report.

1.1.3 Experimental Methods

Protein structures are determined by two main experimental methods: X-ray crystallogra-

phy and nuclear magnetic resonance (NMR). In X-ray crystallography, the target protein

must first be isolated and highly purified. Then a series of procedures are required to grow

a crystal which is then exposed to X-rays. From the diffraction pattern recorded, the 3-d

structure could be solved. Using X-ray crystallography depends on successfully obtaining

protein crystals, which is sometimes a major difficulty. (Some proteins do not crystallize.)

Often, It has to take months to solve even a single protein structure by X-ray methods,

although recently this process has been sped up by some high throughput techniques. An-

other drawback of the x-ray crystallography techniques is that the crystallization process

may cause the protein to assume a structure other than its native conformation. In the

second method of NMR, 3-d structure is constructed from pairwise distances estimated by

exciting nucleus and measuring the coupling effects on their neighboring nucleus. Generally,

NMR has been successful for only small proteins and compared with X-ray crystallography,

the resolution is poorer.

Protein tertiary structures solved by X-ray crystallography or NMR are deposited in

Protein Data Bank [98] and are used to evaluate how accurate the computer prediction

models are. But it should be noted that both X-ray and NMR are indirect methods and

they have their limitations. Protein structures solved by them may not represent the native,

active conformation of the protein. For the time being, they are the best data available to

be used as a test for computer models. In the future, however, with more understanding

about protein's native conformation, the standards by which the predicted models are being

judged may be altered.


1.2 Algorithmic Processing of Evolution

Homology is an important concept in protein structure prediction. It is defined as similar-

ity in structure, physiology, development and evolution of organisms based upon common

genetic factors [7]. Evolution at molecular level is commonly modeled as a process in which

currently observed sequences have diverged from a common ancestor sequence. This process

involves such events as: mutations, deletions, and insertions of amino acids in a sequence

and selection of those having environmental advantages. In general, 3-d structures, hence

functions are more conserved than the sequences [14, 731.

Usually, two proteins are considered to be homologous when they have identical amino

acid residues in a significant number of positions, thus resulting in similar structures, i.e., the

essential fold of the two proteins is identical, details such as additional loop regions may vary.

However, it is frequently found that two proteins with low sequence identity can also have

similar structures. Sequence similarity can be observed by optimal alignment algorithms

which usually employ dynamic programming techniques [55, 831. If a pair-wise alignment

shows sequence identity above some threshold, e.g. 25-30%, it is generally assumed that the

two sequences have diverged from the same ancestor, and therefore they are likely to share

a similar structure. If it is below the threshold, there are two possibilities. Either the two

proteins have diverged from the same ancestor (but their sequences are too divergent for

their homology to be detectable) or the two proteins are unrelated. Also, there are multiple

alignment algorithms comparing multiple sequences.

The evolutionary information in a multiple alignment of N sequences and L positions

can be expressed using a profile, a 20 x L Position Specific Scoring Matrix (PSSM) that lists

the frequencies of each amino acid in each position. This evolutionary information is often

considered when designing computational tools. Suppose we want to predict the structure

of a protein sequence s, besides for exploiting the information contained in s directly, if

we can find a set of similar sequences with s, we can think that this set contains more

structural information than s itself. The success of the most effective predictive systems is

largely based upon this empirical argument and on their ability to process the information

provided by multiple alignments of similar sequences.


1.3 Protein Structure Databases

The development of computational tools is undoubtedly crucial in increasing prediction

accuracy. Another very important aspect is the growth of protein sequence and structure

databases. Except for some ab initio approaches, most structure prediction methods are

dependent on detecting homologies with structures existing in the databases. Thus the

more protein structures deposited in databases, the more likely we can predict a novel

protein structure accurately. In a broader sense, having sizable databases of sequences and

structures provides raw data of evolution. It is the use of evolutionary information and

finding patterns in the information that has pushed forward the field of bioinformatics,

subsequently the sub-field of protein structure predication.

The most important protein structure database is the Protein Data Bank (PDB) [98].

It has existed for three decades and is a primary database that contains all experimentally

determined biological macro-molecular structures, mainly proteins. The PDB is updated

frequently and as of December 2006, about 37,300 protein structures have been deposited in

it. The availability of this large quantity of protein structures allows many analytical studies

to be carried out. Also, there are several structure classification databases that are derived

from the PDB, two of which are SCOP [loo] and CATH [96]. They are both hierarchical

databases of protein structure. SCOP (the Structural Classification Of Proteins) divides

the world of proteins structures to reflect both structural and evolutionary relatedness. The

major levels in the hierarchy are family, superfamily and fold. For example, proteins put in

a same family are clearly evolutionarily related, while proteins in a same fold category may

not have a common evolutionary origin but share some structural similarity. CATH clusters

proteins at four major levels: Class, Architecture, Topology and Homologous superfamily.

It provides a slightly different view on clustering different structures. Both databases are

widely used in structure prediction, in particular, in fold recognition approach.

In addition to structure databases, protein sequences databases, e.g. GenBank and

SwissProt, are also important in structure prediction. With the recent enormous growth of

these databases, some powerful sequence alignment tools, such as PSI-BLAST, can detect

extremely remote homologous relationships between proteins. The evolutionary information

detected is valuable in structure prediction.


1.4 Evaluation of Prediction Methods

A large number of approaches or methods have been applied to PSP problem. There is the

need to evaluate the effectiveness of different prediction methods. In this section, we briefly

introduce three world-wide experimental competitions in protein structure prediction field.

CASP

Critical Assessment of Structure Prediction (CASP) is a world-wide protein structure

prediction contest initiated in 1994. It is held every two years since then and the most

recent one was CASPG in December, 2004. During each prediction season, CASP provides

participants with the amino acid sequences of proteins whose structures are close to being

determined experimentally but not released to public yet. The participants then work on

the blind prediction of structures of these target proteins and submit their structure models

generated by computer programs. (Often the models are produced by a combination of

computer programs and human intervention). CASP assessors then compare the predicted

models with experimentally determined structures. Each CASP contest will conclude with

a meeting to discuss the results.

Work in protein structure prediction is very complex and computationally intensive.

CASP provides the PSP research community with an assessment of the various approaches

and critical review of the field. As with the growth of the field, the number of partici-

pants and the extent of prediction have greatly increased. In the first CASP contest, 35

research groups submitted 100 predictions for 33 protein targets, while ten years later, in

recent CASPG, there were 230 groups submitting more than 41,000 predictions for the 76

targets [93].

Figure 1.4 shows two prediction results of protein TM0919. A good prediction and a

not-so-accurate prediction are shown.

Despite of the enormous value of the CASP experiments, they do have some limitations.

In [28], some limitations are discussed: The assessment is carried out by humans, thus bears

the issue of subjectivity; The number of targets is relatively small, therefore, the results

may not always be significant; The assessments cover only proteins determined in a period

of about four months every two years; Users cannot always reproduce CASP predictions,

because computer programs or the required human expertise are often not available.

C AFASP

In contrast to CASP in which human intervention is allowed in the prediction process,


Figure 1.4: Crystal structure of TM0919, one of the 76 target proteins of CASP6. (b) Comparison of a successful prediction (red) for TM0919 with the crystal structure. (c) Comparison of a less successful prediction. (The image was taken from [93].)

Critical Assessment of Fully Automated Structure Prediction (CAFASP) aims to assess the

performance of fully automatic structure prediction servers. Thus, what is measured is the

capability of the computer program itself, rather than the capability of prediction groups

aided with the programs. This is of significance to biologists who just want to choose a

better prediction tool.

The benefits of an assessment of fully automated methods are listed in [SO] . First, the

nonspecialist users can choose which is the best method for them to use on their prediction

targets. Second, users can evaluate and better interpret the results they obtain from the

various prediction programs. And last, fully automated predictions are reproducible, unlike

the cases where human intervention is part of the model-building process.

The CAFASP results demonstrated that although in most cases human intervention

resulted in better predictions, several programs could already independently produce rea-

sonable models.

LiveBench

Like CAFASP, LiveBench also evaluates automatic servers only, but it is carried out

in a continuous fashion and use a larger number of prediction targets. Each week the

Protein Data Bank is checked for new entries. Proteins with low sequence similarity to

other proteins of known structure are chosen as prediction targets and are immediately

submitted via Internet to the participating servers. After a few months, a large collection


of prediction targets is thus obtained, and the predicted models can be evaluated.

Chapter 2

Problem Overview

In this chapter, we discuss several issues that will help understand the problem domain.

2.1 The Significance

Knowledge about the structure of a protein is essential in understanding its biological func-

tion. It helps us to understand substrate and ligand binding, devise intelligent protein

engineering experiments with improved specificity and stability, perform structure-based

drug design, and design novel proteins. Thus, being able to predict the 3-d structure of a

protein from its amino acid sequence would greatly benefit molecular biology research. It

would provide educated guesses about the function of newly discovered proteins without

the time and cost required to perform x-ray crystallography and NMR. Indeed, if structure

prediction is good enough, it may remove the need for the lab experiments at all. At least,

in many situations, even a crude or approximate model can greatly help experimental de-

termination of protein structure. Thus, even though most of the current approaches cannot

produce accurate results yet, prediction of structures is of great value. Also, structure pre-

diction is important for the progress of protein engineering as it would enable changes to be

made in the amino acid sequence with some expectation of how the change will affect the

structure.

On the other hand, the study of protein structure prediction problem drives the devel-

opment of computing techniques, e.g. this problem on the simplified models is a good test

problem for developing and evaluating evolutionary algorithms.

CHAPTER 2. PROBLEM OVERVIEW 13

2.2 The Challenges

Protein structure prediction is a very difficult problem. We have not even come close to

solving it. In [79], David Searls outlined some major challenges around the problem:

The physical basis of protein structural stability is not fully understood. Although

Anfinsen [I] experimentally showed that the primary sequence plus thermodynamic

principles should suffice to completely account for the native structure of a protein,

what exactly those principles are and the best way to apply them are still not certain.

The search space of the problem is too huge because of the vast range of possible

conformations of even relatively short polypeptides.

The primary sequence may not fully specify the tertiary structure. - "There are no

rules without exception in biology."

To illustrate a little more about the second challenge, take a small protein of 100 amino

acids as an example. Even with a very modest estimation of three possible structural

arrangements for each amino acid, the total number of conformations for this small protein

is 3"' = a number which is far beyond the computing capability of modern computers.

While the other two challenges need to be addressed primarily by biophysists/biochemists

who study and model protein folding processes, the secondary one, however, is a rich source

of interesting and challenging computational problems in A1 field: e.g., what are some

intelligent ways to explore the conformation space? Why is the Nature so efficient and

accurate with respect to the protein folding and what can we learn from the Nature?

2.3 Representation of Protein Structure

When we approach this computational problem, probably the first thing is to represent pro-

tein structure in the problem space. Because protein structure can be specified at different

levels of the hierarchy and for each level, it may be viewed at different levels of detail, there

are various ways of representing protein structures. Meanwhile, due to the complexity of the

'Despite of some new discoveries, e.g. chaperones, a special type of proteins whose function is to assist other proteins in achieving proper folding, this argument largely remains valid in PSP community and is the fundamental principle underlying all prediction methods.


problem, in practice, further simplified or restrained models are often used to accommodate

limited computing resources.

Roughly, we can represent protein structures using two categories: all-atom model and

simplified models. Choosing a suitable representation not only makes the problem space

explicit but may help to find solutions more efficiently and effectively.

2.3.1 All-atom Model

In Protein Data Bank [98], protein structures are represented by lists of 3-d coordinates of

all atoms in a protein. Although an accurate all-atom model is desired in the structure pre-

diction, it causes too huge a computation overhead even for very small proteins. Besides, it

is difficult to identify similar sub-structures across different proteins using all-atom coordi-

nates representation, consequently, it is difficult to carry out generalization and abstraction.

Thus for the PSP problem, various simplified models and representations are used.

2.3.2 Simplified Models

Since all-atom model is not feasible, at least currently, it is attractive to explore simplified

structure models to see if they are good enough to at least allow approximate solutions,

which are useful either directly or as initial models for further improvement.

Simplified models can range from a very abstract model such as an HP lattice model (see

2.3.3) to almost realistic models in which proteins are represented by a geometry description

of the main-chain atoms and a rotamer library of side chains. Roughly, simplified models

can be classified into lattice models and off-lattice models. Lattice models adopted lattice

environment which is a grid and structural elements are positioned only at grid intersec-

tions; whereas off-lattice models use off-lattice environment in which structural elements are

positioned in a continuous space.

Lattice models have two aspects of simplification. Each amino acid is modeled as a

single "bead" without considering the different atoms in the amino acid; and, the beads are

restricted to a rigid lattice, rather than being able to take any position in space. Thus in a

legal conformation in lattice model, one residue occupies one vertex in the lattice and the

adjacent residues in the sequence must be adjacent in the lattice. Thus a legal conformation

is actually a self-avoiding path on a lattice. These lattices may be two-dimensional, e.g.

square or triangular, or three-dimensional, e.g. cube or diamond.


There is debate about the 'physical reality' of a lattice protein. For example, refer-

ence [35] addressed the issue and suggested that simplified lattice models do not contain

the biological information necessary to solve the protein structure prediction problem. But

some other researchers, as an example, N. Krasnogor in [48] think that, simple lattice models

can capture many global aspects of protein structures and other than their inexpensiveness

to use, it is possible to design test problems for which the best conformational structure is

known (for small protein sequences).

Off-lattice models represent protein structure in various ways:

Depending on what level of details each model represents the polypeptide chain composition,

there are models considering:

- at individual residue level, often the central Ca atoms;

- all backbone atoms;

- backbone atoms and side-chain centroids;

- all heavy atoms.

Depending on how to represent the positions of structure units, there are models

- using dihedral angles;

- directly address coordinates, either absolute or relative coordinates;

- using distance matrix.

These different representations carry more or less information about the protein structure.

Some structure prediction approaches use multiple representations and move among them

for different purposes.

2.3.3 HP Lattice Model

The 2-d HP lattice model is perhaps the simplest lattice protein model. It was proposed

by Dill [25] and is widely studied for ab initio prediction. This is the main model we use in

Chapter four when we discuss evolutionary algorithms and Lindenmayer generative encoding

systems.

In this model, the 20 amino acids are classified into only two classes: hydrophobic (H)

and hydrophilic (P), according to their interaction with water molecules. Thus a protein

sequence s is reduced to s E {H, PI+. In addition, the sequence is assumed to be embedded

or a protein sequence of n residues, the corresponding distance matrix contains n * n entries, each representing the distance between the C-a atoms of a pair of residues.


in a certain 2-d lattice. The free energy of a conformation is inversely proportional to

the number of H-H contacts. A H-H contact refers to a hydrophobic non-local bond and

it occurs if two H-residues occupy adjacent vertices in the lattice but are not consecutive

in the sequence. Thus the more there are H-H contacts, the lower the free energy of the

conformation. The forming of lowest energy conformation would result in H-residues forming

a hydrophobic core while being surrounded by P-residues that interface the environment.

This concept is normally quantified by giving a value e = -1 for every pair of H-H contact,

and trying to maximize the total number of H-H contacts.

The following figure shows examples of two HP lattice models. H-residues are represented

by dark circles and P-residues are white circles. HH contacts are highlighted only in (b)

with curved lines:

Figure 2.1: HP models in (a) square lattice and (b) triangular lattice.

The embedding of a HP sequence in a lattice may be represented in two ways: the

location of each residue on the lattice is specified independently; or relative to the previous

residue. In the latter case, the structure is specified as a sequence of moves (e.g. up, down,

left, right) taken on the lattice form one residue to the next.

Although the degrees of freedom, thus the amount of computation is greatly reduced

for the HP lattice model, it has been shown that the PSP problem for the HP model is

NP-hard on both the 2-d square lattice [19] and 3-d cubic lattice [6]. This justifies the use

of intelligent search techniques, e.g. evolutionary algorithms, to tackle this problem.

2.4 Potential Energy Functions

During the prediction process, we need energy functions to provide information on what

conformations of a protein are better or worse. Apparently, energy functions are very

important to the prediction result. A poorly defined energy function may render an energy


hyper-surface that has little correlation with a protein's true conformation. An energy

function is needed in almost all computational approaches.

A wide variety of energy functions have been used in protein structure prediction. These

range from the very simple hydrophobic potential in HP lattice protein to energy models

based on more detailed molecular mechanics, such as CHARMM (Chemistry at HARvard

Macromolecular Mechanics) package [95]. Current energy functions can be roughly classified

into three categories: physical potential functions; mean force potentials and simplified

potentials.

Physical potential functions take into account the bond and non-bonded potentials be-

tween atoms, such as torsion (bonded), electrostatics(non-bonded) ..., and typically have the

form

where R is the vector representing the conformation of the protein, typically in Cartesian

coordinates or torsion angles. A popular example in this category is CHARMM. The v.27

CHARMM energy function adds up seven energy terms [95].

Mean force potentials are derived from databases of known protein structures. They can

be based on statistics of frequencies of contacts between amino acids, or in a finer manner,

between functional groups. For example, the amino acids pair R and D is frequently found

to occur a short distance apart relative to random expectation. This indicates that such

an interaction is favorable. Mean force potentials are quite successful in fold recognition

approach, but generally they are not accurate enough due to their crude representation and

their statistical nature.

Simplified empirical energy functions are often related to simplified protein models,

e.g. hydrophobic potential in the simple HP lattice protein. In these potentials, more

consideration is on computation efficiency and ease of use, rather than accuracy.

Potential energy function is a very important factor for the accuracy of structure pre-

diction. An energy function should be sufficiently close to the right potential for the native

state, otherwise, the lowest energy state will not correlate with the native conformation.

Development of energy functions is a very active research area, new models are frequently

published and tested. Often these new functions are a combination of atomic forces and

statistical properties taken from observed protein structure. Currently, potential functions

are still not accurate enough.


2.5 Measure of Prediction Accuracy

How do we measure the accuracy of our predicted result assuming we know the real native

structure of the protein? In literature, the most popular metric is the 'root mean square

deviation' (RMSD). It measures the average distance between corresponding atoms after the

predicted and the real structures have been optimally superimposed on each other. This

distance is usually measured using Angstrijm (A) which is the unit of length equal to one

millionth of a centimeter. RMSD is given by the formula:

where rai and rh are the positions of atom i of structure a and b, respectively.

In general, a prediction with RMSD of about 6A is considered non-random, but not

useful, RMSD of 4 - 6fL are meaningful, but not accurate, and RMSD of below 4A are

considered good [33]. Of course, the required accuracy also depends on the purpose of the

prediction. For example, identifying the overall fold for understanding the function of a

given protein requires less precision than designing an inhibitor for a protein.

RMSD is widely used for structure comparison. The major problem with this metric is

that two structures have to be appropriately superimposed. Finding best superimpose align-

ment itself is a hard problem and meanwhile, best alignment does not always mean minimal

RMSD. When all equivalent parts of the proteins cannot be simultaneously superimposed,

this is not a good measure. Other problems with the RMSD metric is that significance of

RMSD depends on the size and type of the protein.

Another metric for measuring the accuracy of a predicted structure is the Distance

Matrix Error (DME). But we are not discussing it here.

2.6 Related Problems

The field of protein structure prediction has grown and diversified greatly since the first

attempts. Initially, researchers focused on understanding physical and chemical principles

and using these principles to simulate the folding and obtain protein structure. While this

has not worked out a solution, with more and more experimental data available, researchers

try to derive empirical rules from them and predict new protein structure accordingly. On

the other hand, because protein structure can be viewed at different levels and different types

CHAPTER 2. PROBLEM OVERVIEW

of proteins possess different structure features, the general problem of structure prediction

may be simplified or varied to address different prediction tasks. Here I briefly introduce

three closely related problems, which may help in understanding the problem of PSP.

0 Protein folding

The PSP problem attempts to predict the native structure of a protein given its pri-

mary structure, while the protein folding problem consists in predicting the folding

process or pathways to reach the native structure. Both problems explore protein

structure, but with complementary aims. Studies of protein folding are mainly

concerned with fundamental physicochemical principles and less concerned about pro-

ducing accurate 3-d structure models. A solution to the protein folding problem will

provide a solution to the PSP problem; but knowing the final structure does not solve

the folding problem. In this sense, protein folding problem is more complex than PSP

problem.

The progress in protein folding problem is definitely helping PSP problem because a

better understanding of physicochemical principles in protein folding will help devel-

oping more appropriate energy functions for PSP problem.

When talking about protein folding, we cannot ignore a distributed computing project:

folding@home [97]. It was launched on October 1, 2000, and is managed by the Pande

Group in Stanford University. It is designed to perform the intensive computations of

protein folding simulation. As of February 2006, more than 210,000 CPUs world-wide

were actively participating, and with a total of over 1,600,000 CPUs registered with

the project.

0 Seconda y structure prediction

Predicting protein secondary structure sometimes is considered as a sub-problem of

PSP, although it can stand on its own. The term 'protein structure prediction' in

early (back to 80s) research often actually referred to secondary structure prediction.

Given a protein sequence, if the secondary structure is known, the 3-d structure prob:

lem becomes arranging the known secondary structure elements into the correct 3-d

structure. Some of the other uses of secondary structure prediction are fold recogni-

tion, genome annotation, and predicting regions of a protein that are likely to undergo

3 ~ n old literature, sometimes the two terms are not distinguished.


structural changes. In the next chapter, we will address this very important task using

ANNs.

Protein design problem

This problem is to identify the amino acid sequences folding into a given native con-

formation. Thus, it can be considered as the inverse problem of PSP. Unlike PSP that

has only one desired solution (the native structure), the inverse problem is likely to

have many solutions because it has been recognized that different protein sequences

may fold into a very similar structure. For example, it was reported in [49] that 2

non-homologous proteins, the third domain of ovomucoid and the C-terminal fragment

of ribosomal L7/L12 protein, have very similar structure, while possessing completely

different sequences.

Protein design problem on simplified lattice models has also been shown to be NP-

hard [63]. This problem is also attracting active research, and researchers are asking

whether the (partial) success of the various A1 techniques that were applied to PSP

problem could be replicated in the inverse problem.

Chapter 3

Predict ion Approaches Overview

Many computational techniques have been employed in the PSP problem. To name a

few: artificial neural networks, evolutionary computation, monte carlo search techniques.

To see the big picture of where and how these individual techniques are applied in the

landscape of protein structure prediction, it is useful to introduce the two main categories

of approaches, namely: knowledge-based prediction and ab initio prediction. The use of

knowledge-based approaches relies on the existence and detection of homologous proteins

with known structure that serves as a template to model the target protein structure.

Overall, it is estimated that knowledge-based approaches can be applied to only less than

half of novel proteins [74]. In many cases, given a novel protein sequence, there is no

homologous protein with known structure available in existing databases. Thus its structure

has to be modeled ab initio, which means we have to do a direct prediction based on the

sequence alone, plus known physical-chemical principles. Ab initio structure prediction

is arguably more useful than knowledge-based prediction because it can be applied more

generally. But currently ab initio prediction is very difficult and less successful. Both

knowledge-based and ab initio approaches are trying to predict directly 3-d model of protein

structure, although sometimes just simplified models. The third category of approaches

to protein structure prediction focuses on predicting the intermediate structures or values

of structural features, such as secondary structure, residues distances or contact maps,

which are important information in aiding 3-d prediction. Compared with full 3-d model,

predicting these 1-d or 2-d features are more tractable and various A1 techniques have been

applied and achieved good results.

This chapter will only cover general ideas of how the above-mentioned approaches work

CHAPTER 3. PREDICTION APPROACHES OVERVIEW 22

to find protein native structures. It is organized by different categories of approaches.

Analysis of selected A1 techniques involved in these approaches will be discussed in the next

chapter. PSP is a complex problem, the classification of approaches itself is not the focus of

this chapter. Others might reasonably classify some approaches differently, as many overlap

or share characteristics. Our intent is to structure the presentation to give the big picture

for individual techniques discussed in Chapter four.

3.1 Knowlege-based Prediction

Homology (comparative) modeling and fold recognition (threading) are two major knowledge-

based approaches. In these approaches, we do not have to care about the folding mechanics

of a protein. We make use of the large amount of available sequence and structure data:

comparing, analyzing, and inferring from them. This is an example of a scientific problem

that can be (partially) solved in practice, without first obtaining a complete understanding

of the protein folding process as it occurs in nature.

The major difference between comparative modeling and threading relies on whether a

homologous protein of the target one can be found through mere sequence alignments. If

sequence comparison cannot help finding the template, threading approach has to be tried.

3.1.1 Homology (Comparat ive) Modeling

Based on the principle that significant sequence similarity implies similarity in 3-d structure,

homology modeling first identifies an evolutionarily related protein with the target protein

through sequence alignment, then builds the 3-d model of the target protein using the known

structure of the related protein as a template. The basic assumption of homology modelling

is that the target and the template have identical backbones. The task is to correctly

place the side chains of the target and build loop regions. To build side chains, molecular

dynamics simulations or other techniques can be applied. In more detail, homology modeling

comprises the following four steps.

1. Select the template. This is facilitated by searching databases through programs like

BLAST, FASTA, etc. If no such template exists, homology modeling is not applicable,

other approaches need to be used.


2. Construct a sequence alignment of the target protein and the template protein. The

aim of this step is to match each residue in the target sequence to its corresponding

residue in the template structure, allowing for insertions and deletions.

3. Build the model based on the target-template alignment. When the sequence align-

ment is good, use the template structure directly as the target structure, replacing

the side chains of the residues that differ. A subsequent optimization step would then

take care of the side chain interaction. When the target-template sequence similarity

is low, first build the backbone, then place the side chains and finally optimizing the

entire structure. Some of these techniques need a large amount of computation1 time

and user expertise.

4. Refine the model. Additional adjustments may be needed. Various methods exist for

this optimization stage, such as packing, energy calculations.

The accuracy of homology modelling clearly depends on the amount of target-sequence

identity. With high levels of identity (70%), homology-derived models can be as accurate

as the experimentally-derived. But if the identity is only about 30% or less, the model built

on the alignment would probably be completely wrong.

So far comparative modeling is still the most accurate approach in solving PSP, but it

is limited by the absolute need for a related template structure.

3.1.2 Fold Recognition (Threading)

If a highly similar sequence with known structure cannot be found, a new protein may

still be structurally similar to some protein with known structure. In this case, the two

proteins are said to be remote homologous. Fold recognition is aimed at identifying the

remote homologue from a collection of candidate folds. If such a fold template exists,

threading is used to provide a sequence-structure alignment between target sequence and

template structure, rather than mere sequence alignment as in homology modeling. In

actual operation, the two tasks are usually handled together: given a collection of potential

fold templates, for each template, the query sequence is threaded onto the known structure

template. It then follows an assessment of how well the query sequence fits each structure

template (sequence-structure compatibility) using some scoring function. Threading can

be no-gap or gapped, where gapped threading allows gaps in the match of sequence to


fold. The scoring function can be either amino acid structural propensities 1321 or mean-

force (statistical) potentials [42]. To speed up the process, other techniques have also been

proposed. In [43], profile-based sequence alignments are used to align the query sequence

and the sequence of the candidate template. Feed-forward neural networks are then used to

score the structural similarity of the two proteins. Also, kernel methods have been applied

to detect a remote homology with good results [43].

Various fold recognition methods generally share four components:

0 A library of possible structural templates.

0 A scoring function that distinguishes better threading.

0 An efficient algorithm that searches all possible alignments of the target sequence with

every possible fold in the library. Computing the optimal gapped alignment is a NP-

complete problem if the scoring function takes into consideration pair interactions. In

these cases, approximations or heuristics need to be used.

0 A method to assess significance when selecting best template candidate.

Success of this approach also depends on the degree of similarity between the known

and modeled structures.

3.2 Ab initio Prediction

Ab initio prediction approaches are those that do not rely on known 3-d structures, rather,

they are based on Anfinsen's "thermodynamic hypotheses" [I] asserting that the native

structure of a protein corresponds to its minimum free energy state. Accordingly, many ab

initio prediction methods are formulated as optimizations and computationally intensive. If

this category of methods works fine, it can not only identify in vivo structures of natural

proteins, it can identify structures of arbitrary polypeptides in arbitrary environments.

Therefore ab initio prediction is not only significant to the new proteins that cannot be

modeled with knowledge-based methods, but also significant to drug design. However,

compared with knowledge-based methods, ab initio prediction is less successful and models

produced are not very useful yet - limited to short proteins and coarse models.

In the ab initio prediction category, there are roughly four major approaches: dynamic

modeling, energy minimization, specific protein structure prediction and other approaches.

CHAPTER 3. PREDICTION APPROACHES OVERVIEW

In this section we briefly discuss dynamic modeling and energy minirnization. Specific

protein structure prediction refers to structure prediction of some specific types of proteins

e.g. transmembrane proteins. It generally needs more specific domain knowledge. "Other

approaches" refers to those hybrid approaches or those that are hard to classify. Many

of them achieve good prediction results, e.g. building block approach [88] and Rosetta

program. But like specific protein structure prediction, it lacks generality. We do not cover

them here.

Most research in ab initio approaches is focused on improving the energy function and

search techniques to achieve faster or higher accuracies of prediction; examples can be found

in [85, 571. More about energy function and search techniques will be discussed in Chapter

four.

3.2.1 Dynamic Modeling

Dynamic Modeling uses molecular dynamics (MD) simulation to obtain protein native struc-

ture. Assume our description of all forces at the atomic level is accurate, given any confor-

mation state of a protein system, we should be able to calculate the forces that atoms in

the system exert on each other and to where each atom is moving. Following the trajectory

of the system, eventually the system will rest on its lowest energy state which corresponds

to the native conformation of the protein.

However, there are two problems with this approach. Firstly, we do not have accurate

description of all forces at the atomic level. There are approximate models available, e.g.

empirical potentials, or quantum-mechanical formula, but they are not accurate enough.

Secondly, dynamic modeling often encounters limits of computational power. In the dynamic

system, while one atom moves under the influence of all the other atoms, the other atoms

are also in motion, in other words, the force fields are constantly changing. Thus, we need to

constantly recalculate the forces between each pair of atoms and their positions in very small

timestep. In principle, it requires n2 calculations in each timestep, where n is the number of

atoms in the protein and its surrounding environment. Because the timestep to recalculate

must be chosen small enough to avoid discretization errors (usually at the order of 10-l4

s, which is the same timescale as bond formation.) and the number of timesteps, thus the

simulation time, must be large enough to capture the effect, the calculation becomes huge.

Actually the need to recalculate the forces is the main bottleneck of this method. So far,

we can only simulate very short time of this dynamic process, at the order of s or


s, which is far from enough because proteins fold on the timesale of s, or longer [82].

Normally dynamic modeling simulations require a full atomic description of the protein

and a detailed energy function.

3.2.2 Energy Minimization

Because it is believed that the native state of a protein corresponds to its minimum free

energy state, if we can find the minimum energy state on the energy landscape, we can

obtain the native conformation. The energy landscape of a protein is the variation of its

free energy as a function of its conformation, owing to the interactions between the amino

acid residues. As shown in Figure 3.1, this energy landscape usually has a funneled shape

which leads towards the native state. For a realistic-sized protein, the energy landscape is

very complicated because it has many parameters and has an enormous number of local

minimums.

Figure 3.1: A hypothetical energy landscape exhibiting a folding funnel

In general, energy minimization approaches are composed of the following three compo-

nents 1581: a representation of protein geometry, a potential energy function that can dis-

tinguish between favorable and non-favorable structures, and a search technique to explore

the conformational space. In each of the components, large approximations are required be-

cause of the complexity of the problem. Different computational approaches differ in which

simplifications are made. Brief discussion about each component is as follows. More details

about protein representation and energy function can be found in Chapter two and four.


For protein representation, because an all-atom model of protein is computationally ex-

pensive, often a simplified protein representation is adopted. Simplifications include meth-

ods using one or a few atoms per residue, as well a s a lattice representation of proteins.

Recent computational analyses of PSP problem have shown that this problem is intractable

on the simplest 2-d HP lattice models [6]. Simplified models actually cannot give 3-d struc-

ture prediction of proteins, but they are inexpensive to use while they capture many global

aspects of protein structures, thus current research in energy minimization approach mainly

focuses on simplified models.

Formulating a good energy function is always important, yet difficult. Approximate

energy functions include atom-based potentials from molecular mechanics packages such

as CHARMM [51] or AMBER [17], statistical potentials of mean force derived from many

known structures of proteins, and simplified potentials based on chemical intuition.

Given an energy function, many intelligent search techniques have been applied to im-

prove on the sampling and the convergence of the search, such as Monte Carlo, simulated

annealing, evolutionary computation. Take Monte Carlo method as an example. To mini-

mize a given energy function, take a small conformational step and calculate the free energy

of the new conformation. If the free energy is reduced compared to the old conformation

(i.e. a downhill move), then the new conformation is accepted, and the search continues

from there. If the free energy increases, then a nondeterministic decision is made: the new

conformation is accepted if the Metropolis test is positive. These search methods sometimes

are coupled with the use of other structural information or multiple processors to achieve

better results. For example, an interesting approach 1811 uses a Monte Carlo optimization

of a statistical energy function to assemble the whole protein model from relatively short

building blocks. These candidate blocks are obtained from known protein structures using

energetic, geometric, or sequence similarity filters. While the energy function issue needs to

be addressed primarily by biochemists, the searching for an optimal or near-optimal solution

attracts research attention of computing scientists.

To conclude, to use energy minimization approach to investigate protein structure pre-

diction, we need to pick a representation of protein geometry, an appropriate energy function

and a search technique. It is not that all choices of different combinations will work fine.

For example, the commercial energy packages CHARMM or AMBER is not suitable as fit-

ness functions for evolutionary algorithms partly because they examine atomic interactions

and the energy computation per generation is too expensive 1331. General problems with


this category of approaches are: expensive calculation; some energy functions do not have

strong physics basis; and may not converge to correct result. Another aspect to be noted is

that since most of the research in this approach focuses on simplified models, they are more

conceptual rather than being able to produce working 3-d structure models.

3.3 Structural Features Prediction

Structural features of a protein include secondary structure, inter-residue distances, disulfide

bond formation, etc. Structural features prediction is to relate these structurally measurable

features onto an amino acid sequence. These structural elements can be used to provide

constraints for tertiary structure prediction method or as a part of prediction process. For

example, the results of secondary structure prediction has been integrated in many tertiary

structure prediction approaches. Compared with a complete 3-d structure, structural fea-

tures prediction has less scale and difficulty and they contribute more and more significantly

to the final goal of predicting full tertiary structure.

When predicting structural features, normally, statistical or empirical approaches are

adopted. Examples of sequences and their corresponding known structural features have to

be collected from the existing databases. Then techniques from statistics or A1 are used to

derive meaningful relationships which could be of the form of a neural network, a set of rules

, or an analytical relationship. Then they are applied to sequences of unknown structure to

predict structure features. Among the many techniques applied, artificial neural networks

is more recent and more successful and will be discussed more in detail in Chapter four.

3.3.1 Secondary Structure Prediction

Secondary structure is a very important feature when examining tertiary structure. If

the secondary structure of a given protein sequence is known, the 3-d problem becomes

arranging the known secondary structure elements into the correct 3-d structure. In this

sense, secondary structure prediction can be considered as a sub-problem of PSP.

This prediction problem can be viewed as classifying each amino acid in a sequence into

one of the three classes of secondary structures: H(helix), B(strand), and C(coi1).

Among the many different techniques used in secondary structure prediction, ANNs have

proven successful. One of the first attempts to achieve over 70% prediction accuracy was

PhD [70] using a sliding window and a standard Slayer neural network that was trained


on a carefully selected set of proteins. ANNs have also been used successfully in PSI-

PRED [43], and [64]. Now the ANN-based methods can achieve a Qg (discussed in Section

4.3.3) accuracy of almost 80%.

More discussion on neural networks applied in secondary structure prediction will be

provided in Section 4.4. Recently, Kernel methods have been applied and also perform well

in accuracy.

Chapter 4

A1 Techniques for PSP

While Chapter three gives a big picture of various approaches to the PSP problem, this

chapter focuses on some selected A1 techniques involved in those approaches. As discussed in

the previous chapters, protein structure prediction is a very complex problem and we do not

fully know about the search space, thus we cannot address it fully analytically. This is why

many A1 techniques have long been applied to it. Among them, I am particularly interested

in those that are initiated from biological systems, especially evolutionary computation,

artificial neural networks and Lindenmayer systems.

When humans try to solve problems, looking at Nature's solutions has always been

a source of inspiration. Two powerful natural problem solvers are the human brain and

the evolutionary process [26]. Trying to design problem solvers based on human brains

leads to the field of neuro-computing. Evolutionary process forms a basis for evolutionary

computing. Although not as powerful as ANNs and evolutionary computing, Lindenmayer

systems, also initiated biologically, have found many applications in computing world.

For PSP problem, evolutionary computation is used as a population-based search tech-

nique, mainly in ab initio prediction approach. It represents an intelligent way of search for

an optimal solution. It has general applicability: whenever there is some reasonable method

for scoring candidate solutions to a problem, evolutionary computation can be applied. Lin-

denmayer systems, as a novel generative encoding scheme to capture protein structure in

lattice model, has been tested in evolutionary algorithms. But further research is needed to

investigate its applicability in PSP problem. Artificial neural networks are most successful

in secondary structure prediction. For humans, a large memory of stored examples can serve

as the basis for intelligent inference. For PSP problem, ANNs infer meaningful relations

CHAPTER 4. A1 TECHNIQUES FOR PSP

between primary sequence and seondary structures from selected dataset.

4.1 Evolutionary Computation

Evolution and intelligence are closely related. Evolutionary Computation (EC) is considered

as a subfield in Computational Intelligence by IEEE Computational Intelligence Society. If a

system can adapt its behavior and evolve itself to meet certain goals in certain environments,

it is an intelligent system. By imitating the evolution process on computers, EC mimics

the intelligence associated with the problem solving capabilities of the evolution process. In

real life, evolution creates very robust organisms; on computers, EC often produces good

solutions to hard problems.

Broadly speaking, EC refers to any biologically inspired and population-based search

technique that involve iterative development of final solutions, such as ant colony optimiza-

tion technique. Narrowly speaking, EC refers to Evolutionary Algorithms (EAs) which are

a family of computational models inspired by Darwin's theory of evolution. EAs solve hard

computational problems by simulating the evolutionary process of inheritance, mutation,

recombination and selection to finally evolve a good solution to a problem. As such, EA

represents an intelligent way of search for a near-optimal solution.

For ab initio prediction of protein structure, even for the simple HP lattice model,

it is proven that the problem is computationally intractable 1191. Consequently, there is

much interest in effective techniques that can discover reasonably good solutions within an

acceptable time. Evolutionary computation was first applied to PSP problem in early 90s,

with noticeable success. Not only to PSP problem, the basic technique is both broadly

applicable and easily tailored to many bioinformatics problems. [33] is a good reference

for evolutionary computation in bioinformatics in general. In late 90s or early OOs, some

researchers began to seek multi-objective evolutionary approach to the PSP problem, as

in [20, 241.

In this section, we first give an introduction to EAs in general. Then we discuss how

EAs were used in the various approaches to the PSP problem. Further discussions will be

on some important issues raised from applying EAs in PSP problem.

CHAPTER 4. A1 TECHNIQUES FOR PSP 32

4.1.1 Introduction to Evolutionary Algorithms

How does an evolutionary algorithm work? Generally, EA manipulates a population of

individuals. Each individual represents a single possible solution to the problem under in-

vestigation. EA starts with an initial population of size n of randomly generated solutions

and a fitness value is then calculated for each solution using the fitness function of the prob-

lem. Individuals with better fitness scores represent better solutions to the problem. After

this initialization, the main iterative cycle of the algorithm begins. Using certain variation

operators, the n individuals in the current population produce a number of children. The

children are then assigned fitness scores as well. Then, according to some selection criteria,

a new population of n individuals is selected from the current population and their children.

This new population becomes the current population and the iterative cycle is repeated

until some condition is met.

The above generic framework can be wrapped as follows:

1. Initialize population of candidate solutions and evaluate each of them.

2. Select some of the population to be parents.

3. Apply variation operators to the parents to produce children.

4. Evaluate the children and include them into the population.

5. Repeat from step 2 until some termination condition is met.

While the basic computational framework is quite simple, it is the design and imple-

mentation details that significantly affect the performance of EAs. There are no general

guidelines in choosing a specific design or implementation for different problems. Recent

theory suggests the search for an "all-purpose" algorithm may be fruitless [26]. Thus the

choice of implementation is often based on experience or on trial and error. Some impor-

tant factors determining the performance are representation of individuals; variation and

selection operators; and fitness evaluation.

Representation

In evolutionary computing, representation is translation of the problem space into encodings

that can be used for evolution, i.e., to represent individual candidate solutions in a manner


that can be manipulated by evolutionary operators. Some commonly used representations

are: binary representations (gray coding can be used to ensure that consecutive integers

always have Hamming distance one); integer representations; real-valued or floating-point

representation; and permutation representation.

Traditionally different types of EAs have been associated with different representations.

For example, Genetic Algorithms (GAS), the most widely known type of EAs, often use

fixed length binary strings; while finite state machine representation is often associated

with Evolutionary Programming. But, there is no restriction as to what representation to

use in a particular problem or an algorithm. For example, since binary encoding is often

inappropriate for many problems, in current GAS, non-binary representations such as integer

string individuals or even more general representations such as tree and matrix structures

can also be used.

Thus, the best strategy is to choose representation to suit the problem under investi-

gation and then choose variation operators to suit representation. Selection operators only

use fitness, it is independent of representation.

Not only individuals need to be represented, so does the population. Different types

of representation for the population can be seen in the literature. Two popular ones are

single population and structured populations. In single population, any individual may be

mated with any others. In structured population, the population is decentralized into many

sub-populations, thus the algorithm is decentralized. Often greater performance is achieved

using structured populations, but the implementation complexity is also greater.

Variation

Variation operators act on one or two parent individuals to produce offsprings. They create

the necessary diversity of the population and heavily influence how effectively the algorithm

explores the search space.

Two types of variation operators are mutation and recombination. Mutation can be

viewed as single-parent production: a new individual is created by a random and slight

change from one parent. Thus mutation is always stochastic. Recombination, also called

"crossover" in evolutionary computing, can be viewed as two-parent production (or more

than two parents): each pair of parents selected is recombined to produce (a pair of) children.

When designing Variation operators, it is obvious that variation operators have to match

the given representation, e.g., binary representation and real-valued representation have


different variation techniques applied on them. For specific problems, standard operators

can be considered, but, it may be more beneficial if some thought could be given to designing

operators that take advantage of domain knowledge.

Often variation operators have probability rates associated with them. These proba-

bilities are parameters of the algorithm and must be set beforehand. In actual problems,

we often need to tune these parameters to find reasonable setting for the problem under

investigation. A very small mutation rate may lead to premature convergence in a local

optimum. A mutation rate that is too high may lead to loss of good solutions. There are

theoretical but not yet practical upper and lower bounds that can help guide tuning these

parameters.

Selection

Like in natural selection, the selection operator in evolutionary computing applies evolu-

tionary pressure and is responsible for pushing improvement of population. As opposed to

variation operators that act on individuals, selection operators work on population level. In

EAs, selection is based on fitness scores and is applied either when choosing individuals to

breed children - parent selection or when choosing individuals to form a new population -

survivor selection.

There are different selection methods and selection can be deterministic or probabilistic.

Because selection only considers fitness information, it works independently from the actual

representation. Therefore, selection methods are universally applicable to different problems

and representations. Popular and well-studied selection methods include roulette wheel

selection and tournament selection. In roulette wheel parent selection, each individual is

assigned a sector of a roulette wheel that is proportional to its fitness and the wheel is spun

to select a parent. While in tournament selection, global knowledge of the fitness of the

population is not required, instead, it requires an ordering relation that can rank any two

individuals. Tournament selection looks at relative rather than absolute fitness.

Most selection schemes are designed to enable a small portion of less fit solutions being

selected, which helps keep the diversity of the population and prevent premature convergence

on local optimum.


Fitness evaluation

The fitness of each individual is evaluated by fitness functions. A fitness function can be

viewed as a particular type of objective function that quantifies the goodness of a solution or

viewed in terms of a fitness landscape which shows the fitness for each possible individual.

An ideal fitness function correlates closely with the algorithm's goal, and yet may be

computed quickly. Speed of execution is very important, since an evolutionary cycle must

be iterated many, many times before producing a useable result for a non-trivial problem.

EA variants

Some specific versions of EAs often addressed in literature are listed as follows.

0 Genetic algorithm (GA) - Initially proposed as an adaptive search technique [38], GA

is the most widely known type of EAs. Typically, candidate solutions are represented

by binary strings called chromosomes. The operators used in GAS reflect those found

in natural reproduction, namely mutation and crossover.

0 Evolution strategy - Individuals are often represented as tuples of real values which,

compared to GAS, are closer to the natural problem representation. The main vari-

ation operator used is mutation. Mutations are usually introduced as Gaussian per-

turbations. Evolution strategies have been successfully applied to many engineering

applications.

0 Evolutionay programming - Looks at evolving computer programs. Fogel [34] pro-

posed using the processes present in natural evolution to design intelligent agents,

these agents taking the form of computer programs, which in turn were represented

as finite state automata. These agents could then be used for prediction, control, or

perhaps classification tasks.

0 Genetic programming - Individuals are in the form of computer programs, and their

fitness is measured by their ability to solve a computational problem.

These variations of EAs are similar in their underlying framework, but differ in the nature

of the particular problems to which they can be applied, and the implementation details.

However, given their similarity in nature, detailed implementations such as representation

and variation operators are often borrowed from one type of EAs to another, thus there is no

CHAPTER 4. AI TECHNIQUES FOR PSP 36

clear distinction between them. But in literature, to solve computational biology problems,

GAS are much more frequently seen to be used than other EAs.

4.1.2 Evolutionary Algorithms for PSP

Evolutionary algorithms were first applied to PSP problem in early 90s when Dandekar and

Agros conducted a series of studies [21, 22, 231. Since then, many researchers have used the

EC technique in various approaches to the problem. But most commonly, EAs are applied

to ab initio prediction approach, using genetic algorithms. EAs have also been applied to

secondary structure prediction, one example can be found in [91] that a GA was used to

supervise an artificial neural network to predict secondary structures.

In this section, we mainly discuss EAs applied to ab initio prediction approach in which

the PSP problem is cast as an optimization problem where the conformational space is

searched to find the structure with lowest free energy.

As discussed in the above general settings of EAs, design considerations for an EA for

ab initio PSP problem involve decisions on the following major issues:

0 a protein representation;

0 mutation and recombination operators for effective exploration of the conformation

space;

0 individual selection policies;

0 a molecular interaction model (energy function) with which individual fitness will be

measured.

In the following sections, we discuss these issues and survey common practices dealing

with these issues. Of course, for a full specification of an EA used in PSP problem, other

issues, for instance, the population size, termination criteria, the probability rates of muta-

tion and crossover, etc., have to be considered and specified to produce an executable EA.

We will briefly discuss some of these issues in the "Discussion" section. But largely, our dis-

cussion focuses on the major issues, and mostly on conceptual level, rather than executable

level.


Representation

In literature, EAs are seen applied to both off-lattice and lattice models. For each type of

model, structure representation can be further categorized as follows:

Dihedral Cartesian coordinates

Off-lattice Absolute

matrix coordinates

Relative direction

Figure 4.1: Classification of (direct) structure representations

All these representations use direct encoding of the folded chain, i.e., how each amino acid

(or other structural unit) along the protein chain is arranged in space is directly described.

Recently, some researchers proposed a completely different representation scheme for lattice

proteins - L-systems [27]. L-systems do not encode protein structures directly, but it can

generate directly-encoded-structures. Thus it is a generative encoding scheme. It will be

discussed in next section. Here we give an overview of all the direct representations that

can be used with EAs.

Off-lattice representation For EAs used on off-lattice protein model, individual solution

of protein structure can be encoded by dihedral angle representation as in 1221. Because the

main degrees of freedom in determining protein 3-d conformation are the two dihedral angles

q!J and ?I, at each side of Ca atom, a protein conformation can be represented as a vector of

these angle pairs along the main chain: [(q!J1, ( 4 2 , ?I,2)l ..., (&, $ I~ ) ] . This representation

can be easily converted to Cartesian coordinates of C a atoms. The conversion formula can be

found in [35]. Dihedral angle representation also has the advantage of keeping well-predicted

local segments since local fragments of the structure are encoded continuously. E.g., when

crossover operator is applied, some well-predicted secondary structure segments are more

likely to be kept and inherited to the next generation. For the values of these dihedral angles,

real numbers can be adopted. As an alternative, because dihedral angles are found to be

restricted to a certain range of values, they can be discretized and each discrete dihedral

angle can be encoded as integer numbers, or as bit strings as in the study [22]. In practice,

the range of these angles can be further bounded through preprocessing to further reduce


the size of the conformational space.

Another type of off-lattice representation that can be used for EAs was given in [62]

which introduced a distance matrix representation of residue positions. A distance matrix

contains distances for every residue pair and the Cartesian coordinates can be inferred from

the distance matrix.

Lattice representation For EAs used on lattice models, individual structure can be

represented using Cartesian coordinates [89], or more commonly, by internal coordinates

representation [60, 18, 771.

In Cartesian coordinates representation, each vertex in the lattice has a set of coordi-

nates, thus a protein conformation on a 2-d lattice can be encoded as a vector of coordinates

[(XI, yl), (22, y2), ..., (xn, yn)], where (xi, yi) is the Cartesian coordinates of the vertex occu-

pied by the ith amino acid. A 3-d lattice will require three coordinates for each amino

acid.

In internal coordinates representation, the location of one amino acid is specified in terms

of its previous one on the protein sequence. Thus, a protein conformation can be represented

by a direction list expressing a sequence of moves. Obviously this representation depends

on the particular lattice topology considered. Internal coordinates representation can be

further classified into two major schemes: absolute and relative.

The absolute scheme, as studied in 1481, uses absolute direction reference system and

the moves are specified with respect to it. Take 2-d square lattice as an example (an

extension to other lattices is straightforward), four absolute directions of North, South,

East and West can be naturally chosen as the reference system for it. Using this reference

system, a conformation can be expressed as a sequence S E {N, S, E, Win-' where n is

the length of the protein sequence (the location of the first amino acid is fixed). Thus a

very simple 6-residue conformation as shown in the Figure 4.2(a) below can be expressed as

SabsolzLte = ENESE. In relative direction scheme [60, 771, the reference system is not fixed,

but each move is specified relative to the direction of the previous move, rather than relative

to the absolute axes defined by the lattice. Still take 2-d square lattice as an example, three

directions: Forward, Right-turn and Left-turn are enough to specify each new move relative

to the previous one, thus a conformation can be expressed as a sequence S E {F, R, Ljn-l

(the first move is always Forward). The example structure in Figure 4.2(a) is then expressed,

in this reference system, as S,elati,, = FLRRL.

)

CHAPTER 4. AI TECHNIQUES FOR PSP

Figure 4.2: (a): A very simple &residue conformation is represented in absolute direction as ENESE, in relative direction as FLRRL. (b) and (c) show two possible arrangements after point mutation at the 3rd residue position.

This relative direction representation scheme has the advantage of guaranteeing that all

solutions are at least 1-step self-avoiding since there is no "back" move. self-avoiding (no

clash between chain elements) is the basic condition on which a valid lattice conformation

can be formed. In a comparative study [48], it shows that this representation scheme is

almost always better than the absolute encoding of directions for the square and cubic

lattices.

One problem when using these representations is that some mechanism needs to be in

place to ensure the encoded structure is collision-free, which means the representation has

to observe geometrical constraints to be valid. More discussion about general constraint

handling in EAs for PSP is given later (see 4.1.2 - Other design issues).

Variation operators

When designing Variation operators, it is obvious that they have to match the protein

representation.

We first discuss variation operators of EAs on 2-d lattice models. An early study of the

use of EAs on 2-d square lattice model was [89] in which Genetic Algorithms were investi-

gated and protein conformations were encoded as actual lattice coordinates. In this study,

mutations were implemented by a rotation of the structure around a randomly selected co-

ordinate. Not like most GAS applied to other problems in which mutation rate were kept

low, they found that, for protein structure prediction on simple lattice models, a higher rate

of mutation is beneficial. Crossover was implemented by swapping a pair of selected parent

CHAPTER 4. A 1 TECHNIQUES FOR PSP

structures at randomly selected cutting points. On a square lattice, there are three possible

orientations by which two fragment structures can be joined. All three possibilities were

tested in order to find a valid, collision-free one. In the study, a quality control mechanism

was introduced to the recombination process by requiring the fitness value of the child con-

formation to be not worse than the average fitness of its parents. This was implemented

by performing a Metropolis test comparing the energy of the child to the average energy of

its parents. If the child conformation was rejected, new parents had to be selected. This

study also demonstrated that the performance of EA approach, at least on simple models,

was better than Monte Carlo based approaches.

If protein conformations were not encoded as actual lattice coordinates, but using in-

ternal coordinates, the effect of mutation operators relies on specific representation used.

Consider the effect of one point mutation on the structure in Figure 4.2(a), We know from

the previous section that, using relative direction representation, this 6-residue lattice con-

formation is SrelatiVe = FLRRL. A mutation on the 3rd position value could produce either

of S,'elat,,e = FLFRL or S~ela,ive = FLLRL, which are shown in Figure 4.2(b) and (c)

respectively. However, if the structure in (a) is expressed using absolute direction repre-

sentation as Sabsolute = ENESE, to produce the same conformation as in (b) and (c), all

the three position values beyond the 3rd position have to be mutated; the corresponding

representations are S~bsolute - - E N N E N and S~bsolute = E N W N W respectively.

We can see from this example that a one-point mutation in the relative direction repre-

sentation produces a rotation effect in the structure at the mutated point. To produce the

same effect in the absolute direction representation, a multiple-point mutation is needed,

i.e., all the position values beyond the mutation point need to be simultaneously mutated to

produce the same change in the structure. On the contrary, a one-point mutation in an ab-

solute direction representation leaves the orientation of the rest of the structure unchanged.

To achieve the same effect in the relative representation, changes at two subsequent position

values are needed.

As for crossover operation, most studies use a cut-and-paste-type. But reference [68]

presented an interesting deviation. They investigated GAS on lattice-based models. The

mutation was introduced as an Monte Carlo step, where each move changed the local ar-

rangement of short (2-8 residues) segments of the protein chain. The crossover operation

was performed by averaging two selected parents: first the parents were superimposed on

each other to ensure a common frame of reference and then the locations of corresponding


structural elements in each parent were averaged to produce a child structure that lay in

the middle of the two parents. A refitting step was then required in order to place the child

structure back within lattice coordinates. In the study, this new implementation of GA was

compared to Monte Carlo search and to standard GA. It was shown that it is more effective

than standard GA implementations. And the superiority of both GA methods over MC

search was also demonstrated.

The above discussion is on lattice proteins. For dihedral angle off-lattice representation,

a simple way to introduce a mutation is to change the value of a single dihedral angle. This

can be done in two ways: allowing only a small change in the value, or allowing complete

random assignment of the dihedral angle values for a single amino acid. Like in relative

direction representation in lattice models, one change in a dihedral value might have a large

effect on the overall structure, because it causes the rotation of the entire arm of the structure

beyond the mutated dihedral angle point, which may cause collisions between many atoms.

The crossover operator mostly are implemented as a cut-and-paste operation over the lists

of the dihedral angles, as in [22]. Thus the "children" structure will contain part of each

parents' structure. Similar as mutation, this may also lead to collision. Since detecting

collision in off-lattice models is much more difficult than in lattice models, almost every

implementation needs to carefully address this issue and come up with a way to handle it.

When the child structure resulted from crossover operator does not contain collision, it may

also have another problem of being too open (not compact enough to be globular) and not

likely to be a good candidate for further modification. To overcome these problems, many

of the implementations include explicit quality control procedures that are applied after

variation operators. These procedures may include several rounds of energy minimization

process to relieve collision, loose conformations, etc.

While some ordinary implementations of variation operators are shared by many studies,

the manner and order in which they are applied is different for each actual algorithm. Other

than the above-mentioned regular operators, many special operators have been devised in

literature. We have already given an example research of [68] in which a Cartesian space

operator is used for recombination in GA. Two more examples are as follows. In [77], a

specially devised operator named "partial optimization" was employed on lattice proteins.

The idea of this operator is to randomly select two non-consecutive residues of the protein

and fix their positions in the lattice and then locate some intermediate residues by calculating

all the different possibilities of the intermediate residues. The conformation that gives the

CHAPTER 4. A 1 TECHNIQUES FOR PSP 42

best fitness is kept. The number of intermediate residues to be permutated is a user-defined

parameter named partial optimization size. Another example is a rotation operator designed

in [48], which is actually a mutation operator, that flips a part of the folded chain along a

certain symmetry axis.

Fitness functions

It cannot be over-emphasized how important the fitness function is to the prediction result.

The fitness of each solution must be an accurate reflection of the problem or else the evolu-

tionary process will find the right solution for the wrong problem. Defining an appropriate

fitness function can be challenging in any evolutionary algorithm.

In almost all EA approaches to PSP problem, the fitness function adopts a certain form

of potential energy functions. This makes the design of EA fitness function easier because

there are many existing energy functions available, but this also creates a problem of hardly

distinguishing between the performance of energy function and the EA algorithm itself. The

wide variety of energy functions that have been used in EAs range from the hydrophobic

potential in HP lattice model to much more detailed energy models such as CHARMM (see

2.4). Because it is very easy to incorporate and modify the various energy functions in the

framework of EAs as fitness functions, many researchers develop their own energy function

terms to suit their specific needs, thus energy functions used in EAs are very versatile. In

this section, we mainly survey some typical energy functions used in EAs, with emphasized

discussion on the simple HP model. Further discussion on the dilemma of energy functions

used as fitness functions will be given in Discussion section 4.1.3 - 'More on energy function'.

For lattice models, the simplest energy function is that used in HP model in which every

direct hydrophobic-hydrophobic (HH) amino acid contact is awarded, as shown in the table:

Table 4.1: Energy potential pij for the HP evaluation function

The optimal structure is the one with the most nuqber of HH contact for a given protein


sequence. Figure 2.l(b) shows sequences embedded in triangular lattice with HH contacts

highlighted in curved lines. Given that each HH contact has a value of -1 as specified in

Table 4.1, the conformation in Figure 2.l(b) has an energy of -4. Many EAs working on HP

lattice model use this simple energy potential to measure fitness of individual solutions, yet

it is too gross in some cases. For instance, examine the two conformations in the following

Figure 4.3:

Figure 4.3: (a) and (b) are different conformations but have equal energy values.

(a) is obviously closer to forming the optimal conformation than (b). But because only

direct HH contacts are rewarded, these two conformations have the equal energy values

judged by the simple energy function in Table 4.1. In other words, this function cannot

effectively distinguish between some individual solutions in a EA, thus will cause many

plateaus in the energy landscape and trap the search.

There are ways to avoid the trap. One remedy is by augmenting the energy function to

allow a distance-dependent HH potential, as proposed in [48]. Since the distances between

amino acids form a countable set, it is possible to construct a distance-dependent potential

that preserves the ranking of the conformations in the standard HP model while enabling a

finer level of distinction between conformations with the same number of HH contacts. For

example, if dij is the distance between two hydrophobic amino acids Hi and Hj, reference [48]

gave a modified energy potential as follows:

where NH equals the number of hydrophobic amino acids in the sequence, and k = 4 for

the square lattice and k = 5 for the triangular and cubic lattices. And it was suggested

that the modified energy formulation is especially effective for hybrid EAs that use a local

search method.

CHAPTER 4. A I TECHNIQUES FOR PSP 44

Another remedy was proposed in [77]. A concept "radius of gyration" (RG) was used to

estimate the compactness of a set of amino acids: the more compact a conformation is, the

smaller is its radius of gyration. Hopefully, by integrating RG in the fitness function, the

fitness landscape can be changed in such a way that more compact conformations with the

same number of HH bonds will be rewarded, bringing the evaluation closer to reality.

The above simple energy function for HP model can be extended in various ways to

either fit for more complicated lattice models, or account for more detailed energy items.

In [69], the charge property of amino acids is taken into consideration, thus amino acids are

classified into four types as hydrophobic, positively or negatively charged, or neutral, rather

than just two classes. The energy potential table is expanded to 4 x 4 accordingly. Besides,

different degrees of polarity, or hydrophobicity for different amino acids can be used to make

the energy function more detailed in the hope that it should yield conformations closers to

the native ones. Some of these function examples can be found in [35].

For off-lattice models, a very simple energy function will be an adaptation of the lattice

HP function to off-lattice environments. The energy function can just take into account

the distance between interacting residues which can be calculated using the empirical mean

distance between consecutive residues in proteins l. An optimal interaction potential equiv-

alent to the lattice interaction potential for neighboring hydrophobic residues occurs at unit

distance l. Smaller distances are penalized to enforce steric constraints, i.e., to avoid residue

clashes. In [35], one version of the calculation of the total energy is provided as:

where E is the total energy of a conformation, eij is energy potential between residues i and

j, dij is the distance between residues i and j, y and E are constant parameters, and pij is

the interaction potential according to Table 4.1.

For dihedral angles off-lattice representation, generally, total energy is calculated as the

sum of several energy potentials. The typical form would be like the equation shown in

'This distance is roughly 3.8W and can be set as the unit distance. The distance between a pair of interacting residues can be calculated using this distance and angular values.


section 2.4 in which various bonded or non-bonded potentials are calculated. The popular

CHARMM force field is in this category of energy functions. Another example is that used

in [22] in which small helical proteins are successfully folded using a GA. The fitness(energy)

function took into account the effect of bad clashes, secondary structure formation, tertiary

structure formation, hydrophobic burial, and hydrogen bonding. Normally, this category

of energy functions are linear sums of several energy terms. But in the interesting energy

function used in [61], the terms were normalized and then multiplied rather than added. By

this way, it makes sure that all the terms have reasonable values, since even one bad term

can significantly affect the total score.

One more special type of energy functions adopted in EAs for PSP uses empirically

derived contact potentials for amino acid interactions. A contact potential describes the

energy between two residues close enough to each other (typically 5 6.5A). In [53], a contact

potential emp was determined for all pairs of amino acid types using 1168 known structures.

Then these potentials are used in some function similar to that in Section 2.4 to calculate

the total energy.

Other design issues

Prevention of premature convergence on undesired solutions: These undesired solu-

tions are often local minimums. It is common that, during successive generations, one

or very few solutions take over the population. Once this happens, the rate of evo-

lution drops dramatically: crossover becomes meaningless and advances are achieved

only by mutations at a very slow rate. Several approaches have been suggested to

avoid this situation. These include temporarily increasing the rate of mutations until

the diversity ofthe populations is regained; isolating unrelated sub-populations and

allowing them to interact with each other whenever a given subpopulation becomes

frozen, and rejecting new solutions if they are too similar to solutions that already

exist in the population.

Geometrical constraints: Like many practical problems, PSP problem is constrained.

Two types of constraints that need to be enforced to define a feasible conformation

are: the connectivity of the chain and the collision-free conformation.

Many implementations use internal coordinates representations to implicitly handle

the first constraint (the off-lattice dihedral angle representation is actually a kind


of internal coordinates representation). As for the second constraint, for off-lattice

models, it means some torsion-angle ranges are not allowed and residues should not

collide; for lattice-models, it means the conformational path has to form a self-avoiding

walk in the lattice. Thus, not all possible individuals represent valid solutions. In

one perspective, it provides extra information the EA can use to narrow down the

search space; in another perspective, it adds extra dimension(s) to the already high-

dimensional problem, thus may make the search more difficult to handle.

Generally speaking, in EAs, constraint handling is not straightforward, because the

variation operators(mutation and recombination) are typically "blind" to constraints.

That is, there is no guarantee that even if the parents satisfy some constraints, the

offspring will satisfy them as well. In [26], some ways for handling constraints in EAs

at the conceptual level are introduced as follows:

- Use penalty functions to reduce the fitness of infeasible solutions, the fitness may

be reduced in proportion to the number of constraints violated, or to the distance

from the feasible region.

- Use mechanisms that take infeasible solutions and "repair" them to the closest

feasible one.

- Use a specific alphabet for the problem representation, plus suitable initialization,

recombination, and mutation operator such that the feasibility of a solution is

always ensured.

These constraint handling methods have all been employed in various EAs for PSP

problem. In [44] and [77], penalty functions are used to measure to which extent

the constraints are violated. Infeasible solutions are allowed, but they are assigned a

lower fitness value due to the existence of a penalizing term. In [18], an alternative was

explored. A repair procedure maps.infeasible solutions to feasible conformations, and

evolutionary operators are designed such that they are closed in feasible space. There

are also other techniques designed for particular representation. E.g., to address the

collision-free constraint on lattice model for absolute coordinates representation, one

simple way is just marking lattice vertices as free or occupied.

Human intervention in EAs: How much human intervention will be involved in assist-

ing the algorithm? This is a question for any EA. You can choose to only preset some


probability parameters and leave all other aspects of the evolving process to random

decisions. Or you can incorporate more domain knowledge to guide and assist the al-

gorithm. For PSP problem, in practice, domain knowledge is often incorporated in the

algorithm to improve the prediction accuracy. One way is to first predict secondary

or supersecondary structures, then use the results as constraints during EA search,

e.g., rather than choosing crossover points totally randomly, the EA can choose some

hot spots selected on the bases of keeping secondary structure. Another way is to in-

clude experimentally derived structural information such as the existence of S-S bonds,

conserved hydrophobic residues, in the prediction scheme to improve the prediction

quality. For example in [5], distance constraints derived from NMR experiments were

used to help a genetic algorithm to calculate protein structure.

4.1.3 Discussion

In this section, we discuss some general and conceptual issues that are raised from using EC

in PSP problem.

Suitability of EA in PSP problem

Evolutionary computation, according to [33], is both effective and computationally efficient

search strategy. It has the advantages of ease of use, general applicability, and success in

finding good solutions for difficult high dimensional problems. Particularly, EAs are useful

when: 1) the problem search space is large, complex or poorly understood; 2) domain

knowledge is scarce or difficult to encode to narrow the search space; 3) no mathematical

analysis is available; or 4) traditional search methods fail. Except for the 2nd case, PSP

problem falls into all the other cases. Besides, many studies demonstrated that, as a general

search method, EA does show superiority over other methods like monte carlo search. This

suggests that PSP problem is suited for EA. This is interesting since EA works on population

level, i.e., many individuals mix and interact to evolve a good individual, while protein

molecule folds individually on single-molecule level, not by mixing different proteins on

population level. [go] gave an explanation and suggested an interesting view of EAs as

being compatible with protein folding pathway: although EAs do not simulate the actual

folding pathway of single molecule, we can refer to the many solutions in the EA system not

as different molecules but as different conformations of the same molecule. Each individual


solution can be considered as a point on the folding pathway of the single molecule, and it

examines and evolves itself using the variation and selection operators.

Adaptive and dynamic nature of EAs

Evolutionary computation, by nature, is a dynamic and adaptive process. Thus, when

applying EAs in practical problems, this nature should be given enough consideration. The

consideration is on three levels.

First, the essence of EA's adaptive nature should be taken into consideration when

modeling the problem. Initially, GA, the most popular form of EA, was conceived by

Holland as a means of studying adaptive behavior, as suggested by the title of his book in

which he put together his early research - "Adaptation in natural and artificial systems". In

later studies, however, maybe because EAs generally perform well in searching for optimal

solutions, they have largely been considered as optimization methods. In fact, there are

many ways to view EAs, as pointed out in [26], not only as problem solvers, but also

as basis for competent machine learning, as creative computational models, or as guiding

philosophy. Till now, EAs have been applied to PSP problem only as an optimization search

tool. Maybe the future research on PSP problem would model the problem differently and

combine the macro evolution and micro protein folding in a creative way?

On the second level, when we consider EA as an effective search tool in PSP problem, we

should bear in mind that EA is adaptive and there is no best EA across all problems [33].

PSP problem can be formulated differently or can be focused on different types of proteins.

Thus algorithm components should be developed in such way that they are tuned to the

formulation at hand rather than simply forcing the problem into a particular version of an

EA.

On the third level of setting algorithm parameters for a particular EA, it is suggested

in [26] that using rigid parameters that do not change their values during the running of the

EA is against the adaptive and dynamic nature of it. Globally, there are two major forms

of setting parameter values: parameter tuning and parameter control. Parameter tuning is

the commonly practised approach that values of parameters (population size, mutation rate,

etc.) are set before the run of the algorithm and remain fixed during the run. However, a

run of an EA is an intrinsically dynamic, adaptive process. It is intuitively obvious, and has

been empirically and theoretically demonstrated in [26] that different values of parameters

might be optimal at different stages of the evolutionary process. For instance, large mutation


steps can be good in the early generations, helping the full exploration of the search space,

while small mutation steps may be needed in the late generations to locate the desired global

optimum. Thus we need dynamic parameter control. For the mutation problem, e.g., one

possible solution is to suggest a range of dynamic mutations, from small to large, during the

evolutionary process and let the EA self-control its parameters. This comes to the idea of

self-adaption. Self-adaptation can be done by associating each individual with an additional

vector that provides instructions on how to best mutate it; or it is also natural to use two

EAs: one for problem solving and another one for tuning the first one. But there is not

much research in this line for PSP problem yet.

Variation of EAs applied in PSP

Among the variants of EAs, Genetic Algorithms are still the predominant EA used in PSP

problem. But it was pointed out in [33] that crossover, the main variation operator in GAS,

is largely ineffective for protein structure prediction and other variants, especially Evolution

strategy which emphasizes on mutation, should be more extensively investigated.

In the literature, memetic algorithms are also seen applied to PSP problem. The memetic

algorithm refers to a hybrid evolutionary algorithm approach that uses a standard EA in

conjunction with local search. The additional localized searches conducted in a memetic

algorithm generally results in a significant improvement in the fitness of the best solution

found.

Another research direction is the multi-objective formulation of the PSP problem. His-

torically, ab initio prediction has been approached as a single-objective optimization prob-

lem. While recently, some researchers reformulate it as a multi-objective optimization prob-

lem. An early research is [24] in which a multi-objective evolutionary algorithm (MOfmGA)

was used for the structure prediction of two small proteins (5 and 14 residues respectively).

Using this idea, Cutello investigated medium size proteins (46-70 residues) with promising

results and further conjectured and partially verified by experiments that PSP problem is

more suitable to be modeled as a multi-objective optimization problem [20]. Their approach

considers the local interaction (bond energy) and non-local interaction (non-bond energy)

among atoms to be the main forces to direct the forming of the protein native state, and is

based on the intuition (or, fact) that the two kinds of interaction are in conflict. This is the

typical characteristic of a multi-objective optimization problem.


More on energy function

In ab initio structure prediction, the two important aspects of the problem, the energy

function that must discriminate between the native structure and many non-natives and

the search algorithm to identify the conformation with the lowest energy, are fraught with

difficulties [go]. Furthermore, difficulties in each aspect reduce progress in the other. Until

we have a search method that will enable us to identify the solutions with the lowest energy

for a given energy function, we will not be able to determine whether the conformation

with the minimal calculated energy coincides with the native conformation. On the other

hand, until we develop an optimized energy function, we will not be able to verify that a

particular search method is capable of finding the minimum of that specific function. That

is, evaluating the performance of the search tool and evaluating the performance of the

associated energy function are tangled together and making a distinction between them is

hard. This is a dilemma in PSP research. When discussing EAs for PSP, the same problem

arises, and to make things worse, in almost all EA implementations, the energy function is

also used as the fitness function of the EA, thus making the distinction between the energy

function and the search algorithm even more difficult. It was suggested in [go] that, at

least for algorithmic design and analysis purposes, it is possible to detach the issues of the

search from the issue of the energy function, by using a simple model where the optimal

conformation is known by full enumeration of all conformations, or by tailoring the energy

function to specifically prefer a given conformation. But there is not much research in this

line yet.

Another issue about energy function is that complex energy function models could also

be parallelized for more efficient calculation. This is often adopted in knowledge-based ap-

proaches to the PSP problem, as well as for EAs in ab initio prediction. Significant reduction

in convergence time can be achieved by either distributing a single evolving population over

a number of machines or allowing different machines to compute independently evolving

populations. Many practical EA implementations for solving PSP have adopted parallel

computations. Conceptually, it matches the nature of evolution because evolution itself is

just a parallel process.


Possible future improvements

Despite the conceptual and technical suitability of EA in the PSP problem, the success of

EA in PSP problem is moderate. Most research focuses on lattice models. What kinds of

improvements might be made to EA methods to improve their performance? One obvious

aspect is improving the energy function. While this is a common problem for all prediction

methods, an interesting possibility to explore within the EA framework is to make a dis-

tinction between the fitness function that is used to guide the production of the emerging

solution and the energy function that is being used to select the final structure. In this way

it might be possible to emphasize different aspects of the fitness function in different stages

of folding.

Another possibility, as suggested in [go], is to introduce explicit "memory" into the

emerging substructure, such that substructures that have been advantageous to the struc-

tures that harbored them will get some level of immunity from changes. This can be achieved

by biasing the selection of crossover points to respect the integrity of successful substruc-

tures or by making mutations less likely in these regions. It seems that the PSP problem

is too difficult for a naive "pure" implementation of EAs. The direction to go is to take

advantage of the ability of the EA approach to incorporate various types of considerations

when attacking this problem.

GAS are still the predominant EA used in PSP problem. It was pointed out in [33]

that crossover, the primary reproduction mechanism used in GAS, is largely ineffective for

protein structure prediction. It was suggested that evolution strategies and evolutionary

programming which place emphasis on mutation as a reproduction mechanism should be

explored in PSP problem.

Finally, a long term effort should be made to better integrating the adaptive and dynamic

nature of evolutionary computing in various levels of approaching PSP problem: in modeling

of the problem; in developing algorithm components; and in setting algorithm parameters.

Both conceptual model and technical implementations need to be explored.

As discussed before, ab initio prediction approaches to PSP problem often use simplified

lattice models to study protein structure. On the 2- or 3-d lattices, the folded structures are

usually represented using direct encoding of the coordinates of every residue on the folded


chain (See 4.1.2 - Representation).

Recently, a few researchers proposed using Lindenmayer systems (L-systems) to capture

protein structures [27, 561. After David Searls laid the ground for using generative grammar

in biosequence analysis [78], this is a novel and interesting practice for representing folded

protein structure on lattice models.

In this section, we will give a short introduction to L-systems, then introduce and discuss

the L-system-based encoding for lattice protein in current research.

4.2.1 Introduction to L-systems

L-systems were developed by Aristid Lindenmayer in late 1960s. Originally they were de-

vised to provide a formal description of the growth patterns of simple multicellular organ-

isms. Later on, this system was extended to describe higher plants and complex branching

structures.

L-systems are commonly defined as a tuple < V, C, w, P >, where V, variables, is a set of

symbols that can be replaced; C , constants, is a set of symbols that remain fixed; w, axiom,

is a string of symbols from V+ C defining the initial state of the system; P, productions, (or

rewriting mles), is a set of rules or productions defining the way variables can be replaced

with combinations of constants and other variables. Other than these terms, we also use

alphabet to refer to the set of V + C and symbol to refer to any element in V or C.

As an example, Lindenmayer7s original L-system for modelling the growth of algae is as

follows. Algae consists of cells, each of which could take on one of two values a or b.

variables: a , b

constants: none

axiom: a

rules: a --t ab, b --t a , which successively produces: a , ab, aba, abaab, abaababa, ... This pattern of growth fairly closely matched the growth patterns of the algae that Linden-

mayer was studying.

An L-system is context-free if each production rule has only one variable on the left.

If a rule refers not only to a single variable but also to a combination of this variable and

certain neighbours, it is termed a context-sensitive L-system. An L-system is deterministic

if there is exactly one production for each variable. If there are several, and each is chosen

with a certain probability during each iteration, then it is a stochastic L-system. Finally,

L-systems can be parametric if there are numerical parameters associated with the symbols


or productions. A deterministic context-free L-system is the simplest form of L-systems and

popularly called a DOL-system.

Compared with traditional formal language grammars, the major difference lies in the

way of applying production rules. In formal languages, productions are applied sequentially,

while in L-systems they are applied in parallel, replacing simultaneously all variables in a

given word. This difference reflects the biological motivation of L-systems. Productions

are intended to capture cell divisions in multicellular organisms, where many divisions may

occur at the same time. Another difference is that in L-systems, there is not necessarily

such non-terminals as in traditional grammars. Variables in some L-systems constitute valid

words in the languages of the L-systems. In this case, although they are replaceable, the

variables are more like the terminals in traditional grammars.

4.2.2 L-system-based Encoding for Protein Structure

L-systems are investigated to encode lattice protein conformations only very recently [27,56].

In the research, evolutionary algorithms are used as the inference procedure for discovering

L-systems that represent target protein structures on simple lattice models. At this stage,

the problem they are trying to solve essentially is: given a target structure expressed in

"internal coordinates"(see Figure 4.1), how to find an L-system that, once evaluated, would

produce the original target structure or a closely matched one. They used EAs to search the

space of L-systems and produced promising results for short sequences. However, there is

still long way to go before L-system-based structure representation can be used in the PSP

problem or its inverse problem. We will discuss this point more in detail in the discussion

section.

Why a grammatical encoding?

As discussed in section 4.1.2 - Representation, protein structures on lattice models are

usually represented by a direct encoding of the folded chain. One commonly used direct

encoding is "internal coordinates" that represents the structure by a list of moves on the

lattice. The moves can be absolute or relative. Under the relative scheme, each move is

specified relative to the direction of the previous move. In a 2-d square lattice, e.g., a struc-

ture S is encoded as a string S E {Forward,t~rnRight,turnLeft)~-l. See Figure 4.2(a)

for an encoding example.


However, the string length of the encoded structure is basically the same as that of the

protein sequence, thus causing the search techniques that use this type of encoding hard to

scale. L-systems is a generative or rule-based scheme that specifies how to construct the

structure rather than directly encodes the structure, thus can achieve greater scalability.

But there comes the question: is lattice protein structure suitable to be encoded grammat-

ically? The researchers provide their reasoning in 1271 that can be concluded as: proteins

exhibit regularity and repeated substructures, which is consistent with the recursive nature

of L-systems where rewriting rules lead to modular, auto-similar structures. But the re-

searchers didn't investigate to what level of degree proteins exhibit regularity and whether

the regularity showed in protein structure is enough to be modelled by L-systems in gen-

eral. We will comment on this point more in 4.2.3 Discussion section. Another advantage of

using grammatical encoding is that it is more compact and parts of the encoding are more

easily to be reused. And specifically for evolutionary algorithms, grammatical encoding for

individuals is more suitable for crossover and building block transfer between individuals.

L-system-based encoding

In this section, we briefly introduce how lattice protein structure is encoded by L-systems

based on the methods discussed in 1271 and [56].

The L-system's alphabet will depend on the lattice and coordinate system used. For

square 2-d lattice and relative internal coordinates, the specification of DOL-systems chosen

in [27] is: variable set V = {0,1,2, . . . , m - 1) with each of the number elements representing

one rewriting rule; constant set C = {F, L, R) representing three moves: Forward, Left-turn

and Right-turn in the relative coordinates; axiom w can be any string of combinations of

characters from V + C. The number of production rules is the size of the variable set and

each rule takes the form n -t w, where n E V and w E {V + C)+, the set of all nonempty

words over V + C.

An example of L-system to encode a short lattice protein structure

RFRRLLRLRRFRLLRRFR

would be as follows, with its derivation process shown in Figure 4.4.


axiom = 31

rules = {0 + 3LL2; 1 + RORL; 2 + RRF; 3 + RFR1)

w.'R RORL R 3LLZ RL 1 0

Post- processing

RFRRLLRLRRFRLLRRFR

Figure 4.4: A derivation process example

The maximum lengths of the axiom and rules, as well as the number of rules are param-

eters for the inference algorithm (in [27] it was EAs) that will depend on the length of the

protein.

In further studies [56], knowledge of secondary structures is incorporated in the L-

system-based encoding in the form of predesigned production rules. In the HP 2-d square

model, right-oriented a-helix is designed as RRLL (represented by variable A); left-oriented

a-helix is designed as LLRR (represented by variable H); P-sheet is represented as a string

of F s (the maximum number of F s is 4). Moreover, the L-systems are parametric. There

are numerical parameters associated with the symbols. For example, if a structure segment

in the relative coordinates encoding is FFFF, then in parametric L-systems encoding, it

can be written as F4. Another instance: a 2-d lattice folding RFRRLLRLRRFRLLRRFR

in relative coordinates can be rewritten in parametric L-systems as: R F A R L R 2 ~ R ~ F R .

Thus the parametric L-system has only five symbols in its alphabet: {F, R, L, A, H). And

its rules are fixed and not explicit compared with the DOL-systems discussed above.


Evolving L-systems-encoded structures

The ability of L-system-based encoding to capture protein native conformation in the 2-d

HP lattice model can be tested using EAs. Given a target structure in direct encoding, the

EA will explore the L-system's space and evolve a set of rules that, once derived, would

produce a conformation that closely matches the target.

The following general description of the EA used to test L-systems is based on [27]. The

approach is close to grammatical evolution. Each individual L-system in the population

is determined by the axiom and the rewriting rules. The maximum number of rules and

string lengths for the axiom and rules are preset as parameters. For initialization, both

the axiom and the rules of an individual L-system are randomly generated strings of sym-

bols of the maximum lengths where each symbol is selected with uniform distribution from

the alphabet. The recombination operator resembles uniform crossover where the rules are

interchanged. During the recombination process, if a selected rule in an offspring makes ref-

erence to a variable symbol (rule) not defined in the offspring L-system, a repair operator is

used to change that variable. The mutation operators are addition, deletion or modification

of a single symbol that conforms either the axiom or the rewriting rules of each individ-

ual. For selection, linear ranking selection and elitism can be used. And, a mate selection

strategy that chooses less similar parents can increase the population diversity. To evaluate

an individual's fitness, its L-system is derived and the Hamming distance is computed be-

tween the derived structure and the target structure. During the evolutionary process, the

L-system that produces illegal (not self-avoiding) lattice conformation is allowed, but will

not be accepted as a final solution.

4.2.3 Discussion

L-systems are recursive in nature. This nature makes L-systems very suitable to describe

fractal-like structures. Is L-system suitable to describe protein structures? The preliminary

research [56] seems to give positive answer by asserting "Results confirmed the suitability

of the proposed (L-systems) representation". However, experiments have also shown that

some protein instances are more difficult than others to evolve an adequate L-system [27]

and instances with high frequencies of a-helices and P-sheets have a clear advantage in their

suitability to be encoded by L-systems [56]. These results show that the suitability of the new

encoding scheme heavily depends on the occurrence of sub-structures and their regularity.

CHAPTER 4. AI TECHNIQUES FOR PSP 57 *

Although it is known that protein structures indeed exhibit some regularity and repeated

sub-structures which can be captured by L-systems, to what degree do protein structures

show regularity? And generally, is the level of modularity and repetition within protein

structures high enough for L-systems to be suitable to encode them? Current research has

not explicitly addressed these questions yet.

It is also worth noting that for 2-d lattice protein, the proposed L-system-based encoding

is not independent of direct encodings. This dependence has two folds: the alphabet of the

L-systems includes all the symbols used in direct encoding; and an L-system needs to be

derived to the direct encoding form before the structure it encodes can be evaluated. Also

note that a given target structure may have various direct encoding representations and

that various distinct L-systems could produce the same direct encoding word [27]. Therefore,

if L-systems are actually used in PSP problem under this scheme, the advantages of using

grammatical encoding have to be evaluated against the cost of adding a layer of complication

in the encoding system.

L-systems grammar has been used in many applications of evolutionary algorithms to

problems in biology, engineering, and computer graphics. One example of L-systems as a

powerful encoding is investigated in [46] where it represents blood circulation of the human

retina. Using L-systems to encode lattice protein conformations as reviewed here is very

recent research. It is limited to short proteins on 2-d square lattice model and it has not

been integrated into any approach to PSP problem. However, it is a very interesting protein

conformation representation scheme and more research in this line is needed to investigate

its possible application in PSP and the inverse PSP problem.

4.3 Artificial Neural Networks

As introduced in Chapter three, protein structural features prediction is an important cat-

egory of PSP. Examples of structural features include secondary structure, residue solvent

accessibility, trans-membrane strands and helices etc. Although these features do not repre-

sent 3-d structure, accurate predictions of them are important steps toward 3-d prediction.

2 ~ h e experiment analysis in [27] shows some sub-strings that appear several times in the folded chain (e.g. RFR) are also present as part of the evolved rules. This supports the idea that L-systems capture the natural occurring sub-structures in lattice protein.

30nly applying to absolute internal coordinates.


For instance, predicted secondary structures can be regarded as rigid bodies, simplifying

molecular dynamics simulations; or in ab initio prediction approaches, these predicted fea-

tures represent additional information that can help guide the conformational search . The prediction of structural features are often modeled as inferring a mapping from

input amino acid sequences to some kind of output sequences. The output sequence has

the same length as the input sequence and each symbol appearing in the output sequences

describes the structural property of the residue in the same position as in the input se-

quence. This way of modeling the problem enables the application of automatic learning

methods, such as artificial neural networks (ANNs). These networks are capable of mapping

between protein sequence and structure, of classifying types of structures, and identifying

similar structural features from a database. Neural networks have the advantage of making

decisions out of a large number of competing variables without explicit understanding of the

problem. This is particularly important for PSP problem where the principles governing

protein structure forming are complex and not yet fully understood. So far neural network

models are among the most successful approaches in predicting protein structual features,

especially in secondary structure prediction.

In this section, we first give an introduction to ANNs and a basic ANN scheme for

predicting structural features in general. Then we focus on secondary structure prediction:

to illustrate and review how ANNs are applied in this important category of prediction. Then

we briefly introduce a few other types of structural predictions made by ANNs. Further

discussions will be on some important issues raised from using ANNs in the PSP problem.

4.3.1 Introduction to ANNs

Artificial neural networks are inspired from biological neural net which consists of billions of

biological neurons. Neurons are basic computing units of the brain. For each neuron, input

signals are gathered, then processed and evaluated. If the evaluated result is larger than

some threshold, an action potential fires and then propagates down to become the output

signal of this neuron. Before this output signal becomes the input to the next neuron, it will

undergo some processing to determine how the signal will be transmitted from the output

neuron to the next input neuron.

This rather simplified model of biological neuron serves as basis of artificial neurons

(nodes) from which ANNs are constructed. A simple scheme of a generic ANN node is

shown as follows.


inputs , outputs

t h e shhold function

Figure 4.5: A generic scheme of artificial neuron

In this computation scheme, a weight controls how much influence a previous neuron

node has on this node. Suppose there are n previous nodes connecting to this node, then

a vector 3 = (xl, xz, . . . , x,) represents the n inputs from the corresponding n nodes; and

w = (wl, 202,. . . , w,) is the corresponding weights vector. Inside the node, it calculates a

weighted linear combination of all the inputs: 'LZI 3, then maybe subtracts a threshold, and

passes the result through an activation function to produce the output to other nodes that

connect with it. Activation functions can be of different types, the commonly used one is

the sigmoid function F(x) = 1/(1+ e-").

ANNs can take various architectures. Normally, nodes are arranged into layers. The

inter-layer connections can be divided into two kinds: feed-forward and feed-back. A feed-

forward network has only unidirectional connections and signals propagate only forward

from the input layer to the output layer. In feed-back networks, a layer can be connected

to the next layer or any of the previous layers, thus signals can travel in both directions,

causing loops in the network. Feed-back networks are dynamic and very powerful, but can

get very complicated.

While the connections are hardwired, the weights between nodes can be adjusted by the

network during the training process. The idea of network training is to find, or to learn,

the weights that fit the training data so that the learned network can be used to solve new

data. There are two learning paradigms: supervised and unsupervised. Supervised learning

is the method commonly used in structural features prediction.

In supervised learning, the ANN is repeatedly presented with a set of training samples

with known results. The task of the ANN is to modify the weights through these samples.

The process is, first, the network takes input values of one sample and works out an output

using initially random weights; then, compare the observed output value with the known


value of that sample and back-propagate(see the following subsection) an error adjustment

to the weights so that the next time the sample is presented, the observed output is closer

to the desired output. This is repeated for all samples in the training set and results in

one epoch. Then the process is repeated for a second epoch, a third epoch ... until the

network manages to reduce the output error for all samples to an acceptable low value. At

this point, the training is stopped and all weights are settled, the trained ANN can be used

to work out new data, or, if needed, a test phase can begin to determine the validity, or

prediction accuracy of the network.

In unsupervised learning, the network is not presented with the desired output. It must

learn the weights without being able to measure its result and minimize its error. In such an

unsupervised scheme, nodes compete for the opportunity to update their weights, resulting

in self-organization. Generally, unsupervised ANNs are used for finding interesting clusters

within the data.

Error minimization

In supervised learning, the weights have to be adjusted so the error between the desired

output and the actual output is reduced. A best-known algorithm doing this weight opti-

mization is Back-propagation. For node i, the difference between the observed output oi

and desired output di is called the error,

The sum of squared errors is then

where i runs over all output nodes. By calculating the gradient of the error function the

adjustment of weights is

where 7 is the learning rate. Then each weight is updated as

During this process, the weights are adjusted to minimize the errors. One way of conceiving

this error minimization process is to consider each individual weight as a dimension in space.


If we could plot the value of the error for each combination of weights, we could obtain an

"error surface" in multidimensional space. In one aspect, the objective of network training

is to find the lowest point in the error surface. Like searching for the minimum free energy

on the energy landscape in ab initio prediction approaches, no algorithm can guarantee to

locate the global minimum. But in another aspect, the neural network training should avoid

over-training. If an ANN is over-trained for too many cycles to minimize the errors, it will

overfit the training data while leading to larger errors on test data.

4.3.2 A Basic ANN Scheme for Predicting Structural Features

To apply ANNs in structural features prediction, a common approach is a multi-layer (often

3) feed-forward network. The following figure provides a basic scheme of these structural

predictors.

Figure 4.6: A basic scheme of ANN predictors adopted from [59]

As it shows, the network is moved along the input sequence and computes an output

vector encoding for the structural class of the amino acid in current position (Y in figure).

As it is generally assumed that structural properties of a residue are greatly affected by

its local context (neighboring residues), the input is a window of a certain size of residues

centered at current inspected position. The architectural parameters of the network includes


the number of output nodes, the number of hidden layers and nodes, input encoding and

window size. The number of output nodes depends on the specific prediction task, for

secondary structure prediction, e.g., often three output nodes, representing three secondary

structures: helix, strand and coil. The input encoding refers to the encoding of each input

amino acid. There are two main types of input encoding: orthogonal and profile-based.

Orthogonal encoding just takes each amino acid in the sequence as it is and usually encodes

it using a binary string. Since there are 20 amino acids, each one is represented by a 20-bit

binary string consisting of 19 0s and one 1, where the position of the 1 identifies the amino

acid. Profile-based encoding uses the 20 dimensional profile extracted from the PSSM of

a multiple alignment. More about this type of input and its advantages is discussed in

the next section. About PSSM and multiple sequence alignment, refer to Section 1.2. The

input window size controls how much local context information we want to consider in the

prediction and it usually takes an odd length so that the amino acid at the center of the

window is the prediction target. Ideally, one may expect that the larger the window size, the

more information given to the predictor, hence performance should increase. Unfortunately,

the increase of window size also means the increase of possible noises. It is observed that

beyond some threshold size, the signal to noise ratio would decrease. Typical window sizes

range from 9 to 25 residues [92].

4.3.3 Secondary Structure Prediction

The general hypothesis taken when attempting to predict secondary structure (SS) is, firstly,

an amino acid intrinsically has certain conformational preferences due to its chemical prop-

erties; secondly, these preferences may be modulated by the locally surrounding amino acids;

and thirdly, long range interactions between amino acids may also play a role in forming

SS. Various approaches focusing on different factors have been designed to predict an amino

acid's secondary structure given the sequence context with which it is placed.

Before ANNs were first applied to SS prediction in the work [67], prediction methods

mainly used statistical information as in [I51 or physico-chemical properties of amino acids

as in [66] to investigate amino acids' conformational preferences. These methods make

predictions only on information coming from a single residue and average accuracy achieved

was limited to 60%. Then came many years of fruitful research on ANN-based approaches

which take local context of individual amino acid into account and have achieved 80% with

the help of evolutionary profile. While ANN-based research is still on-going, recently, other


techniques, including Hidden Markov Models [50] and Support Vector Machines [41], have

been applied to SS prediction. But they have not out-performed ANN-based methods yet

in terms of prediction accuracy.

In the following subsections, we first introduce performance measures commonly used

in SS prediction, then review different ANN-based methods ever applied to this problem.

These various ANNs are categorized into four groups: feed-forward networks based on amino

acid local interactions; feed-forward networks based on evolutionary informations; feed-back

networks; and ANNs as combining classifiers.

Performance measures and testing

The performance of prediction methods can be evaluated in terms of four measures: the

Sensitivity, Specificity, Matthew's correlation coefficient and Segment Overlap score.

For the overall sensitivity measure, the most commonly used is the three-state per-

residue accuracy Qg. I t is defined as the percentage of correctly predicted residues out

of the total number of residues. It counts for all three secondary conformational states:

helix, strand and coil. This measure can also be used for a single conformational state,

thus it has three other forms: Qhelir, Qstrand and QCoil, giving respectively the percentage

of correctly predicted helix residues, strand residues and coil residues. Note that the this

accuracy measure does not convey many useful types of information - e.g., it doesnt say

where the errors are, or in what way the prediction failed. Nevertheless, it is commonly

used to compare the performance of SS predictors.

Qindez is based on individual residues. The measure of the prediction of one residue is

relatively independent of the measure of the prediction of its neighbors. But, the secondary

structure is composed by a segment, or a collection of segments of consecutive residues.

To reflect the nature of protein structure, measures should be concentrating on how well

the entire secondary structure elements are predicted instead of individual residues. Thus,

SOV(Segment Overlap quantity) measure was proposed by Rost et al. in [71]. In web

site [loll, this measure was modified and given full descriptions.

Another useful measure of prediction accuracy for each of the three types of secondary

structures can be calculated using the Matthews' correlation coefficient [67]. For a-helix,

e.g., it is: pn - uo

Coefficient = J(P + 21) (P + 0) (n + u) (n + o)


with p being the number of residues which are true positive(correct1y positively predicted),

n being the number of true negative, o the number of false positive, and u the number of

false negative. The correlation coefficients are in the range of +1 (totally correlated) to -1

(totally anti-correlated) and the values for the three types of secondary structure can be

combined in a single figure by calculating the geometric mean.

Moreover, a systematic testing of performance is needed. Often it is done by cross-

validation. In k-fold cross-validation, the original samples are partitioned into k subsets. Of

the k subsets, one is retained as the validation set for testing the model, and the remaining

k - 1 subsets are used as training samples. The cross-validation process is repeated k times

for each training epoch, with each of the k subsets used exactly once as the validation

data. The error on the cross-validation set can then be used to stop the training when it

begins to increase. What is a good value for k? According to [76], the exact number of k

is not important provided that the test set is representative, comprehensive and the cross-

validation results are not miss-used to again change parameters. In [76], the requirements

for the cross-validation process are also addressed.

ANNs based on local interactions

The early ANNs are basically feed-forward networks taking into account local interactions of

amino acids by means of an input sliding window with orthogonal encoding. The pioneering

work was [67] in which 62.7% Q3 accuracy was reported. Their network architecture is

very similar to the template given in the previous section: three layers fully connected and

the output layer consisted of three sigmoidal units representing three SS classes. The input

amino acids are encoded by 21-bit binary strings (the 21st bit specifying a gap). This sparse

encoding increases the number of network parameters needed, but it has the advantage of

not imposing an artificial ordering of the input data. Other network parameters, including

the number of input and hidden nodes and the window size, are experimented thoroughly

in their work. One example of feasible arrangements could be: 357 input nodes, 5 hidden

nodes, and 3 output nodes, resulting in 1,808 weights. The 3 output nodes correspond to

the 3 types of secondary structures. The 357 input nodes allows for a segment of 17 amino

acids, i.e., the input window size is 17.

The number of connections, thus the number of weights, mainly relies on the number of

hidden nodes. One interesting point reflected in [67] is that the performance of the network

is almost independent of the number of hidden nodes. In their work, they experimented


different number of hidden nodes, from 0 to 40. The test results do not show much difference

in performance.

Although the accuracy reported in [67] was not much an improvement compared with

other prediction methods, this early work led to subsequent years of successful research on

ANNs in SS prediction.

This type of ANNs that are based on single sequences and local windows seemed to

achieve prediction accuracy of at most 65-69%. Increasing the size of the window will

not lead to improvements due to the overfitting problem associated with large networks.

However, some improvement was obtained by cascading the previous architecture with a

second network to clean up the output of the lower network. More on this is introduced in

the subsection 4.3.3 - 'ANNs as filters and combining predictors'.

Other than general prediction accuracy, another major difficulty of the ANNs based

on a window of local context is in predicting P-strands, because P-strand is determined

by comparatively long-range interactions. By this, it is suggested that 65% of secondary

structure depends on local interactions.

ANNs based on evolutionary information

The next generation of ANNs for SS prediction considers not only the information contained

in the local context of the input sequence, but also the information coming from homologous

sequences. The rationale behind this approach is that the structural features, including

secondary structures, within a family of evolutionary related proteins is more conserved

than sequences. This information is processed by first doing a PSI-BLAST search for

homologous sequences in databases and doing a multiple alignment of them, then extracting

a matrix of profiles, PSSM, indicating the frequencies of each amino acid in each position.

Thus each residue is encoded by one matrix column at the corresponding position which is

a vector of 20 real number frequencies.

PHD [70, 71, 731 was one of the first ANN methods using profile-based inputs and

going beyond 70% in accuracy and the researchers at the same time suggested that the

power of neural networks should be fully exploited for the PSP problem. The PHD system

is composed of cascading networks. The first one is that of Figure 4.6. A second one

4PSI-BLAST is a web-based search tool, for identifying biologically relevant sequence similarities in databases. Other local alignment algorithms will also do for this task.


takes as input a window sliding on the previous outputs and refines the output of the first

network. A final stage takes a jury decision averaging the outputs from independently

trained models. Although a number of techniques including early stopping and ensembles

of different networks are used, most of the improvements achieved by PHD seem to result

from the use of evolutionary profiles [73]. In [12], it was claimed that the most accurate

SS prediction methods would be found using ANNs. And they developed a system that

involves two neural networks to get an accuracy of 75%. Other example of evolutionary

profile-based ANN method is PSI-PRED [43] which uses two neural networks to analyze

profiles. At present, almost all profile-based ANN prediction can achieve accuracy about

76-78%.

Prediction using recurrent networks

Human brains are recurrent neural nets: a network of neurons with feedback connections.

Recurrent networks are considered computationally more powerful than feed-forward net-

works. For SS prediction, although the forming of SS is mainly driven by local interactions

of residues, which justifies the success of feed-forward networks with evolutionary profiles

as inputs, many researchers suggest that possible long range interactions between different

regions of a sequence should also be taken into account to further improve prediction ac-

curacy. Thus there has been research into recurrent architectures applied in PSP problem

recently.

Recurrent networks permit the state of the hidden (or output) units at the previous time

step to be part of the input at the next time step, as shown in Figure 4.7.

This provides the network with some memory of previous inputs, and this information

can be used when processing current inputs. Recurrent network is useful for modeling time

series data and the acquisition of grammar. One of the common features between protein

structure and sentence structure is the inherent sequential nature of the structures: as

sentence structure is based on sequential characters, protein structure is based on primary

sequence that begins at N-terminal and ends at C-terminal. The other common feature

between sentence structure and protein structure is the possible not-sequential long-distance

dependency existing in the structure. One example of this dependency in protein structures

is the forming of a P-sheet by several strands located apart along the sequence. Feed-forward

networks can hardly capture this long-range dependency, this is why the prediction accuracy

of P-sheets using feed-forward networks is generally lower than that of helices.


output units(t) w I hidden unitsw

I hidden unitso I I hidden units$-1) I Figure 4.7: Sketch of recurrent network

In [4], a bidirectional recurrent neural network (BRNN) architecture was proposed, and

was further refined in [64] to predict protein secondary structure at an accuracy about 76%.

In this architecture, the prediction for the residue at position t is determined by three com-

ponents. First, there is a central component associated with the local window at position

t , as in standard feed forward networks for SS prediction. Then, two other components are

two similar recurrent networks being associated with the central component. These two

recurrent networks act as two "wheels" rolling along the protein chain, one from the N-

terminal and the other from the C-terminal, exploiting upstream and downstream context

in the sequence all the way to the point of prediction. This bidirectional recurrent network

is trained with a generalized back propagation algorithm. But because the algorithm is

gradient descent essentially, the error propagation in both the forward and backward chains

is subject to exponential decay, thus the learning of remote information is not efficient. For

SS prediction, the BRNN can use information within about f 15 residues around the residue

of interest, and it can hardly discover relevant information contained in further distant por-

tions. But anyhow, the researchers in [64] claim that they have developed new algorithmic

ideas that begin to address the problem of long-range dependencies in SS prediction.

There is more research based on BRNN. In [13], segmented memory recurrent networks

were proposed to replace the standard recurrent networks in the BRNN architecture. The

idea of segmented memory is based on the observation that when trying to memorize a long

sequence, humans tend to break it into smaller segments first and then cascade them to form

the final sequence. Thus it is believed that RNNs are more capable of capturing long-term


dependencies if they have segmented memory and imitate the way of human memorization.

The experiment of applying this idea to refine BRNN to predict SS indicates moderate

improvement in prediction accuracy[l3].

In another research paper [9], bidirectional recurrent networks are used as filtering net-

works to correct the output coming from the first stage prediction by trying to capture valid

segments of SS. In this approach, early stopping mechanism was used to control overfitting

during training process. The experiments showed that this approach reached good accuracy

and a very high value of SOV.

Despite some good results achieved, recurrent networks have not been fully explored for

the PSP problem because most research is based on the bidirectional recurrent architecture

proposed in [4] and other network architecture or implementation for PSP problem can be

hardly found in literature.

ANNs as filters a n d combining predictors

Other than a direct SS prediction method, ANNs are also used as filters and in combining

results from different prediction methods as a consensus meta predictor.

Filtering is to examine the final predictions to make them more realistic by removing

bad predictions. I t is now standard in secondary structure prediction, and is used in many

successful methods. There are various filtering techniques, such as using if-then rewrite

rules found through machine learning method CART [99]. One of the rules specifies:

[ l a , *, * , a , c] -+ c

with a = a-helix, c = coil, * = any, 1 = not. This rule says that if the pattern on the left

is met in a prediction, then the secondary structure in bold on the left is rewritten as the

secondary structure on the right of the rule. Thus, a predicted SS segment [b, b, b, a, c],

after filtering, will be rewritten as [b, b, b, c, c].

The more widely used filtering method in SS prediction is to use ANNs. As early as

in [67], a second, structure-structure network was used to filter the outputs from the first,

sequence-structure network. The inputs to the second network was a window of vectors

resulted from the first network, each vector is the frequencies of the three types of SS at a

residue position:


Inputs : ...( O.6,O.l,O.4)(O.8,O.2,O.2)(O.5,O.6,O.2) ...

The structure-structure network has only three inputs per residue, which allows a much

larger window size for the same number of weights as a sequence-structure network which

has to admit 20 inputs per residue. In [67], a 2% improvement was reported in prediction

accuracy for using a filtering network. Now adding a filtering network becomes a common

approach that is believed to be able to improve both Q3 and SOV and the best performance

so far achieves an accuracy of 78% and a SOV of 73.5% [9]. Generally, the filtering networks

are feed-forward, but in [9], the filtering network used was a bidirectional recurrent net-

works(BRNN). This filtering BRNN has a much simpler architecture than the architecture

based on a BRNN with profiles as input, yet when tested on the predictions of both ANN

and SVM predictors, the performance of this solution on Q3 and SOV index are equivalent

to the latter [9].

ANNs are also found useful in combining predictions from several, or many, networks.

In PHD method [70], for example, a third level network combined the predictions from 10

separate neural network systems that vary in training data and encoding schemes. The

output is the prediction resulted from arithmetic average of the 10 ensemble predictions.

The network also outputs a reliability index that indicates how many of the independently

trained networks agree on the prediction. They reported a 2% improvement in predictive

performance. In the PSI-PRED method, it also averages the output from up to four separate

neural networks to increase prediction accuracy. The study of neural network ensembles,

which is closely linked to the development of Bayesian neural networks, is a potential area

that may further improve the SS prediction accuracy.

Other issues on SS prediction

Limits of accuracy: Currently, the best accuracy of secondary structure prediction is close

to 80% '0761. I t is arguable whether this accuracy will ever be significantly improved. There

are probably three reasons for this doubt: 1) given the 3-d structure of a protein, there is

no complete agreement on how to assign SS to each amino acid, especially for the amino

acids that are located a t the beginning or the end of a SS element. This is largely due to the

fact that secondary structure does not represent clear-cut category of structure in nature,

rather it is a useful piece of terminology; 2) some regions of SS are not solely determined


by the local sequence, but may also be influenced by long range interactions. Thus without

full understanding of tertiary structure, secondary structure could not be expected to be

accurately predicted. This is why some researchers use tertiary structure information in

constructing ANNs to predict secondary structure and gain some improvement in accuracy,

e.g. [52]. But this approach reverses the objective of SS prediction; 3) Usually in SS pre-

diction, 3 classes of secondary structure are adopted. But the secondary structure database,

the DSSP, describes 8 structure classes. Then a mapping from 8 to 3 classes is needed to

reduce the feature space and enable efficient computations. However, by imposing a coarser

input space, we may impose a limit on the accuracy of SS prediction.

From secondary to tertiary structures: Suppose we can accurately predict secondary

structure of a protein, how would it help in constructing the protein's tertiary structure?

This problem is not trivial because the SS elements does not uniquely define the 3-d struc-

ture. Other information, such as their relative distances, is needed. There are methods

that attempt to derive distance constraints between amino acids on the basis of a multiple

sequence alignment of proteins of the same family, like the one discussed in 4.3.4. But these

methods are not reasonably effective yet. Not only in computational methods, even in NMR

experiments, not all atomic distances can be measured or the uncertainty of the measured

value is rather high. Thus, there is still way to go to reconstructing tertiary structure from

secondary structure. The encouraging part is, however, the accuracy of methods for SS

prediction has reached a respectable value for further research on this problem.

4.3.4 Other Structural Features Prediction

Besides secondary structures, there are other structural features that can help understand

and predict protein tertiary structure, such as residue solvent accessibility, Cysteines bond-

ing state, residue long-range contacts, etc. Since for most of the structural features, the

prediction problem can be modeled as a mapping problem that relates each residue in the

protein sequence to a symbol that describes a certain property, it is not surprising that

ANNs, as an automatic learning methods, finds its application and success in predicting

many other structural features. In this section, we briefly survey a few of them.

Residue solvent accessibility (RSA) describes the relative degree to which a residue

interacts with solvent molecules. I t can be described in several ways. The simplest is a

two-state description: residues with greater RSA are considered as exposed, residues with

lower RSA are considered as buried. ANN methods have long been applied to the prediction


of RSA. As in SS prediction, the first attempts only took single sequence as input, as in [37]

and later evolutionary profiles were used, as in [72]. Then in [65], ensembles of bidirectional

recurrent neural networks, similar to those employed in SS prediction, were investigated

and obtained good performance, showing again the ability of ANNs to exploit structural

features.

Cysteine is one of the twenty amino acid and it can occur in either of the two forms:

oxidized or reduced. Two oxidized cysteines can pair to form a disulphide bridge which

is a type of covalent bond important for protein folding and stabilizing. Thus identifying

oxidized cysteines can help predicting disulphide bridges and this problem can be cast to a

binary classification task, i.e., for each cysteine in a given protein, predict whether it is in

a disulphide bridge or not. Both feed-forward and recurrent network have been applied to

this task. The program CYSPRED developed in [29] uses a neural network with no hidden

nodes, fed by a window of residue positions centered at the target cysteine. Evolutionary

profile is used as the input for each residue position. This method achieved 79% accuracy.

In [lo], SVM method and more domain knowledges are investigated, the accuracy achieved

is 84%. Based on this method, they further add a global refinement stage by bidirectional

recurrent networks and reach 88% accuracy.

For predicting long-range contacts of residues, the basic hypothesis is that residues in

contact in a protein structure tend to mutate in a covariant manner. Thus detecting residues

mutating in a correlated manner can be taken as an indication of probable physical contact

in 3-d. There are various methods for this problem. One approach is to train neural networks

using different encoding systems for multiple sequence alignments [30]. For example, each

residue pair in the protein sequence can be coded as an input vector containing 210 elements

(20 x (20 + 1)/2), representing all the possible ordered couples of residues (considering that

each residue couple and its symmetric are coded in the same way) and a single output state

can code for contact and non-contact.

4.3.5 Discussion

What has been described here is the application of ANNs to the protein structual features

prediction, a subproblem of PSP, and in particular secondary structure prediction. This

category of prediction problem can be described as a mapping problem in which we relate

a sequence encoded by an alphabet of twenty letters into a sequence of a certain alphabet

representing some structural features. This way of posing the problem enables the use of


ANNs, as well as other automatic learning techniques, to infer the relationships between

sequences and structural features by learning from known cases. ANNs perform quite suc-

cessfully in this task. The general ideas of how ANNs are applied in this task have been

introduced in previous sections. In this section, we discuss a few more issues. Some of the

issues are specific to protein structural features prediction, while others are general problems

of ANNs.

The problem of over-fitting

The purpose of training an ANN is not to learn the training set to the highest degree of

accuracy. Rather, the aim is to generate a network that has the ability to generalize to

other unseen data. Thus a network should avoid being over-trained. Otherwise, it will fit

perfectly to the training data while has poor ability to generalize. I t is like we focus so

much on particular trees that we miss the forest. Another problem with over-training is

that training data normally contain noise. If a network is over-trained, it will learn the noisy

details of the training set and is unlikely to be optimal from the perspective of generalization.

Some factors have been identified concerning with specifying the conditions that make

ANNs generalize well. Examples of these factors are: 1) the ratio of network parameters

to training examples. This ratio should not be too large. 2) The number of hidden nodes.

Although structural features prediction does not necessarily require hidden nodes, most of

ANN designs in literature for realistic length proteins have hidden layers. The number of

hidden nodes are often experimented. Too few hidden nodes will cause the network unable to

learn, but if too many, its generalization will be poor. 3) The number of training iterations.

If there are too few training iterations, the network will be unable to extract important

features from the training set; but if too many, the network will begin to learn the details

of the training set that it will not be able to abstract general features.

In practice, the above-mentioned factors can be handled in differnt ways. For example,

the popular SS prediction method PHD [70] uses two methods to address the over-fitting

problem. One is early stopping. The other is to use ensemble averages by training different

networks independently, using different input information and learning procedures. Cross-

validation techniques (See Section 4.3.3) are also commonly used in the training process to

control over-training. It is effective in handling over-fitting problem, yet computationally

expensive.


Effects of evolutionary information

The fact that proteins are evolutionarily related affects the application of ANNs in structural

features prediction in the following aspects. First, evolutionary information has been proven

to be useful in improving prediction accuracies. Making use of evolutionary information

during prediction process contributes significant improvement in prediction accuracies 191.

The evolutionary information mainly takes the form of multiple alignment profiles. Secondly,

because evolutionarily related proteins often exhibit very similar secondary structures, in

the process of network training, we have to ensure that no protein homologous to those in

the training set is present in the validation and test sets, otherwise the evaluation of the

network is bound to be incorrect because the network may "learn" to recognize homologous

proteins and to give the same answer for them, rather than to recognize the features of

the sequence. Thus, in practice, the protein sequences in training, validation, and test set

normally have to undergo inspections to make sure that no pairs share significant similarity.

Usually a threshold of about 25% sequence identity is used for this purpose.

About data sets

Application of ANNs is dependent on the use of data sets including training, validation and

test sets. For ANNs used in PSP problem, there are some issues about these data sets worth

noting.

First, as pointed out in [12], increasing the number of non-homologous proteins in the

data sets improves the prediction accuracy, because more biological information improves

the network's ability to discriminate between different types of structures, and the risk of

over-fitting is reduced. Such an example was given in [12]: a 4% improvement in Q3 index

was achieved using a data set of 318 non-homologous protein sequences, compared with Qian

and Sejnowski's network [67] in which a much smaller (three times less) data set was used.

This suggests that as the number of solved non-homologous protein structures increases over

time, prediction based on larger data sets will be more accurate. Although it is hard to find

more evidence for this conjecture in literature, subsequent work in ANN approaches to SS

prediction usually use larger data sets. For example, the data set in [41] published in 2001

contains 513 protein chains with low similarity, while the data set in [9] published two years

later contains 969 chains and almost 184,000 amino acids.


Secondly, not only larger data sets themselves contribute to the improvement of predic-

tion, but that the data pool from which the data sets are drawn are getting larger and this

also contributes to the improvement of prediction. This contribution comes from two as-

pects: one, as discussed before, the use of evolutionary information increases the prediction

accuracy, and the obtainment of evolutionary information is directly connected to database

size and database search tools; two, larger data pools make it possible and easy for the

selection of good-quality protein data used for ANN methods.

One last issue about ANN data sets is not specific to PSP problem, but a general problem

to ANN method: one ANN that is trained on certain data may produce a prediction different

from another ANN trained on different data. This poses problems for prediction accuracy

and it has not been addressed in PSP problem. PSP researchers do pay attention in choosing

data sets, but their attention is on choosing proteins that are mutually non-homologous

rather than attending this issue.

Opening the black-box

While ANNs have been used successfully in PSP problem, one major complaint about ANN

predictors, especially from biologists, is that there is no explanation why a protein structure

is predicted as such. A trained ANN is like a protein folding machine, being fed with

protein sequences and producing folded structure features. But this machine is a black-box

or unknown function of the amino acid sequence. A trained ANN has obviously learned

meaningful relationships in the training data, but these relationships are encoded as weight

vectors within the network, which are difficult to interpret. Is it possible to see inside the

black box? Is it possible to "fit the curve" to the data points and thus empirically derive

the corresponding function from sequence to structure? If this problem is cast as fitting a

function to data, many techniques in mathematics or computing science are applicable. But

our discussion here focuses on extracting rules from neural networks so that these networks

can do more than being mere "black boxes".

Rule extraction from neural networks has been an active research topic in recent years.

Many methods have been proposed [87]. If the feed-forward net used for SS prediction has

no hidden layers, the values of the weights chosen by the network during training for each

residue type and window location are themselves instructive. But most of the networks

used for PSP problem are multi-layer. For multi-layer networks or other network types

such as recurrent networks, rule-extraction methods vary and are dependent on network


architecture, training and activation functions. But they can be roughly categorized as being

between 'decompositional' and 'pedagogical' approaches, according to [87]. Decompositional

approaches 'look inside' the network and analyze weights between units to extract rules.

Some of these approaches require specialized restricted weight modification algorithms, while

others require specialized network architectures such as an extra hidden layer of units with

staircase activation functions. Pedagogical approaches do not examine weights inside the

black box, but extract rules by observing the relationship between the network' inputs and

outputs. Thus they are general purpose in nature and can be applied to any feed forward

network architecture.

For PSP problem, the rules extracted from ANN solutions should be applicable to most

protein sequences and compliant with the truths of chemistry and physics. But overall,

there has not been much research in this line yet. In [86], the rules are extracted by

sepecific modulation of the training procedure. The attempt did not improve performance

but it showed that the rules extracted from ANNs are more complicated than were available

by statistical analysis. Because ANNs perform better than other methods in prediction

accuracy, it is worthwhile trying to extract rules from black box ANNs. This improves the

comprehensibility of the solutions without losing the accuracy of the black boxes.

Chapter 5

Summary

In order to understand the function of a protein, it is important to know its structure. This

report deals with the determination of protein structure using computational methods,

especially A1 techniques that are initiated from biological systems.

The structure of a protein may be described at four major levels. Protein structure

prediction operates primarily at the level of the secondary and tertiary structure. The

fundamental principle underlying all the methods is Anfinsen's hypothesis (experimentally

justified) that there is sufficient information contained in the protein sequence that specifies

the final 3-d structure [I].

While the problem remains largely unsolved, researchers have made good effort by re-

sorting to various simplified models and trying various approaches. Some common simplifi-

cations are: focusing only on the residues rather than all the atoms in the protein; reducing

the number of residue types by grouping residues based on physical properties such as hy-

drophobicity, as in HP models; reducing the number of spatial degrees of freedom of the

atoms or residues, for instance, by restricting the residues locations on lattices. Predict-

ing secondary structure can also be seen as simplifying the 3-d problem by projecting 3-d

structure onto 1-d string of secondary structural assignments for each residue.

The various approaches to the problem can be classified into three categories: knowledge-

based - building the structure based on knowledge of a good template structure; ab initio -

building the structure from scratch using first principles; and structural features prediction.

Each category has sub-divisions of approaches. A particular approach is chosen depending

on the protein in question and the amount of data available, or on the research interest of the

research group. In practice, knowledge-based prediction tools are more successful. Hybrid

CHAPTER 5. SUMMARY 77

approaches also perform well and are becoming a trend in PSP research. Currently, most

ab initio methods work on simplified models and at residue level, thus strictly speaking, are

not considered as practical full tertiary structure prediction methods. However, they are

more important in the sense that a true solution to ab initio prediction will permit rational

design of novel proteins with novel functions.

A1 techniques have been applied to many approaches to the problem. The most notable

are evolutionary computation in ab initio prediction and artificial neural networks in pre-

dicting secondary structure. In this report, we reviewed and analyzed the applications of

three biologically initiated A1 techniques to PSP problem: evolutionary computing, ANNs

and L-systems. For each of these techniques, we presented a general framework of how they

can be used for PSP either directly or by discussing important components of the technique.

The rationale of whether or why they are suitable for protein structure prediction is pre-

sented. We also discussed and compared significant studies that were published in recent

years.

Evolutionary algorithms are effective and generally applicable search techniques for hard

problems for which analytical methods or good heuristics are not available. PSP problem,

when formulated as a searching-for-optimal-conformation problem, is a good candidate for

using EAs. EAs explore an energy landscape for a minimal energy conformation which

is believed to correspond to the native state. Three crucial components are addressed:

a representation for structure geometry that translates the problem space into encodings

that can be used for evolution; a potential energy function that can distinguish between

favorable and non-favorable structures, and the specific variation and selection operators to

explore the conformational space. For structure representation and energy function, large

approximations are required because of the complexity of the problem. We addressed this

issue and sampled in research literature how various approximations are handled.

Lindenmayer systems is presented as a novel generative encoding scheme to capture

protein structure in lattice model. We introduced and analyzed the recent research in

this line. L-system-based encoding has been tested in evolutionary algorithms with good

preliminary results. But further research is needed to investigate its applicability in PSP

problem.

For humans, a large memory of stored examples can serve as the basis for intelligent

inference. For PSP problem, ANNs infer meaningful relations between primary sequence and

seondary structures from selected dataset. The learned relationships, although in a hidden

CHAPTER 5. SUMMARY 78

form, are then used to predict the structures of new sequences, with promising results. From

the point of view of pattern recognition, secondary structure prediction can be seen as a

classification task, which assigns to each residue one of the three (sometimes more) classes

of conformational states. Various kinds of ANNs have been used in this task. We examined

feed-forward networks based on amino acid local interactions; feed-forward networks based

on evolutionary informations; feed-back networks; and ANNs as combining classifiers.

Among the many A1 techniques that have been applied to PSP problem, I only sampled

a few of those that are initiated from biological systems. Nature is still 'smarter' than

humans. Maybe eventually we can successfully apply what we've learned from Nature to

biological problems themselves?

Bibliography

[I] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181: 223- 230, 1973.

[2] J . Augen. Bioinformatics in the Post-Genomic Era: Genome, Transcriptome, Pro- teome, and Information-Based Medicine. Addison Wesley, 2004.

[3] P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. The MIT Press, 2001.

[4] P. Baldi, S. Brunak, P. Frasconi, G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. bioinfomatics, 15: 937-946, 1999.

[5] M. J. Bayley, G. Jones, P. Willett and M.P. Williamson. GENFOLD: a genetic algorithm for folding protein structures using NMR restraints. Protein Sci, 7:491-499, 1998.

[6] B. Berger and T. Leight. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J. Comp. Bio., 5: 27-40, 1998.

[7] C. Branden and J . Tooze. Introduction to protein structure. Garland Publishing Inc., 2nd edition, 1999.

[8] R. Casadio, E. Capriotti, M. Compiani, P. Fariselli, I. Jacoboni, P. Luigi, I. Rossi and G. Tasco. Neural networks and the prediction of protein structure. In Artificial intelligence and heuristic methods in bioinformatics, P. Frasconi and R. Shamir (eds.), IOS Press, 2003.

[9] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction. In AI*IA 2003: Advances in Artificial Intelligence, A. Cappelli and F. Turini (eds.), 2003.

[lo] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Predicting the disulfide bonding state of cysteines with combinations of kernel machines. Journal of VLSI Signal Processing, 35:287-295, 2003.

BIBLIOGRAPHY 80

[ll] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Cysteine bonding state: local prediction and global refinement using a combination of kernel machines and bidirectional recurrent neural networks. In AI*IA 2003: Advances in Artificial Intelligence, A. Cap- pelli and F. Turini (eds.), 2003.

[12] J. Chandonia and M. Karplus. The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Science, 5:768-774, 1996.

[13] J. Chen and N.S. Chaudhari. Capturing long-term dependencies for protein secondary structure prediction. In Advances in Neural Networks: Lecture Notes in Computer Science, Vol. 3174, Springer Verlag, 2004.

[14] C. Chothia and A. Lesk. Relationship between the divergence of sequence and structure in proteins. EMBO Journal, 5:823-827, 1986.

[15] P.Y. Chou and U.D. Fasman. Prediction of protein conformation. Biochemistry, 13:211- 215, 1974.

[16] J. Cohen. Bioinformatics-an introduction for computer scientists. ACM Computing Surveys, 36: 122-158, 2004.

[17] W.D. Cornell et al. A second generation force field for the simulation of proteins and nucleic acids. J. Am. Chem. Soc., 117:5179-5197, 1995.

[I81 C. Cotta. Protein structure prediction using evolutionary algorithms hybridized with backtracking. Artificial Neural Nets Problem Solving Methods, Lecture Notes in Com- puter Science, 2687:321-328.

[19] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Piccolboni, and M. Yannakakis. On the complexity of protein folding. J. Comp. Bio., 5: 409-422, 1998.

[20] V. Cutello, G. Narzisi and G. Nicosia. A multi-objective evolutionary approach to the protein structure prediction problem. J. R. Soc. Interface, doi:10.1098, 2005.

[21] T. Dandekar and P. Argos. Potential of genetic algorithms in protein folding and protein engineering simulations. Protein Eng., 5: 637-645, 1992.

[22] T. Dandekar and P. Argos. Folding the main chain of small proteins with the genetic algorithm. Journal of Molecular Biology, 236: 844-861, 1994.

[23] T. Dandekar and P. Argos. Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions. Journal of Molecular Biology, 256: 645-660, 1996.

[24] R. Day, J. Zydallis and G. Lamout. Solving the protein structure prediction problem through a multiobjective genetic algorithm. In Proc Computational Nanoscience and Nanotechnology Conference, 2002.

BIBLIOGRAPHY 81

[25] K.A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 24:1501, 1985.

[26] A.E. Eiben and J.E. Smith Introduction to evolutionary computing. Springer, 2003.

[27] G. Escuela, G. Ochoa and N. Krasnogor. Evolving L-systems to capture protein structure native conformations. In Proc Genetic Programming: 8th European Conference, 74-84, 2005.

[28] V.A. Eyrich, M.A. Marti-Renom, D. Przybylski, M.S. Madhusudhan, A. Fiser, F. Pazos, A. Valencia, A. Sali and B. Rost. EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17:1242-1243, 2001.

[29] P. Fariselli, P. Riccobelli, and R. Casadio. Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins, 36:340-346, 1999.

[30] P. Fariselli and R. Casadio. Neural network based predictor of residue contact in proteins. Protein Engineering, 12: 15-21, 1999.

[31] D. Fischer, D. Baker and J. Moult. We need both computer models and experiments (correspondence). Nature, 409: 558, 2001.

[32] D. Fischer, D. Eisenberg. Fold recognition using sequence derived properties. Protein Science, 5:947-955, 1996.

[33] G.B. Fogel and D.W. Corne. Evolutionary Computation in Bioinformatics. Elsevier, 2003.

[34] D.B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Intelli- gence. IEEE Press, 1995.

[35] J . Gamalielsson and B. Olsson. Evaluating protein structure prediction models with evolutionary algorithms. In Information Processing with Evolutionary Algorithms, M. Grana, R. Duro, dlAnjou and P. Wang (eds.), Springer, 2005.

[36] I. Halperin, B. Ma, H. Wolfson and R. Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins, 47: 409-443, 2002.

[37] S.R. Holbrook, S.M. Muskal and S.H. Kim. Predicting surface exposure of amino acids from protein sequence. Protein Engineering, 3:659-665, 1990.

[38] J.H. Holland. Adaptation in Natural Artificial Systems. The University of Michigan Press, 1975.

[39] B. Honig. Protein folding: from the levinthal paradox to structure prediction. Journal of Molecular Biology, 293: 283-293, 1999.

BIBLIOGRAPHY 82

[40] G. Hornby and J . Pollack. The advantages of generative grammatical encodings for physical design. Congress on Evolutionary Computation, 2001.

[41] S. Hua and Z. Sun. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. Journal of Molecular Biology, 308:397-407, 2001.

[42] D.T. Jones. W.R. Taylor and J.M. Thornton. A new approach to protein fold recognition. Nature, 358: 86-89, 1994.

[43] D.T. Jones. GenThreader: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 287: 797-815, 1999.

[44] M. Khimasia and P. Coveney. Protein structure prediction as a hard optimization problem: the genetic algorithm approach. Molecular Simulation, 19: 205-226, 1997.

[45] R. King and M. Sternberg. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Science, 5: 2298-2310, 1996.

[46] G. Kokai, Z. Toth and R. Vanyi. Modeling blood vesels of the eye with parametric L- systems using evolutionary algorithms. In Proc Joint European Conference on Artijicial Intelligence in Medicine and Medical Decision Making, 1999.

[47] N. Krasnogor, D. Pelta, P.E. Lopez, and E. Canal. Genetic algorithm for the protein folding problem, a critical view. In Proc of Engineering of Intelligent Systems, 1998.

[48] N. Krasnogor, W. Hart, J. Smith and D. Pelta. Protein structure prediction with evolutionary algorithms. In Proc Genetic and Evolutionary Computation Conference, 1999.

[49] D.V. Laurents, S. Subbiah and M. Levitt. Different protein sequence can give rise to highly similar folds through different stabilizing interactions. Protein Science, 3: 1938- 1944, 1994.

[50] K. Lin, V. Simossis, W. Taylor and J . Heringa. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics, 21(2): 152-159, 2005.

[51] A.D. MacKerell et al. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B, 102:3586-3616, 1998.

[52] J. Meiler and D. Baker. Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci, lOO(21): 12105-12110, 2003.

[53] S. Miyazawa and R.L. Jernigan. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading. Journal of Molecular Biology, 256:623-644, 1996.

[54] A. Narayanan, E. C. Keedwell and B. Olsson. Artificial intelligence techniques for Bioinformatics. Applied Bioinformatics, l(4): 191-222, 2003.

[55] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443-453, 1970.

[56] G. Ochoa, G. Escuela and N. Krasnogor. Incorporating knowledge of secondary structures in a L-system-based encoding for protein folding. In Proc Artificial Evolution Conference, 2005.

[57] A.R. Ortiz, A. Kolinski and J. Skolnick. Nativelike topology assembly of small proteins using predicted restraints in monte carlo folding simulations. Proceedings of the Natural Academy of Sciences, 95:1020-1025, 1998.

[58] D.J. Osguthorpe. Ab initio protein folding. Current Opinion in Structural Biology, 10:146-152, 2000.

[59] A. Passerini and A.Vullo. Machine learning in structural genomics. 2004.

[60] A. Patton, W.P.111 and E. Goldman. A standard GA approach to native protein conformation prediction. In Proc 6th Intl Conf Genetic Algorithm~574-581, 1995.

[61] K. Petersen and W.R.Taylor. Modelling zinc-binding proteins with GADGET: genetic algorithm and distance geometry for exploring topology. Journal of Molecular Biology, 325: 1039-1059, 2003.

[62] A. Picoolboni and G. Mauri. Application of evolutionary algorithms to protein folding prediction. In Proc ICONIP. 1997.

[63] N.A. Pierce and E. Winfree. Protein Design is NP-Hard. Protein Engineering, 15: 779- 782, 2002.

[64] G. Pollastri, D. Przybylski, B. Rost and P Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47: 228-235, 2002.

[65] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47: 142-1 53, 2002.

[66] O.B. Ptitsyn and A.V. Finkelstein. Theory of protein secondary structure and algorithm of its prediction. Biopolymers, 22:15-25, 1983.

[67] N. Qian and T. Sejnowski. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202: 865-884, 1988.

BIBLIOGRAPHY 84

[68] A.A. Rabow and H.A. Scheraga. Improved genetic algorithm for the protein folding problem by use of a cartesian combination operator. Protein Science, 5:1800-1815, 1996.

[69] A. Renner and E. Bornberg-Bauer. Exploring the fitness landscapes of lattice proteins. Pacific Symposium on Biocomputing, 2:361-372, 1997.

[70] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232: 584-599, 1993.

[71] B. Rost, C. Sander and R. Schneider. Redefining the goals of protein secondary structure prediction. Journal of Molecular Biology, 235: 13-26, 1994.

[72] B. Rost and C. Sander. Conservation and prediction of solvent accessibility in protein families. Proteins, 20:55-72, 1994.

[73] B. Rost. PHD: predicting one-dimensional protein structure by profile based neural networks. Methods in Enzymology, 266:525-539, 1996.

[74] B. Rost, J. Liu, D. Przybylski, R. Nair, H. Bigelow, K. Wrzeszczynski and Y. Ofran. Predict protein structure through evolution. In Chemoinformatics, J. Gasteiger and T. Engel (eds.), Wiley, 2003.

[75] B. Rost. Neural networks predict protein structure: hype or hit? In Artificial intelligence and heuristic methods in bioinformatics, P. Frasconi and R. Shamir (eds.), IOS Press, 2003.

[76] B. Rost. Rising accuracy of protein secondary structure prediction. In Protein structure determination, analysis, and modeling for drug discovery, D. Chasman (ed), New York: Dekker, 2003.

[77] M.P. Scapin, and H.S. Lopes. Protein structure prediction using an enhanced genetic algorithm for the 2D HP model. 111 Brazilian Workshop on Bioinformatics:183-186, 2004.

[78] D.B. Searls. The Computational linguistics of biological sequences. In Artificial Intel- ligence and Molecular Biology, L. Hunter(eds.), AAAI Press, 1993.

[79] D.B. Searls. Grand challenges in computational biology. In Computational Methods in Molecular Biology, S. L. Salzberg, D. B. Searls and S. Kasif (eds.), Elsevier, 1998.

[80] N. Siew and D. Fischer. Convergent evolution of protein structure prediction and computer chess tournaments: CASP, Kasparov and CAFASP. IBM Systems Journal, 40(2): 410-425, 2001.

[81] K.T. Simons, C. Kooperberg, E. Huang and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scroing functions. J. Mol. Biol., 268:209-225, 1997.

BIBLIOGRAPHY 85

[82] J. Skolnick and A. Kolinski. Simulations of the folding of a globular protein. Science, 250:1121-1125, 1990.

[83] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.

1841 S. Sun, P.D. Thomas, and K.A. Dill. A simple protein folding algorithm using a binary code and secondary structure constraints. Prot. Eng., 8:769-778, 1995.

[85] Z. Sun, X. Xia, 0. Guo and D. Xu. Protein structure prediction in a 210-type lattice model: parameter optimization in the genetic algorithm using orthogonal array. Journal of Protein Chemistry, 181:39-46, 1999.

[86] I. Tchoumatchenko, F. Vissotsky and J.G. Ganascia. How to make explicit a neural network trained to predict proteins secondary structure. proceedings of CEDEX 05, 1993.

[87] A. Tickle, F. Maire, G. Bologna, R. Andrews and J . Diederich. Lessons from past, current issues, and future research directions in extracting knowledge embedded in artificial neural networks. In Hybrid Neural Systems, S. Wermter and R. Sun (eds.), Springer-Verlag, Berlin, 2000.

[88] C.J. Tsai, B. Ma, Y.Y. Sham, and S. Kumar. A hierarchical building-block-based computational scheme for protein structure prediction. IBM Journal of Research and De- velopment, V. 45, 2001.

[89) R. Unger and J. Moult. Genetic algorithms for protein folding simulations. Journal of Molecular Biology, 231:75-81, 1993.

[go] R. Unger. The genetic algorithm approach to protein structure prediction. Structure and Bonding, 110:153-175, 2004.

[91] F. Vivarelli, G. giusti, M. Campanini, M. Compiani, and R. Casadio. LGANN: a parallel system combining a local genetic algorithm and neural networks for the prediction of secondary structure of proteins. Comp Appli Bioscience, 11: 253-260, 1995.

[92] A. Vullo. On the role of machine learning in protein structure determination. AI*IA Notizie (journal of the Italian Association for Artificial Intelligence), XV(3): 22-30, 2002.

BIBLIOGRAPHY

Date post:	28-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

A REVIEW OF ARTIFICIAL INTELLIGENCE TECHNIQUES...

Documents