[IEEE 2012 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)...

PARTICLE SWARM OPTIMIZATION WITH BACKTRACKING IN PROTEIN STRUCTURE PREDICTION PROBLEM

Nanda Dulal Jana1 and Jaya Sil2

1Department of Information Technology National Institute of Technology, Durgapur

West Bengal, India [email protected]

2Department of Computer Science and Technology

Bengal Engineering and Science University, Shibpur West Bengal, India [email protected]

ABSTRACT Several population based search algorithms are developed by the researchers to predict the native state of protein from its primary sequences. The paper aims at predicting the native conformation of proteins in lattice model using PSO based searching method. However, stuck at local minima and generating illegal conformation are the main drawbacks of applying the search algorithm in protein structure prediction. Adaptive Polynomial Mutation (APM) is performed to remove local minima while illegal conformations are repaired using backtracking method. Benchmark sequences with different length are applied to verify the proposed algorithm showing better results compare to the earlier approaches.

Index Terms— Protein Structure Prediction (PSP), 2D HP lattice model, Particle Swarm Optimization (PSO), Adaptive Polynomial Mutation, Backtracking.

1. INTRODUCTION

The prediction of protein structure from its amino acid sequence is one of the most prominent problems in computational biology. The amino acid sequence of its polypeptide chains known as primary structure of a protein. Under certain physiological conditions, the primary structure of a protein spontaneously folds into a precise three-dimensional tertiary structure of native state. The protein structure prediction problem is defined as to find the native state of protein by developing efficient computational method. Solving PSP instances to optimality is a combinatorial optimization problem, because of the exponential number of potential conformations. Simplified models like Dill’s HP-lattice [1] model have been used for

investigating the general properties of the protein folding. In HP model, a protein is simplified to a two letter alphabet, namely H (hydrophobic residue) and P (polar residue). In this model, the energy function is the total number of the hydrophobic interactions between the amino acids and the goal is to have a lattice with minimum energy, i.e., maximum number of H-H contacts. It has been proved that finding minimum energy in HP lattice model is NP complete problem [2]. Therefore, a deterministic approach is not suitable to address the problem. Many population based evolutionary algorithms (EA) [3, 4, 5] have been developed to solve the PSP problem. However, the genetic search of the evolutionary algorithms leads to illegal conformation.

Simple PSO has shown good search ability in many optimization problems with faster convergence speed. However, due to lack of diversity in population, it easily trapped into local optima. PSO variant, another method applied in lattice model [16] and off-lattice model [17] to predict the protein structure, where the drawbacks of simple PSO did not highlighted. The paper aims to applying simple PSO with a local improvement strategy (i.e., PSO-APM) to solve the PSP problem. Another problem is illegal conformation generated from PSO are repaired using proposed backtracking method. This paper organized as follows, section II presents the preliminaries of 2D HP lattice model; describes the particle swarm optimization and adaptive polynomial mutation (PSO-APM) briefly. Methodology for applying PSO-APM algorithm to PSP problem is described in section III. Section IV presents parameters that we are considering in our experiments. In section V, the experimental results are compared against other known algorithms. Finally, conclusion and future direction is summarized in section VI.

978-1-4673-2193-8/12/$31.00 ©2012 IEEE

734

2. BACKGROUND 2.1. The HP Lattice Model The HP model [1] is a simple abstraction that captures the essence of the important concepts of the protein structure prediction. In this model amino acid are divide into two categories: hydrophobic (H) and hydrophilic (P). The primary sequence of protein is therefore S ∈ H, P+. With this simplification, optimization models can be developed that aims to maximize interactions between adjacent pairs of H amino acids. Adjacency is considered only in the cardinal directions of a lattice upon which the sequence is embedded. In an HP lattice, vertices represent amino acids and edges represent connecting bonds. Black squares at the vertices indicate H, while white squares indicate P amino acids. A lattice can be 2D or 3D and either square, cubic or triangular. The H-H contacts are the basis for the evaluation function. Every pair of H that is adjacent on the lattice and not consecutive in the primary sequence is awarded a value ε (usually -1).This pair known as topological contact between amino acids on lattice. The sums of all such values give the energy of the conformation

Fig.1. Optimal H-H topological contacts for the sequence HPHPPHHPHPPHPHHPPHPH in the 2D square lattice In “Fig. 1”, ‘S’ and ‘E’ represent the starting amino acid and end amino acid while dotted lines represent H-H topological local contacts. The conformation has 9 H-H topological contacts and this is the maximum number of contacts for the given sequence. 2.2. Particle Swarm Optimization(PSO) PSO is a kind of algorithm to search for the best solution by simulating the movement and flocking birds. It was originated by Kennedy and Eberhart in 1995 to study social and cognitive behavior [6, 7]. It is similar to evolutionary algorithm, but requires less computational complexity and fewer lines of code. PSO algorithms use a population of individual called particles. Each particle has its own position and velocity to move around the search space. The position vector and the velocity vector of the ith particle in the D dimensional search space can be represented by Xi = (xi1, xi2,

xi3, …., xid) and Vi = (vi1, vi2, vi3, ….., vid), respectively. Particles have memory and each particle keep track of previous best position and corresponding fitness. The previous best value is called as pbest i.e., the best position of each particle at that time. It is representing by Pid = (pi1, pi2, pi3….....pid). On the other hand, the fittest particle found so far in the swarm called gbest and represent by Pgd = (pg1, pg2, pg3,.…,pgd). The new velocities and the positions of the particles for the next fitness evaluation are calculated using the following two equations: Vid(t+1) = ω * Vid(t) + c1 * r1 * (Pid – Xid(t)) + c2 * r2 * (Pgd – Xid(t)) (1) Xid(t+1) = Xid(t) + Vid(t+1) (2) Where ωinertia constant, c1 and c2 are are two acceleration coefficients. r1 and r2 are uniformly distributed random number in [0,1]. The search space is specified by (Xmin, Xmax)D. where D is the dimension of the space. The velocity is constrained within the (Vmin, Vmax)D, Similarly when updating the velocity of a particle, if a component is greater than Vmax it is set back to Vmax *r, and if it is less than Vmin, it is set back to Vmin*r. Generally, Vmin = -Vmax and Vmax is set to 10%-50% of the search space range Xmax. 2.3. Adaptive Polynomial Mutation (APM) The particle swarm optimization algorithms converge rapidly during the initial stages of searching, but often slow considerably and can get trapped at local optima. When particles converge to the global best particle, personal best Pid and global best Pgd are equal and last two terms of the Eq.(1) become zero and the resulting equation become

VID(T+1) = Ω * VID (3)

When t tends to infinity, Vid(t+1) ≡ 0 as 0 < ω < 1. In this circumstance, particles cannot move further in the search space. Mutation strategies [8, 9] are used into PSO to prevent reaching at a local optimum through long jumps made by the mutation. The mutation operator is applied in three ways: Individual particles, particles along with their velocity and global best particles. In the proposed method mutation is applied on the global best particles. While dealing with the local optima problem. A long jump is very useful when global best is far away from the global optima. But when global best trapped into local optima which are near the global optima, a long jump may go to the unfeasible solution space or drive it out towards other better local optima which are far away from the global optima. In that case, mutation becomes non-effective. A controlled mutation-size has been employed (adaptive polynomial mutation [18]) on global best where mutation size decreases with increasing iterations. Polynomial mutation is based on polynomial probability distribution [10, 11].

xj(t+1) = xj(t) + (xju – xj

l) * δj (4)

735

Where xju is the upper bound and xj

l is the lower bound of xj . The parameter δj is calculated from the polynomial probability distribution P(δ) = 0.5(ηm+1)(1-|δ|ηm) (5)

mη is the polynomial distribution index. (6) The property of ηm is such that by varying its value, the perturbance can be varied in the mutated solution. If the value of ηm is large, a small perturbance in the value of a variable is achieved. To achieve gradually decreasing perturbance in the mutated solutions, the value of is gradually increased. The following rule is to achieve the above adaptation which is known as adaptive polynomial mutation: ηm= 80 + t (7) Where t is the current iteration. In this work, we used adaptive polynomial mutation on global best solution in PSO using the following equation: mPgd(t) = Pgd(t) + (xj

u - xjl)*δj (8)

Where xju is the upper bound and xj

l is the lower bound of xj in Pgj . If mutated global best mPgd is better than Pgd , then Pgj is replaced by mPgj .

3. PSO-APM FOR PROTEIN STRUCTURE PREDICTION

In this section, we adapt the PSO-APM algorithm for solving the PSP problem in the 2D HP lattice model. 3.1. Conformation Representation A particle is a conformation of a protein represented by an array of length n-1, where n is the number of amino acids in the respective protein. In this paper absolute internal coordinate [12] is used for representing protein conformation on the HP lattice model. In absolute internal coordinate, a given amino acid moves according to the axis of the lattice. The conformation, using this scheme are coded with a sequence in N, S, E, Wn-1, which corresponds to North, South, East and West for n length protein sequence in 2D lattice. In PSO, every particle/individual is real numbers, decoded into a specific conformation of a protein on the 2D square lattice. Therefore, an adaptation concept is necessary for encoding and decoding the sequence of movements of a protein on the lattice. The same adaption concept proposed in [13] has been used in the paper. Using absolute internal coordinate in 2D square HP lattice model, the movements are North, South, East and West. Therefore, the phenotypical representation of a solution is defined over the alphabets N, S, E, W. The genotypical representation is still a real valued vector. Consider xij is the jth element of particle xi and P is the string representing the sequence of

movements of the conformation and Y1 < Y2 < Y3 < Y4 < Y5 are arbitrary constants in . The genotype-phenotype mapping is defined as follows: If Y1 < xij ≤ Y2 then Pj = N If Y2 < xij ≤ Y3 then Pj = E If Y3 < xij ≤ Y4 then Pj = S If Y4 < xij ≤ Y5 then Pj = W (9) Initially, the swarm is populated with a set of N conformations which are randomly generated with the absolute internal coordinates encoding scheme. Thus, using the above defined genotype-phenotype mapping we convert the string of conformation to real value, because in PSO algorithm every individual is real value. All the velocities are initially set to 0. 3.2. Repair Algorithm When two or more amino acids overlap at the same point on the 2D lattice, collision occurs. If this collision occurs in a particle, treated as invalid particle or illegal conformation. Invalid particles are not accepted in our proposed algorithm but we are repaired using a backtracking method. This method takes invalid particle as input and returns as output the repaired one (i.e., valid conformation). The backtracking method detects a collision and tries to repair it by finding an alternative empty location for the amino acid which caused the collision. If empty location is not available, then it backtrack to previous amino acids which location can be modified or repeatedly backtrack to the previous amino acid to find the empty location and finally valid conformation is returned. Flowchart of the proposed algorithm is shown in “Fig. 2”. In this flowchart, initially conformations with absolute encoding scheme of proteins are stored in ‘S’ and the respective coordinate of the amino acids will be stored in ‘M’. Each amino acid has a value 'back' which stores number of invalid movement. Whenever back value will be greater than 3, it will cause backtrack from the current amino acid. To keep track of current working amino acid using pointer ‘i’. When backtrack occurs, the value of ‘i’ will decrease by 1 and the back value of that amino acid will be set to 0. Now if a particular movement is not available then we have set of strategy which will be followed to place that amino acid to a new movement. The strategy is simply clockwise i.e. if ‘East’ is not available go to ‘South’ and repeatedly ‘South’ to ‘West’ and ‘West’ to ‘North’ . Number of attempt to place that particular amino acid with respect to a particular coordinate will be 4. When value of 'i' is equal to the length of the protein sequence, it returns repaired conformations. 3.3. Acceptance Criterion of new particle In this improved version of PSO, a new particle replaces the old particles if its energy value is greater or equal to the energy of the old particle.

1/( 1)(2 ) 1, 0.5

1/( 1)1 [2(1 )] , 0.5

mj

mj

r rj

r r

η

ηδ

+ − <

+− − ≥=

736

Fig. 2. Flowchart of the proposed backtracking algorithm

4. PARAMETER SETTINGS 4.1. Benchmark Sequences There are 9 different HP protein instances chosen in our experimental studies. These instances of proteins are not the real world proteins but benchmark for 2D HP square lattice model. In Table I, Hi, Pi and (HP)i represents the repetitions of the respective amino acids while E* represents the maximum number of H-H topological contacts known to date and size represent length of protein instances. 4.2. PC Configuration

• SYSTEM: WINDOWS XP • CPU: 2.26 GHZ (CORE TO DUO) • RAM: 2GB • SOFTWARE: MATLAB 2010B

4.3. Parameters of PSO

• POPULATION SIZE (N): 100 • NUMBER OF GENERATIONS =

(WHERE FES IS THE NUMBER OF FUNCTION EVALUATION PERMITTED)

• INITIAL DISTRIBUTION INDEX FOR POLYNOMIAL MUTATION 100

• C1=C2=1.49445

• ω=0.72984 • Y1 = -4, Y2 = -2, Y3 =0, Y4 =2 AND Y5 =4

5. RESULT AND DISCUSSION Table II, presents the results of the PSO-APM algorithm applied on benchmark sequences and compared with other evolutionary methods. In this table, 1st, 2nd and 3rd column show the sequence number, size of the sequence and maximum (E*) H-H topological contacts respectively. The 4th, 5th and 6th column represent E* using Genetic Algorithms (GA) [14], Differential Evolution (DE) approach [5] and hybrid DE [15]. However they did not consider all benchmark data (S9), so corresponding position in Table II are kept empty. In the proposed approach maximum numbers of H-H topological contacts are obtained and number of occurrence (i.e., maximum H-H contacts) with respect to 30 independent runs is given in the parenthesis. Average number of this run is superior or comparable with the DE approach. It has been observed that the average numbers are even better or equal with DE approach. Result using the proposed approach is better or equal than the GA technique for all the sequences. For S7 and S9, we obtained better result over other evolutionary algorithms. Also, we observed that by using a small number of particles in the swarm, PSO is capable of exploring the search space and finds slightly better conformation with lower energy than genetic algorithms, which normally require a large population.

6. CONCLUSION AND FUTURE WORK In this work, the paper proposes adaptive polynomial mutation on global best solution in particle Swarm Optimization and applied on well-known benchmark instances. In this investigation, mutation on global best particle to jump it out from local optima. Using backtracking method, invalid conformations are fully repaired to produce valid conformations. The performance of our algorithm is evaluated by comparing it to previous evolutionary algorithms and produces better or same results using the same set of benchmark instances. Our future works will be directed to finding new mapping concept for encoding and implementing in 3D HP model and triangular lattice model.

TABLE I. PROTEIN HP INSTANCES USED IN THE EXPERIMENTS

Seq No.

Sequence Size

S1 HPHP2H2PHP2HPH2P2HPH 20 S2 H2P2HP2HP2HP2HP2HP2HP2H2 24

737

S3 P2HP2H2P4H2P4H2P4H2 25 S4 P3H2P2H2P5H7P2H2P4H2P2HP2 36

S5 P2HP2H2P2H2P5H10P6H2P2H2P2HP2H5 48 S6 H2PHPHPHPH4PHP3HP3HP4HP3HP3HPH4PHPHPHPH2 50 S7 P2H3PH8P3H10PHP3H12P4H6PH2PHP 60 S8 H12PHPHP2H2P2H2P2HP2H2P2H2P2HP2H2P2H2P2HPHPH1

2 64

S9 H4P4H12P6H12P3 H12P3 H12P3HP2H2 P2H2P2HPH 85

TABLE II. COMPARISON OF MAXIMUM NUMBER OF H-H TOPOLOGICAL CONTACTS OBTAINED BY DIFFERENT

ALGORITHMS ON THE BENCHMARK INSTANCES

11. REFERENCES [1] A. K. Dill, “Theory for the folding and stability of

globular proteins,” Biochemistry, Vol. 24, No. 6, 1985, pp. 1501-1509.

[2] B. Berger and T. Leight, “Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete,” J. Comput. Biol., Vol. 5, No. 1, pp. 27-40, 1998

[3] R. Unger and J. Moult, “A Genetic Algorithm for Three Dimensional Protein Folding Simulations,” In Proc. of the 5th Annual International Conference on Genetic Algorithms, pp. 581 –588, 1993

[4] J. T. Pedersen and J. Moult, “Protein Folding Simulations with Genetic Algorithms and a Detailed Molecular Description,” J. Mol. Biol., Vol. 269, No. 2, pp. 240–259, 1997

[5] R. Bitello and H.S. Lopes, “A differential evolution approach for protein folding,” In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–5, 2006

[6] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” Proc. of the Sixth International Symposium on Micro Machine and Human Science (MHS “95), pp. 39-43, Oct. 1995.

[7] R. Eberhart and J. Kennedy, “Particle swarm optimization,” IEEE International Conference on Neural Networks, vol. 4, no. 27, pp. 1942-1948, 1995.

[8] J. Tang and X. Zhao, “Particle Swarm Optimization with Adaptive Mutation,” WASE International Conference on Information Engineering, pp. 234-237, 2009.

[9] X. Wu and M. Zhong, “Particle Swarm Optimization Based on Power Mutation,” ISECS International Colloquium on Computing, Communication, Control, and Management, pp. 464-467, 2009.

[10] A. Saha, R. Datta and K. Deb, “Hybrid Gradient Projection based Genetic Algorithms for Constrained Optimization,” IEEE Congress on Evolutionary Computation - CEC, pp. 1-8, 2010.

[11] M. M. Raghuswanshi and O.G. Kakde, “Survey on multiobjective evolutionary and real code genetic algorithms,” Complexity International, Volume 11, 2005.

[12] N. Krasnogor, W. E. Hart, J. Smith and D. A. Pelta, “Protein structure prediction with evolutionary algorithms,” In Proc. Int. Genetic and Evolutionary Computation Conf., pp. 1596–1601, 1999.

[13] N. D. Jana and J. Sil, “Protein Structure Prediction in 2D HP lattice model using differential evolutionary algorithm”. In S.C.Satapathy et al (EDs.) Proc. of the Incon INDIA2012, AISC 132, pp. 281-290, 2012.

[14] R. Unger and J. Moult, “Genetic Algorithms for protein folding simulations,” Journal of Molecular

Biology, vol. 231, No. 1, pp. 75-81, 1993. [15] J. Santos and M. Dieguez, “Differential Evolution for

protein structure prediction using the HP Model,” In proc. of IWINAC 2011, Part I, LNCS 6686, pp. 323-333, 2011.

[16] A. Bautu and H. Luchian,” Protein Structute prediction in lattice models with particle swarm optimization,” In M. Dorigo et al. (EDs.): ANTS 2010, LNCS 6234, pp. 512-519, 2010.

[17] H. Zhu, C. Pu, X. Lin, J. Gu, S. Zhang and M. Su, “ Protein structure prediction with EPSO in toy model,” In 2nd International conference on intelligent networks and intelligent systems, pp. 673-676, 2009.

[18] T. Si, N.D. Jana and J. Sil, “Particle Swarm Optimization with Adaptive Polynomial Mutation,” In proc. WICT 2011, Mumbai, India, pp.69-73, 2011.

SeqNo.

Size

E* GA [14]

Hybrid DE[15]

DE[5] Our Approach

Max Avg. Max Avg. S1 20 9 9 9 9 9.00 9(30) 9.00 S2 24 9 9 9 9 9.00 9(30) 9.00 S3 25 8 8 8 8 8.00 8(30) 8.00 S4 36 14 14 14 14 13.96 14(30) 14.00 S5 48 23 22 23 23 23.00 23(30) 23.00 S6 50 21 21 21 21 21.00 21(30) 21.00

S7 60 36 34 35 35 34.79 36(25) 35.80 S8 64 42 37 42 42 41.87 42(25) 41.83

S9 85 53 -- -- 52 51.38 53(15) 52.4

738

Date post:	11-Dec-2016
Category:	Documents
Upload:	jaya
View:	219 times
Download:	6 times

[IEEE 2012 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)...

Documents