Neural network for protein secondary structure prediction

transcript

Neural Network For Protein Secondary Structure Prediction

Jakub Paś

Introduction

Applications of Secondary structure prediction:

• Protein Folding research• Fold Recognition

• Protein classification

• Homology Modeling• Ab initio methods

Evolution Of Prediction Accuracy

1978 Chou & Fasman 57 %

1989 Qian & Seynowski 66 %

1993 Rost & Sander 71 %

Present – Up to 78% ?

Accuracy in present methods of secondary structure prediction methods

Method Q3 CorrH CorrE CorrCPROF 77.0 0.67 0.65 0.56

PSIPRED 76.6 0.66 0.64 0.56

SSpro 76.3 0.67 0.64 0.56

JPred2 75.2 0.65 0.63 0.54

PHDpsi 75.1 0.64 0.62 0.53

PHD 71.9 0.59 0.59 0.59

Copenhagen 78

Neural Networks

• Neural networks are rather trained then programmed to carry out chosen information processing tasks

• Training neural network involves adjusting the network so that is able to produce specific output for each of given set of input patterns

• Since the desired data are known in advance, training a feed forward network is a supervised learning.

• Back propagation algorithm – Each error in recognition on output effects with reaction of back correction in parameters of activation function.

Three layer Network

Input nodes

Hidden nodes

Output node

Computing Output

• The output of each node is weighted sum of its input

• Then squashing function (monotonic and differentiable is used to gain values from 0 to 1

pjj ijpi ywa ,, ∑=

Training a feed forward net

• Training was performed using SNNS (Stuttgart Neural Network System) package

• Network architecture and weights were exported using ssns2c program from SNNS package

• Own Perl programs was used to preparing data and benchmarking network

Training a feed forward net

• Feed forward nets are trained using set of patterns known as a training set. Desired output are known in advance

• Every pattern have the same number of elements that input nodes

• On-line training is to find number of epoch which is optimal in generalisation. It means that net is well trained on training set and good recognize test set.

Network architecture

• Number of input nodes (windows size) – 11 aa|VLSPADKTNVK|AAWGKVGAHAGEYGAEALERMFLSFP

Predicted state

|00000VLSPAD|KTNVKAAWGKVGAHAGEYGAEALERM

11 x 20 = 220

-101 -186 -387 -302 -165 -403 -387 341 -302 36 15 -387 -302 -302 -387 -286 -101 545 -387 -186

Network ArchitectureInput Hidden Output Q3

300 20 3 71.277

260 20 3 72.742

220 20 3 73.428

180 20 3 70.083

Q3 – Corelation Coeficient - Percentage of correctly predicted residues

MSE = Mean Square Errror.2

)( txptxfn

ii −=−= ∑∑

220 180

260300

Network ArchitectureInput Hidden Output Q3

220 40 3 71.567

220 30 3 71.466

220 20 3 73.428

220 10 3 70.345

Preparing data

• Two sets of 118 non-homologous proteins• Each sequence was used as a query for PSI-BLAST. (cut of 0.001 database

nr)• Then gained profile was normalized using squashing function.

• Secondary structure information for training comes from PDB database:• DSSP classes were collapsed to three state target classes:

H,G – helix

E – strand

B,I,S,T,e,g,h – coil

• Two sets – training and test were built. About 20000 patterns for 220/20/3

Architecture network.

Results

• C program using PSI-BLAST or fake profile generate secondary structure prediction in three letter code or in PSIPRED format:(0.234 0.135 0.679 ).

• PHP based web service (not benchmarked yet)

• Percentage of correctly predicted residues Q3 = 73.428

Strategies to increase accuracy

• Adding new types of biological information

• Change the way that information is presented to the network

• Post process the network predictions

• Change the network architecture

Biological Information:

• Hydrophobicity, charge, backbone properties

• Length of chain – additional input

• Distance to N & C terminal aa

• Non local information. (all alpha, all beta etc.)

Pattern level information:

• Increase amount of sequences

• Balance data (e.g. Equal number of helices an strands)

• Window information from other parts of amino acids

Post processing and filtering:

• Second level network: structure to structure – in development

• Simple voting of numerous network with other parameters – in development

Changing the architecture:

• Number of hidden nodes - Done

• Number of input window – Done

• Other learning algorithm

• Increase predictions accurency more then 4 %

• Benchmarking the method using external service EVA

• Implementation of method in ORFEUS

Neural network for protein secondary structure prediction

Documents