Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 225 times |
Download: | 2 times |
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction
Rajkumar Bondugula,
Ognen Duzlevski and Dong Xu
Digital Biology Laboratory, Dept. of Computer Science
University of Missouri – Columbia, MO 65211, USA
Outline
Introduction Protein secondary structure prediction Popular methods K-Nearest Neighbor method Fuzzy K-Nearest Neighbor method
Methods Filtering the prediction Results and discussion Summary and Future work
Introduction
Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to.
CASP convention {H,G,I} → H {B,E} → E {C,S,T} → C
Example:Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCHSecondary StructureCEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC
Importance of Secondary Structure
An intermediate step in 3D structure prediction structure → function
ClassificationEx: α, β, α/β, α+β
Helps in protein folding pathway determination
Existing Methods
Popular MethodsNeural Network methods
Ex: PSIPRED, PHD
Nearest Neighbor methods Ex: NNSSP
Hidden Markov Model methods
Why K-Nearest Neighbors method?
Methods based on Neural Networks and Hidden Markov models perform well if the query protein have many homologs
in the sequence databasenot easily expandable
The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985]
K-NN will work better and better as more and more structures are being solved
K-Nearest Neighbor Algorithm
Advantages of Nearest Neighbor methodsSimple and transparent model
New structures can be added without re-training
Linear complexity
DisadvantageSlower compared to other models as processing is
delayed until prediction is needed
Why Fuzzy K-NN?
Disadvantages of Crisp K-NN Atypical examples are given as much as weight as those that
truly represent a particular class
Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class
Position Specific Scoring Matrix
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
Length of protein(l)
20
PSI-BLAST
. . . N L G A G N S G L N L G H V A L T F . . .
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Why Profile-FKNN?
Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods
An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods
Methods
Calculate profiles using PSI-BLAST The popular Rost and Sander database of 126
representative proteins (<25% sequence Identity)
Find K-Nearest Neighbors Calculate the membership values of the neighbors Calculate the membership values of the current
residue Assign classes Filter the output
Profile Calculation
The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST
Parameters for PSI-BLAST Expectation Value (e) = 0.1
Maximum number of passes (j) = 3
E-value threshold for inclusion in multi-pass model (h) = 5
Default values for the rest of the parameters
K-Nearest Neighbors
For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database.
The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors
20
1 1
1,min,1maxi
W
j
Databaseij
Queryij jWjppd
Distance Calculation
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
Distance Calculation
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
Distance Calculation
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
Distance Calculation
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
Distance Calculation
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .
. . . N L G A G N S G L T F . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .
. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .
. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .
. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .
. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .
. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .
. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .
. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .
. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .
. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .
. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .
. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
. . . N L G A G N S G L N L G H V A L T F . . .
Distance Calculation
Membership Values of the Neighbors
The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window
The residues near to the center are weighed more than the residues that are farther away
Membership values of the Neighbors
0.067 0.133 0.20 0.20 0.20 0.133 0.067
H
E 1 1 1
C 1 1 1 1
C C E E E C C
H = 0
E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6
C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4C C E E E C C
E
N L G A G N S
A
Membership Value
The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm
Each residue is assigned to class in which it has the highest membership value
Helix = . . . 15 22 61 91 95 96 26 21 23 18 29 30 24 17 5 8 . . .
Sheet = . . . 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10 . . .
Coil = . . . 63 50 26 8 4 2 65 71 65 71 29 26 31 53 81 82 . . .
Final = . . . C C H H H H C C C C E E E C C C . . .
Fuzzy K-Nearest neighbor Algorithm
BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute ui(r) using
Increment i. END DO UNTILEND
K
j
mj
K
j
mjij
i
rrd
rrdu
ru
1
12
1
12
),(/1
),(/1
)(
Where,
ui = membership value of
residue ‘r’ in class ‘i’,
i = Helix, Sheet or Coil
d(r,rj)= distance between query
window centered in
residue ‘r’ its jth
neighbor
m = 2 (Fuzzifier)
Structure Filtration
In the basic setting, the secondary structure state is class with highest membership value
Unrealistic structures may be present Popular methods of structure filtration
Neural Network
Heuristic based
Heuristic Filter
1. Smoothen the memberships values
2. Filter unrealistic structures Helix > 3 amino acids, -sheet > 2 amino acids
3. Calculate the thresholds to filter noise
4. Mark the possible Helix and Sheet regions Resolve conflicts based on average membership value in
overlap region
5. Fill the rest of the structure with Coil
11 25.05.025.0 nnn mmmm
Filter: Final Structure
Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCCTarget CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEECFiltered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC
Metrics
Seven commonly used metricsQ3 = Number of correctly predicted residues x 100
Total number of residues
Q<H,E,C>= Number of <helix,sheet,coil> residues correctly predicted X100
Total number of residues in <helix,sheet,coil>
Matthew’s Correlation Coefficient
MCC<H,E,C>= opuponun
uopn
where, p – true positives n – true negatives u – false negatives o – false positives
Results
Q3(%) QH(%) QE(%) QC(%) MH ME MC
Unfiltered 74.0 69.6 55.8 79.9 0.58 0.61 0.54
Filtered 76.2 68.1 66.1 80.4 0.64 0.64 0.56
Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES1 server
1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.
Relative Performance
Method Accuracy
MBR1 66.40
NN2 68.00
NNSSP3 72.20
PFKNN 76.20
1. X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992
2. Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993
3. A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995
Summary
A novel approach for PSSP Evolutionary information
K-Nearest Neighbor algorithm
Fuzzy set theory
Most accurate KNN approach to date Easily expandable Accuracy increases with new structures Average computing time < 1 min on a single
CPU machine
Future Work
System with faster search capabilitiesEfficient search for neighbors
Accurate prediction system
Acknowledgements
Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm
Oak Ridge National Laboratory for providing the supercomputing facilities
Members of Digital Biology Laboratory for their support
Software
The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to