Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | hope-hopkins |
View: | 227 times |
Download: | 6 times |
Protein Secondary Structure Classifier Fusion
By: Majid Kazemian
AdvisorProf. Behzad Moshiri
Co AdvisorProf. Caro Lucas
University of TehranCollege of Engineering
School of Electrical and Computer Engineering
April 2006
2
Agenda
• Introduction• Problem statement• Expertness assessment of protein secondary structure prediction
engines• Meta classifier
– Definition and general architecture– Classifier types– Combining approaches– Decision profile– Accuracy measures– Results
• Data fusion simulation tool (SDF tool)• Conclusion and further research
3
• Protein– A large, complex molecule composed of amino acids. – Proteins are essential to the structure, function, and regulation of the body. – Examples are hormones, enzymes, and antibodies
• Amino Acid– The fundamental building blocks of a protein molecule. – There are 20 different amino acids commonly found in proteins
Introduction
Cα
HH
C
O
OH
NH
R
Carboxyl groupAmino group
Side chain
4
• Proteins are complex macromolecules whose structure can be discussed in levels of:– Primary (the sequence of amino acids that are linked together)– Secondary (alpha helices, beta sheets, random coils)– Tertiary (the way the random coils, alpha helices and beta
sheets fold in respect to each other)– Quaternary (packing of the protein subunits to form the final
protein complex)
Introduction (cont.)
Structure Function
The accumulation of protein sequences especially after the beginning of whole genome sequencing project is in stark contrast to the much slower
experimental determination of protein structures.
5
• A variety of approaches that each of them might have its own strengths and weaknesses in protein structure prediction
• One of the most useful approaches is through determining secondary structural elements as first step toward 3D structure determination. There are three main classes of secondary structural elements in proteins as alpha helices (H), beta strands (E) and irregular structures (turns or coils that are shown as C), so prediction engines can be assumed as structural classifiers. – Provide a unified access to data for users (Biologists).– To achieve a higher overall accuracy than any of the individual classifiers
• A standard tool for fusion systems simulation and prototyping
Problem statement
6
Expertness assessment of protein secondary structure prediction
engines • The first level
– Deals with the prediction accuracy of the whole secondary structure content of proteins regardless of prediction accuracy of each secondary structure class
– Q3 criterion is one of the most important criteria in this level – This level is known as global measurement level
• The Second Level– Explains the prediction accuracy in each secondary structure class separately
– This level allows analyzing prediction methods in more details comparing the
previous level – QH, QE,QC criteria are introduced in this level
7
Expertness assessment of protein secondary structure prediction engines (cont.)
• The Third level– Evaluate the prediction accuracy of each secondary structure class
based on amino acid index – Reveals more details of prediction accuracy than the second level
mentioned above
acids amino 20 of listaa
CoilStrand BetaHelixindex
100index structure in aa acid amino of number
index structure in aa acid amino predicted correctly of numberQ aa
index
∈
,,=
×=CASP, (Critical Assessment of Techniques for Protein Structure Prediction), is commonly referred to as a competition for protein structure prediction taking place every two years.CASP provides research groups with an opportunity to assess the quality of their methods for protein structure prediction from the primary structure of the protein.
8
Expertness assessment of protein secondary structure prediction engines (cont.)
• Single global propensity– Single Global Propensity (SGP) shows the tendency or
hindrance of a certain amino acid to occur in a certain secondary structure.
– The great amount of SGP value of an amino acid for a specific secondary structure means that it has a tendency to take that specific structure in comparison with other secondary structures.
alldataset
idataset
allindex
iindex
iindex
aan
aan
aan
aan
aaSGP
=
To calculate single global propensity, we used Hobohm and Sander representative set of protein chains with homology less than 25%. Those structures are resolved by X-Ray Crystallography with resolution higher than 3 A°
9
Expertness index assessment
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A C D E F G H I K L M N P Q R S T V W Y
00.10.20.30.40.50.60.70.80.9
1
A C D E F G H I K L M N P Q R S T V W Y
H
E
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A C D E F G H I K L M N P Q R S T V W Y
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
385
258
150
111
31
SGP H
SGP E
SGP C
Engine Index
Name MethodCASP 4
Q3
031 BioInfo.PL BASIC 79.73
111 SAM-T99 HMM 78.14
150 Chandonia-Cohen NN+F 72.63
151 Pred2ary/Chandonia NN 72.74
258 PSIPRED NN 78.49
385 Prof-server NN 75.66
CASP4 (2000)
10
Expertness index assessment
H
E
C
0
0.1
0.2
0.3
0.40.5
0.6
0.7
0.8
0.9
1
A C D E F G H I K L M N P Q R S T V W Y
00.10.20.30.40.50.60.70.80.9
1
A C D E F G H I K L M N P Q R S T V W Y
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A C D E F G H I K L M N P Q R S T V W Y
0
0.2
0.4
0.6
0.8
1
1.2
A C D E F G H I K L M N P Q R S T V W Y
1
20
23
68
72
100
108
357
393
517
SGP H
SGP E
SGP C
Engine Index Name Method CASP 5
Q3
001 SAM-T02-human NNHMM 79.91
020 Bujnicki-Janusz Consensus 81.79
023 Baldi-SSpro RNN 79.61
068 Jones-NewFold NN+F 79.83
072 PSIPRED NN 79.96
100 SBI NN 79.6
108 CaspIta Consensus 81.04
357 PROF/Rost NN 77.72
393 MacCallum GP+NN 79.97
517 GeneSilico Consensus 80.77
CASP5 (2002)
11
A
VI
L
M
F
C
T
Y
WH K
R
E
Q
DN
S
PG
PolarHydrophobic
Amino Acids
Analysis results
• Less accuracy in beta strand prediction
• The harmony between SGP values and prediction accuracy
• Different prediction behavior
Structure
CASP 4 CASP 5
AverageSTD. Dev. Average
STD. Dev.
H 0.7768 0.0974 0.8287 0.0812
E 0.5916 0.1296 0.6973 0.1285
C 0.7578 0.0801 0.7654 0.0680
12
• A Meta Classifier combines the results of multiple classifiers at once, returning more accurate and less confusing results.
• Meta Classifier can improve the Classification Quality in many ways:– Accuracy and Efficiency– One Classification Results {one-click paradigm}
Data fusion deals with the synergistic combination of information made available by various knowledge source such as sensors, data bases, … in order to provide a better understanding of a given scene or environment
How to combine the individual classifiers?
Meta Classifiers
13
Architecture of meta-classifier
User
B SE 1 B SE 2
Fusion
Pre- Processing Pre- Processing
…
…
User Interface
Display
Feedback
Query
C 1 C 2 C N
Fusion
Pre-Processing Pre-Processing
Post-Processing
Display
Biological Insights
Statistics&
Expertness
PersonalizedKnowledge
User Interface
Dispatcher
Pre-Processing
Data isHard to findHard to understandHard to reconcileHard to analyze
Current environment
Simplified Version
DSSP* used
H H
G H
I H
E E
B E
T L
S L
' ' L
* Dictionary of Secondary Structure of Protein
14
Types of Classifiers & Combination Approaches
• Type I classification is a simple statement of the class (abstract
level outputs) Majority voting.
• Type II classification is a ranked list of probable classes (rank
level outputs) Borda voting.
• Type III classification assigns probabilities to each classes
(measurement level outputs) Any Fusion Technique.
Most of protein secondary structure classifiers are Type I classifier
How to extract Measurement level from Abstract level Classifications?
15
• Confusion matrix (a posteriori probabilities of each classification) • A confusion matrix is a matrix in which the actual class or a datum under
test is represented by the matrix row, and the classification of that particular datum is represented by the confusion matrix column.
• The normalized values of column i of the classifier Dj’s confusion matrix have been assumed as the decision vector of classifier D j, where the classifier Dj outputs the class i as its classification label.
CONFUSION MATRIX FOR A CLASSIFIER
Measurement level from Abstract level Classifications
Classified as H Classified as E Classified as C
Actual Class H 1600 100 300
Actual Class E 200 1200 600
Actual Class C 100 500 1400
16
Decision Profile
• Let D1, D2, D3,…, DL be a set of our protein secondary structure
classifiers and be a set of the class labels, which are the main secondary structural elements. The output of each secondary structure classifier can be assumed as three-dimensional vector:
• Where di,j(x) is the degree of support given by classifier Di to the
hypothesis that x comes from class j. The decision profile is the matrix of soft labels:
1, 0,,, ,,,, xdxdxdxdxD jiCiEiHii
xdxdxd
xdxdxd
xdxdxd
xdxdxd
xdxdxd
xDP
CEH
CEH
CEH
CEH
CEH
,5,5,5
,4,4,4
,3,3,3
,2,2,2
,1,1,1
Output of Classifier D3 (x)
Support from classifiers 521 ,...,, DDD for class E
17
Paradigms for designing a classifier combination system
Classifier 1 Classifier i Classifier L
Combiner
A. Different combination schemes
... ... Classifier 1 Classifier i Classifier L
Combiner
B. Different classifier models
... ...
Classifier 1 Classifier i Classifier L
Combiner
C. Different feature subsets
... ...
D. Different training sets
18
Classifier fusion methodsType of outputs
Decision Profile
Class indifferentClass conscious
Linear Models
Nonlinear Models Order Statistics
Fixed weights
Data dependent weights
Soft weights Crisp weights
Class labels
Plurality [Hansen 1990]
Majority [Lam 1997]
Unanimity
Naïve Bayes [Xu 1992]
BKS [Huang 1995]
Wernecke [Wernecke 1992]Brute force [Kuncheva 2000b]
Stacked generalization [Wolpert 1992]
Dempster-Shafer [Rogova 1994]
Decision templates [Kuncheva 2001]
Minimum
Maximum
Median
OWA [Kuncheva 1997a]
Product
Geometric mean
Fuzzy integral [Verikas 1999]
Neural networks [Gader 1996]
Classifier selection [Kuncheva 2000a]
Mixture of Experts[Jordan 1995]
[Chibelushi 1999]
[Hashem 1997]
[Tresp 1995, Verikas 1999]
Borda voting
Weighted Borda voting
Rank
19
Conventional Approaches
OWA
Bayesian Approaches
Dempster Shafer Method
Kalman Filter
Knowledge based Systems & Intelligent Approaches
Neural Network
Fuzzy Logic (Integral Operators)
Techniques of Data Fusion
Source 1 Source 2
Redundant information
Complementary information
20
n
jjjn xwxxxOWA
1)(21 ),...,,( Yager 1988 mapping
F: Rn → R
11
n
iiw
Where σ is a permutation that orders the elements:
The weights are all non-negative ( ) and their sum equals to
one
0iw
)()2()1( .... nxxx
11
n
iiw
• Provide a parameterized family of aggregation operators• Maximum, Minimum • Median, k-order statistics• Arithmetic mean• …
Ordered Weighted Averaging
),...,2,1(),...,2,1(),...,2,1( wnwwMaxwnwwOWAwnwwMin
• A parameterized way to go from the min to the max.
n
iin win
1n
1wwwrnesso
121 )(),...,,(
21
• Satisfying a given degree of maxness
Optimistic weights:w1=α ; w2= α(1–α) ; w3= α(1- α)2 ; ... ; wn-1= α(1–α)n-2 ; wn=(1- α)n-1
Pessimistic weights:
w1= α n-1 ; w2=(1- α) α n-2; w3=(1- α) α n-3 ; ... ; wn-1=(1- α) α ; wn=(1- α)
Where parameter alpha belongs to the unit interval [0 1], and is related to orness value regarding the number of sources
Exponential class of OWA
• Proposed by Filev 1998:
22
Alpha-Orness
23
OWA operator for classifier fusion
(1) Pick “L” OWA coefficient such that:
(2) For k = 1, 2, …, c
a. Sort di,k(x), i=1, 2, …, L in descending order, so that
b. Calculate the support for class
L
ii
TL bbbb
11 1,,...,
xdaandxda kii
Lkii
,,1 min,max
L
iiik abx
1
k
Kuncheva 2001:Combining classifiers: Soft computing solutions
where L = number of classifiers
24
Fuzzy Integral Operator
Relations between various aggregation operators and fuzzy integral
Weighted Sum
Min
Med
Max
OWA
Sugeno integralChoquet integral
Arithmetic Mean
Order statistic
25
Fuzzy integral for classifier fusion
(1) Fix the L fuzzy densities g1, g2, …, gL e.g., by setting gj to the estimated probability of correct classification of Dj
(2) Calculate from Equation:
(3) For a given x sort the kth column of DP(x) to obtain
(4) Sort the fuzzy densities correspondingly, i.e., and set
(5) For t = 2 to L, calculate recursively
(6) Calculate the final degree of support for class by
0,111
L
j
jg
Tkikiki xdxdxdL ,,, 121
,...,,
Liii ggg ,...,, 21 11 igg
11)( tggtggtg tt ii
12
,,, 11
jgxdxdxdxL
jkikikik jj
Kuncheva 2001:Combining classifiers: Soft computing solutions
26
Dempster-Shafer Evidential Reasoning
• For k = 1, 2, …, c– Calculate the support for class
from Dempster combination rule:k
ii
ii
Liki
k Liki
k xd
xd
x
1,
1,
1
27
Protein Secondary Structure classifiers
Server Location Prediction method
APSSP2 Institute of Microbial Technology, INDIA EBL + Neural network
PROFSEC Columbia University, USA Profile-based Neural network
PSIPRED University College London, UK Neural network
SAMT99_SEC University of California, Santa Cruz, USA Hidden Markov Model
SSPRO2 University of California, Irvine, USA Recurrent Neural Network
Some of the common classifiers
Dataset statistics
Dataset Name
Data of Publication
Number of Proteins
Number of Residues
Web Address
EVA: Common subset 1
Cumulative results from 1999 to 2002/10
30More than 4000
http://cubic.bioc.columbia.edu/eva/sec_2002_10/common.html#common_set1
28
DP(1)
DP(2)
DP(3)
DP(4)
DP(5)
ASSPS2
PROFSEC
PSIPRED
SAMT99_SEC
SSPRO2
DataSet Confidence H
Confidence E
Confidence C
agrmaxPredicted
class
FusedDP
Fusion O
perator
Meta Classifier Architecture
Classifier 1
Classifier 2
Classifier N
Fusion
...
ECH
29
3
13
1 100
iii
res
MN
Qi
iiobsi obs
MQ
100%
Mij = number of residues observed in state i and predicted in state j, where i, j ε {H,E,C}
Some of the Accuracy Measures
333231
232221
131211
aaa
aaa
aaaH
H
E C
E
C
TNFP
FN
Confusion Matrix
real
classified
FPTN
TNySpecificit
FNTP
TPySensitivit
30
Specificity Sensitivity
APSSP2 74.49 78.00 65.65 77.01 87.04 74.08
PROFSEC 74.71 75.38 74.48 74.05 87.12 74.24
PSIPRED 74.78 78.53 68.25 75.67 87.18 74.36
SAMT99_SEC 74.63 82.60 63.12 75.06 87.13 74.27
SSPRO2 73.58 78.14 62.79 76.45 86.60 73.21
obshQ% obs
eQ% obscQ%
The results of EVA1 dataset prediction by common selected engines
3Q
70
72
74
76
78
80
82
84
86
88
90
Accuracy Specificity Sensitivity
PSIPRED
APSSP2
PROFSEC
SAMT99_SEC
SSPRO2
31
Results
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
Choquet
OWA
Dempster-Shafer
Majority voting
APSSP2
PROFSEC
PSIPRED
SAMT99_SEC
SSPRO2
Choquet 0.767152256 0.883576128 0.7702
OWA 0.759633459 0.879816729 0.7647
Dempster-Shafer 0.757518797 0.878759398 0.7642
Majority voting 0.757753759 0.87887688 0.7568
APSSP2 0.740836466 0.870418233 0.7449
PROFSEC 0.742481203 0.871240602 0.7471
PSIPRED 0.743656015 0.871828008 0.7478
SAMT99_SEC 0.742716165 0.871358083 0.7463
SSPRO2 0.732142857 0.866071429 0.7358
Sensitivity Specificity Accuracy
32
Results
65
70
75
80
85
Choquet
OWA
Dempster-Shafer
Majority voting
PSIPRED
Choquet 77.02 84.43 72.58 73.43
OWA 76.47 84.31 73.08 71.92
Dempster-Shafer 76.42 81.12 71.52 75.81
Majority voting 75.68 80.17 66.37 77.68
PSIPRED 74.78 78.53 68.25 75.67
Q3 H E C
33
Sensor/Data Fusion Design Pattern
• Today it is difficult to imagine the task of system development without help of a simulation
• Simulation is a very important means in all fields of science and engineering.
• It provides better understanding of the function, testing and finding of critical states and regions of operation, fast optimization of system and control.
• Patterns help you build on the collective experience of skilled system engineers
• They capture existing, well-proven experience in system development and help to promote good design practice
• Every pattern deals with a specific, recurring problem in the design or implementation of a system
34
Design patterns• System-Level Design Requirements
– Describe system architecture– Model different levels– Model different components– Simulate and test– Re-use design
35
Data Fusion Design Pattern & Simulation Tool
• Motivations– Lack of standardization– Various common modules found in fusion algorithms – Reduce programming efforts
• Save time and energy• Intent
– development of a high-level graphical system modeling using the powerful modeling environments
• Functionalities offered by this view– rapid and accurate system design– Modularity which means that the new components can be
easily added– Reusability and easiness of use which are the most
important aims of design patterns
36
Data Fusion Design Pattern & Simulation Tool
• Applicability– Defense, Robotics, Medical Engineering, Air and Space – Soft computing, Agriculture, Process control, Fire control systems– Biology, Bioinformatics
• Implementation– Simulink platform using Matlab S-function technique
• An interactive software package to model, simulate, and analyze dynamic systems
• A graphical, mouse-driven program that allows systems to be modeled by drawing a block diagram on the screen.
MATLAB
SIMULINK
37
Data Fusion Design Pattern & Simulation Tool
• Inter- relationship– Each block has one output, an input vector, and a set of parameters – Each block converts the input vector to an output– Each block’s output could be used as an input of other blocks
Interactive link (wrapper) between VTB and Matlab/Simulink
38
Data Fusion Design Pattern & Simulation Tool
• Structure– 10 functional blocks each performing
specific tasks related to data fusion
• Participants – Continuous output fusion subsystem– Label output fusion subsystem
• Consequences– A general framework for designing, developing, and applying data
fusion algorithms– Substantially reduce the amount of time and energy needed for
algorithm implementation
39
Conclusion
• Analysis– The more accurate perception of weak points in secondary structure
prediction methods can lead us to enhance these methods more objectively.
– The results showed that the third level of expertness study gives us such perception and exposes more hidden facts of prediction methods than before.
• Fusion– This approach is a step toward a meta-classifier or meta-predictor
concept, which can obtain better results than each individual classifier.
– It has been shown that the performance of a Meta classifier system can be better than those of each individual; also such systems can provide a unified access to data for users.
• SDF Tool– provides a convenient and intuitive framework for designing and
developing data fusion algorithms– substantially reduce the amount of time needed for algorithm
implementation and avoid undetectable human errors
40
Further research• Analysis
– The only two nonhydrophobe - nonpolar amino acids, Proline and Glycine, showed an unusual prediction behavior. It seems that the extension of current evaluation criteria based on physico-chemical properties of amino acids could result in more objective perception of prediction methods.
– A conceptual view to the weaker results of beta strand prediction comparing with other secondary structure classes shows that if there were a suitable engine, which focused on this specific secondary structure class, the fusion of this engine with previous engines could improve the overall results.
– Moreover, we can develop a prediction algorithm focusing on Proline and Glycine and then perform a fusion in decision level among the results achieved by this engine and other engines to acquire better overall results.
• Fusion– Classifier selection, in which the most expert classifier is selected to
contribute the decision– To provide better identifiers to convert Type I classification into Type III
classification.– To contain the confidence of each decision, separately (WOWA)– To publish the meta-classifiers as an open-access web service.
• SDF Tool– Extending the algorithms and blocks of fusion– Developing some standard models for data fusion– Achieving the hardware design from the highest level of design (e.g. UML)
THANK YOU FOR YOUR ATTENTION