Protein Secondary Structure Classifier Fusion By: Majid Kazemian Advisor Prof. Behzad Moshiri Co...

Protein Secondary Structure Classifier Fusion

By: Majid Kazemian

AdvisorProf. Behzad Moshiri

Co AdvisorProf. Caro Lucas

University of TehranCollege of Engineering

School of Electrical and Computer Engineering

April 2006

2

Agenda

• Introduction• Problem statement• Expertness assessment of protein secondary structure prediction

engines• Meta classifier

– Definition and general architecture– Classifier types– Combining approaches– Decision profile– Accuracy measures– Results

• Data fusion simulation tool (SDF tool)• Conclusion and further research

3

• Protein– A large, complex molecule composed of amino acids. – Proteins are essential to the structure, function, and regulation of the body. – Examples are hormones, enzymes, and antibodies

• Amino Acid– The fundamental building blocks of a protein molecule. – There are 20 different amino acids commonly found in proteins

Introduction

Cα

HH

C

O

OH

NH

R

Carboxyl groupAmino group

Side chain

4

• Proteins are complex macromolecules whose structure can be discussed in levels of:– Primary (the sequence of amino acids that are linked together)– Secondary (alpha helices, beta sheets, random coils)– Tertiary (the way the random coils, alpha helices and beta

sheets fold in respect to each other)– Quaternary (packing of the protein subunits to form the final

protein complex)

Introduction (cont.)

Structure Function

The accumulation of protein sequences especially after the beginning of whole genome sequencing project is in stark contrast to the much slower

experimental determination of protein structures.

5

• A variety of approaches that each of them might have its own strengths and weaknesses in protein structure prediction

• One of the most useful approaches is through determining secondary structural elements as first step toward 3D structure determination. There are three main classes of secondary structural elements in proteins as alpha helices (H), beta strands (E) and irregular structures (turns or coils that are shown as C), so prediction engines can be assumed as structural classifiers. – Provide a unified access to data for users (Biologists).– To achieve a higher overall accuracy than any of the individual classifiers

• A standard tool for fusion systems simulation and prototyping

Problem statement

6

Expertness assessment of protein secondary structure prediction

engines • The first level

– Deals with the prediction accuracy of the whole secondary structure content of proteins regardless of prediction accuracy of each secondary structure class

– Q3 criterion is one of the most important criteria in this level – This level is known as global measurement level

• The Second Level– Explains the prediction accuracy in each secondary structure class separately

– This level allows analyzing prediction methods in more details comparing the

previous level – QH, QE,QC criteria are introduced in this level

7

Expertness assessment of protein secondary structure prediction engines (cont.)

• The Third level– Evaluate the prediction accuracy of each secondary structure class

based on amino acid index – Reveals more details of prediction accuracy than the second level

mentioned above

acids amino 20 of listaa

CoilStrand BetaHelixindex

100index structure in aa acid amino of number

index structure in aa acid amino predicted correctly of numberQ aa

index

∈

,,=

×=CASP, (Critical Assessment of Techniques for Protein Structure Prediction), is commonly referred to as a competition for protein structure prediction taking place every two years.CASP provides research groups with an opportunity to assess the quality of their methods for protein structure prediction from the primary structure of the protein.

8

Expertness assessment of protein secondary structure prediction engines (cont.)

• Single global propensity– Single Global Propensity (SGP) shows the tendency or

hindrance of a certain amino acid to occur in a certain secondary structure.

– The great amount of SGP value of an amino acid for a specific secondary structure means that it has a tendency to take that specific structure in comparison with other secondary structures.

alldataset

idataset

allindex

iindex

iindex

aan

aan

aan

aan

aaSGP

=

To calculate single global propensity, we used Hobohm and Sander representative set of protein chains with homology less than 25%. Those structures are resolved by X-Ray Crystallography with resolution higher than 3 A°

9

Expertness index assessment

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A C D E F G H I K L M N P Q R S T V W Y

00.10.20.30.40.50.60.70.80.9

1


H

E

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


C

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

385

258

150

111

31

SGP H

SGP E

SGP C

Engine Index

Name MethodCASP 4

Q3

031 BioInfo.PL BASIC 79.73

111 SAM-T99 HMM 78.14

150 Chandonia-Cohen NN+F 72.63

151 Pred2ary/Chandonia NN 72.74

258 PSIPRED NN 78.49

385 Prof-server NN 75.66

CASP4 (2000)

10

Expertness index assessment

H

E

C

0

0.1

0.2

0.3

0.40.5

0.6

0.7

0.8

0.9

1


00.10.20.30.40.50.60.70.80.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.2

0.4

0.6

0.8

1

1.2


1

20

23

68

72

100

108

357

393

517

SGP H

SGP E

SGP C

Engine Index Name Method CASP 5

Q3

001 SAM-T02-human NNHMM 79.91

020 Bujnicki-Janusz Consensus 81.79

023 Baldi-SSpro RNN 79.61

068 Jones-NewFold NN+F 79.83

072 PSIPRED NN 79.96

100 SBI NN 79.6

108 CaspIta Consensus 81.04

357 PROF/Rost NN 77.72

393 MacCallum GP+NN 79.97

517 GeneSilico Consensus 80.77

CASP5 (2002)

11

A

VI

L

M

F

C

T

Y

WH K

R

E

Q

DN

S

PG

PolarHydrophobic

Amino Acids

Analysis results

• Less accuracy in beta strand prediction

• The harmony between SGP values and prediction accuracy

• Different prediction behavior

Structure

CASP 4 CASP 5

AverageSTD. Dev. Average

STD. Dev.

H 0.7768 0.0974 0.8287 0.0812

E 0.5916 0.1296 0.6973 0.1285

C 0.7578 0.0801 0.7654 0.0680

12

• A Meta Classifier combines the results of multiple classifiers at once, returning more accurate and less confusing results.

• Meta Classifier can improve the Classification Quality in many ways:– Accuracy and Efficiency– One Classification Results {one-click paradigm}

Data fusion deals with the synergistic combination of information made available by various knowledge source such as sensors, data bases, … in order to provide a better understanding of a given scene or environment

How to combine the individual classifiers?

Meta Classifiers

13

Architecture of meta-classifier

User

B SE 1 B SE 2

Fusion

Pre- Processing Pre- Processing

…

…

User Interface

Display

Feedback

Query

C 1 C 2 C N

Fusion

Pre-Processing Pre-Processing

Post-Processing

Display

Biological Insights

Statistics&

Expertness

PersonalizedKnowledge

User Interface

Dispatcher

Pre-Processing

Data isHard to findHard to understandHard to reconcileHard to analyze

Current environment

Simplified Version

DSSP* used

H H

G H

I H

E E

B E

T L

S L

' ' L

* Dictionary of Secondary Structure of Protein

14

Types of Classifiers & Combination Approaches

• Type I classification is a simple statement of the class (abstract

level outputs) Majority voting.

• Type II classification is a ranked list of probable classes (rank

level outputs) Borda voting.

• Type III classification assigns probabilities to each classes

(measurement level outputs) Any Fusion Technique.

Most of protein secondary structure classifiers are Type I classifier

How to extract Measurement level from Abstract level Classifications?

15

• Confusion matrix (a posteriori probabilities of each classification) • A confusion matrix is a matrix in which the actual class or a datum under

test is represented by the matrix row, and the classification of that particular datum is represented by the confusion matrix column.

• The normalized values of column i of the classifier Dj’s confusion matrix have been assumed as the decision vector of classifier D j, where the classifier Dj outputs the class i as its classification label.

CONFUSION MATRIX FOR A CLASSIFIER

Measurement level from Abstract level Classifications

Classified as H Classified as E Classified as C

Actual Class H 1600 100 300

Actual Class E 200 1200 600

Actual Class C 100 500 1400

16

Decision Profile

• Let D1, D2, D3,…, DL be a set of our protein secondary structure

classifiers and be a set of the class labels, which are the main secondary structural elements. The output of each secondary structure classifier can be assumed as three-dimensional vector:

• Where di,j(x) is the degree of support given by classifier Di to the

hypothesis that x comes from class j. The decision profile is the matrix of soft labels:

1, 0,,, ,,,, xdxdxdxdxD jiCiEiHii

xdxdxd

xdxdxd

xdxdxd

xdxdxd

xdxdxd

xDP

CEH

CEH

CEH

CEH

CEH

,5,5,5

,4,4,4

,3,3,3

,2,2,2

,1,1,1

Output of Classifier D3 (x)

Support from classifiers 521 ,...,, DDD for class E

17

Paradigms for designing a classifier combination system

Classifier 1 Classifier i Classifier L

Combiner

A. Different combination schemes

... ... Classifier 1 Classifier i Classifier L

Combiner

B. Different classifier models

... ...

Classifier 1 Classifier i Classifier L

Combiner

C. Different feature subsets

... ...

D. Different training sets

18

Classifier fusion methodsType of outputs

Decision Profile

Class indifferentClass conscious

Linear Models

Nonlinear Models Order Statistics

Fixed weights

Data dependent weights

Soft weights Crisp weights

Class labels

Plurality [Hansen 1990]

Majority [Lam 1997]

Unanimity

Naïve Bayes [Xu 1992]

BKS [Huang 1995]

Wernecke [Wernecke 1992]Brute force [Kuncheva 2000b]

Stacked generalization [Wolpert 1992]

Dempster-Shafer [Rogova 1994]

Decision templates [Kuncheva 2001]

Minimum

Maximum

Median

OWA [Kuncheva 1997a]

Product

Geometric mean

Fuzzy integral [Verikas 1999]

Neural networks [Gader 1996]

Classifier selection [Kuncheva 2000a]

Mixture of Experts[Jordan 1995]

[Chibelushi 1999]

[Hashem 1997]

[Tresp 1995, Verikas 1999]

Borda voting

Weighted Borda voting

Rank

19

Conventional Approaches

OWA

Bayesian Approaches

Dempster Shafer Method

Kalman Filter

Knowledge based Systems & Intelligent Approaches

Neural Network

Fuzzy Logic (Integral Operators)

Techniques of Data Fusion

Source 1 Source 2

Redundant information

Complementary information

20

n

jjjn xwxxxOWA

1)(21 ),...,,( Yager 1988 mapping

F: Rn → R

11

n

iiw

Where σ is a permutation that orders the elements:

The weights are all non-negative ( ) and their sum equals to

one

0iw

)()2()1( .... nxxx

11

n

iiw

• Provide a parameterized family of aggregation operators• Maximum, Minimum • Median, k-order statistics• Arithmetic mean• …

Ordered Weighted Averaging

),...,2,1(),...,2,1(),...,2,1( wnwwMaxwnwwOWAwnwwMin

• A parameterized way to go from the min to the max.

n

iin win

1n

1wwwrnesso

121 )(),...,,(

21

• Satisfying a given degree of maxness

Optimistic weights:w1=α ; w2= α(1–α) ; w3= α(1- α)2 ; ... ; wn-1= α(1–α)n-2 ; wn=(1- α)n-1

Pessimistic weights:

w1= α n-1 ; w2=(1- α) α n-2; w3=(1- α) α n-3 ; ... ; wn-1=(1- α) α ; wn=(1- α)

Where parameter alpha belongs to the unit interval [0 1], and is related to orness value regarding the number of sources

Exponential class of OWA

• Proposed by Filev 1998:

22

Alpha-Orness

23

OWA operator for classifier fusion

(1) Pick “L” OWA coefficient such that:

(2) For k = 1, 2, …, c

a. Sort di,k(x), i=1, 2, …, L in descending order, so that

b. Calculate the support for class

L

ii

TL bbbb

11 1,,...,

xdaandxda kii

Lkii

,,1 min,max

L

iiik abx

1

k

Kuncheva 2001:Combining classifiers: Soft computing solutions

where L = number of classifiers

24

Fuzzy Integral Operator

Relations between various aggregation operators and fuzzy integral

Weighted Sum

Min

Med

Max

OWA

Sugeno integralChoquet integral

Arithmetic Mean

Order statistic

25

Fuzzy integral for classifier fusion

(1) Fix the L fuzzy densities g1, g2, …, gL e.g., by setting gj to the estimated probability of correct classification of Dj

(2) Calculate from Equation:

(3) For a given x sort the kth column of DP(x) to obtain

(4) Sort the fuzzy densities correspondingly, i.e., and set

(5) For t = 2 to L, calculate recursively

(6) Calculate the final degree of support for class by

0,111

L

j

jg

Tkikiki xdxdxdL ,,, 121

,...,,

Liii ggg ,...,, 21 11 igg

11)( tggtggtg tt ii

12

,,, 11

jgxdxdxdxL

jkikikik jj

Kuncheva 2001:Combining classifiers: Soft computing solutions

26

Dempster-Shafer Evidential Reasoning

• For k = 1, 2, …, c– Calculate the support for class

from Dempster combination rule:k

ii

ii

Liki

k Liki

k xd

xd

x

1,

1,

1

27

Protein Secondary Structure classifiers

Server Location Prediction method

APSSP2 Institute of Microbial Technology, INDIA EBL + Neural network

PROFSEC Columbia University, USA Profile-based Neural network

PSIPRED University College London, UK Neural network

SAMT99_SEC University of California, Santa Cruz, USA Hidden Markov Model

SSPRO2 University of California, Irvine, USA Recurrent Neural Network

Some of the common classifiers

Dataset statistics

Dataset Name

Data of Publication

Number of Proteins

Number of Residues

Web Address

EVA: Common subset 1

Cumulative results from 1999 to 2002/10

30More than 4000

http://cubic.bioc.columbia.edu/eva/sec_2002_10/common.html#common_set1

28

DP(1)

DP(2)

DP(3)

DP(4)

DP(5)

ASSPS2

PROFSEC

PSIPRED

SAMT99_SEC

SSPRO2

DataSet Confidence H

Confidence E

Confidence C

agrmaxPredicted

class

FusedDP

Fusion O

perator

Meta Classifier Architecture

Classifier 1

Classifier 2

Classifier N

Fusion

...

ECH

29

3

13

1 100

iii

res

MN

Qi

iiobsi obs

MQ

100%

Mij = number of residues observed in state i and predicted in state j, where i, j ε {H,E,C}

Some of the Accuracy Measures

333231

232221

131211

aaa

aaa

aaaH

H

E C

E

C

TNFP

FN

Confusion Matrix

real

classified

FPTN

TNySpecificit

FNTP

TPySensitivit

30

Specificity Sensitivity

APSSP2 74.49 78.00 65.65 77.01 87.04 74.08

PROFSEC 74.71 75.38 74.48 74.05 87.12 74.24

PSIPRED 74.78 78.53 68.25 75.67 87.18 74.36

SAMT99_SEC 74.63 82.60 63.12 75.06 87.13 74.27

SSPRO2 73.58 78.14 62.79 76.45 86.60 73.21

obshQ% obs

eQ% obscQ%

The results of EVA1 dataset prediction by common selected engines

3Q

70

72

74

76

78

80

82

84

86

88

90

Accuracy Specificity Sensitivity

PSIPRED

APSSP2

PROFSEC

SAMT99_SEC

SSPRO2

31

Results

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Choquet

OWA

Dempster-Shafer

Majority voting

APSSP2

PROFSEC

PSIPRED

SAMT99_SEC

SSPRO2

Choquet 0.767152256 0.883576128 0.7702

OWA 0.759633459 0.879816729 0.7647

Dempster-Shafer 0.757518797 0.878759398 0.7642

Majority voting 0.757753759 0.87887688 0.7568

APSSP2 0.740836466 0.870418233 0.7449

PROFSEC 0.742481203 0.871240602 0.7471

PSIPRED 0.743656015 0.871828008 0.7478

SAMT99_SEC 0.742716165 0.871358083 0.7463

SSPRO2 0.732142857 0.866071429 0.7358

Sensitivity Specificity Accuracy

32

Results

65

70

75

80

85

Choquet

OWA

Dempster-Shafer

Majority voting

PSIPRED

Choquet 77.02 84.43 72.58 73.43

OWA 76.47 84.31 73.08 71.92

Dempster-Shafer 76.42 81.12 71.52 75.81

Majority voting 75.68 80.17 66.37 77.68

PSIPRED 74.78 78.53 68.25 75.67

Q3 H E C

33

Sensor/Data Fusion Design Pattern

• Today it is difficult to imagine the task of system development without help of a simulation

• Simulation is a very important means in all fields of science and engineering.

• It provides better understanding of the function, testing and finding of critical states and regions of operation, fast optimization of system and control.

• Patterns help you build on the collective experience of skilled system engineers

• They capture existing, well-proven experience in system development and help to promote good design practice

• Every pattern deals with a specific, recurring problem in the design or implementation of a system

34

Design patterns• System-Level Design Requirements

– Describe system architecture– Model different levels– Model different components– Simulate and test– Re-use design

35

Data Fusion Design Pattern & Simulation Tool

• Motivations– Lack of standardization– Various common modules found in fusion algorithms – Reduce programming efforts

• Save time and energy• Intent

– development of a high-level graphical system modeling using the powerful modeling environments

• Functionalities offered by this view– rapid and accurate system design– Modularity which means that the new components can be

easily added– Reusability and easiness of use which are the most

important aims of design patterns

36


• Applicability– Defense, Robotics, Medical Engineering, Air and Space – Soft computing, Agriculture, Process control, Fire control systems– Biology, Bioinformatics

• Implementation– Simulink platform using Matlab S-function technique

• An interactive software package to model, simulate, and analyze dynamic systems

• A graphical, mouse-driven program that allows systems to be modeled by drawing a block diagram on the screen.

MATLAB

SIMULINK

37


• Inter- relationship– Each block has one output, an input vector, and a set of parameters – Each block converts the input vector to an output– Each block’s output could be used as an input of other blocks

Interactive link (wrapper) between VTB and Matlab/Simulink

38


• Structure– 10 functional blocks each performing

specific tasks related to data fusion

• Participants – Continuous output fusion subsystem– Label output fusion subsystem

• Consequences– A general framework for designing, developing, and applying data

fusion algorithms– Substantially reduce the amount of time and energy needed for

algorithm implementation

39

Conclusion

• Analysis– The more accurate perception of weak points in secondary structure

prediction methods can lead us to enhance these methods more objectively.

– The results showed that the third level of expertness study gives us such perception and exposes more hidden facts of prediction methods than before.

• Fusion– This approach is a step toward a meta-classifier or meta-predictor

concept, which can obtain better results than each individual classifier.

– It has been shown that the performance of a Meta classifier system can be better than those of each individual; also such systems can provide a unified access to data for users.

• SDF Tool– provides a convenient and intuitive framework for designing and

developing data fusion algorithms– substantially reduce the amount of time needed for algorithm

implementation and avoid undetectable human errors

40

Further research• Analysis

– The only two nonhydrophobe - nonpolar amino acids, Proline and Glycine, showed an unusual prediction behavior. It seems that the extension of current evaluation criteria based on physico-chemical properties of amino acids could result in more objective perception of prediction methods.

– A conceptual view to the weaker results of beta strand prediction comparing with other secondary structure classes shows that if there were a suitable engine, which focused on this specific secondary structure class, the fusion of this engine with previous engines could improve the overall results.

– Moreover, we can develop a prediction algorithm focusing on Proline and Glycine and then perform a fusion in decision level among the results achieved by this engine and other engines to acquire better overall results.

• Fusion– Classifier selection, in which the most expert classifier is selected to

contribute the decision– To provide better identifiers to convert Type I classification into Type III

classification.– To contain the confidence of each decision, separately (WOWA)– To publish the meta-classifiers as an open-access web service.

• SDF Tool– Extending the algorithms and blocks of fusion– Developing some standard models for data fusion– Achieving the hardware design from the highest level of design (e.g. UML)

THANK YOU FOR YOUR ATTENTION

Date post:	11-Jan-2016
Category:	Documents
Upload:	hope-hopkins
View:	227 times
Download:	6 times

Protein Secondary Structure Classifier Fusion By: Majid Kazemian Advisor Prof. Behzad Moshiri Co...

Documents