+ All Categories
Home > Documents > Project Presentation

Project Presentation

Date post: 03-Dec-2014
Category:
Upload: butest
View: 351 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
22
5/4/05 CS573X Machine Learning Project Prediction of peptidyl prolyl residues in cis/trans configuration using machine learning algorithms Jae-Hyung Lee Genetics, Development and Cell Biology Bioinformatics and Computational Biology Program
Transcript
Page 1: Project Presentation

5/4/05CS573X Machine Learning

Project

Prediction of peptidyl prolyl residues in cis/trans configuration

using machine learning algorithms

Jae-Hyung LeeGenetics, Development and Cell Biology

Bioinformatics and Computational Biology Program

Page 2: Project Presentation

CS573X Machine Learning Project5/4/05

What is a protein?What is a protein?

Page 3: Project Presentation

CS573X Machine Learning Project5/4/05

Proteins are polypeptide chainsProteins are polypeptide chains

Page 4: Project Presentation

CS573X Machine Learning Project5/4/05

Torsional anglesTorsional angles

Although the peptide bond is planer and fixed rotation can and does occur about the two single bonds on either side of the a carbon: Φ (Phi), the bond between N and Ca Ψ (Psi), the bond between Ca and C

Page 5: Project Presentation

CS573X Machine Learning Project5/4/05

Two different peptide configurationsTwo different peptide configurations

cisω = -20° ~ 20°

transω = -180° ~ -160°

orω = 160° ~ 180°

Page 6: Project Presentation

CS573X Machine Learning Project5/4/05

Importance in isomerization of prolyl peptide bondImportance in isomerization of prolyl peptide bond

Peptidyl-prolyl cis/trans isomerization has considerably biological significance.

Peptidyl prolyl cis/trans isomerization has been frequently found as a rate limiting step in the folding of proteins.

Prolyl residue plays an important role in the final structure of proteins

Can be potential regulatory switch which is involved in cellular functions

Page 7: Project Presentation

CS573X Machine Learning Project5/4/05

Datasets (1)Datasets (1)

Protein Data Bank (PDB) database Protein structure information 3D coordinates of atoms based on X-ray crystallography or NMR spectroscopy exp

erimental data

Every Omega (ω) angles in prolyl peptide bond were calculated

Total 667,230 proline residue’s omega angle were calculated

To reduce the redundancy of dataset and use more reliable structural information, the proteins were removed based on resolution cutoff (≤ 3.0 Å), R-factor cutoff (≤ 0.3) and sequence identity (< 30%) using PISCES web server (60,814 pdb chains -> 4,006 pdb chains)

Finally total 3268 instances (1571 –cis cases and 1697 –trans cases) were used for constructing and testing classifiers.

Page 8: Project Presentation

CS573X Machine Learning Project5/4/05

Datasets (2)Datasets (2)

9 Different window sizes (different local sequences near proline residue) For example window size: 5

i-2, i-1, i, i+1, i+2 (total 5 amino acids, ith residue: proline)

In the absence and the presence of secondary structure (ss) information in instances’ attributes

Total 9 (window size) X 2 (+/- ss information) = 18 datasets were generated

K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, t

K, A, I, I, S, E, N, P, C, I, K, H, Y, H, I, E, E, C, C, C, C, C, C, E, E, E, E, E, C, C, t

secondary structure information

Class labelamino acid sequence information

a.

b.

Window size: 15 and no ss information

Window size: 15 and ss information

Page 9: Project Presentation

CS573X Machine Learning Project5/4/05

Naïve Bayes Classifier (1) Naïve Bayes Classifier (1)

Using Bayes theorem,

The probability of hypothesis given a set of data can be calculated based on its prior probability the probability of observing the data given the hypothesis and the probability of the data

Page 10: Project Presentation

CS573X Machine Learning Project5/4/05

Naïve Bayes Classifier (2)Naïve Bayes Classifier (2)

Given an instance X with attribute value

the Bayesian approach to classify X is to assign to it the most probable hypothesis

Assumption: when class is given attributes are independent of each other

Page 11: Project Presentation

CS573X Machine Learning Project5/4/05

Support Vector Machine (SVM) (1)Support Vector Machine (SVM) (1)

Finds a linear boundary which separate the training data

Uses the non-linear kernel function to map non-separable original n-dimensional pattern space onto higher dimensional feature space in which the patterns can be separable implicitly

Polynomial function was used for kernel function

ox

oo

o

x

x

x

(o)

(o)

(o)(o)

(x)(x)

(x)

(x)

Page 12: Project Presentation

CS573X Machine Learning Project5/4/05

Support Vector Machine (SVM) (2)Support Vector Machine (SVM) (2)

A maximal margin hyperplane with its support vector highlighted in the 2-dimensional feature space (1, 2)

x

o

x

o

o

o o o

x

xx

x

1

2

Overfitting problem for the training dataset is resolved by selecting the hype plane that maximizes the margin of separation between two classes from among all separating hyperplanes

Page 13: Project Presentation

CS573X Machine Learning Project5/4/05

Performance evaluationPerformance evaluation

Accuracy = N

TNTP

.Correlation Coefficient =

))()()(( FNTNFPTNFPTPFNTP

FNFPTNTP

-TP (true positives) = the number of proline residues predicted to be trans configurations that actually are trans configurations.

-TN (true negatives) =the number of proline residues predicted to be cis configurations that actually are cis configurations.

-FP (false positive) = the number of proline residues predicted to be trans configurations that actually are cis configurations.

-FN (false negative) =the number of proline residues predicted to be cis configurations that actually are trans configurations.

- N = TP+TN+FP+FN

5 fold cross-validation using datasets in WEKA package

Page 14: Project Presentation

CS573X Machine Learning Project5/4/05

Result – Naïve Bayes Classifier (1)Result – Naïve Bayes Classifier (1)

  no ss information ss information

window size Accuracy CC Accuracy CC

3 0.596 0.190 0.628 0.276

5 0.604 0.204 0.635 0.296

7 0.599 0.195 0.637 0.302

9 0.606 0.210 0.629 0.282

11 0.602 0.201 0.625 0.270

13 0.598 0.194 0.618 0.254

15 0.606 0.209 0.614 0.245

17 0.602 0.202 0.610 0.235

19 0.601 0.199 0.609 0.229

21 0.602 0.201 0.610 0.230

Page 15: Project Presentation

CS573X Machine Learning Project5/4/05

Result – Naïve Bayes Classifier (2)Result – Naïve Bayes Classifier (2)

0.59

0.592

0.594

0.596

0.598

0.6

0.602

0.604

0.606

0.608

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0.185

0.19

0.195

0.2

0.205

0.21

0.215

cc

Accuracy CC

0.59

0.595

0.6

0.605

0.61

0.615

0.62

0.625

0.63

0.635

0.64

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

cc

Accuracy CC

No ss information

ss information incorporated

Page 16: Project Presentation

CS573X Machine Learning Project5/4/05

Result – SVM (1)Result – SVM (1)

  no ss information ss information

window size Accuracy CC Accuracy CC

3 0.5826 0.1625 0.63 0.2594

5 0.575 0.1499 0.6233 0.246

7 0.5817 0.1613 0.653 0.305

9 0.5921 0.1815 0.6515 0.3021

11 0.6034 0.2036 0.66 0.3192

13 0.6047 0.2059 0.6631 0.3253

15 0.593 0.1821 0.6662 0.3312

17 0.6001 0.1966 0.6704 0.3399

19 0.6028 0.2027 0.6649 0.3289

21 0.6037 0.2053 0.6634 0.3258

Polynomial kernel – third degree

Page 17: Project Presentation

CS573X Machine Learning Project5/4/05

Result – SVM (2)Result – SVM (2)

0.56

0.5650.57

0.5750.58

0.585

0.590.595

0.60.605

0.61

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

cc

Accuracy CC

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

3 5 7 9 11 13 15 17 19 21

window size

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

cc

Accuracy CC

No ss information

ss information incorporated

Polynomial kernel – third degree

Page 18: Project Presentation

CS573X Machine Learning Project5/4/05

Result – SVM (3)Result – SVM (3)

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

1 2 3 4 5 6 7 8 9

polynomial degree

accu

racy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

cc

Accuracy CC

Window size 17 and ss information incorporated

Page 19: Project Presentation

CS573X Machine Learning Project5/4/05

Discussion (1)Discussion (1)

Unbalanced data

5%

95%

Trans Cis

Page 20: Project Presentation

CS573X Machine Learning Project5/4/05

Ligand:Phosphopeptide

Ligand:SH3 domain of ITK

Discussion (2)Discussion (2) Third class- both cis and trans

Native proline isomerization

Page 21: Project Presentation

CS573X Machine Learning Project5/4/05

SummarySummary

If we use just sequence information, the performance is not as good as using secondary structure information The optimal performance Naïve Bayes classifier achieved an

accuracy of 64% with a CC of 0.302, specificity for trans of 0.483 and sensitivity for trans of 0.590 if we use secondary structure information

In case of SVM method, we can get the performance, an accuracy of 67% with a CC of 0.340, specificity for trans of 0.683 and sensitivity for trans of 0.657 if we use secondary structure information

Naïve Bayes classifier did not need long window size to get good performance (~7). More bigger window did not show reasonable performance. On the other hand, SVM needed at least 7 longer window size to get good performance

For SVM, polynomial kernel with third degree did the best performance.

Page 22: Project Presentation

CS573X Machine Learning Project5/4/05

ReferencesReferences

1. Andreotti, A. H. 2003. Native state proline isomerization: an intrinsic molecular switch. Biochemistry 42:9515-24.

2. Baldi, P., S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412-24.

3. Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. 2000. The Protein Data Bank. Nucleic Acids Res 28:235-42.

4. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-637.

5. Mitchell, T. 1997. Machine Learning. McGRAW-HILL.

6. Pahlke, D., C. Freund, D. Leitner, and D. Labudde. 2005. Statistically significant dependenceof the Xaa-Pro peptide bond conformation on secondary structure and amino acid sequence. BMC Struct Biol 5:8.

7. Vapnik, V. 1998. Statistical learning theory. Springer-Verlag., New York.

8. Wang, G., and R. L. Dunbrack, Jr. 2003. PISCES: a protein sequence culling server. Bioinformatics 19:1589-91.

9. Witten, I. H., and E. Frank. 2000. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann.


Recommended