Characterization of Secondary Structure of Proteins
using Different Vocabularies
Madhavi K. GanapathirajuLanguage Technologies Institute
Advisors
Raj Reddy, Judith Klein-Seetharaman,
Roni Rosenfeld
2nd Biological Language Modeling Workshop
Carnegie Mellon University
May 13-14 2003
2
Presentation overview
• Classification of Protein Segments by their Secondary Structure types
• Document Processing Techniques
• Choice of Vocabulary in Protein Sequences
• Application of Latent Semantic Analysis
• Results
• Discussion
3
Sample Protein:MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITALYSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA…
Sample Protein:MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITALYSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA…
Secondary Structure of Protein
4
Application of Text Processing
Letters Words SentencesLetter counts in languages
Word counts in Documents
Residues Secondary Structure ProteinsGenomes
Can unigrams distinguish Secondary Structure Elements from one another
5
Unigrams for Document Classification
• Word-Document matrix– represents documents in terms of their
word unigrams
Do
c-1
Do
c-2
Do
c-3
Do
c-4
clouds 1 1cell 1 3 drawing 10 1 3dry 2 1gene 1 1graph 1 2 1. . .weather 1 1
“Bag-of-words” model since the position of words in the document is not taken into account
6
Word Document Matrix1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
7
Document Vectors1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
8
1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
111010
Doc-1
Document Vectors
9
1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
003232
Doc-2
Document Vectors
10
1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
202100
Doc-3
Document Vectors
11
1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0
000100
Doc-N
Document Vectors
12
• Documents can be compared to one another in terms of dot-product of document vectors
Document Comparison
202100
003232
.* =006200
13
• Documents can be compared to one another in terms of dot-product of document vectors
Document Comparison
202100
003232
.* =006200
14
• Documents can be compared to one another in terms of dot-product of document vectors
Document Comparison
202100
003232
.* =006200
• Formal Modeling of documents is
presented in next few slides…
15
Vector Space Model construction
• Document vectors in word-document matrix are normalized– By word counts in entire document collection– By document lengths
• This gives a Vector Space Model (VSM) of the set of documents
• Equations for Normalization…
16
(Word count in document)
(document length)
(depends on word count in corpus)
t_i is the total number of times word i occurs in the corpus
Word count normalization
17
1 211 3 2 1
2 1 1 11 3 1
2 1 1
.2 .3
.3
.1 .1 .2 .4.1 .1 .4 .4
.1 .2 .6.1 .5 .5
Word-Document Matrix
Normalized Word-Document Matrix
18
Document vectors after normalisation
.2
.3
.1
.1
.1
.1
.2
.1
.3
.2
.1 .4
.2 .3
.3
.1 .1 .2 .4.1 .1 .4 .4
.1 .2 .6.1 .5 .5
...
19
Use of Vector Space Model
• A query document is also represented as a vector
• It is normalized by corpus word counts
• Documents related to the query-doc are identified – by measuring similarity of document
vectors to the query document vector
21
Protein Secondary Structure
• Dictionary of Secondary Structure Prediction: annotation of each residue with its structure – based on hydrogen bonding patterns and
geometrical constraints
• 7 DSSP labels for PSS:– H– G– B– E– S– I– T
Helix types
Strand types
Coil types
22
Example
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
Residues
DSSP
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLHPKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
T, S, I,_: Coil E, B: Strand H, G: Helix Key to DSSP labels
23
Reference Model
• Proteins are segmented into structural Segments
• Normalized word-document matrix – constructed from structural segments
24
Example
Structural Segments obtained from the given sequence:PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
Residues
DSSP
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLHPKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
25
Example
Unigrams in the structural segments
Se
g-1
Se
g-2
Se
g-3
Se
g-4
Se
g-5
Se
g-6
Se
g-7
Se
g-8
Se
g-9
A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2
Structural Segments obtained from the given sequence:PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
Residues
DSSP
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLHPKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
26S
eg
-1
Se
g-2
Se
g-3
Se
g-4
Se
g-5
Se
g-6
Se
g-7
Se
g-8
Se
g-9
A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2
Am
ino
Aci
ds
StructuralSegments
Amino-acid Structural-Segment
Matrix
27S
eg
-1
Se
g-2
Se
g-3
Se
g-4
Se
g-5
Se
g-6
Se
g-7
Se
g-8
Se
g-9
A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2
Am
ino
Aci
ds
StructuralSegments
Amino-acid Structural-Segment
Matrix
Similar to Word-Document
Matrix
28S
eg
-1
Se
g-2
Se
g-3
Se
g-4
Se
g-5
Se
g-6
Se
g-7
Se
g-8
Se
g-9
A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2
Document Vectors
Word Vectors
…
29S
eg
-1
Se
g-2
Se
g-3
Se
g-4
Se
g-5
Se
g-6
Se
g-7
Se
g-8
Se
g-9
A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2
Document Vectors
Word Vectors
…
Qu
ery
-1
ACD 2E 1F G 0H 1I 1K 0LMN 2P Q 2RSTV 1WY
Query Vector
30
• JPred data– 513 protein sequences in all– <25% homology between sequences– Residues & corresponding DSSP
annotations are given
• We used• 50 sequences for model construction (training)• 30 sequences for testing
Data Set used for PSSP
31
• Proteins from test set– segmented into structural elements– Called “query segments”– Segment vectors are constructed
• For each query segment– ‘n’ most similar reference segment vectors
are retrieved– Query segment is assigned same structure
as that of the majority of the retrieved segments*
Classification
*k-nearest neighbour classification
32
.2 .3
.3
.1 .1 .2 .4.1 .1 .4 .4
.1 .2 .6.1 .5 .5
.1
.1
.2
Helix Strand CoilKey
Query Vect
or
Referen
ce Model
Compare Similarities
3 most similar reference vectors
.2
.3
.1
.1
.1
.1
.2
.1
.3
.2
.1
Majority voting out of 3-most similar reference vectors = = Coil
Hence Structure-type assigned to Query Vector is Coil
Structure type assignment to QVector
33
Choice of Vocabulary in Protein Sequences
• Amino Acids • But Amino acids are
– Not all distinct..– Similarity is primarily due to chemical
composition
So,– Represent protein segments in terms of
“types” of amino acids– Represent in terms of “chemical composition”
34
Representation in terms of “types” of AA
• Classify based on Electronic Properties– e- donors: D,E,A,P – weak e-donors: I,L,V – Ambivalent: G,H,S,W – weak e- acceptor: T,M,F,Q,Y – e- acceptor: K,R,N– C (by itself, another group)
• Use Chemical Groups
36
Results of Classification with “AA” as words
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
Average
AA Train 97.8 56.7 91.4 82.7 81.9 99.6 87.6 65.9 77.5 84.3AA Test 42.7 30.1 83.3 62 52 65.8 67.3 20 40.6 51
Precision Recall
Leave 1-out testing of reference vectorsUnseen query segments
37
Results with “chemical groups” as words
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
AverageCW Train 96.7 58.9 92.2 83.6 82.6 99.6 88.3 68.4 79 85.4CW Test 60 50 90 67.2 66.7 60 60 80 65.9 66.4
Precision Recall
• Build VSM using both reference segments and test segments– Structure labels of reference segments are known– Structure labels of query segments are unknown
38
Modification to Word-Document matrix
• Latent Semantic Analysis
• Word document matrix is transformed– by “Singular Value Decomposition”
40
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
Average
AA Train 97 60 91.6 83.7 82.8 99.2 87 70.4 79.7 85.5AA Test 40 50 80 63.6 66.1 70 50 80 63.3 66.8
Precision Recall
Results with “AA” as words, using LSA
41
Results with “types of AA” as wordsusing LSA
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
Average
AA Train 82.7 53.3 75.6 70.6 70.6 96.2 81.4 23.5 67 67AA Test 90 70 30 60.5 60.5 70 50 70 63.5 63.5
Precision Recall
42
Results with “chemical groups” as wordsusing LSA
Precision RecallHelix Sheet Coil Micro Macro Helix Sheet Coil Micro Macro
CW Train 99.6 66.2 82.7 82.6 80.9 99.6 89 54.2 81 80.9CW Test 80 50 50 55.7 59.7 40 40 80 64.4 55.1
43
LSA results for Different Vocabularies
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
Average
AA Train 97 60 91.6 83.7 82.8 99.2 87 70.4 79.7 85.5AA Test 40 50 80 63.6 66.1 70 50 80 63.3 66.8
Precision Recall
Amino acidsLSA
Types ofAmino acid LSA
Helix Sheet CoilMicro
AverageMacro
Average Helix Sheet CoilMicro
AverageMacro
AverageCW Train 99.6 66.2 82.7 82.6 80.9 99.6 89 54.2 81 80.9CW Test 80 50 50 55.7 59.7 40 40 80 64.4 55.1
Precision Recall
Chemical GroupsLSA
Helix Sheet CoilMicro
Average Helix Sheet CoilMicro
Average
AA Train 82.7 53.3 75.6 70.6 96.2 81.4 23.5 67AA Test 90 70 30 60.5 70 50 70 63.5
Precision Recall
44
Model construction using all data
Helix Strand Other Micro Macro Helix Strand Other Micro Macro
VSMAA Train 98.5 56.2 92.3 83.2 82.3 98.9 89.4 64.9 77.1 84.4AA Test 61.3 39.6 78 64.8 59.6 48 65 59.6 58.9 57.5LSACW Train 99.2 66.7 88.6 84.8 84.8 99.6 93.2 53 82 82CW Test 50 45 83 67.1 59.4 83.6 50.4 60.6 62 64.8VSMCW Train 99.2 63.2 79.8 80.7 80.7 99.6 87.1 49.2 78.6 78.6CW Test 49.8 45.2 81.5 66.2 58.8 84.2 41.6 61.7 60.3 62.5LSAAA_types_Train 85.2 55.2 78.4 72.9 96.2 83 28.8 69.3AA_types_Test 74.4 47 67.1 62.7 82.2 67.1 30.9 60.8VSMAA_types_Train 77.1 57 81.7 72 95.5 80.3 28.8 68.1AA_types_test 72.5 48.4 77.4 66.1 84.9 71.7 27 61.1
Word doc matrices constructed using both reference and query dataPricision Recall
Matrix models constructed using both reference and query documents together. This gives better models both for normalization and in construction Of latent semantic model
Am
ino
A
cid
C
hem
ica
l G
rou
ps
Am
ino
aci
d
typ
es
45
Applications
• Complement other methods for protein structure prediction– Segmentation approaches
• Protein classifications as all-alpha, all-beta, alpha+beta or alpha/beta types
• Automatically assigning new proteins into SCOP families
46
References1. Kabsch, Sander “Dictionary of Secondary Structure Prediction”, Biopolymers.
2. Dwyer, D.S., Electronic properties of the amino acid side chains contribute to the
structural preferences in protein folding. J Biomol Struct Dyn, 2001. 18(6): p. 881-92. 3. Bellegarda, J., “Exploiting Latent Semantic Information in Statistical Language
Modeling”, Proceedings of the IEEE, Vol 88:8, 2000.
48
Use of SVD
• Representation of Training and test segments very similar to that in VSM
• Structure type assignment goes through same process, except that it is done with the LSA matrices
49
Classification of Query Document
• A query document is also represented as a vector
• It is normalized by corpus word counts
• Documents related to the query are identified – by measuring similarity of document
vectors to the query document vector
• Query Document is assigned the same Structure as of those retrieved by similarity measure
• Majority voting*
*k-nearest neighbour classification