Transmembrane Protein Prediction
Project Presentation
CMPUT 606
Overview
Transmembrane (TM) protein: Associated with the plasma membrane “A protein that has domains exposed on both
sides of the membrane” [Genes VII] Some of the TM proteins that span the lipid layer
several times form a hydrophilic channel that permits various ions and molecules to circulate through the plasma membrane.
Transmembrane Proteins
Transmembrane Segments
Ion Channels
Transmembrane Domains
Data SetsData Sets Brief Description
DB-TMR Database of TM segments (not fasta). After translation into fasta: DB-TMR40672.fasta (TM segments flanked by 5 amino acids at each end of the segment) and DB-TMR40672onlytm.fasta (only TM segments). They each contain 40672 protein sequences.
PDB Database of 3D structures in PDB format. After translation into fasta and removing all nucleotide sequences: pdb61042.fasta. The file PDBseqsnontm.fasta that contains 645 globular proteins constitutes a negative test set.As a result of testing the TMHMM predictor with the protein chains extracted from PDB, a file containing the prediction for each of the 61042 sequences was obtained, outputTMHMMonPDB61042.txt. Out of the 61042 sequences, 1294 were predicted to be TM. The sequence predicted as TM are stored in the file seqOutputTMHMM1294.fasta in fasta format (all entries are preceded by “sp|” at the beginning of each entry to mark them as TM for prediction and testing purposes).
PDB_TM Database of TM proteins in XML format. The site provides lists of PDB IDs representing TM proteins (test_mem.txt) and globular proteins (globpdb.txt). From these initial list, the following fasta files are generated: tm.fasta with 916 TM protein chains, nontm.fasta with 900 protein chains, and both.fasta with 1816 protein chains. We have also two files obtained from the PDB_TM site, pdbtm_all.seq with 1363 chains.
TMHMMset160 This dataset is used to train TMHMM and comprises 160 protein sequences in fasta format. They are all TM proteins and are preceded by “sp|”.
TMPDB Database of 302 verified transmembrane protein sequences, together with their TM domain location and number, in SwissProt format. After translation into fasta for all the TM categories: alpha helix non-redundant (231), alpha buried non-redundant (7), and beta non-redundant (15), the file obtained is the sum of these three files and it contains 253 protein sequences, tmpdb253.fasta in fasta format (“sp|”).
TMbase Database of transmembrane proteins and their helical membrane-spanning domains. It is mainly based on Swiss-Prot.
Predictors
ePST bPST TMHMM TMpred HMMTOP HMMer TMDET
PredictorsPredictors Major/Minor Contribution Impact
TMHMM Predictor for TM helices. Based on TMHMM predictions, the authors estimated that 20-30 % of all genes in most genomes encode membrane proteins.
In July 2001: rated best for prediction of TM helices. The accuracy reported is 97-98%.
TMpred Predictor for membrane spanning regions and their orientation. The underlying algorithm is based on the statistical analysis of TMbase. The prediction is made using a combination of several weight-matrices for scoring.
Still a reference comparison for TM protein prediction.
HMMer Searches for homologues of a sequence family. Builds an HMM from the training data and matches the query sequence into a sequence database to find homologues. The model accepts as input a file on which MSA is performed.
Improves upon the methods for sensitive database searches using multiple sequence alignments as queries.
HMMTOP Builds on an HMM architecture. The training model is a regularizer that is estimated from a set of known TM proteins. The prediction model is estimated from the query sequence and then it is used to predict the structure of that sequence. The server only accepts one test sequence at a time.
The accuracy reported is 78%.
TMDET Predictor for transmembrane domains. Based only on the structural information (3D) of the protein. Determines the membrane planes relative to the position of atomic coordinates. A discrimination function separates TM and globular proteins even in cases of low resolution or incomplete structures such as fragments or parts of large multi chain complexes. First algorithm that uses the 3D structure as input, identifies TM proteins, and determines membrane location. This method can be used to annotate protein structures having TM segments.
Generates PDB_TM: automatically updated database for TM proteins from PDB. The algorithm can also construct a globular protein database.
bPST Histories are represented in the tree. Alternative approach for detecting significant patterns in protein sequences based on probabilistic suffix trees (PSTs) without any prior information about the input sequences and without the prior alignment of the input sequences.
The PST model detects much more related sequences than pair-wise methods and it is much faster and almost as sensitive as an HMM.
ePST Training sequences are represented in the tree. Prediction of the probability of a protein sequence function using an efficient PST is possible in linear time.
Good results for protein function prediction.
Predictors Performance: Theoretical Time
TMHMM
Short form prediction sp_1xqe_A len=418 ExpAA=243.54 First60=39.67 PredHel=11 Topology=o10-32i45-67o98-120i127-149o159-181i193-215o225-
247i259-281o285-302i315-337o352-374i
TMHMM
TMpred
TMpred
HMMTOP
TMDET
HMMer Flow
HMMerScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---nontm|1ALO._ OXIDOREDUCTASE -20.6 4.7 1nontm|1CDE._ TRANSFERASE(FORMYL) -26.1 9.9 1nontm|1AKO._ NUCLEASE -27.4 10 1nontm|1ARU._ PEROXIDASE -37.1 10 1sp|1pv7_A -41.7 10 1sp|1pw4_A -46.0 10 1sp|1pxs_A -48.9 10 1sp|1xqe_A -49.0 10 1sp|1r2c_L -53.2 10 1nontm|1HSB.B HISTOCOMPATIBILITY -61.4 10 1
Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------nontm|1ALO._ 1/1 125 323 .. 1 199 [] -20.6 4.7nontm|1CDE._ 1/1 4 202 .. 1 199 [] -26.1 9.9nontm|1AKO._ 1/1 5 202 .. 1 199 [] -27.4 10nontm|1ARU._ 1/1 112 295 .. 1 199 [] -37.1 10sp|1pv7_A 1/1 116 314 .. 1 199 [] -41.7 10sp|1pw4_A 1/1 162 329 .. 1 199 [] -46.0 10sp|1pxs_A 1/1 51 249 .] 1 199 [] -48.9 10sp|1xqe_A 1/1 39 226 .. 1 199 [] -49.0 10sp|1r2c_L 1/1 62 260 .. 1 199 [] -53.2 10nontm|1HSB.B 1/1 2 99 .] 1 199 [] -61.4 10
HMMer
Total sequences searched: 10
Whole sequence top hits:tophits_s report: Total hits: 10 Satisfying E cutoff: 9 Total memory: 16K
Domain top hits:tophits_s report: Total hits: 10 Satisfying E cutoff: 10 Total memory: 22K
ePST Output
TM# Start End
1 12 24 2 50 61 3 101 112 4 130 142 5 163 166 6 168 175 7 199 201 8 203 211 9 228 24010 260 27111 287 29712 315 33313 353 365
Total # ePST segments = 13
ePST Outputs# i char pos neg odds tot win maxwin regions 0 A -1.87 -708.40 706.52 706.52 706.52 0.00 -s 1 P -2.96 -708.40 705.44 1411.96 1411.96 0.00 -s 2 A -1.87 -708.40 706.52 2118.48 2118.48 0.00 -s 3 V -0.75 -708.40 707.64 2826.13 2826.13 0.00 -s 4 A -1.80 -708.40 706.60 3532.72 3532.72 0.00 -s 5 D -6.47 -708.40 701.92 4234.65 4234.65 0.00 -s 6 K -3.53 -708.40 704.87 4939.52 4939.52 0.00 -s 7 A -3.40 -708.40 705.00 5644.51 5644.51 0.00 -s 8 D -6.47 -708.40 701.92 6346.43 6346.43 0.00 -s 9 N -5.22 -708.40 703.18 7049.61 7049.61 0.00 -s 10 A -1.87 -708.40 706.52 7756.14 7756.14 0.00 -s 11 F -3.91 -708.40 704.49 8460.63 8460.63 0.00 -s 12 M -3.76 -708.40 704.63 9165.26 9165.26 0.00 -s 13 M -3.76 -708.40 704.63 9869.89 9869.89 0.00 -s 14 I -2.06 -708.40 706.34 10576.23 10576.23 0.00 -s 15 C -4.54 -708.40 703.86 11280.08 10573.56 10573.56 -s 16 T -2.71 -708.40 705.69 11985.77 10573.81 10573.81 -s 17 A -2.48 -708.40 705.91 12691.68 10573.20 10573.81 -s 18 L -4.01 -708.40 704.38 13396.07 10569.94 10573.81 -s 19 V -1.29 -708.40 707.11 14103.18 10570.45 10573.81 -s 20 L -0.59 -708.40 707.81 14810.99 10576.34 10576.34 -s 21 F -1.12 -708.40 707.28 15518.26 10578.75 10578.75 +s 22 M -3.76 -708.40 704.63 16222.90 10578.39 10578.75 +s 23 T -3.12 -708.40 705.27 16928.17 10581.74 10581.74 +s 24 I -0.87 -708.40 707.52 17635.69 10586.08 10586.08 +s 25 P -0.51 -708.40 707.89 18343.58 10587.44 10587.44 +s 26 G -2.25 -708.40 706.15 19049.73 10589.11 10589.11 +s 27 I -1.49 -708.40 706.91 19756.64 10591.38 10591.38 +s 28 A -1.54 -708.40 706.85 20463.50 10593.61 10593.61 +s 29 L -4.01 -708.40 704.38 21167.88 10591.65 10593.61 +s 30 F -1.92 -708.40 706.48 21874.36 10594.27 10594.27 +s 31 Y -6.07 -708.40 702.33 22576.69 10590.91 10594.27 +s 32 G -2.25 -708.40 706.15 23282.84 10591.15 10594.27 +s 33 G -4.38 -708.40 704.02 23986.86 10590.79 10594.27 +s 34 L -1.54 -708.40 706.85 24693.71 10590.53 10594.27 +s 35 I -2.06 -708.40 706.34 25400.05 10589.06 10594.27 +s 36 R -2.75 -708.40 705.65 26105.70 10587.43 10594.27 +s 37 G -2.25 -708.40 706.15 26811.85 10588.95 10594.27 +
ePST Execution Flow
Training Set
Testing Set
ePST
ePSTPrediction
Post-processingScripts
TM# Start End1 12 242 50 61
3 101 112 4 130 142 5 163 166 6 168 175
7 199 201 8 203 211 9 228 24010 260 27111 287 29712 315 33313 353 365
Total # segments predicted by ePST = 13
HMMer Results for both.fasta
Step Time
CLUSTALW 41.37s (41.05s)
hmmbuild 0.56s (0.25s –f)
hmmcalibrate 5.47s (2.71s -f)
hmmsearch 1.56s (0.73s -f)
HMMer vs. ePST
Predictor Train Global
Train Local
Test Global
Test Local
Accuracy Global
Accuracy Local
HMMer 6.03 (2.96) 1.56 (0.73) 240/916 = 26%
ePST 0.33 0.23 5.59 fp=fn=288
0.82 fp=fn=333
69% 64%
ePST
Training Testing Local Accuracy Global Accuracy
DBTMR40672 q.fasta 100% (W 15) 100%
DBTMR1000 q.fasta 60% (W 15)
fp=2; fn=0
100%
DBTMR40672 both.fasta fp=fn=269, 71% fp=fn=247, 73%
DBTMR1000 both.fasta fp=fn=333, 64% fp=fn=288, 69%
Set 160 both.fasta fp=fn=321, 65% 72.55% (-), 73%(+)
Cross-validation (5 folds) - ePST
Data Set Train Global
Train Local
Test Global
Test Local
Accuracy Global
Accuracy Local
DBTMR40672
7.15 7.20 0.36 0.42 100% 100%
DBTMR1000
0.13 0.13 0.01 0.01 100% 100%
tm.fasta 0.60 0.60 0.01 0.01 100% 100%
both.fasta 0.61 0.61 1.67 1.67 99% 99%
Set 160 0.84 0.84 0.00 0.00 100% 100%
TMHMM and ePST
Predictor Testing Local Accuracy Global Accuracy
TMHMM both.fasta 99.11% (-), 60% (+)
ePST trained on Set 160
both.fasta 65% 72.55% (-), 73%(+)
ePST trained on mix.fasta
both.fasta 74% (W 15, 20), 78% (W 10, 35),
80% (W 25, 27)
78%
ePST trained on Set 160
q.fasta 100% 100%
TMHMM q.fasta 100%
Scanning PDB
Training: DMTMR40672 Testing: PDB Threshold 705.37->Nrtm=1665 chains PDB_TM retrieves 1673 chains Validation necessary – lack of ground truth
TMH Benchmark
tmeval.fasta: 2247 non-annotated sequences Script for converting ePST output to TMH
submit format Comparison with other predictors 4 tables 8 evaluation parameters
Window 25, 35, T 10584 - High Resolution
Window 25, 35, T 10584 - Low Resolution
Window 15, T 10588 – High Resolution
Window 15, T 10588 – Low Resolution
Window 15, T 10588 – False Positives
Window 15, T 10588 – Confusion with Signal Peptides
Conclusions
ePST competitive predictor Fast training Scales well in contrast with HMMs ePST does not suffer from a poor local
minimum as HMMs ePST does not require MSA of the
sequences ePST allows more than one test sequence at
a time
Future Work
More tuning, use pruning Applications to other tasks (phosphorylation)
involved in signal transduction pathways Search for a verified data set for training and
testing (no consensus in the literature) Extract features from the sequence Analyze the false negatives with particular
helix topologies (such as 1orq)