Funding This work has been supported by the Sven och Lilly Lawskis fond för naturvetenskaplig forskning, the Swedish Research Council (VR-NT 2012-5046, VR-M 2010-3555) and the Swedish E-science Research Center.
Knowledge of the correct protein subcellular localization is necessary for understanding the functions of a protein. Unfortunately large-scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large-scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons.
References SubCons: a new ensemble method for improved human subcellular localization predictions. Salvatore, M. Et al. Bioinformatics, 2017. (33) 16, 2464-2470. The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction. Salvatore, M., Shu, N., Elofsson, A. Protein Science. 2017 Sep 13. doi: 10.1002/pro.3297
Workflow of SubCons. The figure shows the SubCons workflow. SubCons combines predictions from four predictors using a Random Forest classifier. These tools can either accept a fasta sequence(s) (CELLO2.5, MultiLoc2 and SherLoc2) or a fasta plus an MSA profile (LocTree2). The latest is constructed using PRODRES. The predicted localizations are first mapped to a standard 3 letters code. Thereafter, a vector of 9 X 4 values is used as an input for a Random forest classifier that output 9 values (one for each class). The value of each class corresponds to the average score of the class into the forest.
Introduction
Conclusions
The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction
1 Science for Life Laboratory, Stockholm University, 171 21, Solna, Sweden. 2 Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden. 3 Sweden Bioinformatics Infrastructure for Life Sciences (BILS), Stockholm University, Stockholm, Sweden
• SubCons is an ensemble method that combines four predictors
using a Random Forest classifier and it is freely available as a
web-server at http:subcons.bioinfo.se • SubCons outperforms earlier methods in a dataset of proteins
where two independent methods confirm the subcellular
localization.
• Given nine subcellular localizations, SubCons achieves an F1-
Score of 0.79 compared to 0.70 of the second best method.
Furthermore, at a false positive rate of 1% the true positive rate is
over 58% for SubCons compared to less than 50% for the best
individual predictor.
Materials
Venn diagram showing the training and golden dataset. The figure shows the three verified experimental datasets used to train SubCons (left). Additionally, it shows the golden dataset used to test SubCons, in which at least two independent methods must confirm the subcellular localization (right).
Mass−Spec
SLHPA
UniProt
43052431
1080
21295 12772
Overall performance in the golden dataset. Roc Curve showing the performance of the tools benchmarked in the golden dataset for the entire range of sensitivity and specificity (left). Performance of each predictor in the golden dataset in terms of F1-Score (right).
Results
Overall performance
0.7
0.7
0.53
0.66
0.66
0.7
0.69
0.79
MultiLoc2
SherLoc2
WoLF PSORT
Majority Vote
CELLO2.5
LocTree2
YLoc
SubCons
0.0 0.2 0.4 0.6 0.8F1−SCORE
Performance for different localizations
0.85
0.53
0.85
0.43
0.67
0.56
0.67
0.61
PEX
CYT
GLG
MEM
ERE
LYS
NUC
MIT
0.0 0.2 0.4 0.6 0.8F1−SCORE
0.0
1.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.001 0.010 0.100 0.500
Log10 False Positive Rate (1−Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
)
CELLO2.5LocTree2Majority voteMultiLoc2SherLoc2SubConsWoLF PSORTYLoc
Performance for different localizations in the golden dataset. Performance for different localizations of each predictor in the golden dataset, in terms of F1-Score (left). Performance of SubCons in the golden dataset in terms of F1-Score for every single localization (right).
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
MIT NUC LYS ERE MEM GLG CYT PEX
F1−S
core
●●●●●●●●
CELLO2.5LocTree2Majority Vote MultiLoc2SherLoc2SubConsWoLF PSORTYloc
Salvatore Marco 1,2, Warholm Per 1,2, Basile Walter 1,2, Shu Nanjiang 1,2,3 and Elofsson Arne1,2.
Stockholm University and Science for Life Laboratory Marco Salvatore, PhD student E-mail: [email protected] Website: http://bioinfo.se/members/marco-salvatore/
Venn diagram showing the three experimental datasets. The figure shows the three verified experimental datasets used to train and test SubCons: Mass-Spec (Localization of Organelle Proteins by Isotope Tagging (LOPIT)), Human Protein Atlas (SLHPA) and UniProt.
Train dataset Test (golden) dataset