STITCH 4: integration of protein–chemical interactions ... · STITCH 4: integration of...

STITCH 4 integration of proteinndashchemicalinteractions with user dataMichael Kuhn1 Damian Szklarczyk2 Sune Pletscher-Frankild3 Thomas H Blicher3

Christian von Mering2 Lars J Jensen3 and Peer Bork45

1Biotechnology Center TU Dresden 01062 Dresden Germany 2Institute of Molecular Life SciencesUniversity of Zurich and Swiss Institute of Bioinformatics Winterthurerstrasse 190 8057 Zurich Switzerland3Novo Nordisk Foundation Center for Protein Research Faculty of Health Sciences University of Copenhagen2200 Copenhagen N Denmark 4European Molecular Biology Laboratory Meyerhofstrasse 1 69117 HeidelbergGermany and 5Max-Delbruck-Centre for Molecular Medicine Robert-Rossle-Strasse 10 13092 Berlin Germany

Received September 30 2013 Revised November 1 2013 Accepted November 4 2013

ABSTRACT

STITCH is a database of proteinndashchemical inter-actions that integrates many sources of experimen-tal and manually curated evidence with text-mininginformation and interaction predictions Available athttpstitchemblde the resulting interactionnetwork includes 390 000 chemicals and 36 millionproteins from 1133 organisms Compared with theprevious version the number of high-confidenceproteinndashchemical interactions in human hasincreased by 45 to 367 000 In this version weadded features for users to upload their own datato STITCH in the form of internal identifierschemical structures or quantitative data Forexample a user can now upload a spreadsheetwith screening hits to easily check which inter-actions are already known To increase thecoverage of STITCH we expanded the text miningto include full-text articles and added a predictionmethod based on chemical structures We furtherchanged our scheme for transferring interactionsbetween species to rely on orthology rather thanprotein similarity This improves the performancewithin protein families where scores are nowtransferred only to orthologous proteins but not toparalogous proteins STITCH can be accessed witha web-interface an API and downloadable files

INTRODUCTION

Proteinndashchemical interactions are essential for any biolo-gical system for example they drive the metabolism of the

cell or initiate many signaling cascades and most pharma-ceutical interventions A large collection of such inter-actions can therefore be used to study a variety ofcellular functions and the impact of drug treatment onthe cell For such research it is important to have ascomplete as possible data on proteinndashchemical inter-actions By treating proteins and chemicals as nodes of agraph which are linked by edges if they have been found tointeract (1) we can adopt a network view that enables us tointegrate many different sources The concept of STITCH(lsquosearch tool for interacting chemicalsrsquo) was from the begin-ning to combine sources of proteinndashchemical interactionsfrom experimental databases pathway databases drugndashtarget databases text mining and drugndashtarget predictionsinto a unified network (2ndash4) This network abstracts thecomplexity of the underlying data sources making large-scale studies possible At the same time links to the originalsources are retained making it possible to trace the prov-enance of the data The underlying STITCH database canbe accessed in multiple ways via an intuitive web interfacevia download files (for large-scale analysis) and via an API(enabling automated access on a small to medium scale)Here we present recent improvements to the database anduser interface of STITCH Already in the previousversions it has been possible to query STITCH usingprotein or chemical names InChIKeys and SMILESstrings New in this version is the possibility to uploadspreadsheets with chemical descriptors and experimentaldata that can be directly added to the network as describedlater in text We also for the first time use the evidencetransfer algorithm described for the STRING 91database (5) to improve the performance for proteinfamiliesCompared with STITCH 3 we use the same underlying

set of proteins containing 1133 species We updated the

To whom correspondence should be addressed Tel +49 6221 387 8526 Fax +49 6221 387 8517 Email borkembldeCorrespondence may also be addressed to Michael Kuhn Tel +49 351 463 40063 Fax +49 351 463 40061 Email michaelkuhnbiotectu-dresdendeCorrespondence may also be addressed to Lars J Jensen Tel +45 353 25025 Fax +45 353 25001 Email larsjuhljensencprkudk

Published online 28 November 2013 Nucleic Acids Research 2014 Vol 42 Database issue D401ndashD407doi101093nargkt1207

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

by guest on February 26 2014httpnaroxfordjournalsorg

Dow

nloaded from

set of chemicals (6) and find interactions with 390 000distinct chemicals In human high-confidence interactionsfor 172 000 compounds are available in STITCH 4(Figure 1) compared with 110 000 in STITCH 3 (4) Intotal the human proteinndashchemical interaction networkcontains 22 million interactions (Figure 1) Applyingdifferent confidence thresholds 570 000 interactions areof medium confidence (score cutoff 05) and 367 000 inter-actions are of high confidence (cutoff 07)

SOURCES OF INTERACTIONS

Proteinndashchemical interactions are presented in four differ-ent channels experiments databases text mining and pre-dicted interactions We import the following sources ofexperimental information ChEMBL [interactions withreported Ki or IC50 (7)] PDSP Ki Database (8) PDB(9) andmdashnew to STITCHmdashdata from two large-scalestudies on kinasendashligand interactions (1011) From thelatter studies we extracted 74 291 interactions between229 compounds and 414 human kinases We convertedthe reported residual kinase activities (10) and kinaseaffinities (11) to probabilistic scores which gave rise to14 187 9431 and 5977 interactions of at least lowmedium and high confidence respectively The secondchannel is made up of manually curated drugndashtarget data-bases DrugBank (12) GLIDA (13) Matador (14) TTD(15) and CTD (16) and pathway databases KEGG (17)NCINature Pathway Interaction Database (18)Reactome (19) and BioCyc (20)

PREDICTION OF INTERACTIONS

STITCH contains verified interactions (from the sourceslisted earlier in text) and predicted interactions basedon text mining and other prediction methods In the text-mining channels interactions were extracted from the lit-erature using both co-occurrence text mining and NaturalLanguage Processing (2122) For the first time forSTITCH we not only use data from MEDLINE abstractsand OMIM (23) but also from full-text articles freely avail-able from PubMed Central or publishersrsquo Web sites

In previous versions we have used medical subjectheadings (MeSH) terms in text mining and when import-ing external databases These terms allowed us to expandconcepts like lsquoalpha adrenergic receptorsrsquo to individualproteins We used to map MeSH terms to proteins usinga combination of automatic and manual approacheswhich led to errors in some cases Furthermore themapping was only valid for human proteins We havetherefore started to use terms from the Gene Ontology[GO terms (24)] to define groups of proteins We excludedGO annotations based on mutant phenotypes (IMP) andelectronic annotations (IEA) We then checked thecoverage of GO annotations for all species in STITCHWe only mapped GO terms to proteins for specieswhere at least 10 of the proteins have been annotatednamely Drosophila melanogaster Escherichia coli Homosapiens Mus musculus Saccharomyces cerevisiae andSchizosaccharomyces pombe

As the coverage of synonyms is lower than for MeSHterms we manually added additional synonyms to GO

Figure 1 Cumulative distribution of scores For each confidence score threshold the plot shows the number of chemicals (top) and proteinndashchemicalinteractions (bottom) that have at least this confidence score in the human proteinndashchemical network For example there are 172 000 chemicals witha high-confidence interaction (score at least 07) As there are many interactions with low confidence scores we use a minimum score threshold of015 Steps in the data correspond to large numbers of compounds that have the same maximum score in manually curated databases or theChEMBL database (with different confidence levels)

D402 Nucleic Acids Research 2014 Vol 42 Database issue


Dow

nloaded from

terms to increase the text-mining sensitivity As one GOterm corresponds to multiple proteins the resulting confi-dence score for the individual proteinndashchemical inter-actions should be down-weighted compared withinteractions that are directly associated with a singleprotein We therefore determined a correction factorthrough benchmarking (as a function of the number ofmember proteins in the GO term) For each channel welooked at the GO terms that are interacting with chemicalsWe then checked if the member proteins that are part of theGO terms are in turn interacting with chemicals For eachof these chemicals we determined the fraction of memberproteins that are interacting For example if a drug wasknown to bind two of the three a2-adrenergic receptors itwas added as a data point (x=3 y=23) to the bench-mark data The data points were then fitted for eachchannel by the following function

f xeth THORN frac14 x aeth THORNe

bx

For larger groups the function approaches x1 (ie inter-acting with one protein is not predictive for the otherproteins)

In this version of STITCH we introduced a fourthchannel namely predicted proteinndashchemical interactionsbased on chemical structure Countless articles on the pre-diction of drugndashtarget interactions have been published inthe last years [eg (25ndash27) reviewed in (28)] In many caseshowever the actual predictions are not available Wetherefore implemented a relatively simple and transparentprediction scheme based on Random Forests (2930) foreach target for which gt100 binding partners are knownfrom the ChEMBL database we attempted to make a pre-diction To avoid biases we first excluded highly similarchemicals enforcing a maximum Tanimoto similarity of09 (using Algorithm 2 described by Hobohm) (31) using2D chemical fingerprints calculated with the chemistry de-velopment kit (3233) We then added ten times as manyrandom chemicals as non-binders to the training set andused the fingerprints as predictors for all compoundsUsing 10-fold cross-validation we assessed how predictivethe model is (by calculating the Pearson correlation coeffi-cient between the training data and the cross-validationresults) We used the correlation as a correction factor todecrease the confidence score of the predicted interactionswhich were predicted for all compounds occurring in theChEMBL database We repeated this procedure threetimes for each compound and used the median predictedscore to decrease the effect of the random negative set Asinteractions were predicted from the experimental channelthe predictions and experimental channels are not inde-pendent of each other To compute the combined score(which is shown on the network) we therefore took thehighest of either score instead of combining the scores ina Bayesian fashion as it is done for the other channelsIn total predictions were made for 767 proteins across15 species The median correlation between the trainingdata and the cross-validation prediction was 090

Links between compounds were also extracted from theaforementioned sources if possible (eg chemical reac-tions from pathway databases or co-mentioned chemicals

from text mining) We also predicted shared mechanismsof action from MeSH pharmacological actions theConnectivity Map using the DIPS method (34) whichtests for similar changes in gene expression oncompound treatment and from screening data from theDevelopmental Therapeutics Program NCINIH (35)The latter screening data replaces our previous analysisof the NCI60 panel We considered only the 70 of 115cell lines against which gt10 000 compounds have beenscreened and centered the negative logarithm of GI50values with respect to both compounds and cell linesFor the 47 692 compounds in the data set we calculatedall-against-all covariance across cell lines and convertedthese to probabilistic scores This resulted in 114 07224 889 and 6890 pairs of compounds of at least lowmedium and high confidence respectivelyTo account for the fact that many interactions are

determined in model species we transfer interactionsbetween species Previously the sequence similaritybetween two proteins was used to determine the confi-dence in the transferred score This had the disadvantagethat when transferring evidence from a selective binder(eg inhibiting only one subtype of a receptor) allsubtypes of the receptor in the target species wouldreceive a similar score In the new scheme only theorthologous protein receives the evidence from thespecific compound

INTEGRATION WITH USER DATA

Users can now upload a spreadsheet (eg in MicrosoftExcel format) with experimental data to STITCH usingthe lsquobatch importrsquo functionality (Figure 2) For eachcompound the spreadsheet may contain the name ofthe compound the chemical structure (as SMILESstring InChI or InChIKey) an internal identifier and areadout value STITCH uses the name and chemical struc-ture to find the compound in the STITCH databaseThe name provided by the user can then be shown inthe interaction network and the downloadable filescontain both the name and the userrsquos internal identifier(if provided) The readout value may be a numericalvalue eg the activity of a compound in a screen Theuser can then select a palette from the ColorBrewer2color schemes (36) The palette is used to convert the nu-merical value into a color which is then used to highlightthe compounds in the network with a colored halo(Figure 3) It is also possible to directly specify colors(in standard hexadecimal notation)

USE CASES

The majority of users access STITCH via the web inter-face where networks can be retrieved using single ormultiple names of proteins or chemicals Furthermoreusers can query STITCH with protein sequences andchemical structures (in the form of SMILES strings)The networks can then be explored interactively orsaved in different formats including publication-qualityimages Proteins and chemicals can be clustered in the

Nucleic Acids Research 2014 Vol 42 Database issue D403


Dow

nloaded from

Figure 2 Data upload The user can use the batch import form to upload a spreadsheet eg from Microsoft Excel (a) STITCH will then show thefirst five rows of the spreadsheet and ask the user to identify columns that contain the name chemical structure or a numerical readout (b) Selectedcolumns are highlighted in green STITCH uses heuristics to suggest which kind of information the columns contain eg by identifying SMILESstrings as structural descriptors



Dow

nloaded from

interactive network viewer and enriched GO terms amongthe proteins can be computed (537) The set of all inter-actions is also available for download under CreativeCommons licenses (with separate commercial licensingfor a subset) In this way STITCH can be used to drivelarge-scale studies Many research groups have alreadyused STITCH 3 in this way a few examples illustratingdifferent utilities follow STITCH has been used to deter-mine which proteins cause side effects during drug treat-ment (3839) by combining the STITCH network withdata from a side effect database (40) The database hasalso been instrumental for the identification of druggableproteins to predict polypharmacological treatment ofdiseases on the basis of network topology features (41)For a method that predicts drug targets based onchemogenetic assays in yeast STITCH has been chosenas a benchmark set (42) Lastly STITCH has also beenintegrated into other tools for example ResponseNet20and QuantMap (4344)

ACKNOWLEDGEMENTS

The authors wish to thank Yan P Yuan (EMBL) for hisoutstanding support with the STITCH servers

FUNDING

Deutsche Forschungsgemeinschaft [DFG KU 27962-1 toMK] Novo Nordisk Foundation Center for ProteinResearch Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BarabasiAL and OltvaiZN (2004) Network biologyunderstanding the cellrsquos functional organization Nat Rev Genet5 101ndash113

Figure 3 User data and the STITCH network For four compounds that are part of the example data set from Figure 2 interacting proteins areshown The numerical readout has been converted to a color on a redndashblue gradient Instead of the normal chemical names used by STITCH the fullnames provided in the data set are used enabling the user to easily recognize the studied chemicals



Dow

nloaded from

2 KuhnM von MeringC CampillosM JensenLJ and BorkP(2008) STITCH interaction networks of chemicals and proteinsNucleic Acids Res 36 D684ndashD688

3 KuhnM SzklarczykD FranceschiniA CampillosM vonMeringC JensenLJ BeyerA and BorkP (2010) STITCH 2an interaction network database for small molecules and proteinsNucleic Acids Res 38 D552ndashD556

4 KuhnM SzklarczykD FranceschiniA von MeringCJensenLJ and BorkP (2012) STITCH 3 zooming inon protein-chemical interactions Nucleic Acids Res 40D876ndashD880

5 FranceschiniA SzklarczykD FrankildS KuhnMSimonovicM RothA LinJ MinguezP BorkPvon MeringC et al (2013) STRING v91 protein-proteininteraction networks with increased coverage and integrationNucleic Acids Res 41 D808ndashD815

6 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW33

7 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

8 RothBL LopezE PatelS and KroezeW (2000)The multiplicity of serotonin receptors uselessly diversemolecules or an embarrassment of riches Neuroscientist 6252ndash262

9 RosePW BeranB BiC BluhmWF DimitropoulosDGoodsellDS PrlicA QuesadaM QuinnGB WestbrookJDet al (2011) The RCSB protein data bank redesigned web siteand web services Nucleic Acids Res 39 D392ndashD401

10 AnastassiadisT DeaconSW DevarajanK MaH andPetersonJR (2011) Comprehensive assay of kinase catalyticactivity reveals features of kinase inhibitor selectivityNat Biotechnol 29 1039ndash1045

11 DavisMI HuntJP HerrgardS CiceriP WodickaLMPallaresG HockerM TreiberDK and ZarrinkarPP (2011)Comprehensive analysis of kinase inhibitor selectivity NatBiotechnol 29 1046ndash1051

12 KnoxC LawV JewisonT LiuP LyS FrolkisA PonABancoK MakC NeveuV et al (2011) DrugBank 30 acomprehensive resource for lsquoomicsrsquo research on drugs NucleicAcids Res 39 D1035ndashD1041

13 OkunoY YangJ TaneishiK YabuuchiH and TsujimotoG(2006) GLIDA GPCR-ligand database for chemical genomicdrug discovery Nucleic Acids Res 34 D673ndashD677

14 GuntherS KuhnM DunkelM CampillosM SengerCPetsalakiE AhmedJ UrdialesEG GewiessA JensenLJet al (2008) SuperTarget and Matador resources forexploring drug-target relationships Nucleic Acids Res 36D919ndashD922

15 ZhuF ShiZ QinC TaoL LiuX XuF ZhangL SongYLiuX ZhangJ et al (2012) Therapeutic target database update2012 a resource for facilitating target-oriented drug discoveryNucleic Acids Res 40 D1128ndashD1136

16 DavisAP MurphyCG JohnsonR LayJMLennon-HopkinsK Saraceni-RichardsC SciakyD KingBLRosensteinMC WiegersTC et al (2013) The comparativetoxicogenomics database update 2013 Nucleic Acids Res 41D1104ndashD1114

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large-scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 SchaeferCF AnthonyK KrupaS BuchoffJ DayMHannayT and BuetowKH (2009) PID the pathwayinteraction database Nucleic Acids Res 37 D674ndashD679

19 CroftD OrsquoKellyG WuG HawR GillespieM MatthewsLCaudyM GarapatiP GopinathG JassalB et al (2011)Reactome a database of reactions pathways and biologicalprocesses Nucleic Acids Res 39 D691ndashD697

20 CaspiR AltmanT DaleJM DreherK FulcherCAGilhamF KaipaP KarthikeyanAS KothariA

KrummenackerM et al (2010) The MetaCyc databaseof metabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res 38D473ndashD479

21 SaricJ JensenLJ OuzounovaR RojasI and BorkP (2006)Extraction of regulatory geneprotein networks from MedlineBioinformatics 22 645ndash650

22 JensenLJ SaricJ and BorkP (2006) Literature mining for thebiologist from information retrieval to biological discoveryNat Rev Genet 7 119ndash129

23 AmbergerJ BocchiniCA ScottAF and HamoshA (2009)McKusickrsquos Online Mendelian Inheritance in Man (OMIM)Nucleic Acids Res 37 D793ndashD796

24 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification of biologyThe gene ontology consortium Nat Genet 25 25ndash29

25 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

26 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

27 PaoliniGV ShaplandRH van HoornWP MasonJS andHopkinsAL (2006) Global mapping of pharmacological spaceNat Biotechnol 24 805ndash815

28 RognanD (2013) Towards the next generation of computationalchemogenomics tools Mol Inf 32 1029ndash1034

29 BreimanL (2001) Random forests Mach Learn 45 5ndash3230 ChenB SheridanRP HornakV and VoigtJH (2012)

Comparison of random forest and pipeline pilot naıve bayesin prospective QSAR predictions J Chem Inf Model 52792ndash803

31 HobohmU ScharfM SchneiderR and SanderC (1992)Selection of representative protein data sets Protein Sci 1409ndash417

32 SteinbeckC HoppeC KuhnS FlorisM GuhaR andWillighagenEL (2006) Recent developments of thechemistry development kit (CDK) - an open-source javalibrary for chemo- and bioinformatics Curr Pharm Des 122111ndash2120

33 SteinbeckC HanY KuhnS HorlacherO LuttmannE andWillighagenE (2003) The Chemistry Development Kit (CDK)an open-source java library for chemo- and bioinformaticsJ Chem Inf Comput Sci 43 493ndash500

34 IskarM CampillosM KuhnM JensenLJ van NoortV andBorkP (2010) Drug-induced regulation of target expressionPLoS Comput Biol 6 e1000925

35 GreverMR SchepartzSA and ChabnerBA (1992) TheNational Cancer Institute cancer drug discovery and developmentprogram Semin Oncol 19 622ndash638

36 HarrowerM and BrewerCA (2003) ColorBrewerorg anonline tool for selecting colour schemes for maps Cartogr J 4027ndash37

37 SzklarczykD FranceschiniA KuhnM SimonovicM RothAMinguezP DoerksT StarkM MullerJ BorkP et al (2011)The STRING database in 2011 functional interaction networksof proteins globally integrated and scored Nucleic Acids Res 39D561ndashD568

38 Duran-FrigolaM and AloyP (2013) Analysis of chemical andbiological features yields mechanistic insights into drug sideeffects Chem Biol 20 594ndash603

39 KuhnM Al BanchaabouchiM CampillosM JensenLJGrossC GavinAC and BorkP (2013) Systematicidentification of proteins that elicit drug side effects Mol SystBiol 9 663

40 KuhnM CampillosM LetunicI JensenLJ and BorkP(2010) A side effect resource to capture phenotypic effects ofdrugs Mol Syst Biol 6 343

41 VitaliF MulasF MariniP and BellazziR (2013)Network-based target ranking for polypharmacological therapiesJ Biomed Inf 46 876ndash881



Dow

nloaded from

42 HeiskanenMA and AittokallioT (2013) Predicting drug-targetinteractions through integrative analysis of chemogenetic assays inyeast Mol Biosyst 9 768ndash779

43 BashaO TirmanS ElukA and Yeger-LotemE (2013)ResponseNet20 revealing signaling and regulatory pathways

connecting your proteins and genesmdashnow with human dataNucleic Acids Res 41 W198ndashW203

44 SchaalW HammerlingU GustafssonMG and SpjuthO(2013) Automated QuantMap for rapid quantitative molecularnetwork topology analysis Bioinformatics 29 2369ndash2370



Dow

nloaded from

set of chemicals (6) and find interactions with 390 000distinct chemicals In human high-confidence interactionsfor 172 000 compounds are available in STITCH 4(Figure 1) compared with 110 000 in STITCH 3 (4) Intotal the human proteinndashchemical interaction networkcontains 22 million interactions (Figure 1) Applyingdifferent confidence thresholds 570 000 interactions areof medium confidence (score cutoff 05) and 367 000 inter-actions are of high confidence (cutoff 07)

SOURCES OF INTERACTIONS

Proteinndashchemical interactions are presented in four differ-ent channels experiments databases text mining and pre-dicted interactions We import the following sources ofexperimental information ChEMBL [interactions withreported Ki or IC50 (7)] PDSP Ki Database (8) PDB(9) andmdashnew to STITCHmdashdata from two large-scalestudies on kinasendashligand interactions (1011) From thelatter studies we extracted 74 291 interactions between229 compounds and 414 human kinases We convertedthe reported residual kinase activities (10) and kinaseaffinities (11) to probabilistic scores which gave rise to14 187 9431 and 5977 interactions of at least lowmedium and high confidence respectively The secondchannel is made up of manually curated drugndashtarget data-bases DrugBank (12) GLIDA (13) Matador (14) TTD(15) and CTD (16) and pathway databases KEGG (17)NCINature Pathway Interaction Database (18)Reactome (19) and BioCyc (20)

PREDICTION OF INTERACTIONS

STITCH contains verified interactions (from the sourceslisted earlier in text) and predicted interactions basedon text mining and other prediction methods In the text-mining channels interactions were extracted from the lit-erature using both co-occurrence text mining and NaturalLanguage Processing (2122) For the first time forSTITCH we not only use data from MEDLINE abstractsand OMIM (23) but also from full-text articles freely avail-able from PubMed Central or publishersrsquo Web sites

In previous versions we have used medical subjectheadings (MeSH) terms in text mining and when import-ing external databases These terms allowed us to expandconcepts like lsquoalpha adrenergic receptorsrsquo to individualproteins We used to map MeSH terms to proteins usinga combination of automatic and manual approacheswhich led to errors in some cases Furthermore themapping was only valid for human proteins We havetherefore started to use terms from the Gene Ontology[GO terms (24)] to define groups of proteins We excludedGO annotations based on mutant phenotypes (IMP) andelectronic annotations (IEA) We then checked thecoverage of GO annotations for all species in STITCHWe only mapped GO terms to proteins for specieswhere at least 10 of the proteins have been annotatednamely Drosophila melanogaster Escherichia coli Homosapiens Mus musculus Saccharomyces cerevisiae andSchizosaccharomyces pombe

As the coverage of synonyms is lower than for MeSHterms we manually added additional synonyms to GO

Figure 1 Cumulative distribution of scores For each confidence score threshold the plot shows the number of chemicals (top) and proteinndashchemicalinteractions (bottom) that have at least this confidence score in the human proteinndashchemical network For example there are 172 000 chemicals witha high-confidence interaction (score at least 07) As there are many interactions with low confidence scores we use a minimum score threshold of015 Steps in the data correspond to large numbers of compounds that have the same maximum score in manually curated databases or theChEMBL database (with different confidence levels)



Dow

nloaded from



bx








USE CASES




Dow

nloaded from




Dow

nloaded from


ACKNOWLEDGEMENTS


FUNDING



REFERENCES





Dow

nloaded from












































Dow

nloaded from







Dow

nloaded from



bx








USE CASES




Dow

nloaded from




Dow

nloaded from


ACKNOWLEDGEMENTS


FUNDING



REFERENCES





Dow

nloaded from












































Dow

nloaded from







Dow

nloaded from




Dow

nloaded from


ACKNOWLEDGEMENTS


FUNDING



REFERENCES





Dow

nloaded from












































Dow

nloaded from







Dow

nloaded from


ACKNOWLEDGEMENTS


FUNDING



REFERENCES





Dow

nloaded from












































Dow

nloaded from







Dow

nloaded from












































Dow

nloaded from







Dow

nloaded from







Dow

nloaded from

Date post:	21-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

STITCH 4: integration of protein–chemical interactions ... · STITCH 4: integration of...

Documents