+ All Categories
Home > Documents > DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering...

DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering...

Date post: 16-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
10
RESEARCH ARTICLE SUMMARY DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche, Amitabh Sharma, Maksim Kitsak, Susan Dina Ghiassian, Marc Vidal, Joseph Loscalzo, Albert-László Barabási* INTRODUCTION: A disease is rarely a straight- forward consequence of an abnormality in a single gene, but rather reflects the interplay of multiple molecular processes. The rela- tionships among these processes are encoded in the interactome, a network that integrates all physical interactions within a cell, from protein-protein to regulatory proteinDNA and metabolic interactions. The documented propensity of disease-associated proteins to interact with each other suggests that they tend to cluster in the same neighborhood of the interactome, forming a disease module, a connected subgraph that contains all molecu- lar determinants of a disease. The accurate identification of the corresponding disease module represents the first step toward a sys- tematic understanding of the molecular mech- anisms underlying a complex disease. Here, we present a network-based framework to iden- tify the location of disease modules within the interactome and use the overlap between the modules to predict disease-disease relationships. RATIONALE: Despite impressive advances in high-throughput interactome mapping and disease gene identification, both the interac- tome and our knowledge of disease-associated genes remain incomplete. This incomplete- ness prompts us to ask to what extent the current data are sufficient to map out the disease modules, the first step toward an in- tegrated approach toward human disease. To make progress, we must formulate math- ematically the impact of network incomplete- ness on the identifiability of disease modules, quantifying the predictive power and the lim- itations of the current interactome. RESULTS: Using the tools of network science, we show that we can only uncover disease modules for diseases whose number of asso- ciated genes exceeds a crit- ical threshold determined by the network incomplete- ness. We find that disease proteins associated with 226 diseases are clustered in the same network neigh- borhood, displaying a statistically significant tendency to form identifiable disease modules. The higher the degree of agglomeration of the disease proteins within the interactome, the higher the biological and functional similar- ity of the corresponding genes. These find- ings indicate that many local neighborhoods of the interactome represent the observable part of the true, larger and denser disease modules. If two disease modules overlap, local per- turbations causing one disease can disrupt pathways of the other disease module as well, resulting in shared clinical and pathobiolog- ical characteristics. To test this hypothesis, we measure the network-based separation of each disease pair, observing a direct relation between the pathobiological similarity of diseases and their relative distance in the interactome. We find that disease pairs with overlapping disease modules display significant molecular similarity, elevated coexpression of their associated genes, and similar symptoms and high comorbidity. At the same time, non- overlapping disease pairs lack any detectable pathobiological relationships. The proposed network-based distance allows us to predict the pathobiological relationship even for dis- eases that do not share genes. CONCLUSION: Despite its incompleteness, the interactome has reached sufficient cov- erage to allow the systematic investigation of disease mechanisms and to help uncover the molecular origins of the pathobiological relationships between diseases. The intro- duced network-based framework can be ex- tended to address numerous questions at the forefront of network medicine, from inter- preting genome-wide association study data to drug target identification and repurposing. RESEARCH SCIENCE sciencemag.org 20 FEBRUARY 2015 VOL 347 ISSUE 6224 841 ON OUR WEB SITE Read the full article at http://dx.doi. org/10.1126/ science.1257601 .................................................. Diseases within the interactome. The interactome collects all physical interactions between a cells molecular components. Proteins associated with the same disease form connected subgraphs, called disease modules, shown for multiple sclerosis (MS), peroxisomal disorders (PD), and rheumatoid arthritis (RA). Disease pairs with overlapping modules (MS and RA) have some phenotypic similarities and high comorbidity. Non-overlapping diseases, like MS and PD, lack detectable clinical relationships. The list of author affiliations is available in the full article online. *Corresponding author. E-mail: [email protected] Cite this article as J. Menche et al., Science 347, 1257601 (2015). DOI: 10.1126/science.1257601 on August 31, 2020 http://science.sciencemag.org/ Downloaded from
Transcript
Page 1: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

RESEARCH ARTICLE SUMMARY

DISEASE NETWORKS

Uncovering disease-diseaserelationships through theincomplete interactomeJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc VidalJoseph Loscalzo Albert-Laacuteszloacute Barabaacutesi

INTRODUCTION Adisease is rarely a straight-forward consequence of an abnormality in asingle gene but rather reflects the interplayof multiple molecular processes The rela-tionships among these processes are encodedin the interactome a network that integratesall physical interactions within a cell fromprotein-protein to regulatory proteinndashDNAand metabolic interactions The documentedpropensity of disease-associated proteins tointeract with each other suggests that theytend to cluster in the same neighborhood ofthe interactome forming a disease module aconnected subgraph that contains all molecu-lar determinants of a disease The accurateidentification of the corresponding diseasemodule represents the first step toward a sys-

tematic understanding of themolecularmech-anisms underlying a complex disease Herewe present a network-based framework to iden-tify the location of disease modules within theinteractome and use the overlap between themodules to predict disease-disease relationships

RATIONALE Despite impressive advancesin high-throughput interactome mapping anddisease gene identification both the interac-tome and our knowledge of disease-associatedgenes remain incomplete This incomplete-ness prompts us to ask to what extent thecurrent data are sufficient to map out thedisease modules the first step toward an in-tegrated approach toward human diseaseTo make progress we must formulate math-

ematically the impact of network incomplete-ness on the identifiability of disease modulesquantifying the predictive power and the lim-itations of the current interactome

RESULTSUsing the tools of network sciencewe show that we can only uncover diseasemodules for diseases whose number of asso-

ciated genes exceeds a crit-ical threshold determinedby thenetwork incomplete-nessWe find that diseaseproteins associated with226 diseases are clusteredin the samenetworkneigh-

borhood displaying a statistically significanttendency to form identifiable diseasemodulesThe higher the degree of agglomeration of thedisease proteins within the interactome thehigher the biological and functional similar-ity of the corresponding genes These find-ings indicate that many local neighborhoodsof the interactome represent the observablepart of the true larger and denser diseasemodulesIf two disease modules overlap local per-

turbations causing one disease can disruptpathways of the other disease module as wellresulting in shared clinical and pathobiolog-ical characteristics To test this hypothesiswemeasure the network-based separation ofeach disease pair observing a direct relationbetween the pathobiological similarity ofdiseases and their relative distance in theinteractome We find that disease pairs withoverlapping diseasemodules display significantmolecular similarity elevated coexpression oftheir associated genes and similar symptomsand high comorbidity At the same time non-overlapping disease pairs lack any detectablepathobiological relationships The proposednetwork-based distance allows us to predictthe pathobiological relationship even for dis-eases that do not share genes

CONCLUSION Despite its incompletenessthe interactome has reached sufficient cov-erage to allow the systematic investigationof disease mechanisms and to help uncoverthe molecular origins of the pathobiologicalrelationships between diseases The intro-duced network-based framework can be ex-tended to address numerous questions at theforefront of network medicine from inter-preting genome-wide association study datato drug target identification and repurposing

RESEARCH

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 841

ON OUR WEB SITE

Read the full articleat httpdxdoiorg101126science1257601

Diseases within the interactomeThe interactome collects all physical interactions betweena cellrsquos molecular components Proteins associated with the same disease form connectedsubgraphs called disease modules shown for multiple sclerosis (MS) peroxisomal disorders(PD) and rheumatoid arthritis (RA) Disease pairs with overlapping modules (MS and RA)have some phenotypic similarities and high comorbidity Non-overlapping diseases like MSand PD lack detectable clinical relationships

The list of author affiliations is available in the full article onlineCorresponding author E-mail albneueduCite this article as J Menche et al Science 347 1257601(2015) DOI 101126science1257601

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

RESEARCH ARTICLE

DISEASE NETWORKS

Uncovering disease-diseaserelationships through theincomplete interactomeJoumlrg Menche123 Amitabh Sharma12 Maksim Kitsak12 Susan Dina Ghiassian12

Marc Vidal24 Joseph Loscalzo5 Albert-Laacuteszloacute Barabaacutesi1235

According to the diseasemodule hypothesis the cellular components associatedwith a diseasesegregate in the same neighborhood of the human interactome themap of biologically relevantmolecular interactionsYet given the incompleteness of the interactome and the limitedknowledge of disease-associated genes it is not obvious if the available data have sufficientcoverage to map out modules associated with each disease Here we derive mathematicalconditions for the identifiability of disease modules and show that the network-based locationof each disease module determines its pathobiological relationship to other diseases Forexample diseases with overlapping network modules show significant coexpression patternssymptom similarity and comorbidity whereas diseases residing in separated networkneighborhoods are phenotypically distinctThese tools represent an interactome-basedplatform to predict molecular commonalities between phenotypically related diseases evenif they do not share primary disease genes

Identifying sequence variations associated withspecific phenotypes represents only the firststep of a systematic program toward under-standing human disease Indeed most pheno-types reflect the interplay of multiple molecular

components that interact with each other (1ndash6)many of which do not carry disease-associatedvariationsHence wemust viewdisease-associatedmutations in the context of the human inter-actome a comprehensive map of all biologicallyrelevant molecular interactions (6ndash12)Yet the predictive power of the current network-

based approaches to human disease is limited byseveral conceptual and methodological issuesFirst high-throughput methods cover less than20 of all potential pairwise protein interactionsin the human cell (11ndash16) which means that weseek to discover disease mechanisms relying oninteractome maps that are 80 incomplete Sec-ond the genetic roots of a disease are tradition-ally captured by the list of disease genes whosemutations have a causal effect on the respectivephenotype The disease proteins (the products ofdisease genes) are not scattered randomly in theinteractome but tend to interact with each other

forming one or several connected subgraphs thatwe call the diseasemodule (Fig 1A) This agglom-eration of disease proteins is supported by arange of biological and empirical evidence (7 17 18)and has fueled the development of numerous toolsto identify new disease genes and prioritize path-ways for disease relevance (8 9 19ndash28) Despite itsfrequent use however the disease module hypoth-esis lacks a solid mathematical basis Third therelationships between distinct phenotypes arecurrently uncovered by identifying shared com-ponents like disease genes single-nucleotide poly-morphisms (SNPs) pathways or differentiallyexpressed genes involved in both diseases Thishas resulted in the construction of ldquodisease net-worksrdquo unveiling the common genetic origins ofmany disease pairs (7 29) Yet shared genes offeronly limited information about the relationshipbetween two diseases Indeed mechanistic in-sights are often carried by the molecular net-works throughwhich the gene products associatedwith the two diseases interact with each other

The fragmentation of disease modules

We started by compiling 141296 physical inter-actions between 13460 proteins experimentallydocumented in human cells including protein-protein and regulatory interactions metabolicpathway interactions and kinase-substrate inter-actions [Fig 1 see also figs S1 and S2 and sup-plementary materials (SM) section 1 for a detaileddiscussion] representing a blueprint of the humaninteractome (Fig 1D)Wealso compiled a corpus ofall 299 diseases defined by the Medical SubjectHeadings (MeSH) ontology that have at least 20associated genes in the current Online Mendelian

Inheritance in Man (OMIM) and genome-wideassociation study (GWAS) databases (30 31) in-volving 2436 disease-associated proteins (Fig 1 Band C and SM section 1)Despite the best curation efforts both the in-

teractome and the disease gene list remain in-complete (6 11ndash16) andbiased towardmuch-studieddisease genes and disease mechanisms (32 33)The consequences of this incompleteness are il-lustrated by multiple sclerosis Of the 69 genesassociated with the disease only 11 disease pro-teins form a connected subgraph (observablemodule Fig 1D) the remaining 58 proteins ap-pear to be distributed randomly in the inter-actome This pattern holds for all 299 diseasestheir observable modules comprising on aver-age only 20 of the respective disease genes(Fig 1C) Several factors contribute to this frag-mentation (Fig 1A) the main one being data in-completeness Missing links leave many diseaseproteins isolated from their disease module(Fig 1A)In percolation theory if only a p fraction of

links is available a connected subgraph (diseasemodule) of m nodes undergoes a phase tran-sition under certain conditions (34 35) If p isabove pm

c some fraction of nodes continue toform an observable module if however p isbelow pmc the module becomes too fragmentedto be observable (Fig 1E see also fig S14 and SMsection 6) To quantify this phenomenon we cal-culated the minimum network coverage pm

c re-quired to observe a disease module of originalsize m finding that pm

c e 1=m valid for an arbi-trary degree distribution of the underlying inter-actome Figure 1F illustrates a signature of thisphenomenon in the interactome The observabledisease module size Si versus the number of dis-ease genes associated with each disease followsthe predicted percolation transition (purple line)Hence percolation theory predicts that for dis-eases with fewer than Nc asymp 25 genes the moduleis too fragmented to be observable in the cur-rent interactome only diseases withNd gt Nc dis-ease genes should have an observable diseasemoduleTo test whether the observed disease modules

represent nonrandomdisease gene aggregationsfor each disease we compared the size Si of itsobservable module with the expected Si

rand ifthe same number of disease proteins were placedrandomly on the interactome For example formultiple sclerosis the observed Si = 11 is sig-nificantly larger than the random expectationSirand = 2 T 1 (z score = 58 p value = 33 times 10minus9

Figs 1D and 2A) hence the observed multiplesclerosis module cannot be attributed to a ran-dom agglomeration of disease genes We alsodetermined for each disease protein the network-based distance ds to the closest other proteinassociated with the same disease Again for mul-tiple sclerosis P(ds) is shifted toward smaller dscompared to the random expectation Prand(ds)(p value = 26 times 10minus6 Fig 2B) indicating thatthe disconnected disease proteins agglomeratein the neighborhood of the observable moduleAltogether disease genes associated with 226

RESEARCH

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-1

1Center for Complex Networks Research and Department ofPhysics Northeastern University 110 Forsyth Street 111 DanaResearch Center Boston MA 02115 USA 2Center for CancerSystems Biology (CCSB) and Department of Cancer BiologyDana-Farber Cancer Institute 450 Brookline Avenue BostonMA 02215 USA 3Center for Network Science Central EuropeanUniversity Nador u 9 1051 Budapest Hungary 4Department ofGenetics Harvard Medical School 77 Avenue Louis PasteurBoston MA 02115 USA 5Department of Medicine Brigham andWomenrsquos Hospital Harvard Medical School 75 Francis StreetBoston MA 02115 USACorresponding author E-mail albneuedu

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

of the 299 diseases show a statistically signifi-cant tendency to form disease modules based onboth Si and P(ds) (fig S4)

We also asked if there is a relationship be-tween the tendency of disease proteins to agglom-erate in the same interactome neighborhood

and their biological similarity (7 36 37) Wefind that as the relative size si equiv SiNi of the ob-servable module increases from 01 to 08 a sign

1257601-2 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 1 From the hu-man interactome todisease modules (A)According to the dis-ease module hypothe-sis a diseaserepresents a local per-turbation of theunderlying disease-associated subgraphSuch perturbationscould represent theremoval of a protein(eg by a nonsensemutation) the disrup-tion of a protein-protein interaction ormodifications in thestrength of an interac-tion The completedisease module can beidentified only in a fullinteractome map thedisease moduleobservable to uscaptures a subset ofthis module owing todata incompleteness(B) Distribution of thenumber of disease-associated genes for299 diseases (C)Distribution of thefraction of diseasegenes within theobservable diseasemodule (D) A smallneighborhood of theinteractome showingthe biological nature ofeach physical interac-tion and the origin ofthe disease-geneassociations used inour study (see also SMsection 1) Genesassociated withmultiple sclerosis areshown in red theshaded area indicatingtheir observablemodule a connectedsubgraph consisting of11 proteins (E) Sche-matic illustration ofthe predicted size ofthe observable diseasemodules (subgraphs)as a function of network completeness Large modules should be observable even for low network coverage to discover smaller modules we need highernetwork completeness (F) Size of the observable module as a function of the total number of disease genes The purple curve corresponds to thepercolation-based prediction (SM section 6) indicating that diseases with Nd lt Nc asymp 25 genes do not have an observable disease module in the currentinteractome Each gray point captures one of the 299 diseases

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

of increasing agglomeration of the disease genesthe significance of the biological similarity inGene Ontology (GO) annotations (biological pro-cesses molecular function and cellular compo-nent) increases 10- to 100-fold (Fig 2 C to E andfig S3 a to c) an exceptionally strong effect (seeSM sect 2 for statistical analysis) Similarly asthe mean shortest distance between disease pro-teins increases from 1 (agglomerated disease pro-teins) to 3 (scattered disease proteins) we observea factor of 10 to 100 decrease in the significance ofGO termsimilarity (Fig 2 F toH and fig S3 d to f)Taken together we find that genes associated

with the same disease tend to agglomerate in thesame neighborhood of the interactome Indeedalthough ~80 of the disease proteins are dis-connected from the observable module theseisolates tend to be localized in its network vicin-ity This result offers quantitative support to thehypothesis thatmany local neighborhoods of theinteractome represent the observable parts ofthe true larger and denser disease modules

Relationship between diseases

If two disease modules overlap local perturba-tions leading to one disease will likely disruptpathways involved in the other disease moduleas well resulting in shared clinical characteristicsTo test the validity of this hypothesis we introducethe network-based separation of a disease pair Aand B (Fig 3A see also figs S5 to S7) using

sAB equiv langdABrang minuslangdAArangthorn langdBBrang

2eth1THORN

sAB compares the shortest distances betweenproteins within each disease langdAArang and langdBBrang tothe shortest distances langdABrang between A-B proteinpairs Proteins associated with both A and B havedAB = 0 As discussed in SM section 33 the gen-eralization of sAB to account for directed regulatoryand signaling interactions does not alter our sub-sequent findings (fig S8)We find that only 7 of disease pairs have

overlapping disease neighborhoods with nega-tive sAB (Fig 3B) the remaining 93 have a po-sitive sAB indicating that their disease modulesare topologically separated (Fig 3C) Because welack unambiguous true positive and true nega-tive disease relationships that could be used as areference we use two complementary null mod-els to evaluate the statistical significance of eachdisease pair compared to random expectation(see SM section 22) At a global false discoverylevel of 5 we find that 75 of all disease pairsexhibit significant sAB To determine the degreeto which this network-based separation of twodiseases is predictive for pathobiological mani-festations we rely on four data sets1) Biological similarity We find that the closer

two diseases are in the interactome the higherthe GO annotationndashbased similarity of the pro-teins associated with them (Fig 3 D to F) Theeffect is strong resulting in a two-order-of-magnitude decrease in GO term similarity as wemove from highly overlapping (sAB asymp ndash2) to well-separated disease pairs (sAB gt 0)

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-3

Fig 2 Topological localization and biological similarity of disease genes (A) The size of the largestconnected component S of proteins associated with the same disease shown for multiple sclerosis Theobserved module size S = 11 is significantly larger than the random expectation Srand = 2 T 1 (B) Thedistribution of the shortest distance of each disease protein to the next closest disease protein ds Formultiple sclerosis P(ds) is significantly shifted compared to the random expectation indicating thatdisease genes tend to agglomerate in each otherrsquos network neighborhood (C to H) The degree of thenetwork-based localization of a disease as measured by the relative size of its observable module si =SiNd and the mean shortest distance langdsrang correlates strongly with the significance of the biologicalsimilarity of the respective disease genes Using the GO annotations we determine for each disease howsimilar its associated genes are in terms of their biological processes (C and F) molecular function (Dand G) and cellular component (E and H) Comparing the resulting values with random expectation wefind that the more localized a disease is topologically (ie the larger si or the shorter langdsrang) the higher thesignificance in the similarity of the associated genes

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-4 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 3 Network separation and disease similarity (A) A subnetwork of thefull interactome highlighting the network-based relationship between diseasegenes associated with three diseases identified in the legend (B and C)Distance distributions for disease pairs that have topologically overlappingmodules (sAB lt 0 B) or topologically separated modules (sAB gt 0 C) The plotsshow P(d) for the disease pairs shown in (A) (D to I) Topological separationversus biomedical similarity (D to F) GO term similarity (G) gene coexpression(H) symptom similarity for all disease pairs in function of their topologicalseparation sAB The region of overlapping disease pairs is highlighted in red(sAB lt 0) the region of the separated disease pairs is shown in blue (sAB gt 0)For symptom similarity we show the cosine similarity (cAB = 0 if there are noshared symptoms between diseases A and B and cAB = 1 for diseases withidentical symptoms) Comorbidity in (I) is measured by the relative risk RR(40) Bars in (D) to (I) indicate random expectation (SM section 1) in (D) to(G) the expected value for a randomly chosen protein pair is shown In (H)and (I) the mean value of all disease pairs is used (J to M) The interplay

between gene-set overlap and the network-based relationships between diseasepairs (J) The relationship between gene sets A and B is captured by the overlapcoefficient C = |A cap B|min(|A| |B|) and the Jaccard-index J = |A cap B||A cup B|More than half (59) of the disease pairs do not share genes (J = C = 0) hencetheir relation cannot be uncovered based on shared genes (K) Distribution ofsAB for disease pairs with no gene overlapWe find that despite having disjointgene sets 717 diseases pairs have overlapping modules (sAB lt 0) (L) Thedistribution of sAB for disease pairs with complete gene overlap (C = 1) shows abroad range of network-based relationships including non-overlapping mod-ules (sAB gt 0) (M) Fold change of the number of shared genes compared torandom expectation versus sAB for all disease pairs The 59 of all diseasepairs without shared genes are highlighted with red background For 98 of alldisease pairs that share at least one gene the gene-based overlap is largerthan expected by chance Nevertheless most (87) of these disease pairs areseparated in the network (sAB gt 0) Conversely a considerable number of pairs(717) without shared genes exhibit detectable network overlap (sAB lt 0)

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 2: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

RESEARCH ARTICLE

DISEASE NETWORKS

Uncovering disease-diseaserelationships through theincomplete interactomeJoumlrg Menche123 Amitabh Sharma12 Maksim Kitsak12 Susan Dina Ghiassian12

Marc Vidal24 Joseph Loscalzo5 Albert-Laacuteszloacute Barabaacutesi1235

According to the diseasemodule hypothesis the cellular components associatedwith a diseasesegregate in the same neighborhood of the human interactome themap of biologically relevantmolecular interactionsYet given the incompleteness of the interactome and the limitedknowledge of disease-associated genes it is not obvious if the available data have sufficientcoverage to map out modules associated with each disease Here we derive mathematicalconditions for the identifiability of disease modules and show that the network-based locationof each disease module determines its pathobiological relationship to other diseases Forexample diseases with overlapping network modules show significant coexpression patternssymptom similarity and comorbidity whereas diseases residing in separated networkneighborhoods are phenotypically distinctThese tools represent an interactome-basedplatform to predict molecular commonalities between phenotypically related diseases evenif they do not share primary disease genes

Identifying sequence variations associated withspecific phenotypes represents only the firststep of a systematic program toward under-standing human disease Indeed most pheno-types reflect the interplay of multiple molecular

components that interact with each other (1ndash6)many of which do not carry disease-associatedvariationsHence wemust viewdisease-associatedmutations in the context of the human inter-actome a comprehensive map of all biologicallyrelevant molecular interactions (6ndash12)Yet the predictive power of the current network-

based approaches to human disease is limited byseveral conceptual and methodological issuesFirst high-throughput methods cover less than20 of all potential pairwise protein interactionsin the human cell (11ndash16) which means that weseek to discover disease mechanisms relying oninteractome maps that are 80 incomplete Sec-ond the genetic roots of a disease are tradition-ally captured by the list of disease genes whosemutations have a causal effect on the respectivephenotype The disease proteins (the products ofdisease genes) are not scattered randomly in theinteractome but tend to interact with each other

forming one or several connected subgraphs thatwe call the diseasemodule (Fig 1A) This agglom-eration of disease proteins is supported by arange of biological and empirical evidence (7 17 18)and has fueled the development of numerous toolsto identify new disease genes and prioritize path-ways for disease relevance (8 9 19ndash28) Despite itsfrequent use however the disease module hypoth-esis lacks a solid mathematical basis Third therelationships between distinct phenotypes arecurrently uncovered by identifying shared com-ponents like disease genes single-nucleotide poly-morphisms (SNPs) pathways or differentiallyexpressed genes involved in both diseases Thishas resulted in the construction of ldquodisease net-worksrdquo unveiling the common genetic origins ofmany disease pairs (7 29) Yet shared genes offeronly limited information about the relationshipbetween two diseases Indeed mechanistic in-sights are often carried by the molecular net-works throughwhich the gene products associatedwith the two diseases interact with each other

The fragmentation of disease modules

We started by compiling 141296 physical inter-actions between 13460 proteins experimentallydocumented in human cells including protein-protein and regulatory interactions metabolicpathway interactions and kinase-substrate inter-actions [Fig 1 see also figs S1 and S2 and sup-plementary materials (SM) section 1 for a detaileddiscussion] representing a blueprint of the humaninteractome (Fig 1D)Wealso compiled a corpus ofall 299 diseases defined by the Medical SubjectHeadings (MeSH) ontology that have at least 20associated genes in the current Online Mendelian

Inheritance in Man (OMIM) and genome-wideassociation study (GWAS) databases (30 31) in-volving 2436 disease-associated proteins (Fig 1 Band C and SM section 1)Despite the best curation efforts both the in-

teractome and the disease gene list remain in-complete (6 11ndash16) andbiased towardmuch-studieddisease genes and disease mechanisms (32 33)The consequences of this incompleteness are il-lustrated by multiple sclerosis Of the 69 genesassociated with the disease only 11 disease pro-teins form a connected subgraph (observablemodule Fig 1D) the remaining 58 proteins ap-pear to be distributed randomly in the inter-actome This pattern holds for all 299 diseasestheir observable modules comprising on aver-age only 20 of the respective disease genes(Fig 1C) Several factors contribute to this frag-mentation (Fig 1A) the main one being data in-completeness Missing links leave many diseaseproteins isolated from their disease module(Fig 1A)In percolation theory if only a p fraction of

links is available a connected subgraph (diseasemodule) of m nodes undergoes a phase tran-sition under certain conditions (34 35) If p isabove pm

c some fraction of nodes continue toform an observable module if however p isbelow pmc the module becomes too fragmentedto be observable (Fig 1E see also fig S14 and SMsection 6) To quantify this phenomenon we cal-culated the minimum network coverage pm

c re-quired to observe a disease module of originalsize m finding that pm

c e 1=m valid for an arbi-trary degree distribution of the underlying inter-actome Figure 1F illustrates a signature of thisphenomenon in the interactome The observabledisease module size Si versus the number of dis-ease genes associated with each disease followsthe predicted percolation transition (purple line)Hence percolation theory predicts that for dis-eases with fewer than Nc asymp 25 genes the moduleis too fragmented to be observable in the cur-rent interactome only diseases withNd gt Nc dis-ease genes should have an observable diseasemoduleTo test whether the observed disease modules

represent nonrandomdisease gene aggregationsfor each disease we compared the size Si of itsobservable module with the expected Si

rand ifthe same number of disease proteins were placedrandomly on the interactome For example formultiple sclerosis the observed Si = 11 is sig-nificantly larger than the random expectationSirand = 2 T 1 (z score = 58 p value = 33 times 10minus9

Figs 1D and 2A) hence the observed multiplesclerosis module cannot be attributed to a ran-dom agglomeration of disease genes We alsodetermined for each disease protein the network-based distance ds to the closest other proteinassociated with the same disease Again for mul-tiple sclerosis P(ds) is shifted toward smaller dscompared to the random expectation Prand(ds)(p value = 26 times 10minus6 Fig 2B) indicating thatthe disconnected disease proteins agglomeratein the neighborhood of the observable moduleAltogether disease genes associated with 226

RESEARCH

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-1

1Center for Complex Networks Research and Department ofPhysics Northeastern University 110 Forsyth Street 111 DanaResearch Center Boston MA 02115 USA 2Center for CancerSystems Biology (CCSB) and Department of Cancer BiologyDana-Farber Cancer Institute 450 Brookline Avenue BostonMA 02215 USA 3Center for Network Science Central EuropeanUniversity Nador u 9 1051 Budapest Hungary 4Department ofGenetics Harvard Medical School 77 Avenue Louis PasteurBoston MA 02115 USA 5Department of Medicine Brigham andWomenrsquos Hospital Harvard Medical School 75 Francis StreetBoston MA 02115 USACorresponding author E-mail albneuedu

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

of the 299 diseases show a statistically signifi-cant tendency to form disease modules based onboth Si and P(ds) (fig S4)

We also asked if there is a relationship be-tween the tendency of disease proteins to agglom-erate in the same interactome neighborhood

and their biological similarity (7 36 37) Wefind that as the relative size si equiv SiNi of the ob-servable module increases from 01 to 08 a sign

1257601-2 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 1 From the hu-man interactome todisease modules (A)According to the dis-ease module hypothe-sis a diseaserepresents a local per-turbation of theunderlying disease-associated subgraphSuch perturbationscould represent theremoval of a protein(eg by a nonsensemutation) the disrup-tion of a protein-protein interaction ormodifications in thestrength of an interac-tion The completedisease module can beidentified only in a fullinteractome map thedisease moduleobservable to uscaptures a subset ofthis module owing todata incompleteness(B) Distribution of thenumber of disease-associated genes for299 diseases (C)Distribution of thefraction of diseasegenes within theobservable diseasemodule (D) A smallneighborhood of theinteractome showingthe biological nature ofeach physical interac-tion and the origin ofthe disease-geneassociations used inour study (see also SMsection 1) Genesassociated withmultiple sclerosis areshown in red theshaded area indicatingtheir observablemodule a connectedsubgraph consisting of11 proteins (E) Sche-matic illustration ofthe predicted size ofthe observable diseasemodules (subgraphs)as a function of network completeness Large modules should be observable even for low network coverage to discover smaller modules we need highernetwork completeness (F) Size of the observable module as a function of the total number of disease genes The purple curve corresponds to thepercolation-based prediction (SM section 6) indicating that diseases with Nd lt Nc asymp 25 genes do not have an observable disease module in the currentinteractome Each gray point captures one of the 299 diseases

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

of increasing agglomeration of the disease genesthe significance of the biological similarity inGene Ontology (GO) annotations (biological pro-cesses molecular function and cellular compo-nent) increases 10- to 100-fold (Fig 2 C to E andfig S3 a to c) an exceptionally strong effect (seeSM sect 2 for statistical analysis) Similarly asthe mean shortest distance between disease pro-teins increases from 1 (agglomerated disease pro-teins) to 3 (scattered disease proteins) we observea factor of 10 to 100 decrease in the significance ofGO termsimilarity (Fig 2 F toH and fig S3 d to f)Taken together we find that genes associated

with the same disease tend to agglomerate in thesame neighborhood of the interactome Indeedalthough ~80 of the disease proteins are dis-connected from the observable module theseisolates tend to be localized in its network vicin-ity This result offers quantitative support to thehypothesis thatmany local neighborhoods of theinteractome represent the observable parts ofthe true larger and denser disease modules

Relationship between diseases

If two disease modules overlap local perturba-tions leading to one disease will likely disruptpathways involved in the other disease moduleas well resulting in shared clinical characteristicsTo test the validity of this hypothesis we introducethe network-based separation of a disease pair Aand B (Fig 3A see also figs S5 to S7) using

sAB equiv langdABrang minuslangdAArangthorn langdBBrang

2eth1THORN

sAB compares the shortest distances betweenproteins within each disease langdAArang and langdBBrang tothe shortest distances langdABrang between A-B proteinpairs Proteins associated with both A and B havedAB = 0 As discussed in SM section 33 the gen-eralization of sAB to account for directed regulatoryand signaling interactions does not alter our sub-sequent findings (fig S8)We find that only 7 of disease pairs have

overlapping disease neighborhoods with nega-tive sAB (Fig 3B) the remaining 93 have a po-sitive sAB indicating that their disease modulesare topologically separated (Fig 3C) Because welack unambiguous true positive and true nega-tive disease relationships that could be used as areference we use two complementary null mod-els to evaluate the statistical significance of eachdisease pair compared to random expectation(see SM section 22) At a global false discoverylevel of 5 we find that 75 of all disease pairsexhibit significant sAB To determine the degreeto which this network-based separation of twodiseases is predictive for pathobiological mani-festations we rely on four data sets1) Biological similarity We find that the closer

two diseases are in the interactome the higherthe GO annotationndashbased similarity of the pro-teins associated with them (Fig 3 D to F) Theeffect is strong resulting in a two-order-of-magnitude decrease in GO term similarity as wemove from highly overlapping (sAB asymp ndash2) to well-separated disease pairs (sAB gt 0)

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-3

Fig 2 Topological localization and biological similarity of disease genes (A) The size of the largestconnected component S of proteins associated with the same disease shown for multiple sclerosis Theobserved module size S = 11 is significantly larger than the random expectation Srand = 2 T 1 (B) Thedistribution of the shortest distance of each disease protein to the next closest disease protein ds Formultiple sclerosis P(ds) is significantly shifted compared to the random expectation indicating thatdisease genes tend to agglomerate in each otherrsquos network neighborhood (C to H) The degree of thenetwork-based localization of a disease as measured by the relative size of its observable module si =SiNd and the mean shortest distance langdsrang correlates strongly with the significance of the biologicalsimilarity of the respective disease genes Using the GO annotations we determine for each disease howsimilar its associated genes are in terms of their biological processes (C and F) molecular function (Dand G) and cellular component (E and H) Comparing the resulting values with random expectation wefind that the more localized a disease is topologically (ie the larger si or the shorter langdsrang) the higher thesignificance in the similarity of the associated genes

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-4 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 3 Network separation and disease similarity (A) A subnetwork of thefull interactome highlighting the network-based relationship between diseasegenes associated with three diseases identified in the legend (B and C)Distance distributions for disease pairs that have topologically overlappingmodules (sAB lt 0 B) or topologically separated modules (sAB gt 0 C) The plotsshow P(d) for the disease pairs shown in (A) (D to I) Topological separationversus biomedical similarity (D to F) GO term similarity (G) gene coexpression(H) symptom similarity for all disease pairs in function of their topologicalseparation sAB The region of overlapping disease pairs is highlighted in red(sAB lt 0) the region of the separated disease pairs is shown in blue (sAB gt 0)For symptom similarity we show the cosine similarity (cAB = 0 if there are noshared symptoms between diseases A and B and cAB = 1 for diseases withidentical symptoms) Comorbidity in (I) is measured by the relative risk RR(40) Bars in (D) to (I) indicate random expectation (SM section 1) in (D) to(G) the expected value for a randomly chosen protein pair is shown In (H)and (I) the mean value of all disease pairs is used (J to M) The interplay

between gene-set overlap and the network-based relationships between diseasepairs (J) The relationship between gene sets A and B is captured by the overlapcoefficient C = |A cap B|min(|A| |B|) and the Jaccard-index J = |A cap B||A cup B|More than half (59) of the disease pairs do not share genes (J = C = 0) hencetheir relation cannot be uncovered based on shared genes (K) Distribution ofsAB for disease pairs with no gene overlapWe find that despite having disjointgene sets 717 diseases pairs have overlapping modules (sAB lt 0) (L) Thedistribution of sAB for disease pairs with complete gene overlap (C = 1) shows abroad range of network-based relationships including non-overlapping mod-ules (sAB gt 0) (M) Fold change of the number of shared genes compared torandom expectation versus sAB for all disease pairs The 59 of all diseasepairs without shared genes are highlighted with red background For 98 of alldisease pairs that share at least one gene the gene-based overlap is largerthan expected by chance Nevertheless most (87) of these disease pairs areseparated in the network (sAB gt 0) Conversely a considerable number of pairs(717) without shared genes exhibit detectable network overlap (sAB lt 0)

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 3: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

of the 299 diseases show a statistically signifi-cant tendency to form disease modules based onboth Si and P(ds) (fig S4)

We also asked if there is a relationship be-tween the tendency of disease proteins to agglom-erate in the same interactome neighborhood

and their biological similarity (7 36 37) Wefind that as the relative size si equiv SiNi of the ob-servable module increases from 01 to 08 a sign

1257601-2 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 1 From the hu-man interactome todisease modules (A)According to the dis-ease module hypothe-sis a diseaserepresents a local per-turbation of theunderlying disease-associated subgraphSuch perturbationscould represent theremoval of a protein(eg by a nonsensemutation) the disrup-tion of a protein-protein interaction ormodifications in thestrength of an interac-tion The completedisease module can beidentified only in a fullinteractome map thedisease moduleobservable to uscaptures a subset ofthis module owing todata incompleteness(B) Distribution of thenumber of disease-associated genes for299 diseases (C)Distribution of thefraction of diseasegenes within theobservable diseasemodule (D) A smallneighborhood of theinteractome showingthe biological nature ofeach physical interac-tion and the origin ofthe disease-geneassociations used inour study (see also SMsection 1) Genesassociated withmultiple sclerosis areshown in red theshaded area indicatingtheir observablemodule a connectedsubgraph consisting of11 proteins (E) Sche-matic illustration ofthe predicted size ofthe observable diseasemodules (subgraphs)as a function of network completeness Large modules should be observable even for low network coverage to discover smaller modules we need highernetwork completeness (F) Size of the observable module as a function of the total number of disease genes The purple curve corresponds to thepercolation-based prediction (SM section 6) indicating that diseases with Nd lt Nc asymp 25 genes do not have an observable disease module in the currentinteractome Each gray point captures one of the 299 diseases

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

of increasing agglomeration of the disease genesthe significance of the biological similarity inGene Ontology (GO) annotations (biological pro-cesses molecular function and cellular compo-nent) increases 10- to 100-fold (Fig 2 C to E andfig S3 a to c) an exceptionally strong effect (seeSM sect 2 for statistical analysis) Similarly asthe mean shortest distance between disease pro-teins increases from 1 (agglomerated disease pro-teins) to 3 (scattered disease proteins) we observea factor of 10 to 100 decrease in the significance ofGO termsimilarity (Fig 2 F toH and fig S3 d to f)Taken together we find that genes associated

with the same disease tend to agglomerate in thesame neighborhood of the interactome Indeedalthough ~80 of the disease proteins are dis-connected from the observable module theseisolates tend to be localized in its network vicin-ity This result offers quantitative support to thehypothesis thatmany local neighborhoods of theinteractome represent the observable parts ofthe true larger and denser disease modules

Relationship between diseases

If two disease modules overlap local perturba-tions leading to one disease will likely disruptpathways involved in the other disease moduleas well resulting in shared clinical characteristicsTo test the validity of this hypothesis we introducethe network-based separation of a disease pair Aand B (Fig 3A see also figs S5 to S7) using

sAB equiv langdABrang minuslangdAArangthorn langdBBrang

2eth1THORN

sAB compares the shortest distances betweenproteins within each disease langdAArang and langdBBrang tothe shortest distances langdABrang between A-B proteinpairs Proteins associated with both A and B havedAB = 0 As discussed in SM section 33 the gen-eralization of sAB to account for directed regulatoryand signaling interactions does not alter our sub-sequent findings (fig S8)We find that only 7 of disease pairs have

overlapping disease neighborhoods with nega-tive sAB (Fig 3B) the remaining 93 have a po-sitive sAB indicating that their disease modulesare topologically separated (Fig 3C) Because welack unambiguous true positive and true nega-tive disease relationships that could be used as areference we use two complementary null mod-els to evaluate the statistical significance of eachdisease pair compared to random expectation(see SM section 22) At a global false discoverylevel of 5 we find that 75 of all disease pairsexhibit significant sAB To determine the degreeto which this network-based separation of twodiseases is predictive for pathobiological mani-festations we rely on four data sets1) Biological similarity We find that the closer

two diseases are in the interactome the higherthe GO annotationndashbased similarity of the pro-teins associated with them (Fig 3 D to F) Theeffect is strong resulting in a two-order-of-magnitude decrease in GO term similarity as wemove from highly overlapping (sAB asymp ndash2) to well-separated disease pairs (sAB gt 0)

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-3

Fig 2 Topological localization and biological similarity of disease genes (A) The size of the largestconnected component S of proteins associated with the same disease shown for multiple sclerosis Theobserved module size S = 11 is significantly larger than the random expectation Srand = 2 T 1 (B) Thedistribution of the shortest distance of each disease protein to the next closest disease protein ds Formultiple sclerosis P(ds) is significantly shifted compared to the random expectation indicating thatdisease genes tend to agglomerate in each otherrsquos network neighborhood (C to H) The degree of thenetwork-based localization of a disease as measured by the relative size of its observable module si =SiNd and the mean shortest distance langdsrang correlates strongly with the significance of the biologicalsimilarity of the respective disease genes Using the GO annotations we determine for each disease howsimilar its associated genes are in terms of their biological processes (C and F) molecular function (Dand G) and cellular component (E and H) Comparing the resulting values with random expectation wefind that the more localized a disease is topologically (ie the larger si or the shorter langdsrang) the higher thesignificance in the similarity of the associated genes

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-4 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 3 Network separation and disease similarity (A) A subnetwork of thefull interactome highlighting the network-based relationship between diseasegenes associated with three diseases identified in the legend (B and C)Distance distributions for disease pairs that have topologically overlappingmodules (sAB lt 0 B) or topologically separated modules (sAB gt 0 C) The plotsshow P(d) for the disease pairs shown in (A) (D to I) Topological separationversus biomedical similarity (D to F) GO term similarity (G) gene coexpression(H) symptom similarity for all disease pairs in function of their topologicalseparation sAB The region of overlapping disease pairs is highlighted in red(sAB lt 0) the region of the separated disease pairs is shown in blue (sAB gt 0)For symptom similarity we show the cosine similarity (cAB = 0 if there are noshared symptoms between diseases A and B and cAB = 1 for diseases withidentical symptoms) Comorbidity in (I) is measured by the relative risk RR(40) Bars in (D) to (I) indicate random expectation (SM section 1) in (D) to(G) the expected value for a randomly chosen protein pair is shown In (H)and (I) the mean value of all disease pairs is used (J to M) The interplay

between gene-set overlap and the network-based relationships between diseasepairs (J) The relationship between gene sets A and B is captured by the overlapcoefficient C = |A cap B|min(|A| |B|) and the Jaccard-index J = |A cap B||A cup B|More than half (59) of the disease pairs do not share genes (J = C = 0) hencetheir relation cannot be uncovered based on shared genes (K) Distribution ofsAB for disease pairs with no gene overlapWe find that despite having disjointgene sets 717 diseases pairs have overlapping modules (sAB lt 0) (L) Thedistribution of sAB for disease pairs with complete gene overlap (C = 1) shows abroad range of network-based relationships including non-overlapping mod-ules (sAB gt 0) (M) Fold change of the number of shared genes compared torandom expectation versus sAB for all disease pairs The 59 of all diseasepairs without shared genes are highlighted with red background For 98 of alldisease pairs that share at least one gene the gene-based overlap is largerthan expected by chance Nevertheless most (87) of these disease pairs areseparated in the network (sAB gt 0) Conversely a considerable number of pairs(717) without shared genes exhibit detectable network overlap (sAB lt 0)

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 4: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

of increasing agglomeration of the disease genesthe significance of the biological similarity inGene Ontology (GO) annotations (biological pro-cesses molecular function and cellular compo-nent) increases 10- to 100-fold (Fig 2 C to E andfig S3 a to c) an exceptionally strong effect (seeSM sect 2 for statistical analysis) Similarly asthe mean shortest distance between disease pro-teins increases from 1 (agglomerated disease pro-teins) to 3 (scattered disease proteins) we observea factor of 10 to 100 decrease in the significance ofGO termsimilarity (Fig 2 F toH and fig S3 d to f)Taken together we find that genes associated

with the same disease tend to agglomerate in thesame neighborhood of the interactome Indeedalthough ~80 of the disease proteins are dis-connected from the observable module theseisolates tend to be localized in its network vicin-ity This result offers quantitative support to thehypothesis thatmany local neighborhoods of theinteractome represent the observable parts ofthe true larger and denser disease modules

Relationship between diseases

If two disease modules overlap local perturba-tions leading to one disease will likely disruptpathways involved in the other disease moduleas well resulting in shared clinical characteristicsTo test the validity of this hypothesis we introducethe network-based separation of a disease pair Aand B (Fig 3A see also figs S5 to S7) using

sAB equiv langdABrang minuslangdAArangthorn langdBBrang

2eth1THORN

sAB compares the shortest distances betweenproteins within each disease langdAArang and langdBBrang tothe shortest distances langdABrang between A-B proteinpairs Proteins associated with both A and B havedAB = 0 As discussed in SM section 33 the gen-eralization of sAB to account for directed regulatoryand signaling interactions does not alter our sub-sequent findings (fig S8)We find that only 7 of disease pairs have

overlapping disease neighborhoods with nega-tive sAB (Fig 3B) the remaining 93 have a po-sitive sAB indicating that their disease modulesare topologically separated (Fig 3C) Because welack unambiguous true positive and true nega-tive disease relationships that could be used as areference we use two complementary null mod-els to evaluate the statistical significance of eachdisease pair compared to random expectation(see SM section 22) At a global false discoverylevel of 5 we find that 75 of all disease pairsexhibit significant sAB To determine the degreeto which this network-based separation of twodiseases is predictive for pathobiological mani-festations we rely on four data sets1) Biological similarity We find that the closer

two diseases are in the interactome the higherthe GO annotationndashbased similarity of the pro-teins associated with them (Fig 3 D to F) Theeffect is strong resulting in a two-order-of-magnitude decrease in GO term similarity as wemove from highly overlapping (sAB asymp ndash2) to well-separated disease pairs (sAB gt 0)

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-3

Fig 2 Topological localization and biological similarity of disease genes (A) The size of the largestconnected component S of proteins associated with the same disease shown for multiple sclerosis Theobserved module size S = 11 is significantly larger than the random expectation Srand = 2 T 1 (B) Thedistribution of the shortest distance of each disease protein to the next closest disease protein ds Formultiple sclerosis P(ds) is significantly shifted compared to the random expectation indicating thatdisease genes tend to agglomerate in each otherrsquos network neighborhood (C to H) The degree of thenetwork-based localization of a disease as measured by the relative size of its observable module si =SiNd and the mean shortest distance langdsrang correlates strongly with the significance of the biologicalsimilarity of the respective disease genes Using the GO annotations we determine for each disease howsimilar its associated genes are in terms of their biological processes (C and F) molecular function (Dand G) and cellular component (E and H) Comparing the resulting values with random expectation wefind that the more localized a disease is topologically (ie the larger si or the shorter langdsrang) the higher thesignificance in the similarity of the associated genes

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-4 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 3 Network separation and disease similarity (A) A subnetwork of thefull interactome highlighting the network-based relationship between diseasegenes associated with three diseases identified in the legend (B and C)Distance distributions for disease pairs that have topologically overlappingmodules (sAB lt 0 B) or topologically separated modules (sAB gt 0 C) The plotsshow P(d) for the disease pairs shown in (A) (D to I) Topological separationversus biomedical similarity (D to F) GO term similarity (G) gene coexpression(H) symptom similarity for all disease pairs in function of their topologicalseparation sAB The region of overlapping disease pairs is highlighted in red(sAB lt 0) the region of the separated disease pairs is shown in blue (sAB gt 0)For symptom similarity we show the cosine similarity (cAB = 0 if there are noshared symptoms between diseases A and B and cAB = 1 for diseases withidentical symptoms) Comorbidity in (I) is measured by the relative risk RR(40) Bars in (D) to (I) indicate random expectation (SM section 1) in (D) to(G) the expected value for a randomly chosen protein pair is shown In (H)and (I) the mean value of all disease pairs is used (J to M) The interplay

between gene-set overlap and the network-based relationships between diseasepairs (J) The relationship between gene sets A and B is captured by the overlapcoefficient C = |A cap B|min(|A| |B|) and the Jaccard-index J = |A cap B||A cup B|More than half (59) of the disease pairs do not share genes (J = C = 0) hencetheir relation cannot be uncovered based on shared genes (K) Distribution ofsAB for disease pairs with no gene overlapWe find that despite having disjointgene sets 717 diseases pairs have overlapping modules (sAB lt 0) (L) Thedistribution of sAB for disease pairs with complete gene overlap (C = 1) shows abroad range of network-based relationships including non-overlapping mod-ules (sAB gt 0) (M) Fold change of the number of shared genes compared torandom expectation versus sAB for all disease pairs The 59 of all diseasepairs without shared genes are highlighted with red background For 98 of alldisease pairs that share at least one gene the gene-based overlap is largerthan expected by chance Nevertheless most (87) of these disease pairs areseparated in the network (sAB gt 0) Conversely a considerable number of pairs(717) without shared genes exhibit detectable network overlap (sAB lt 0)

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 5: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

1257601-4 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 3 Network separation and disease similarity (A) A subnetwork of thefull interactome highlighting the network-based relationship between diseasegenes associated with three diseases identified in the legend (B and C)Distance distributions for disease pairs that have topologically overlappingmodules (sAB lt 0 B) or topologically separated modules (sAB gt 0 C) The plotsshow P(d) for the disease pairs shown in (A) (D to I) Topological separationversus biomedical similarity (D to F) GO term similarity (G) gene coexpression(H) symptom similarity for all disease pairs in function of their topologicalseparation sAB The region of overlapping disease pairs is highlighted in red(sAB lt 0) the region of the separated disease pairs is shown in blue (sAB gt 0)For symptom similarity we show the cosine similarity (cAB = 0 if there are noshared symptoms between diseases A and B and cAB = 1 for diseases withidentical symptoms) Comorbidity in (I) is measured by the relative risk RR(40) Bars in (D) to (I) indicate random expectation (SM section 1) in (D) to(G) the expected value for a randomly chosen protein pair is shown In (H)and (I) the mean value of all disease pairs is used (J to M) The interplay

between gene-set overlap and the network-based relationships between diseasepairs (J) The relationship between gene sets A and B is captured by the overlapcoefficient C = |A cap B|min(|A| |B|) and the Jaccard-index J = |A cap B||A cup B|More than half (59) of the disease pairs do not share genes (J = C = 0) hencetheir relation cannot be uncovered based on shared genes (K) Distribution ofsAB for disease pairs with no gene overlapWe find that despite having disjointgene sets 717 diseases pairs have overlapping modules (sAB lt 0) (L) Thedistribution of sAB for disease pairs with complete gene overlap (C = 1) shows abroad range of network-based relationships including non-overlapping mod-ules (sAB gt 0) (M) Fold change of the number of shared genes compared torandom expectation versus sAB for all disease pairs The 59 of all diseasepairs without shared genes are highlighted with red background For 98 of alldisease pairs that share at least one gene the gene-based overlap is largerthan expected by chance Nevertheless most (87) of these disease pairs areseparated in the network (sAB gt 0) Conversely a considerable number of pairs(717) without shared genes exhibit detectable network overlap (sAB lt 0)

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 6: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

2) CoexpressionWe find that the coexpression-based correlation across 70 tissues (36) betweengenes associated with overlapping diseases isalmost twice that of well-separated diseases(Fig 3G) falling to the random expectation forsAB gt 03) Disease symptoms We find that symptom

similarity as captured by large-scale medicalbibliographic records (38) falls about an order ofmagnitude as we move from overlapping (sAB lt0) to separated (sAB gt 0) diseases (Fig 3H) Non-overlapping diseases share fewer symptoms thanexpected by chance4) ComorbidityWe used the disease history of

30 million individuals aged 65 and older (USMedicare) to determine for each disease pair therelative risk RR of disease comorbidity (39) (Fig3I) finding that the relative risk drops fromRR ge10 for sAB lt 0 to the random expectation of RR asymp1 for sAB gt 0Thus the network-based distance of two

diseases indicates their pathobiological and clin-ical similarity This result suggests a molecularnetwork model of human disease Each diseasehas a well-defined location and a diameter langdAArangthat captures its network-based size (Fig 3 Ato C) If two disease modules are topologicallyseparated (sAB gt 0) then the diseases are patho-biologically distinct If the disease modulestopologically overlap (sAB lt 0) the magnitude ofthe overlap is indicative of their biologicalrelationship The higher the overlap the moresignificant are the pathobiological similaritiesbetween themWe therefore represent each dis-ease by a sphere with diameter langdAArang in a three-dimensional (3D) disease space such that thephysical distance rAB between diseases A and Bcorrelates with the observed network-based dis-tance langdABrang (Fig 4A see also fig S15 and SMsection 8) Disease modules that do not overlapin Fig 4A are predicted to be pathobiologicallydistinct for those that overlap the degree ofoverlap captures their common pathobiology andphenotypic characteristicsTo test the predictive power of this model we

grouped the disease pairs with sAB lt 0 into theldquooverlappingrdquo disease category and those withsAB gt 0 into the ldquonon-overlappingrdquo disease cate-gory As Fig 4 B to G indicates all biological andclinical characteristics show statistically highlysignificant similarity for overlapping diseaseswhereas the effects vanish for the non-overlappingdisease pairsThe disease separation allows us to identify

unexpected overlapping disease pairs ie thosethat lack overt pathobiological or clinical associ-ation (see table S1 for 12 such examples) For exam-ple we find that asthma a respiratory disease andceliac disease an autoimmune disease of the smallintestine are localized in overlapping neighbor-hoods (sAB lt 0 Fig 4N) suggesting shared mo-lecular roots despite their rather differentpathobiologies A closer inspection reveals evi-dence supporting this prediction The two dis-eases share three genes identified via genome-wide associationswith genome-wide significance(HLA-DQA1 IL18R1 IL1RL1) and recently SNP

rs1464510 previously associated with celiac dis-ease was also found to be associated with asthma(40) Although the twodiseases have few commonphenotypic features they exhibit a remarkablyhigh comorbidity (RR = 618) and statistically sig-nificant coexpressionbetween their genes (r=032p value = 002) Furthermore the top enrichedpathway in the combined gene set of the twodiseases is the immune network for immunoglo-bulin A (IgA) production (p value = 5 times 10minus15 Fig4O) with 48 genes of which seven are associatedwith asthma and five with celiac disease Mea-suring amounts of an IgA antibody subclass againsttissue transglutaminase (ATA) is widely used toscreen for and diagnose celiac disease (41) At thesame time the IgA response to allergens in therespiratory tract of asthma patients plays a path-ogenic role through eosinophil activation (42)To determine whether we could have arrived

at the same conclusion by identifying diseaseswith shared genes (7) we quantified the predic-tive power of gene overlap finding that indeeddisease pairs with large gene overlap tend to belocalized in the same network neighborhood (Fig3 L and M) Nevertheless 59 of disease pairs donot share genes hence their relationship cannotbe resolved based on the shared gene hypothesis(Fig 3J see also figs S9 and S10) We thereforerepeated the analysis of Fig 4 B toG for all diseasepairs without common genes finding that sABcontinues to predict accurately the biological sim-ilarity (or distinctness) of these disease pairs (Fig 4H to M and SM section 3) Overall we find 717pairs with overlapping disease modules (sAB lt 0Fig 3K) relationships that cannot be predictedbased on gene overlap For example lymphoma acancer and myocardial infarction a heart diseasedo not share disease genes Yet they have stronglyoverlapping modules (sAB = ndash024) indicating thatthey are located in the same neighborhood of theinteractome Indeed we find that SMARCA4 aprotein associated with myocardial infarctioninteracts with ALKMYC andNF-kB2 which arelymphomadisease proteins Cancer cells frequentlydepend on chromatin regulatory activities tomain-tain a malignant phenotype It has been shownthat leukemia cells require the SWISNF chroma-tin remodeling complex containing the SMARCA4protein as the catalytic subunit for their survivaland aberrant self-renewal potential (43) The re-latedness of the two diseases is further supportedby a high comorbidity [relative risk (RR) = 21] andthe clinical finding that intravascular large celllymphoma can affect andobstruct the small vesselsof the heart (44) Other disease pairs that lackshared genes but are found in the same neighbor-hood of the interactome include glioma and goutglioma and myocardial infarction and myelo-proliferative disorders and proteinuria each pairhaving high comorbidity (RR = 243 63 and 20respectively) A detailed discussion of these andother novel disease-disease relationships predictedby our approach is offered in SM section 10

Summary and discussion

A complete and accurate map of the interactomecould have tremendous impact on our ability to

understand the molecular underpinnings of hu-man disease Yet such a map is at least a decadeaway which makes it currently impossible toevaluate precisely how far a given disease mod-ule is from completion Yet here we showed thatdespite its incompleteness the available inter-actomehas sufficient coverage to pursue a system-atic network-based approach to human diseasesTo be specific we offer quantitative evidence forthe identifiability of some disease modules whileshowing that for other diseases the identifiabili-ty condition is not yet satisfied at the currentlevel of incompleteness of the interactome Mostimportant we demonstrated that the relativeinteractome-based position of two disease mod-ules is a strong predictor of their biological andphenotypic similarity Throughout this paper wefocused on the impact of network incompletenessignoring another limitation of the interactome Itis prone to notable investigative biases (12 32 33)(see also fig S13 and SM section 5) We thereforerepeated our analysis relying only on high-throughput data from yeast two-hybrid screens(12) (y2h SM section 4) finding that the diam-eter langdAArang of the observable modules the dis-tance langdABrang and separation sAB of all disease pairsmeasured in the full and the unbiased inter-actome show statistically highly significant corre-lations Similarly OMIM is also prone to selectionand investigative biases hence we repeatedour measurements using only unbiased GWAS-associated disease genes Comparing gene setsthat include OMIM data and those that onlycontain GWAS associations we again find highlysignificant correlations for langdAArang langdABrang and sAB(figs S11 and S12) Therefore the diseasemodulesand the overlap between them can be reproducedin the unbiased data as well indicating that ourkey findings cannot be attributed to investigativebiases We estimate the minimal number of asso-ciated genes that a disease needs to have in orderto be observable to be around 25 for the currentinteractomeUnbiasedhigh-throughputdata alonehave not yet reached sufficient coverage to mapout putative modules for many diseases For they2h network being a subset of the interactomewith a much lower coverage the respective min-imal number is around 350 (Nc

y2h) hence only afewdiseasemodules canbeobserved (see fig S14f )However this approach can provide valuableinsights into the properties of the complete in-teractome (SM section 6) Indeed as the currenty2h data are expected to represent a uniformsubset of the complete y2h network (12) we canuse it to derive the minimum coverage pm

c of thelatter As the coverage of high-throughput mapsimproves they will allow us to use the fullpower of unbiased approaches for disease mod-ule identificationThe true value of the developed interactome-

based approach is its open-ended multipurposenature It offers a platform that can addressnumerous fundamental and practical issuespertaining to our understanding of human dis-ease This platform can be used to improve theinterpretation of GWAS data (see fig S16 andSM section 10 for an application to type II

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-5

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 7: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

1257601-6 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

Fig 4 Network-basedmodel of disease-disease relationship (A) To illustratethe uncovered network-based relationship between diseases we place eachdisease in a 3D disease space such that their physical distance to otherdiseases is proportional to langdABrang predicted by the interactome-based analysisDiseases whose modules (spheres) overlap are predicted to have commonmolecular underpinnings The colors capture several broad disease classesindicating that typically diseases of the same class are located close to eachotherThere are exceptions such as cerebrovascular disease which is separatedfrom other cardiovascular diseases suggesting distinct molecular roots (B to G)Biological similarity shown separately for the predicted overlapping and non-overlapping disease pairs (see Fig 3 D to I for interpretation) Error bars indicatethe SEMGray lines show random expectation either for random protein pairs (B

to E H to K) or for a random disease pair (F G L M) p values denote thesignificance of the difference of the means according to a Mann-Whitney U test(H to M) Biological similarity for disease pairs that do not share genes (controlset) (N) Three overlapping disease pairs in the disease space Coronary arterydiseases and atherosclerosis as well as hepatic cirrhosis and biliary tract dis-eases are diseases with common classification hence their disease modulesoverlap Our methodology also predicts several overlapping disease modules ofapparently unrelated disease pairs (table S1) illustrated by asthma and celiacdisease (O) A network-level map of the overlapping asthmandashceliac diseasenetwork neighborhood also shown is the IgA production pathway (yellow) thatplays a biological role in both diseases We denote genes that are either sharedby the two diseases or by the pathway or that interact across the modules

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 8: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

diabetes) help us uncover new uses for existingdrugs (repurposing) by identifying the diseasemodules located in the vicinity of each drug tar-get (45ndash47) and facilitate the discovery of themolecular underpinnings of undiagnosed dis-eases by exploiting the agglomeration ofmutationsand expression changes in network neighborhoodsassociated with well-characterized diseases In thelong run network-based approaches relying onan increasingly accurate interactome are poisedto become highly useful in interpreting disease-associated genome variations

Materials and methods

Interactome construction

We combine several sources of protein interac-tions (i) regulatory interactions derived fromtranscription factors binding to regulatory elem-ents (ii) binary interactions from several yeasttwo-hybrid high-throughput and literature-curateddata sets (iii) literature-curated interactions de-rived mostly from low-throughput experiments(iv) metabolic enzyme-coupled interactions (v)protein complexes (vi) kinase-substrate pairsand (vii) signaling interactions The union of allinteractions from (i) to (vii) yields a network of13460 proteins that are interconnected by 141296interactions For more information on the individ-ual data sets and general properties of the inter-actome see SM section 1

Disease-gene associations

Weintegratedisease-geneannotations fromOnlineMendelian Inheritance inMan (OMIMwwwncbinlmnihgovomim) (48) and UniProtKBSwiss-Prot as compiled by (30) with GWAS data fromthe Phenotype-Genotype Integrator database(PheGenI wwwncbinlmnihgovgapPheGenI)(31) using a genome-wide significance cutoff of pvaluele 5 times 10minus8 To combine the different diseasenomenclatures of the two sources into a singlestandard vocabulary we use the Medical SubjectHeadings ontology (MeSH wwwnlmnihgovmesh) as described in SM section 1 After fil-tering for diseases with at least 20 associatedgenes and genes for which we have interactioninformation we obtain 299 diseases and 3173associated genes

Additional disease and geneannotation data

For the analysis of the similarity between genesand diseases we use (i) Gene Ontology (GO) an-notations (49) (ii) tissue-specific gene expressiondata (36) (iii) symptom disease associations (38)(iv) comorbidity data (39) and (v) pathway an-notations from theMolecular Signatures Database(MSigDB) (50) Full details on data sources pro-cessing and analysis are provided in SM section 1

Network localization

We use two complementary measures to quan-tify the degree to which disease proteins agglom-erate in specific interactome neighborhoods (i)observable module size S representing the sizeof the largest connected subgraph formed by

disease proteins and (ii) shortest distance ds Foreach of theNd disease proteins we determine thedistance ds to the next-closest protein associatedwith the same disease The average langdsrang can be inter-preted as the diameter of a disease on the inter-actome The network-based overlap between twodiseases A and B is measured by comparing thediameters langdAArang and langdBBrang of the respective diseasesto the mean shortest distance langdABrang between theirproteins sAB = langdABrang ndash (langdAArang + langdBBrang)2 Positive sABindicates that the two disease modules are sep-arated on the interactomewhereas negative valuescorrespond to overlapping modules Details on theanalysis and the appropriate random controls arepresented in SM section 2

Gene-based disease overlap

The overlap between two gene sets A and B ismeasured by the overlap coefficient C = |AcapB|min(|A||B|) and the Jaccard-index J= |AcapB||AcupB|The values of bothmeasures lie in the range [01]with JC = 0 for no common genes A Jaccard-index J = 1 indicates two identical gene setswhereas the overlap coefficient C = 1 when oneset is a complete subset of the other For a sta-tistical evaluation of the observed overlaps weuse a basic hypergeometric model with the nullhypothesis that disease-associated genes arerandomly drawn from the space of all N genesin the network (see SM section 3 for full details)

REFERENCES AND NOTES

1 M Buchanan G Caldarelli P De Los Rios Networks in CellBiology (Cambridge Univ Press Cambridge 2010)

2 T Pawson R Linding Network medicine FEBS Lett 5821266ndash1270 (2008) doi 101016jfebslet200802011pmid 18282479

3 E E Schadt Molecular networks as sensors and drivers ofcommon human diseases Nature 461 218ndash223 (2009)doi 101038nature08454 pmid 19741703

4 A Califano A J Butte S Friend T Ideker E SchadtLeveraging models of cell regulation and GWAS data inintegrative network-based association studies Nat Genet 44841ndash847 (2012) doi 101038ng2355 pmid 22836096

5 A Zanzoni M Soler-Loacutepez P Aloy A network medicineapproach to human disease FEBS Lett 583 1759ndash1765(2009) doi 101016jfebslet200903001 pmid 19269289

6 A-L Barabaacutesi N Gulbahce J Loscalzo Network medicine Anetwork-based approach to human disease Nat Rev Genet12 56ndash68 (2011) doi 101038nrg2918 pmid 21164525

7 K-I Goh et al The human disease network Proc Natl AcadSci USA 104 8685ndash8690 (2007) doi 101073pnas0701361104 pmid 17502601

8 M Oti B Snel M A Huynen H G Brunner Predicting diseasegenes using protein-protein interactions J Med Genet 43691ndash698 (2006) doi 101136jmg2006041376pmid 16611749

9 K Lage et al Dissecting spatio-temporal protein networksdriving human heart development and related disordersMol Syst Biol 6 381 (2010) doi 101038msb201036pmid 20571530

10 H-Y Chuang E Lee Y-T Liu D Lee T Ideker Network-basedclassification of breast cancer metastasis Mol Syst Biol 3140 (2007) doi 101038msb4100180 pmid 17940530

11 R Mosca T Pons A Ceacuteol A Valencia P Aloy Towards a detailedatlas of protein-protein interactions Curr Opin Struct Biol 23929ndash940 (2013) doi 101016jsbi201307005 pmid 23896349

12 T Rolland et al A proteome-scale map of the humaninteractome network Cell 159 1212ndash1226 (2014)doi 101016jcell201410050 pmid 25416956

13 G T Hart A K Ramani E M Marcotte How complete are currentyeast and human protein-interaction networks Genome Biol 7120 (2006) doi 101186gb-2006-7-11-120 pmid 17147767

14 K Venkatesan et al An empirical framework for binaryinteractome mapping Nat Methods 6 83ndash90 (2009)pmid 19060904

15 M P Stumpf et al Estimating the size of the humaninteractome Proc Natl Acad Sci USA 105 6959ndash6964(2008) doi 101073pnas0708078105 pmid 18474861

16 M N Wass A David M J Sternberg Challenges for theprediction of macromolecular interactions Curr Opin StructBiol 21 382ndash390 (2011) doi 101016jsbi201103013pmid 21497504

17 J Xu Y Li Discovering disease-genes by topological featuresin human protein-protein interaction network Bioinformatics22 2800ndash2805 (2006) doi 101093bioinformaticsbtl467pmid 16954137

18 I Feldman A Rzhetsky D Vitkup Network properties of genesharboring inherited disease mutations Proc Natl Acad SciUSA 105 4323ndash4328 (2008) doi 101073pnas0701722105pmid 18326631

19 M Krauthammer C A Kaufmann T C Gilliam A RzhetskyMolecular triangulation Bridging linkage and molecular-networkinformation for identifying candidate genes in Alzheimerrsquosdisease Proc Natl Acad Sci USA 101 15148ndash15153 (2004)doi 101073pnas0404315101 pmid 15471992

20 L Franke et al Reconstruction of a functional human genenetwork with an application for prioritizing positionalcandidate genes Am J Hum Genet 78 1011ndash1025(2006) doi 101086504300 pmid 16685651

21 S Koumlhler S Bauer D Horn P N Robinson Walking theinteractome for prioritization of candidate disease genesAm J Hum Genet 82 949ndash958 (2008) doi 101016jajhg200802013 pmid 18371930

22 Y Chen et al Variations in DNA elucidate molecularnetworks that cause disease Nature 452 429ndash435 (2008)doi 101038nature06757 pmid 18344982

23 S E Baranzini et al Pathway and network-based analysis ofgenome-wide association studies in multiple sclerosis HumMol Genet 18 2078ndash2090 (2009) doi 101093hmgddp120pmid 19286671

24 C E Wheelock et al Systems biology approaches and pathwaytools for investigating cardiovascular disease Mol Biosyst 5588ndash602 (2009) doi 101039b902356a pmid 19462016

25 A S Khalil J J Collins Synthetic biology Applications comeof age Nat Rev Genet 11 367ndash379 (2010) doi 101038nrg2775 pmid 20395970

26 S Wuchty et al Gene pathways and subnetworks distinguishbetween major glioma subtypes and elucidate potentialunderlying biology J Biomed Inform 43 945ndash952 (2010)doi 101016jjbi201008011 pmid 20828632

27 I Lee U M Blom P I Wang J E Shim E M MarcottePrioritizing candidate disease genes by network-basedboosting of genome-wide association data Genome Res 211109ndash1121 (2011) doi 101101gr118992110 pmid 21536720

28 U M Singh-Blom et al Prediction and validation of gene-disease associations using methods inspired by social networkanalyses PLOS ONE 8 e58977 (2013) doi 101371journalpone0058977 pmid 23650495

29 A Rzhetsky D Wajngurt N Park T Zheng Probing geneticoverlap among complex human phenotypes Proc Natl AcadSci USA 104 11694ndash11699 (2007) doi 101073pnas0704820104 pmid 17609372

30 A Mottaz Y L Yip P Ruch A-L Veuthey Mapping proteinsto disease terminologies From UniProt to MeSH BMCBioinformatics 9 (suppl 5) S3 (2008) doi 1011861471-2105-9-S5-S3 pmid 18460185

31 E M Ramos et al Phenotype-Genotype Integrator (PheGenI)Synthesizing genome-wide association study (GWAS) data withexisting genomic resources Eur J Hum Genet 22 144ndash147(2014) doi 101038ejhg201396 pmid 23695286

32 L Hakes J W Pinney D L Robertson S C LovellProtein-protein interaction networks and biologymdashWhatrsquos theconnection Nat Biotechnol 26 69ndash72 (2008)doi 101038nbt0108-69

33 M E Cusick et al Literature-curated protein interactiondatasets Nat Methods 6 39ndash46 (2009) doi 101038nmeth1284 pmid 19116613

34 R Cohen S Havlin Complex Networks Structure Robustnessand Function (Cambridge Univ Cambridge 2010)

35 S Bornholdt H G Schuster Eds Handbook of Graphs andNetworks (Wiley Online Library 2003) vol 2

36 A I Su et al A gene atlas of the mouse and humanprotein-encoding transcriptomes Proc Natl Acad Sci USA101 6062ndash6067 (2004) doi 101073pnas0400782101pmid 15075390

37 T K Gandhi et al Analysis of the human protein interactomeand comparison with yeast worm and fly interaction datasetsNat Genet 38 285ndash293 (2006) pmid 16501559

SCIENCE sciencemagorg 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 1257601-7

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 9: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

38 X Zhou J Menche A-L Barabaacutesi A Sharma Humansymptoms-disease network Nat Commun 5 4212 (2014)doi 101038ncomms5212 pmid 24967666

39 C A Hidalgo N Blumm A-L Barabaacutesi N A Christakis Adynamic network approach for the study of humanphenotypes PLOS Comput Biol 5 e1000353 (2009)doi 101371journalpcbi1000353 pmid 19360091

40 K A Hunt et al Newly identified genetic risk variants for celiacdisease related to the immune response Nat Genet 40395ndash402 (2008) doi 101038ng102 pmid 18311140

41 D A van der Windt P Jellema C J Mulder C M KneepkensH E van der Horst Diagnostic testing for celiac diseaseamong patients with abdominal symptoms A systematicreview JAMA 303 1738ndash1746 (2010) doi 101001jama2010549 pmid 20442390

42 C Pilette S R Durham J-P Vaerman Y Sibille Mucosalimmunity in asthma and chronic obstructive pulmonarydisease A role for immunoglobulin A Proc Am Thorac Soc 1125ndash135 (2004) doi 101513pats2306032

43 J Shi et al Role of SWISNF in acute leukemia maintenance andenhancer-mediated Myc regulation Genes Dev 27 2648ndash2662(2013) doi 101101gad232710113 pmid 24285714

44 A Bauer B Perras S Sufke H-P Horny B KreftMyocardial infarction as an uncommon clinical manifestation

of intravascular large cell lymphoma Acta Cardiol 60551ndash555 (2005) doi 102143AC6052004979pmid 16261789

45 A L Hopkins Network pharmacology The next paradigmin drug discovery Nat Chem Biol 4 682ndash690 (2008)doi 101038nchembio118 pmid 18936753

46 J Mestres E Gregori-Puigjaneacute S Valverde R V SoleacuteThe topology of drug-target interaction networks Implicitdependence on drug properties and target familiesMol Biosyst 5 1051ndash1057 (2009) doi 101039b905821bpmid 19668871

47 M Kuhn et al Systematic identification of proteins that elicitdrug side effects Mol Syst Biol 9 663 (2013) doi 101038msb201310 pmid 23632385

48 A Hamosh A F Scott J S Amberger C A BocchiniV A McKusick Online Mendelian Inheritance in Man (OMIM) aknowledgebase of human genes and genetic disorders NucleicAcids Res 33 D514ndashD517 (2005) doi 101093nargki033pmid 15608251

49 M Ashburner et alThe Gene Ontology Consortium Geneontology Tool for the unification of biology Nat Genet 2525ndash29 (2000) doi 10103875556 pmid 10802651

50 A Subramanian et al Gene set enrichment analysis Aknowledge-based approach for interpreting genome-wide

expression profiles Proc Natl Acad Sci USA 10215545ndash15550 (2005) doi 101073pnas0506580102pmid 16199517

ACKNOWLEDGMENTS

We thank A-R Carvunis S Pevzner and T Rolland forproviding invaluable insights into the y2h data set J Bagrow andF Simini for many discussions on the network methods andG Musella for figure design This work was supported by NIHgrants P50-HG004233 U01-HG001715 and UO1-HG007690 fromNHGRI and PO1-HL083069 R37-HL061795 RC2-HL101543 andU01-HL108630 from NHLBI

SUPPLEMENTARY MATERIALS

wwwsciencemagorgcontent34762241257601supplDC1Materials and MethodsFigs S1 to S16Table S1External Databases S1 to S4Source CodeReferences (51ndash107)

17 June 2014 accepted 15 January 2015101126science1257601

1257601-8 20 FEBRUARY 2015 bull VOL 347 ISSUE 6224 sciencemagorg SCIENCE

RESEARCH | RESEARCH ARTICLEon A

ugust 31 2020

httpsciencesciencemagorg

Dow

nloaded from

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from

Page 10: DISEASE NETWORKS RESULTS: Uncovering disease ......RESEARCH ARTICLE DISEASE NETWORKS Uncovering disease-disease relationships through the incomplete interactome Jörg Menche,1,2,3

Uncovering disease-disease relationships through the incomplete interactome

BarabaacutesiJoumlrg Menche Amitabh Sharma Maksim Kitsak Susan Dina Ghiassian Marc Vidal Joseph Loscalzo and Albert-Laacuteszloacute

DOI 101126science1257601 (6224) 1257601347Science

this issue 101126science1257601Sciencebetween diseases lacking shared disease genes could also be identifiedsimilarities encompassed their protein components gene expression symptoms and morbidity Molecular-level linksdisease pairs that are predicted to have overlapping modules had statistically significant molecular similarity These threshold have identifiable disease modules The network-based distance between two disease modules revealed thatconnections between disease-related proteins) to be observed Only diseases with data coverage that exceeds a specific

formulated the mathematical conditions needed to allow a disease module (a localized region ofet almaps Menche interactomediseases However the analysis of protein-protein interactions has been hampered by the incompleteness of

Shared genes represent a powerful but limited representation of the mechanistic relationship between twoA network approach to finding disease modules

ARTICLE TOOLS httpsciencesciencemagorgcontent34762241257601

MATERIALSSUPPLEMENTARY httpsciencesciencemagorgcontentsuppl2015021834762241257601DC1

CONTENTRELATED

httpstkesciencemagorgcontentsigtrans9439rs7fullhttpstkesciencemagorgcontentsigtrans9439pc17fullhttpstkesciencemagorgcontentsigtrans281ra39fullhttpstmsciencemagorgcontentscitransmed3114114ra127fullhttpstmsciencemagorgcontentscitransmed4115115rv1fullhttpstmsciencemagorgcontentscitransmed5205205rv1fullhttpstmsciencemagorgcontentscitransmed5206206ra140full

REFERENCES

httpsciencesciencemagorgcontent34762241257601BIBLThis article cites 99 articles 20 of which you can access for free

PERMISSIONS httpwwwsciencemagorghelpreprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAASScienceScience 1200 New York Avenue NW Washington DC 20005 The title (print ISSN 0036-8075 online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Copyright copy 2015 American Association for the Advancement of Science

on August 31 2020

httpsciencesciencem

agorgD

ownloaded from


Recommended