Date post: | 26-Jan-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
Comparative Genomics of the Eukaryotes
Gerald M. Rubin1, Mark D. Yandell3, Jennifer R. Wortman3, George L. Gabor Miklos4,Catherine R. Nelson2, Iswar K. Hariharan5, Mark E. Fortini6, Peter W. Li3, Rolf Apweiler7,Wolfgang Fleischmann7, J. Michael Cherry8, Steven Henikoff9, Marian P. Skupski3, SimaMisra2, Michael Ashburner7, Ewan Birney7, Mark S. Boguski10, Thomas Brody11, PeterBrokstein2, Susan E. Celniker12, Stephen A. Chervitz13, David Coates14, Anibal Cravchik3,Andrei Gabrielian3, Richard F. Galle12, William M. Gelbart15, Reed A. George12, LawrenceS. B. Goldstein16, Fangcheng Gong3, Ping Guan3, Nomi L. Harris12, Bruce A. Hay17, RogerA. Hoskins12, Jiayin Li3, Zhenya Li3, Richard O. Hynes18, S. J. M. Jones19, Peter M.Kuehl20, Bruno Lemaitre21, J. Troy Littleton22, Deborah K. Morrison23, Chris Mungall12,Patrick H. O'Farrell24, Oxana K. Pickeral10, Chris Shue3, Leslie B. Vosshall25, JiongZhang10, Qi Zhao3, Xiangqun H. Zheng3, Fei Zhong3, Wenyan Zhong3, Richard Gibbs26, J.Craig Venter3, Mark D. Adams3, and Suzanna Lewis21Howard Hughes Medical Institute, Berkeley Drosophila Genome Project, University of California,Berkeley, CA 94720, USA2Department of Molecular and Cell Biology, Berkeley Drosophila Genome Project, University ofCalifornia, Berkeley, CA 94720, USA3Celera Genomics, Rockville, MD, 20850 USA4GenetixXpress, 78 Pacific Road, Palm Beach, Sydney, Australia 21085Massachusetts General Hospital Cancer Center, Building 149, 13th Street, Charlestown, MA02129 USA6Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104,USA7EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK8Department of Genetics, Stanford University, Palo Alto, CA 94305, USA9Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98109,USA10National Center for Biotechnology Information, National Library of Medicine, National Institutes ofHealth, Bethesda, MD 20894, USA11Neurogenetics Unit, Laboratory of Neurochemistry, National Institute of Neurological Disordersand Stroke, National Institutes of Health, Bethesda, MD 20892, USA12Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA94720, USA13Neomorphic, 2612 Eighth Street, Berkeley, CA 94710, USA14School of Biology, University of Leeds, Leeds LS2 9JT, UK15Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Cambridge,MA 02138, USA16Departments of Cellular and Molecular Medicine and Pharmacology, Howard Hughes MedicalInstitute, University of California–San Diego, La Jolla, CA 92093, USA17Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA
NIH Public AccessAuthor ManuscriptScience. Author manuscript; available in PMC 2009 September 29.
Published in final edited form as:Science. 2000 March 24; 287(5461): 2204–2215.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
18Howard Hughes Medical Institute, Massachusetts Institute of Technology (MIT), Cambridge, MA02139, USA19Genome Sequence Centre, BC Cancer Research Centre, 600 West 10th Avenue, Vancouver,BC, V52 4E6, Canada20Molecular and Cell Biology Program, University of Maryland at Baltimore, Baltimore, MD 21201,USA21Centre de Génétique Moléculaire, CNRS, 91198 Gif-sur-Yvette, France22Center for Learning and Memory, MIT, 77 Massachusetts Avenue, Cambridge, MA 02139, USA23Regulation of Cell Growth Laboratory, Division of Basic Sciences, National Cancer Institute–Frederick Cancer Research and Development Center, National Institutes of Health, Frederick, MD21702, USA24Department of Biochemistry and Biophysics, University of California, San Francisco, CA 94143,USA25Center for Neurobiology and Behavior, Columbia University, New York, NY 10032, USA26Baylor College of Medicine Human Genome Sequencing Center, Department of Molecular andHuman Genetics, Baylor College of Medicine, Houston, TX 77030, USA
AbstractA comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, andSaccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in thecontext of cellular, developmental, and evolutionary processes. The nonredundant protein sets offlies and worms are similar in size and are only twice that of yeast, but different gene families areexpanded in each genome, and the multidomain proteins and signaling pathways of the fly and wormare far more complex than those of yeast. The fly has orthologs to 177 of the 289 human diseasegenes examined and provides the foundation for rapid analysis of some of the basic processesinvolved in human disease.
With the full genomic sequence of three major model organisms now available, much of ourknowledge about the evolutionary basis of cellular and developmental processes will derivefrom comparisons between protein domains, intracellular networks, and cell-cell interactionsin different phyla. In this paper, we begin a comparison of D. melanogaster, C. elegans, andS. cerevisiae. We first ask how many distinct protein families each genome encodes, how thegenes encoding these protein families are distributed in each genome, and how many genesare shared among flies, worms, yeast, and mammals. Next we describe the composition andorganization of protein domains within the proteomes of fly, worm, and yeast and examine therepresentation in each genome of a subset of genes that have been directly implicated ascausative agents of human disease. Then we compare some fundamental cellular anddevelopmental processes: the cell cycle, cell structure, cell adhesion, cell signaling, apoptosis,neuronal signaling, and the immune system. In each case, we present a summary of what wehave learned from the sequence of the fly genome and how the components that carry out theseprocesses differ in other organisms. We end by presenting some observations on what we havelearned, the obvious questions that remain, and how knowledge of the sequence of theDrosophila genome will help us approach new areas of inquiry.
The “Core Proteome”How many distinct protein families are encoded in the genomes of D. melanogaster, C.elegans, and S. cerevisiae (1), and how do these genomes compare with that of a simple
Rubin et al. Page 2
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
prokaryote, Haemophilus influenzae? We carried out an “all-against-all” comparison of proteinsequences encoded by each genome using algorithms that aim to differentiate paralogs—highlysimilar proteins that occur in the same genome—from proteins that are uniquely represented(Table 1). Counting each set of paralogs as a unit reveals the “core proteome”: the number ofdistinct protein families in each organism. This operational definition does not includeposttranslationally modifed forms of a protein or isoforms arising from alternate splicing.
In Haemophilus, there are 1709 protein coding sequences, 1247 of which have no sequencerelatives within Haemophilus (2). There are 178 families that have two or more paralogs,yielding a core proteome of 1425. In yeast, there are 6241 predicted proteins and a coreproteome of 4383 proteins. The fly and worm have 13,601 and 18,424 (3) predicted protein-coding genes, and their core proteomes consist of 8065 and 9453 proteins, respectively. It isremarkable that Drosophila, a complex metazoan, has a core proteome only twice the size ofthat of yeast. Furthermore, despite the large differences between fly and worm in terms ofdevelopment and morphology, they use a core proteome of similar size.
Gene DuplicationsMuch of the genomes of flies and worms consists of duplicated genes; we next asked how theseparalogs are arranged. The frequency of local gene duplications and the number of theirconstituent genes differ widely between fly and worm, although in both genomes most paralogsare dispersed. The fly genome contains half the number of local gene duplications relative toC. elegans (4), and these gene clusters are distributed randomly along the chromosome arms;in C. elegans there is a concentration of gene duplications in the recombinogenic segments ofthe autosomal arms (1). In both organisms, approximately 70% of duplicated gene pairs are onthe same strand (306 out of 417 for D. melanogaster and 581 out of 826 for C. elegans). Thelargest cluster in the fly contains 17 genes that code for proteins of unknown function; the nextlargest clusters both consist of glutathione S-transferase genes, each with 10 members. Incontrast, 11 of 33 of the largest clusters in C. elegans consist of genes coding for seventransmembrane domain receptors, most of which are thought to be involved in chemosensation.Other than these local tandem duplications, genes with similar functional assignment in theGene Ontology (GO) classification (5) do not appear to be clustered in the genome.
We next compared the large duplicated gene families in fly, worm, and yeast without regardto genomic location. All of the known and predicted protein sequences of these three genomeswere pooled, and each protein was compared to all others in the pool by means of the programBLASTP. Among the larger protein families that are found in worms and flies but not yeastare several that are associated with multicellular development, including homeobox proteins,cell adhesion molecules, and guanylate cyclases, as well as trypsinlike peptidases and esterases.Among the large families that are present only in flies are proteins involved in the immuneresponse, such as lectins and peptidoglycan recognition proteins, transmembrane proteins ofunknown function, and proteins that are probably fly-specific: cuticle proteins, peritrophicmembrane proteins, and larval serum proteins.
Gene SimilaritiesWhat fraction of the proteins encoded by these three eukaryotes is shared? Comparativeanalysis of the predicted proteins encoded by these genomes suggests that nearly 30% of thefly genes have putative orthologs in the worm genome. We required that a protein showsignificant similarity over at least 80% of its length to a sequence in another species to beconsidered its ortholog (6). We know that this results in an underestimate, because the lengthrequirement excludes known orthologs, such as homeodomain proteins, which have littlesimilarity outside the homeodomain. The number of such fly-worm pairs does not decreasemuch as the similarity scores become more stringent (Table 2A), which strongly suggests that
Rubin et al. Page 3
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
we have indeed identified orthologs, which may share molecular function. Nearly 20% of thefly proteins have a putative ortholog in both worm and yeast; these shared proteins probablyperform functions common to all eukaryotic cells.
We also compared the proteins of fly, worm, and yeast to mammalian sequences. Mostmammalian sequences are available as short expressed sequence tags (ESTs), so we dispensedwith the requirement for similarity over 80% of the length of the proteins. Table 2B presentsthese data. Half of the fly protein sequences show similarity to mammalian proteins at a cutoffof E < 10−10 (where E is expectation value), as compared to only 36% of worm proteins. Thisdifference increases as the criteria become more stringent: 25% versus 15% at E < 10−50 and12% versus 7% at E < 10−100. Because many of the comparisons are with short sequences, itis likely that many of these sequence similarities reflect conserved domains within proteinsrather than orthology. However, it does suggest that the Drosophila proteome is more similarto mammalian proteomes than are those of worm or yeast.
Protein Domains and FamiliesProteins are often mosaic, containing two or more different identifiable domains, and domainscan occur in different combinations in different proteins. Thus, only a portion of a protein maybe conserved among organisms. We therefore performed a comparative analysis of the proteindomains composing the predicted proteomes from D. melanogaster, C. elegans, and S.cerevisiae using sequence similarity searches against the SWISS-PROT/TrEMBLnonredundant protein database (7), the BLOCKS database (8), and the InterPro database (9).The 200 most common fly protein families and domains are listed in Table 3, and the 10 mosthighly represented families in worm and yeast are shown in Table 4. InterPro analyses plusmanual data inspection enabled us to assign 7419 fly proteins, 8356 worm proteins, and 3056yeast proteins to either protein families or domain families. We found 1400 different proteinfamilies or domains in all: 1177 in the fly, 1133 in the worm, and 984 in yeast; 744 familiesor domains were common to all three organisms.
Many protein families exhibit great disparities in abundance, and only the C2H2-type zincfinger proteins and the eukaryotic protein kinases are among the top 10 protein familiescommon to all three organisms. There are 352 zinc finger proteins of the C2H2 type in the flybut only 138 in the worm; whether this reflects greater regulatory complexity in the fly is notknown. The protein kinases constitute approximately 2% of each proteome. Curation of thegenomic data revealed that Drosophila has approximately 300 protein kinases and 85 proteinphosphatases, around half of which had previously been identified. In contrast, there areapproximately 500 kinases and 185 phosphatases in the worm; the difference is largely due tothe worm-specific expansion of certain families such as the CK1, FER, and KIN-15 families.There are currently approximately 600 kinases and 130 phosphatases in humans, and it isexpected that these figures will rise to 1100 and 300, respectively, when the sequence of thehuman genome is completed (10). Of the proteins uncovered in this analysis, over 70% exhibitsequence similarity outside the kinase or phosphatase domain to proteins in other species. Inthe kinase group, approximately 75% are serine/threonine kinases, and 25% are tyrosine ordual-specificity kinases. Over 90% of the newly discovered kinases are predicted tophosphorylate serine/threonine residues; this group includes the first atypical protein kinase Cisoforms identified in Drosophila. In addition, we found counterparts of the mammaliankinases CSK, MLK2, ATM, and Peutz-Jeghers syndrome kinase, and additional members ofthe Drosophila GSK3B, casein kinase I, SNF1-like, and Pak/STE20-like kinase families. Inthe fly protein phosphatase group, approximately 42% are predicted to be serine/threoninephosphatases; 48% are tyrosine or dual-specificity phosphatases. Among the newly discoveredphosphatases, 35% are serine/threonine phosphatases, most of which are related to the proteinphosphatase 2C family, and 65% are tyrosine or dual-specificity phosphatases. The fly and
Rubin et al. Page 4
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
worm both contain close relatives to many of the known mammalian lipid kinases andphosphatases; however, no SH2-containing inositol 5′ phosphatase SHIP is apparent. Finally,it has been found that the assembly of kinase signaling complexes in vertebrate cells is aidedby the presence of scaffolding and adaptor molecules, many of which contain phosphoproteinbinding domains; we found 85 such proteins in the fly, including counterparts to IRS, VAV,SHC, JIP, and MP1.
Two remarkable findings emerge from the peptidase data that may reflect different approachesto growth and development in flies, worms, and humans. The pattern and distribution ofpeptidase types are similar between the fly and the worm: there are approximately 450peptidases in the fly and 260 in the worm. The difference is due almost entirely to the expansionor contraction of a single class of trypsin-like (S1) peptidases. C. elegans has seven of thisclass and yeast has one, but the fly has 199. Of these, 163 are small proteins of approximately250 amino acids containing single trypsin domains; very few are mosaic proteins. Theremainder have either multiple trypsin-like domains or long stretches of amino acids with noreadily identifiable motif, usually at the NH2-terminus. In humans, trypsin-like peptidasesperform diverse functions in digestion, in the complement cascade, and in several othersignaling pathways (11), and flies may have a similarly wide range of uses for these proteins.The extensively characterized members of this family, which include Snake, Easter, Nudel,and Gastrulation-defective, are all key members of a regulatory cascade that controlsdorsoventral patterning in the fly (12). In addition, flies have only two members of the M10class of peptidases, which include the matrix metalloproteases, collagenases, and gelatinasesthat are essential for tissue remodeling and repair in vertebrates.
The number of identifiable multidomain proteins is similar in the fly and the worm: 2130 and2261, respectively. Yeast has only 672 (Table 5). Part of this difference is accounted for byproteins with extracellular domains involved in cell-cell and cell-substrate contacts (13), suchas the immunoglobulin domain–containing proteins, which are more abundant in flies than inworms (153 versus 70) and are nonexistent in yeast. Two other common extracellular domainsoccur in similar numbers in fly and worm: EGF (110 versus 109, respectively) and fibronectintype III (46 versus 43) but are rare or absent in yeast. Extracellular regions of proteins oftencontain a variety of repeated domains (14), and so these proteins may account for our findingthat flies have a larger number of proteins with multiple InterPro domains than either wormsor yeast (2107 versus 1747 and 525, respectively) (Table 6). Some multidomain proteins ofthe fly are particularly heterogeneous: Two low-density lipoprotein receptor–related proteinshave 75 InterPro domains each. Another protein of unknown function has 62 InterPro domains;the most heterogeneous worm and yeast proteins [SWISS-PROT/TrEMBL accession numbers(AC), Q04833 and P32768, respectively] have 61 and 18 InterPro domains, respectively. Therecan be extensive repetition of the same domain within a protein; for example, animmunoglobulin-like domain is repeated 52 times within one protein of unknown function inthe fly. The large worm protein UNC-89 contains 48 immunoglobulin-like domains (SWISS-PROT/TrEMBL AC, Q17362). In contrast, the largest number of repeats in yeast, of a C2H2-type zinc finger domain, occurs nine times in the transcription factor TFIIIA (SWISS-PROT/TrEMBL AC, P39933).
The heterotrimeric GTP-binding protein (G protein)–coupled receptors (GPCRs) are a largeprotein family in flies, worms, and vertebrates whose members are involved in synapticfunction, hormonal physiology, and the regulation of morphological movements duringgastrulation and germ band extension (15). There are predicted to be at least 700 GPCRs inthe human genome (16) and roughly 1100 GPCRs in C. elegans (17). We found approximately160 GPCR genes in the Drosophila genome, 57 of which appear to be olfactory receptors.Drosophila, C. elegans, and vertebrates each have diverse families of odorant receptors that,although recognizable as GPCRs, are unrelated by sequence and therefore apparently evolved
Rubin et al. Page 5
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
independently. The number of odorant receptors in vertebrates ranges from around 100 inzebrafish and catfish to approximately 1000 in the mouse; C. elegans also has approximately1000. In the fly, as in zebrafish and mouse, there is a correlation between the number of odorantreceptors and the number of discrete synaptic structures called glomeruli in the olfactoryprocessing centers of the brain (16,18). In the mouse, each glomerulus is dedicated to receivingaxonal input from neurons expressing a particular odorant receptor (16). Therefore, thecorrelation between number of odorant receptors and number of glomeruli may reflect aconservation in the organizational logic of odor recognition in insect and vertebrate brains.Although the fly odorant receptors are extremely diverse, there are a number of subfamilieswhose members share 50 to 65% sequence identity. The distribution of odorant receptor genesis different among these organisms as well. Unlike C. elegans or vertebrate odorant receptors,which are in large linked arrays, the fly odorant receptor genes are distributed as single genesor in arrays of two or three. Vertebrate receptors are encoded by intronless genes, but both flyand worm receptor genes have multiple introns. These distinctions suggest that in addition todifferences in the sequences of the odorant receptors of the different organisms, the processesgenerating the families of receptors may have differed among the lineages that gave rise toflies, worms, and vertebrates.
The data suggest conservation of hormone receptors between flies and vertebrates;nevertheless, there is a greater diversity of hormone receptors in both C. elegans and vertebratesthan in Drosophila. Insects are subject to complex hormonal regulation, but no apparenthomologs of vertebrate neuropeptide and hormone precursors were identified. However, manyreceptors with sequence similarity to vertebrate receptors for neurokinin, growth hormonesecretagogue, leutotropin (follicle-stimulating hormone and luteinizing hormone), thyroid-stimulating hormone, galanin/allatostatin, somatostatin, and vasopressin were identified. OtherGPCRs include a seventh Drosophila rhodopsin and homologs of adenosine, metabotropicglutamate, γ-aminobutyric acid (GABA), octopamine, serotonin, dopamine, and muscarinicacetylcholine receptors. In addition, there are GPCRs that are unique to Drosophila, otherswith sequence similarity to C. elegans and human orphan receptors, and an insect diuretichormone receptor that is closely related to vertebrate corticotropin-releasing factor receptor.Finally, we found several atypical seven-transmembrane domain receptors, including 10Methuselah (MTH)–like proteins and four Frizzled (FZ)–like proteins. A mutation in mthincreases the fly's life-span and its resistance to various stresses (19); the FZ-like proteinsprobably serve as receptors for different members of the Wingless/Wnt family of ligands.
Human Disease GenesStudies in model organisms have provided important insights into our understanding of genesand pathways that are involved in a variety of human diseases. In order to estimate the extentto which different types of human disease genes are found in flies, worms, and yeast, wecompiled a set of 289 genes that are mutated, altered, amplified, or deleted in a diverse set ofhuman diseases and searched for similar genes in D. melanogaster, C. elegans, and S.cerevisiae, as described in the legend to Fig. 1. Of these 289 human genes, 177 (61%) appearto have an ortholog in Drosophila (Fig. 1). Only proteins with similar domain structures wereconsidered to be orthologs; this judgment was made by human inspection of the InterProdomain composition of the fly and human proteins. The importance of human inspection, aswell as consideration of published information, is underscored by the fact that some sequenceswith extremely high similarity scores to proteins encoded by fly genes, such as LCK andMyotonic Dystrophy 1, were judged not to be orthologous, but others with relatively low scores,such as p53 and Rb1, were considered to be orthologs. We attempted this additional level ofanalysis only for the fly proteins, as the lower overall level of similarity of worm and yeastproteins made these subjective judgments even more difficult. Some of the human diseasegenes that are absent in Drosophila reflect clear differences in physiology between the two
Rubin et al. Page 6
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
organisms. For instance, none of the hemoglobins, which are mutated in thalassemias, haveorthologs in Drosophila. In flies, oxygen is delivered directly to tissues via the tracheal systemrather than by circulating erythrocytes. Similarly, several genes required for normalrearrangement of the immunoglobulin genes do not have Drosophila orthologs.
Of the cancer genes surveyed, 68% appear to have Drosophila orthologs. In addition topreviously described proteins, these searches identified clear protein orthologs for menin(MEN; multiple endocrine neoplasia type 1), Peutz-Jeghers disease (STK11), ataxiatelangiectasia (ATM), multiple exostosis type 2 (EXT2), a second bCL2 family member, asecond retinoblastoma family member, and a p53-like protein. Despite its relatively lowsequence similarity to the human genes, the Drosophila gene encoding p53 was considered anortholog because it shows a conserved organization of functional domains, and its DNA bindingdomain includes many of the same amino acids that appear to be hot spots for mutations inhuman cancer. Comparison of the fly p53-like protein with the human p53, p63, and p73proteins suggests that it may represent a progenitor of this entire family. In mammalian cells,levels of p53 protein are tightly regulated in vivo by its interaction with the Mdm2 protein,which in turn binds to p19ARF (20). This mode of regulation, which modulates the activity ofp53 but probably not of p63 or p73 (21), may not apply to the Drosophila protein, because wehave not been able to identify orthologs of either Mdm2 or p19ARF in Drosophila.Interestingly, likely orthologs of the breast cancer susceptibility genes BRCA1 and BRCA2were not found in Drosophila. In most instances, cancer genes that have a Drosophila orthologalso have an ortholog in C. elegans, although the extent of sequence similarity to the wormgene is lower. In a minority of instances, a C. elegans ortholog was clearly absent. Cancergenes with orthologs in Drosophila and apparently not in C. elegans include p53 andneurofibromatosis type 1 (22), the two genes implicated in tuberous sclerosis (TSC1 andTSC2) (23), and MEN. The two TSC gene products are thought to bind to each other and mayfunction in a pathway that is conserved between humans and Drosophila but is absent in C.elegans and S. cerevisiae. However, the limitations of this type of analysis are clearly illustratedby our inability to find a bCL2 ortholog in C. elegans using these search parameters. The C.elegans ced-9 gene has been shown to function as a bCL2 homolog, and its protein is 23%identical to the human protein over its entire length (24).
Numerous orthologs of neurological genes are also found in the Drosophila genome. Some,such as Notch (CADASIL syndrome), the beta amyloid protein precursorlike gene, andPresenilin (Alzheimer's disease), were already known from previous studies in the fly. Thegenome sequencing effort has uncovered several additional genes that are likely to be orthologsof human neurological genes, such as tau (frontotemporal dementia with Parkinsonism), theBest macular dystrophy gene, neuroserpin (familial encephalopathy), genes for limb girdlemuscular dystrophy types 2A and 2B, the Friedreich ataxia gene, the gene for Miller-Diekerlissencephaly, parkin (juvenile Parkinson's disease), and the Tay-Sachs and Stargardt's diseasegenes. Several genes implicated in expanded polyglutamine repeat diseases, includingHuntington's and spinal cerebellar ataxia 2 (SCA2), are found in the fruit fly. Most humanneurological disease genes surveyed were also detected in C. elegans, and some were evenfound in yeast, although a few examples are apparently present only in Drosophila, such asthe Parkin and SCA2 orthologs.
Among genes implicated in endocrine diseases, those functioning in the insulin pathway aremostly conserved. In contrast, members of pathways involving growth hormone,mineralocorticoids, thyroid hormone, and the proteins that regulate body mass in vertebrates,such as those encoding leptin, do not appear to have Drosophila orthologs. Surprisingly, aprotein that shows significant sequence similarity to the luteinizing hormone receptor is presentin Drosophila (25). The physiological ligand for this receptor is not known. A number of genesthat have been implicated in human renal disorders have orthologs in Drosophila, despite the
Rubin et al. Page 7
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
differences between human kidneys and insect Malpighian tubules. In many instances, thesegene products are involved in fluid and electrolyte transport across epithelia. Not surprisingly,most disease genes that function in intracellular metabolic pathways appear to haveDrosophila orthologs.
Developmental and Cellular ProcessesDevelopmental strategies in various phyla are overtly very different, from the fixed cell lineageof C. elegans to the syncytial embryogenic development of the fly, to early embryogenesis inamphibians and mammals. A number of major processes—cell division, cell shape, signalingpathways, cell-cell and cell-substrate adhesion, and apoptosis— determine the developmentaloutcomes of these very different embryos. Although there are many more, such as the processesthat determine embryonic gradients, cell polarities, and cell movement, here we examine thefirst five, beginning with cell cycle components, and examine what new insights have beengained from the genomic data that affect our knowledge of the evolution of developmentalprocesses. We then discuss the processes of neuronal signaling and innate immunity.
Cell cycleDespite conservation of the mechanisms regulating cell cycle progression, many of thefunctions governing this progression are encoded by gene families whose individual membersare not conserved between vertebrates and yeast. For example, the cyclins of S. cerevisiae canbe divided into a G1 class (Cln1, Cln2, and Cln3) and an S/G2 class (Clb1 through Clb6); it isnot possible to identify orthologs of individual vertebrate cyclins. Consequently, analysis ofthe roles of particular vertebrate cell cycle genes benefits from a genetic model in whichparallels are more evident. Analysis of the Drosophila genome sequence supports and extendsprevious suggestions of strong parallels between fly and human cell cycle regulators. Orthologsof vertebrate cell cycle cyclins—cyclin A (CycA), CycB, CycB3, CycE, and CycD—have beenidentified in Drosophila, as have orthologs of cyclins that appear to have roles in transcription:CycC, CycH, CycK, and CycT. Apparent orthologs of these cyclins can be also be found inC. elegans; however, the level of similarity to the vertebrate members is invariably substantiallyless. Indeed, BLAST comparisons suggest that vertebrate and Drosophila CycA and CycBshare more sequence similarity with yeast than with proposed C. elegans orthologs.Examination of other cell cycle regulators confirms that quite precise comparisons can be madebetween vertebrates and flies; parallels with yeast are looser. For example, like vertebrates,Drosophila uses several different cyclin-dependent kinases (Cdks) to regulate different aspectsof the cell cycle; S. cerevisiae and Schizosaccharomyces pombe use only one. Cloning effortsand the genome sequence revealed Drosophila orthologs of vertebrate Cdk1 (cdc2) and Cdk2(cdc2c), as well as a single Drosophila Cdk (Cdk4/6) with close similarity to both Cdk4 andCdk6. As in vertebrates, Drosophila has two distinct kinases that add inhibitory phosphate toCdk1, the previously identified Wee, and a recently recognized homolog of Myt1, which wasinitially identified as a membrane-associated inhibitory kinase in Xenopus (26). C. elegans alsohas two homologs of these kinases (Wee1.1 and Wee1.3); however, similarity scores do notplace these into distinct Wee1 and Myt1 subtypes. Each of these genes appears to be presentin a single copy, a factor that simplifies genetic interpretations.
The retinoblastoma gene product pRb is a crucial cell cycle regulator in mammals and is thoughtto modulate S-phase entry via its interactions with the transcriptional regulator E2F and itsdimerization partner (DP). This important mode of regulation is not found in yeast, but manycomponents of the Rb pathway have been identified and studied in Drosophila (27). Thesequencing effort uncovered a second Rb-related gene in Drosophila and confirmed theexistence of only two E2F family members and a single DP ortholog. C. elegans also has anRb-related gene, isolated in a genetic screen for mutations affecting cell fate decisions (28),but it has not been shown to play a direct role in cell cycle regulation. Also evident from the
Rubin et al. Page 8
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
sequence are eight skp-like genes and six cullin-related genes. The Skp and Cullin proteinsfunction in a complex that mediates the degradation of specific target proteins during crucialcell cycle transitions. Further exploration of the genome sequence should define orthologs tomost vertebrate cell cycle genes and lead to genetic tests of their regulation and function.
CytoskeletonA large number of proteins link events at the cell surface with cytoskeletal networks andintracellular messengers (13). We found approximately 230 genes (approximately 2% of thepredicted genes) that encode cytoskeletal structural or motor proteins; these represent mostmajor families found in other invertebrates and vertebrates (29). The fraction of theDrosophila genome devoted to cytoskeletal functions appears to be somewhat smaller thanthat found in C. elegans (5%) (30); whether this reflects a true biological difference or adifference in classification criteria remains to be discovered. Of the Drosophila cytoskeletalgenes, 90 encode proteins belonging to the kinesin, dynein, or myosin motor superfamilies, oraccessory or regulatory proteins known to interact with the motor protein subunits.Approximately 80 genes encode actin-binding proteins, including proteins belonging to thespectrin/α-actinin/dystrophin superfamily of membrane cytoskeletal and actin–cross-linkingproteins. Twenty genes encode proteins that are likely to bind microtubules, based on theirsimilarity to microtubule-binding proteins found in other organisms. Fourteen genes encodemembers of the actin superfamily, 12 encode members of the tubulin superfamily, and 5 encodeseptins. Overall, the representation of predicted cytoskeletal protein types and families issimilar to what has been found for C. elegans, although Drosophila has many more dyneins,probably because C. elegans lacks motile cilia and flagella.
Among this collection of cytoskeletal genes are several interesting and in some cases long-sought genes. One gene encodes a protein with striking homology to proteins of the tau/MAP2/MAP4 family that share a characteristic repeated microtubule-binding domain. Two encodenew tubulins; one appears most closely related to α-tubulin, and the other appears most closelyrelated to β-tubulin, both with approximately 50% identity. Neither new tubulin has greatersimilarity to the other, more divergent members of the tubulin superfamily, such as γ-, δ-, orε-tubulin (31). Thus, both Drosophila and C. elegans appear to lack δ- and ε-tubulin, eventhough δ-tubulin is highly conserved between Chlamydomonas and humans. There are alsothree new members of the central motor domain family of kinesins that encode nonmotorproteins that regulate microtubule dynamics (32). There are clear homologs of the dystrophincomplex and of dystrobrevin. Finally, the fly lacks cytoplasmic intermediate filament proteins,other than nuclear lamins, although other invertebrates, including C. elegans, appear to havegenes encoding these (33). Drosophila and C. elegans both also appear to lack a gene encodingkinectin, the proposed receptor for kinesin and cytoplasmic dynein on vesicles and organelles(34). Flies and worms must thus use different proteins to link microtubule motors to vesiclesand organelles.
Cell adhesionCell-cell adhesion and cell-substrate adhesion molecules have been crucial to the developmentof multicellular organisms and the evolution of complex forms of embryogenesis (13). Thetransmembrane extracellular matrix-cytoskeleton linkage via integrins is ancient. There arefive α and two β integrins in the fly, two α and one β in C. elegans, and at least 18 α and eightβ in vertebrates. Integrin-associated cytoplasmic proteins (talin, vinculin, α-actinin, paxillin,FAK, p130CAS, and ILK) are encoded by single-copy fly genes, as are tensin and syndecan.
Two genes for type IV collagen subunits and genes for the three subunits of laminin werealready known in the fly. Analysis of the genome revealed no more laminin genes and onlyone more collagen, which is closest to types XV and XVIII of vertebrates. A counterpart of
Rubin et al. Page 9
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
this collagen is found in C. elegans, which has on the order of 170 collagens. Most important,it appears that the core components of basement membranes (two type IV collagen subunits,three laminin subunits, entactin/nidogen, and one perlecan), are all present in flies. Thisconstitution of basement membranes was clearly established early in evolution and has beenwell conserved in metazoans; remarkably, the fly preserves the linked head-to-headorganization of vertebrate type-IV collagen genes. In contrast to this conservation, many well-known vertebrate integrin (ECM) ligands are absent from the fly: fibronectin, vitronectin,elastin, von Willebrand factor, osteopontin, and fibrillar collagens are all missing.
The fly has three classic cadherins, two of which are closely linked, but no protocadherins ofthe type found in vertebrates as clusters with common cytoplasmic domains (35). Vertebrateshave three such clusters encoding over 50 protocadherins and close to 20 classical cadherins.The fly has no reelin, an ECM ligand for CNR-type protocadherins in vertebrates (36).However, there are other fly proteins with cadherin repeats, including the previously knownFat, Dachsous, and Starry night, and a new very large protein related to Fat. C. elegans has 15genes containing cadherin repeats; the number in humans is now 70 and will undoubtedly rise(13).
Cell signalingComponents of known signaling pathways in the fly and worm have largely been uncoveredby examinations of developmental systems. It is a tribute to the previous genetic analyses donein these organisms that only a modest number of new components of the known signalingpathways were revealed by analysis of the genomic sequence. The core components definedin flies and worms have been used in modified and expanded forms in vertebrates (37). Thepredominant pathways—transforming growth factor–β (TGF-β), receptor tyrosine kinases,Wingless/Wnt, Notch/lin-12, Toll/IL1, JAK/STAT/cytokine, and Hedgehog (HH) signalingnetworks—all have largely conserved fly and vertebrate components. The worm, by contrast,does not appear to possess the HH or Toll/IL1 pathways, nor does it have all of the componentsof the Notch/lin-12 network (38). Two new proteins of the TGF-β superfamily were identified,bringing the total to seven; all seven are members of the bone morphogenetic protein (BMP)or β-activin subfamilies. We detected no representatives of the other branches of thissuperfamily, namely the TGF-β, α-inhibin, and Mullerian inhibiting substance (MIS)subfamilies. Three new members of the Wingless/Wnt family were identified, bringing thetotal to seven. Each of these proteins has sequence similarity to a different vertebrate Wntprotein; this ancient family clearly underwent much of its expansion before the divergence ofthe arthropod and chordate lineages. There is only one member of the Notch and HH families,in contrast to the many members of these families in vertebrates.
ApoptosisThe core apoptotic machinery of Drosophila shares many features in common with that ofmammals. Many apoptosis-inducing signals lead to activation of members of the caspasefamily of proteases. These proteases function in apoptotic processes as cell death signaltransducers and death effectors, and in nonapoptotic processes in flies and mammals (39).Drosophila contains genes encoding 8 caspases, as compared to 4 in the worm and at least 14in mammals. Three of the fly caspases contain long NH2-terminal prodomains of 100 to 200amino acids that are characteristic of caspases that function as signal transducers. Theseprodomains are thought to mediate caspase recruitment into signaling complexes in whichactivation occurs in response to oligomerization. In one pathway described in mammals butnot in worms, death signals cause the release of proteins, including cytochrome c and theapoptosis-inducing factor (AIF), from mitochondria (40). The human protein Apaf-1, inconjunction with cytochrome c, activates CARD domain–containing caspases (41).Drosophila has an Apaf-1 counterpart, a CARD domain–containing caspase, and AIF;
Rubin et al. Page 10
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Drosophila also has counterparts to the caspase-activated DNAse CAD/CPAN/DFF40, itsinhibitor ICAD/DFF45, and the chromatin condensation factor Acinus (42).
Pro- and anti-apoptotic BCL2 family members regulate apoptosis at multiple points (43).Drosophila encodes two BCL2 family proteins, though more divergent family members mayexist. Fifteen BCL2 family proteins have been identified in mammals and two in the worm. Inaddition, inhibitor of apoptosis (IAP) family proteins negatively regulate apoptosis (44). Theyare defined by the presence of one or more NH2-terminal repeats of a BIR domain, a motif thatis essential for death inhibition. Drosophila has four proteins with this motif, as compared toseven identified thus far in mammals. There are several BIR domain–containing proteins inC. elegans and yeast, but none has been implicated in cell death regulation. Reaper (RPR),Wrinkled (W), and Grim are essential Drosophila cell death activators (45). Orthologs havenot been identified in other organisms, but they are likely to exist because RPR, W, and Griminduce apoptosis in vertebrate systems and physically interact with apoptosis regulators thatinclude IAPs and the Xenopus protein Scythe (46), for which there is a predicted Drosophilahomolog.
Neuronal signalingThe neuronal signaling systems in flies, worms, and vertebrates reveal extensive conservationof some components, as well as extreme divergence, or the total absence, of others. There isno voltage-activated sodium channel in the worm (17); flies and vertebrates generate sodium-dependent action potentials. The fly genome encodes two pore-forming subunits for sodiumchannels (Para and NaCP60E), and also four voltage-dependent calcium channel α subunits,including one T-type/α1G, one L-type/α1D (Dmca1D), one N-type/α1A (Dmcα1A), and oneprotein that is more similar to an outlying C. elegans protein than to known vertebrate calciumchannels. Additional fly calcium channel subunits include one (β, one γ 2, and three α 2subunits.
The worm genome encodes over 80 potassium channel proteins (17); the fly genome has only30. The extent to which these different family sizes contribute to the establishment of uniqueelectrical signatures is unknown. The fly potassium channel family includes five Shaker-likegenes (Shaker, Shab, Shal, and two Shaws); a large conductance calcium-activated channelgene (slowpoke); a slack subunit relative; three members of the eag family (eag, sei, and elk);one small conductance calcium-regulated channel gene; one KCNQ channel gene; and fourcyclic nucleotide–gated channel genes. In addition, there are 50 TWIK members in the worm,but only 11 fly members of the two-pore/TWIK family with four transmembrane domains.There are also three fly members of the inward rectifier/two transmembrane family. Finally,neither the fly nor the worm has discernible relatives of a number of mammalian channel-associated subunits such as minK and miRP1.
There are also major differences postsynaptically. C. elegans has approximately 100 membersof a family of ligand-gated ion channels (17); flies have about 50. The worm has 42 nicotinicacetylcholine receptor subunits and 37 GABA(A)-like receptor subunits; the fly contains only11 nicotinic receptor subunit genes and 12 GABA(A)/glycine-like receptor subunit genes. Incontrast, there are 30 members of the excitatory glutamate receptor family in the fly but only10 in the worm. These include subtypes of the AMPA, kainate, NMDA, and delta families. Inaddition, the fly genome contains a large number of PDZ-containing genes, approximately adozen of which encode proteins that have high sequence similarity to mammalian proteins thatinteract with specific subsets of ion channels. We also found a number of additional ion channelfamilies, including three voltage-dependent chloride channels, 14 Trp-like channels, 24amiloride-sensitive/degenerin-like sodium channels, one ryanodine receptor, one IP3 (inositol1,4,5-trisphosphate) receptor, eight innexins, and two porins. C. elegans is missing a nitricoxide synthase gene, copies of which occur in fly and vertebrate genomes.
Rubin et al. Page 11
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
A large array of proteins mediates specific aspects of synaptic vesicle trafficking andcontributes to the conversion of electrical signals to neurotransmitter release. Thesecomponents of exocytosis and endocytosis are relatively well conserved with respect to bothdomain structures and amino acid identities (50 to 90%). The fly has enzymes for the synthesisof the neurotransmitters glutamate, dopamine, serotonin, histamine, GABA, acetylcholine, andoctopamine, and a family of conserved transporters is likely to be involved in loading vesicleswith these neurotransmitters. The conserved vesicular trafficking proteins, with 50 to 80%amino acid identity, include members of the Munc-18, SCAMP, synaptogyrin, HRS2,tomosyn, cysteine string protein, exocyst (SEC 5, 6, 7, 8, 10, 13, 15, EXO 70, and EXO84),synapsin, rab-philin-3A, RIM, rab-3, CAPS, Mint, Munc-13, NSF, α and γ SNAP, DOC-2B,latrophilin, Veli, CASK, VAP-33, Snapin, SV2, and complexin families. Generally, there isonly one homolog in Drosophila for every three to four isoforms in mammals. However, thereare eight fly synaptotagmin-like genes, making this the largest family of vesicle proteins inDrosophila (47). However, there is no homolog of synaptophysin, an early candidate for avesicle fusion pore, which indicates a nonessential role in exocytosis for this particular proteinacross phyla.
Membrane trafficking also requires interactions between compartment-specific vesicular andtarget membrane proteins (v-SNAREs and t-SNAREs, respectively), whose subcellulardistribution and combinatorial binding patterns are predicted to define organelle identity andtargeting specificity (48). The completed fly genome allows us to address whether there is anycorrelation between the increased developmental complexity of multicellular organisms and alarger number of SNAREs than that found in unicellular organisms. In the fly, we find sixsynaptobrevins, three SNAP-25s, 10 syntaxins, and four additional t-SNAREs (membrin,BET1, UFE1, and GOS28), and the number of SNAREs is similar between yeast (49) andDrosophila. Thus, basic subcellular compartmentalization and membrane trafficking to andbetween these various compartments has not changed dramatically in multicellular versusunicellular organisms. Dynamin, clathrin, the clathrin adapter proteins, amphiphysin,synaptojanin, and a number of additional genes that encode proteins with defined endocytoticmotifs are all present.
In contrast to the conservation of the synaptic vesicle trafficking machinery, the few identifiedproteins present at mammalian active zones, namely aczonin, bassoon, and piccolo, do nothave relatives in Drosophila. There are, however, numerous proteins in the fly withcombinations of C2 domains, PDZ domains, zinc fingers, and proline-rich domains, indicatingthat the precise protein composition of active zones is likely to vary among metazoans. Inaddition, Drosophila contains a neurexin III gene and four neuroligin genes that may be partof a neurexin-neuroligin complex that has been widely proposed to provide a synaptic scaffoldfor linking pre- and postsynaptic structures in mammals (50). Potential agrin and Musk genesare also present, though the overall sequence similarity is low.
ImmunityMulticellular organisms have elaborate systems to defend against microbial pathogens. Onlyvertebrates have an acquired immune system, but both vertebrates and invertebrates share amore primitive innate immune system. Innate immunity is based on the detection of commonmicrobial molecules such as lipopolysaccharides and peptidoglycans by a class of receptorsknown as pattern recognition receptors (51). We identified a large family of genes encodinghomologs of receptors that are involved in microbial recognition in other organisms. Theseinclude two new homologs of the Drosophila Scavenger Receptors (dSR-CI), nine membersof the CD36 family, 11 members of the peptidoglycan recognition protein (PGRP) family,three Gram-negative binding protein (GNBP) homologs, and several lectins (52).
Rubin et al. Page 12
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
The recognition of infection by immuno-responsive tissues induces a battery of defense genesvia Toll/nuclear factor kappa B (NF-κB) pathways in both Drosophila and mammals (53). TheToll receptor was initially discovered as an essential component of the pathway that establishesthe dorsoventral axis of the Drosophila embryo. Recent genetic studies now reveal that Tollsignaling pathways are key mediators of immune responses to fungi and bacteria in bothDrosophila and mice (53). We found seven additional homologs of Toll proteins inDrosophila, all of which are more similar to each other than to their mammalian counterparts.Some of these other Toll proteins, like 18-wheeler, will probably mediate innate immuneresponses. In Drosophila, infection by at least some microbes induces a proteolytic cascadethat leads to the processing of Spaetzle (SPZ), a cytokine-like protein, which then activatesToll (53). We found two proteins related to SPZ with similarities that include most or all ofthe cysteine residues of SPZ. Given the presence of multiple Toll-like receptors inDrosophila, these new SPZ-like proteins may also function in the immune system. With theexception of the two I-κB kinase homologs and the three rel proteins (Dorsal, Dif, and Relish),the Drosophila genome appears to contain only single copies of the genes encodingintracellular components of the Toll pathway: Tube, Pelle, and Cactus. How do the differentToll receptors trigger specific immune responses using the same intracellular intermediates?One explanation is that additional signaling components remain unidentified; anotherexplanation is crosstalk with other signaling pathways. In contrast, a Toll ortholog has not beenidentified in C. elegans, although there are some Toll-like receptors. C. elegans, in addition,does not possess homologs of NF-κB/dorsal transcriptional activators that functiondownstream of Toll. Although it is probable that the worm has retained parts of the innateimmunity network, there is no clear evidence of an inducible host defense system in the worm.
One of the most potent innate immune responses in insects is the transcriptional induction ofgenes encoding antimicrobial peptides (53). In contrast to Metchnikowin, Drosocin, andDefensin peptides, which are encoded by single genes, the sequence data indicate that, like thepreviously identifed cecropin clusters, several antimicrobial peptides are encoded by genefamilies that are larger than previously suspected. Four genes appear to encode antifungalpeptide Drosomycin isoforms, and two genes each code for the antibacterial proteins Attacinand Diptericin. These additional genes may generate peptides with slightly different spectra ofantimicrobial activity or may simply amplify the antimicrobial response.
Concluding RemarksWhat have we learned about the proteins encoded by the three sequenced eukaryotic genomes?Some information emerges readily from the comparison of the fly, worm, and yeast genomes.First, the core proteome sizes of flies and worms are similar and are only twice the size of thatof yeast. This is perhaps counterintuitive, because the fly, a multicellular animal withspecialized cell types, complex development, and a sophisticated nervous system, looks morethan twice as complicated as single-celled yeast. The lesson is that the complexity apparent inthe metazoans is not achieved by sheer number of genes (54). Second, there has been aproliferation of bigger and more complex proteins in the two metazoans relative to yeast,including, not surprisingly, more proteins with extracellular domains involved in cell-cell andcell-substrate interactions. Finally, the population of multidomain proteins is somewhat largerand more diverse in the fly than in the worm. There is presently no practical way to quantifydifferences in biological complexity between two organisms, however, so it is not possible tocorrelate this increased domain expansion and diversity in the fly with differences indevelopment and morphology.
The availability of the annotated sequence of the Drosophila genome enhances the fly'susefulness as an experimental organism. By greatly facilitating positional cloning, the genomesequence will increase the efficiency of genetic screens that seek to identify genes underlying
Rubin et al. Page 13
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
many complex processes of cell biology, development, and behavior. Such screens have beenthe mainstay of Drosophila research and have contributed enormously to our knowledge ofmetazoan biology. The genome sequencing effort has revealed a number of previouslyunknown counterparts to human genes involved in cancer and neurological disorders; forexample, p53, menin, tau, limb girdle muscular dystrophy type 2B, Friedrich ataxia, andparkin. All of these fly genes are present in a single copy in the genome and can be geneticallyanalyzed without uncertainty about redundant copies. More genetic screens are important inorder to uncover interacting network members. Orthologs of these network members can thenbe sought in the human genome to determine if alterations in any of them predispose humansto the disease in question, an experimental paradigm that has already been successfullyexecuted in several cases. Flies can also play an important role in exploring ways to rectifydisease phenotypes. For example, at least 10 human neurodegenerative diseases are caused byexpansion of polyglutamine repeats (55). Human proteins containing expanded polyglutaminerepeats have been expressed in flies, resulting in the formation of nuclear inclusions that containthe protein as well as other shared components (56), just as in humans. It has been shown thatdirected expression of the human HSP70 chaperone in the fly can totally suppressneurodegeneration resulting from expression of the human spinocerebellar ataxia type 3 protein(57). The power and speed of this in vivo system are unparalleled, and we anticipate theincreased use of such “humanized” fly models.
Knowing the complete genomic sequence also allows new experimental approaches to long-standing problems. For example, it makes it possible to study networks of genes rather thanindividual genes or pathways. Assaying the level of transcription of every gene in the genomemakes it at least theoretically possible to monitor the expression of an entire network of genessimultaneously. One problem that is approachable this way is the combinatorial control of genetranscription. The fly genome appears to encode only about 700 transcription factors, andmutations in over 170 have already been isolated and characterized. The techniques areavailable to measure the changes in expression of every gene in individual cell types as aconsequence of loss or overexpression of each transcription factor. We can look for commonsequence elements in the promoters of coregulated genes and perform chromatin immuno-precipitation to identify the in vivo binding sites of individual factors. For the first time, wecan envision obtaining the data needed to understand the behavior of a complex regulatorynetwork. Of course, collecting these data is a massive task, and developing methods to analyzethe data is even more daunting. But it is no longer ludicrous to try.
How big is the core proteome of humans? Vertebrates have many gene families with three orfour members: the HOX clusters, calmodulins, Ezrins, Notch receptors, nitric oxide synthases,syndecans, and NF1 transcription factor genes are some examples (58). This is evidence fortwo genome doublings during mammalian evolution, superimposed on which were theamplifications and contractions over evolutionary time that uniquely characterize each lineage(59). The human genome, with 80,000 or so genes, is likely to be an amplified version of avery much smaller genome, and its core proteome may not be much larger than that of the flyor worm; that is, the more complex attributes of a human being are achieved using largely thesame molecular components. The evolution of additional complex attributes is essentially anorganizational one; a matter of novel interactions that derive from the temporal and spatialsegregation of fairly similar components.
Finally, approximately 30% of the predicted proteins in every organism bear no similarity toproteins in its own proteome or in the proteomes of other organisms. In other words, sequencesimilarity comparisons consistently fail to give us information about nearly a third of thecomponents that make every organism uniquely itself. What does this mean with respect to theevolution and function of these proteins? Does each genome contain a sub-population of veryrapidly evolving genes? One-third of randomly chosen cDNA clones do not cross-hybridize
Rubin et al. Page 14
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
between D. melanogaster and Drosophila virilis (60). Even though these are distantly relatedspecies, they are developmentally and morphologically very similar. Crystallographic data willbe needed to determine whether these proteins that have diverged in primary sequence havemaintained their three-dimensional structures or have diverged so far that new folds anddomains have formed.
Our first look at the annotated fly genome provokes these and other questions. Access to thegenomic sequence will help us design the experiments needed to answer them. The relativesimplicity and manipulability of the fly genome means that we can address some of thesebiological questions much more readily than in vertebrates. That is, after all, what modelorganisms are for.
References and Notes1. Adams MD, et al. Science 2000;287:2185. [PubMed: 10731132]C elegans Sequencing Consortium.
Science 1998;282:2012. [PubMed: 9851916]Goffeau A, et al. Science 1996;274:546. [PubMed:8849441]
2. Fleischman RD, et al. Science 1995;269:496. [PubMed: 7542800]3. C. elegans data were taken from A C. Elegans Database (ACEDB) release WS8.4. Local gene duplications were determined by searching for N similar genes within 2N genes on each
arm. For example, if three similar genes are found within a region containing six genes, this counts asone cluster of three genes. Genes were judged to be similar if a BLASTP High Scoring Pair (HSP)with a score of 200 or more existed between them. Histone gene clusters were not included. C.elegans data were taken from ACEDB release WS8, containing 18,424 genes.
5. More information about GO is available at http://www.geneontology.org/. The Gene Ontology projectprovides terms for categorizing gene products on the basis of their molecular function, biological role,and cellular location using controlled vocabularies.
6. Initial results came from an NxN BLASTP analysis performed for each fly, worm, and yeast sequencein a combined data set of these completed proteomes. The databases used are as follows: Celera–Berkeley Drosophila Genome Project (BDGP), 14,195 predicted protein sequences (1/5/2000);WormPep 18, Sanger Centre, 18,576 protein sequences; and Saccharomyces Genome Database (SGD),6306 protein sequences (1/7/2000). A version of NCBI-BLAST2 was used with the SEG filter andwith the effective search space length (Y option) set to 17,973,263. Pairs were formed between everyquery sequence with a significant BLASTP to one of the other organisms' sequences. Significance wasbased on E-value cutoffs and length of match. These pairs were then independently grouped usingsingle linkage clustering (61). Finally, the number of proteins from each proteome was counted. Therequirement for 80% alignment of sequences makes this method of defining orthology particularlysensitive to errors that arise from incorrect protein prediction. However, the results comparing yeastand worm are essentially identical to those previously reported (61), even though the effective databasesize was different, the data sets have changed (Chervitz: yeast 6217 and worm 19,099; this study: yeast6306, and worm 18,576), and the version of BLAST used is quite different (Chervitz: WashU BLAST2.0a19MP; this study: NCBI BLAST 2.08).
7. Bairoch A, Apweiler R. Nucleic Acids Res 2000;28:45. [PubMed: 10592178]8. Henikoff JG, Greene EA, Pietrokovski S, Henikoff S. Nucleic Acids Res 2000;28:228. [PubMed:
10592233]9. InterPro (Integrated resource for protein domains and functional sites) is a collaborative effort of the
SWISS-PROT, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom databases to integrate the differentpattern databases into a single resource. The database and a detailed description of the project can befound under http://www.ebi.ac.uk/interpro/. PROSITE is described in Hofmann K, Bucher P, FalquetL, Bairoch A. Nucleic Acids Res 27:215.1999; [PubMed: 9847184]; PFAM is described in BatemanA, et al. Nucleic Acids Res 27:260.1999; [PubMed: 9847196]; and PRINTS is described in AttwoodTK, et al. Nucleic Acids Res 27:220.1999; [PubMed: 9847185]
10. Plowman GD, Sudarsanam S, Bingham J, Whyte D, Hunter T. Proc Natl Acad Sci U S A1999;96:13603. [PubMed: 10570119]
Rubin et al. Page 15
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
11. Barrett, J.; Rawlings, ND.; Wessner, JF., editors. Handbook of Proteolytic Enzymes. Academic Press;San Diego, CA: 1998.
12. Smith CL, DeLotto R. Nature 1994;368:548. [PubMed: 8139688]Konrad KD, Goralski TJ, MahowaldAP, Marsh JL. Proc Natl Acad Sci U S A 1998;95:6819. [PubMed: 9618496]LeMosy EK, Hong CC,Hashimoto C. Trends Cell Biol 1999;9:102. [PubMed: 10201075]
13. Hynes RO. Trends Cell Biol 1999;9:M33. [PubMed: 10611678]14. Bork P, Downing AK, Kieffer B, Campbell ID. Quart Rev Biophys 1996;29:119.15. Vernier P, Cardinaud B, Valdenaire O, Philippe H, Vincent JD. Trends Pharmacol Sci 1995;16:375.
[PubMed: 8578606]Colas J, Launay J, Vonesch J, Hickel P, Maroteaux L. Mech Dev 1999;87:77.[PubMed: 10495273]Costa MR, Wilson ET, Wieschaus E. Cell 1994;76:1075. [PubMed: 8137424]
16. Mombaerts P. Science 1999;286:707. [PubMed: 10531047]17. Bargmann CI. Science 1998;282:2028. [PubMed: 9851919]18. Clyne PJ, et al. Neuron 1999;22:327. [PubMed: 10069338]Vosshall LB, Amrein H, Morozov PS,
Rzhetsky A, Axel R. Cell 1999;96:725. [PubMed: 10089887]Laissue PP, et al. J Comp Neurol1999;405:543. [PubMed: 10098944]
19. Lin YJ, Seroude L, Benzer S. Science 1998;282:943. [PubMed: 9794765]20. Zhang Y, Xiong Y, Yarbrough WG. Cell 1998;92:725. [PubMed: 9529249]21. Jones SN, Roe AE, Donehower LA, Bradley A. Nature 1995;378:206. [PubMed: 7477327]22. The I, et al. Science 1997;276:791. [PubMed: 9115203]23. Ito N, Rubin GM. Cell 1999;96:529. [PubMed: 10052455]24. Hengartner MO, Horvitz HR. Cell 1994;76:665. [PubMed: 7907274]25. Hauser F, Nothacker HP, Grimmelikhuijzen CJ. J Biol Chem 1997;272:1002. [PubMed: 8995395]26. Mueller PR, Coleman TR, Kumagai A, Dunphy WG. Science 1995;270:86. [PubMed: 7569953]27. Dynlacht BD, Brook A, Dembski M, Yenush L, Dyson N. Proc Natl Acad Sci U S A 1994;91:6359.
[PubMed: 8022787]Du W, Vidal M, Xie JE, Dyson N. Genes Dev 1996;10:1206. [PubMed: 8675008]Sawado T, et al. Biochem Biophys Res Commun 1998;251:409. [PubMed: 9792788]
28. Lu X, Horvitz HR. Cell 1998;95:981. [PubMed: 9875852]29. Kreis, T.; Vale, R., editors. Guidebook to the Cytoskeletal and Motor Proteins. Oxford Univ Press;
Oxford: 1999.30. Chang P, Stearns T. Nature Cell Biol 2000;2:30. [PubMed: 10620804]31. Dutcher SK, Trabuco EC. Mol Biol Cell 1998;9:1293. [PubMed: 9614175]32. Desai A, Verma S, Mitchison TJ, Walczak CE. Cell 1999;96:69. [PubMed: 9989498]33. K. Weber, in (29), pp. 291–293.34. Kumar J, Yu H, Sheetz MP. Science 1995;267:1834. [PubMed: 7892610]35. Wu Q, Maniatis T. Cell 1999;97:779. [PubMed: 10380929]36. Senzaki K, Ogawa M, Yagi T. Cell 1999;99:635. [PubMed: 10612399]37. Belvin MP, Anderson KV. Annu Rev Cell Dev Biol 1996;12:393. [PubMed: 8970732]
Hammerschmidt M, Brook A, McMahon AP. Trends Genet 1997;13:14. [PubMed: 9009843]Blaumueller CM, Artavanis-Tsakonas S. Perspect Dev Neurobiol 1997;4:325. [PubMed: 9171446]Hunter T. Philos Trans R Soc London Ser B 1998;353:583. [PubMed: 9602534]Cadigan KM, NusseR. Genes Dev 1997;11:3286. [PubMed: 9407023]Capdevila J, Belmonte JC. Curr Opin Genet Dev1999;9:427. [PubMed: 10449357]Engstrom L, Noll E, Perrimon N. Curr Top Dev Biol 1997;35:229.[PubMed: 9292272]Stronach BE, Perrimon N. Oncogene 1999;18:6172. [PubMed: 10557109]Holland PWH, Garcia-Fernandez J, Williams NA, Sidow A. Development 1994;(suppl):125.
38. Ruvkun G, Hobert O. Science 1998;282:2033. [PubMed: 9851920]39. Earnshaw WC, Martins LM, Kaufmann SH. Annu Rev Biochem 1999;68:383. [PubMed: 10872455]
Zeuner A, Eramo A, Peschle C, DeMaria R. Cell Death Diff 1999;6:1075.40. Liu X, Kim CN, Yang J, Jemmerson R, Wang X. Cell 1996;86:147. [PubMed: 8689682]Susin SA,
et al. Nature 1999;397:441. [PubMed: 9989411]41. Li P, et al. Cell 1997;91:479. [PubMed: 9390557]42. Park AG. Trends Cell Biol 2000;10:394.Sahara S, et al. Nature 1999;401:168. [PubMed: 10490026]43. Gross A, McDonnell JM, Korsmeyer SJ. Genes Dev 1999;13:1899. [PubMed: 10444588]
Rubin et al. Page 16
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
44. Miller LK. Trends Cell Biol 1999;9:323. [PubMed: 10407412]45. Abrams JM. Trends Cell Biol 1999;9:435. [PubMed: 10511707]46. Thress K, Henzel W, Shillinglaw W, Kornbluth S. EMBO J 1998;17:6135. [PubMed: 9799223]47. Littleton JT, Serano TL, Rubin GM, Ganetzky B, Chapman ER. Nature 1999;400:757. [PubMed:
10466723]48. Solner T, et al. Nature 1993;362:318. [PubMed: 8455717]49. Jahn R, Sudhof TC. Annu Rev Biochem 1999;68:863. [PubMed: 10872468]50. Ichtchenko K, et al. Cell 1995;81:435. [PubMed: 7736595]51. Medzhitov R, Janeway CA Jr. Cell 1997;91:295. [PubMed: 9363937]52. Pearson A. Current Opin Immunol 1996;8:20.Franc NC, et al. Immunity 1996;4:431. [PubMed:
8630729]Kang D, et al. Proc Natl Acad Sci U S A 1998;95:10078. [PubMed: 9707603]Lee WJ, etal. Proc Natl Acad Sci U S A 1996;93:7888. [PubMed: 8755572]
53. Hoffmann JA, Reichhart JM. Trends Cell Biol 1997;7:309. [PubMed: 17708965]Anderson KV. CurrOpin Immun 2000;12:13.
54. Miklos GLG. J Am Acad Arts Sci 1998;127:197.55. Perutz M. Trends Biochem Sci 1999;24:58. [PubMed: 10098399]56. Warrick JM, et al. Cell 1998;93:939. [PubMed: 9635424]Jackson GR, et al. Neuron 1998;21:633.
[PubMed: 9768849]57. Warrick JM, et al. Nature Genet 1999;23:425. [PubMed: 10581028]58. Spring J. FEBS Lett 1997;400:2. [PubMed: 9000502]59. Aparicio S. Trends Genet 2000;16:54. [PubMed: 10652527]60. Schmid KJ, Tautz D. Proc Natl Acad Sci USA 1997;94:9746. [PubMed: 9275195]61. Chervitz SA, et al. Science 1998;282:2022. [PubMed: 9851918]62. See www.sciencemag.org/feature/data/1049664.shl for complete protein domain analysis.63. Paralogous gene families (Table 1) were identified by running BLASTP. A version of NCBI-BLAST2
optimized for the Compaq Alpha architecture was used with the SEG filter and the effective searchspace length (Y option) set to 17,973,263. Each protein was used as a query against a database of allother proteins of that organism. A clustering algorithm was then used to extract protein families fromthese BLASTP results. Each protein sequence constitutes a vertex; each HSP between proteinsequences is an arc, weighted by the BLAST Expect value. The algorithm identifies protein familiesby first breaking all arcs with an E value greater than some user-defined value (1 × 10−6 was usedfor all of the analyses reported here). The resulting graph is then split into subgraphs that contain atleast two-thirds of all possible arcs between vertices. The algorithm is “greedy”; that is, it arbitrarilychooses a starting sequence and adds new sequences to the subgraph as long as this criterion is met.An interesting property of this algorithm is that it inherently respects the multidomain nature ofproteins: For example, two multidomain proteins may have significant similarity to one another butshare only one or a few domains. In such a case, the two proteins will not be clustered if the unshareddomains introduce a large number of other arcs.
64. An NxN BLASTP analysis was performed for each fly, worm, and yeast sequence in a combined dataset of these completed proteomes. The databases used are as follows: Celera-BDGP, 14,195 predictedprotein sequences (1/5/2000); WormPep18, Sanger Centre, 18,424 protein sequences; and SGD, 6246protein sequences (1/7/2000). BLASTP analysis was also performed against known mammalianproteins (2/1/2000, GenBank nonredundant amino acid, Human, Mouse, and Rat, 75,236 proteinsequences), and TBLASTN analysis was performed against a database of mammalian ESTs (2/1/00,GenBank dbEST, Human, Mouse, and Rat). A version of NCBI-BLAST2 optimized for the CompaqAlpha architecture was used with the SEG filter and the effective search space length (Y option) setto 17,973,263.
65. The many participants from academic institutions are grateful for their various sources of support.Participants from the Berkeley Drosophila Genome Project are supported by NIH grant P50HG00750(G.M.R.) and grant P4IHG00739 (W.M.G.).
Rubin et al. Page 17
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 18
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Fig. 1.Fly (F), worm (W), and yeast (Y) genes showing similarity to human disease genes. Thiscollection of human disease genes was selected to represent a cross section of humanpathophysiology and is not comprehensive. The selection criteria require that the gene isactually mutated, altered, amplified, or deleted in a human disease, as opposed to having afunction deduced from experiments on model organisms or in cell culture. Due to redundancyin gene and protein sequence databases, a single reference sequence for each gene had to bechosen. Most reference sequences represent the longest mRNA of several alternatives inGenBank. Authoritative sources in the literature and electronic databases [Online MendelianInheritance in Man (OMIM)] were also consulted. In all, 289 protein sequences met thesecriteria. These were used as queries to search a database consisting of the sum total of geneproducts (38,860) found in the complete genomes of fly, worm, and yeast. 12,953 was used asthe effective database size (the z parameter in BLAST). BLASTP searches were conducted asdescribed for full genome searches, except for the z parameter. To control for potentialframeshift errors in the Drosophila genome sequence, searches against a six-frame translationof the entire genome (using TBLASTN) were also conducted with the disease gene sequencesusing the z parameter above. Only two cases in which matches to genomic sequence were betterthan to the predicted protein were found, and these were manually corrected to reflect the betterTBLASTN scores in the table. Results are scaled according to various levels of statisticalsignificance, reflecting a level of confidence in either evolutionary homology or functionalsimilarity. White boxes represent BLAST E values >1 × 10−6, indicating no or weak similarity;light blue boxes represent E values in the range of 1 × 10−6 to 1 × 10−40; purple boxes representE values in the range of 1 × 10−40 to 1 × 10−100; and dark blue boxes represent E values <1 ×10−100, indicating the highest degree of sequence conservation. Actual E values can be found
Rubin et al. Page 19
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
in the Web supplement to this figure (62), where links to OMIM and GenBank may also befound. A plus sign indicates our best estimate that the corresponding Drosophila gene productis the functional equivalent of the human protein, based on degree of sequence similarity,InterPro domain composition, and supporting biological evidence, when available. A minussign indicates that we were unable to identify a likely functional equivalent of the humanprotein.
Rubin et al. Page 20
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 21
Table 1Numbers of distinct gene families versus numbers of predicted genes and their duplicated copies in H. influenzae, S.cerevisiae, C. elegans, and D. melanogaster. Row one shows the total number of genes in each species. Row two showsthe total number of all genes in each genome that appear to have arisen by gene duplication. Row three is the totalnumber of distinct gene families for each genome. Each proteome was compared to itself using the same parametersas described in (63).
H. influenzae S. cerevisiae C. elegans D. melanogaster
Total no. of predictedgenes
1709 6241 18424 13601
No. of genes duplicated 284 1858 8971 5536
Total no. of distinctfamilies
1425 4383 9453 8065
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 22Ta
ble
2T
able
2A
. Sim
ilarit
y of
sequ
ence
s in
pred
icte
d pr
oteo
mes
of D
. mel
anog
aste
r, S.
cere
visi
ae, a
nd C
. ele
gans
. To
be sc
ored
as a
sim
ilarit
y,ea
ch p
airw
ise
sim
ilarit
y w
as re
quire
d to
ext
end
over
mor
e th
an 8
0% o
f the
leng
th o
f the
que
ry s
eque
nce
at a
n E
valu
e le
ss th
an th
atin
dica
ted.
For
exa
mpl
e, in
“Fl
y pr
otei
ns in
Fly
-yea
st,”
the
colu
mn
labe
led
E <
10−1
0 sh
ows
the
num
ber a
nd p
erce
ntag
e of
fly
prot
eins
that
mat
ch y
east
pro
tein
s at
this
E v
alue
or l
ess
and
for w
hich
mor
e th
an 8
0% o
f the
leng
th o
f the
fly
prot
ein
is a
ligne
d w
ith th
e ye
ast
prot
ein.
Eac
h se
t of
pairs
was
ana
lyze
d w
ithou
t con
side
ratio
n of
the
third
pro
teom
e. T
he r
ows
labe
led
“Fly
-wor
m-y
east
” re
port
the
com
posi
tion
of an
inde
pend
ent c
lust
erin
g in
whi
ch o
nly
grou
ps co
ntai
ning
a m
embe
r fro
m al
l thr
ee p
rote
omes
wer
e cou
nted
. The
num
bers
are
slig
htly
hig
her f
or th
e “F
ly-w
orm
-yea
st”
coun
ts th
an fo
r the
“Fl
y-ye
ast”
or “
Wor
m-y
east
” co
unts
bec
ause
of s
eque
nce
brid
ging
; tha
tis
, not
all
sequ
ence
s with
in a
gro
up n
eces
saril
y ha
ve a
sign
ifica
nt m
atch
to a
ll ot
her m
embe
rs o
f tha
t gro
up. S
ee (6
) for
det
ails
.
E <
10−1
0E
< 10
−20
E <
10−5
0E
< 10
−100
(n)
(%)
(n)
(%)
(n)
(%)
(n)
(%)
Fly
prot
eins
in:
Fl
y-ye
ast
2345
16.5
1877
13.2
1036
7.3
433
3.1
Fl
y-w
orm
4998
35.2
4212
29.7
2442
17.2
1106
7.8
Fl
y-w
orm
-yea
st33
0323
.324
2817
.111
137.
843
53.
1
Wor
m p
rote
ins i
n:
W
orm
-yea
st21
8411
.817
689.
593
35.
037
42.
0
Fl
y-w
orm
4795
25.8
4004
21.6
2403
12.9
1092
5.9
Fl
y-w
orm
-yea
st32
2917
.424
3913
.111
156.
041
92.
3
Yea
st p
rote
ins i
n:
Fl
y-ye
ast
1856
29.4
1567
24.8
891
14.1
376
6.0
W
orm
-yea
st17
0427
.014
2522
.680
212
.733
55.
3
Fl
y-w
orm
-yea
st18
3329
.115
2524
.283
113
.235
25.
6
Tab
le 2
B. A
com
paris
on o
f D. m
elan
ogas
ter,
C. e
lega
ns, a
nd S
. cer
evis
iae
prot
ein
sequ
ence
s to
each
oth
er a
nd to
mam
mal
ian
sequ
ence
s (64
). Th
is ta
ble
repo
rts th
e nu
mbe
r and
per
cent
of f
ly, w
orm
,or
yea
st q
uery
sequ
ence
s with
sim
ilarit
ies l
ess t
han
the
indi
cate
d E
valu
e cu
toff
s. Fo
r exa
mpl
e, in
the
“Fly
vs.
Yea
st”
com
paris
on, 3
986
or 2
8.1%
of f
ly p
rote
ins h
ave
a si
mila
rity
with
a y
east
pro
tein
with
an
E va
lue
less
than
1 ×
10−
10. E
ST E
val
ues a
re n
ot d
irect
ly c
ompa
rabl
e to
pro
tein
E v
alue
s, be
caus
e th
e re
sulti
ng a
lignm
ents
are
shor
ter.
No
sim
ilarit
y E
> 10
−4E
< 10
−10
E <
10−2
0E
< 10
−50
E <
10−1
00
(n)
(%)
(n)
(%)
(n)
(%)
(n)
(%)
(n)
(%)
Fly
vs.
Y
east
8177
57.6
3986
28.1
2677
18.9
1266
8.9
504
3.6
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 23
E <
10−1
0E
< 10
−20
E <
10−5
0E
< 10
−100
(n)
(%)
(n)
(%)
(n)
(%)
(n)
(%)
W
orm
5110
36.0
6743
47.5
5180
36.5
2832
19.9
1197
8.4
M
amm
alia
n58
3341
.170
3249
.558
3741
.135
8025
.217
7212
.5
M
amm
alia
n ES
Ts53
8637
.973
2951
.653
5237
.717
7512
.511
00.
8
Wor
m v
s.
Y
east
1254
168
.035
8219
.423
7812
.911
066.
040
12.
2
Fl
y86
0346
.771
3838
.854
2829
.528
8015
.612
296.
7
M
amm
alia
n10
152
55.1
6550
35.6
4999
27.1
2782
15.1
1211
6.6
M
amm
alia
n ES
Ts10
354
56.2
6005
32.6
4000
21.7
1170
6.4
680.
4
Yea
st v
s.
Fl
y26
1441
.925
6441
.019
1030
.610
2116
.440
86.
5
W
orm
2762
44.2
2358
37.8
1730
27.7
882
14.1
348
5.6
M
amm
alia
n32
3051
.723
4037
.518
0228
.999
215
.942
96.
9
M
amm
alia
n ES
Ts31
0649
.723
1937
.115
5324
.950
38.
118
0.3
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 24Ta
ble
3N
umbe
r of p
rote
ins i
n D
. mel
anog
aste
r (F)
, C. e
lega
ns (W
), an
d S.
cer
evis
iae
(Y) c
onta
inin
g th
e 20
0 m
ost f
requ
ently
occ
urrin
g pr
otei
ndo
mai
ns in
D. m
elan
ogas
ter.
Dom
ain
iden
tifie
rs a
re f
rom
Int
erPr
o (9
), a
new
dat
abas
e th
at h
as b
egun
to in
tegr
ate
the
inde
pend
ent
data
base
s of l
ocal
ized
pro
tein
sequ
ence
pat
tern
s int
o a
sing
le re
sour
ce. T
he b
eta
rele
ase
used
incl
udes
PR
OSI
TE, P
RIN
TS, a
nd P
FAM
.In
terP
ro c
onsi
ders
a si
gnat
ure
to b
e tru
e if
its sc
ore
is a
bove
a th
resh
old
spec
ified
for t
hat s
igna
ture
by
the
indi
vidu
al d
atab
ase.
Res
ults
of th
e In
terP
ro a
naly
sis
may
diff
er fr
om re
sults
obt
aine
d ba
sed
on h
uman
cur
atio
n of
pro
tein
fam
ilies
, due
to th
e lim
itatio
ns o
f lar
ge-
scal
e aut
omat
ic cl
assi
ficat
ions
. In
som
e ins
tanc
es, d
iffer
ent I
nter
Pro
dom
ains
corr
espo
nd to
diff
eren
t fea
ture
s of p
rote
ins w
ithin
the s
ame
fam
ily; f
or e
xam
ple,
IPR
0016
50 a
nd IP
R00
1410
(26
and
42 in
the
tabl
e). S
ee (6
2) fo
r liv
e lin
ks to
the
Inte
rPro
dat
abas
e.
Acc
. No.
FW
YIn
terp
ro D
omai
n N
ame
1.IP
R00
0694
579
398
40Pr
olin
e-ric
h re
gion
2.IP
R00
0822
352
138
47Zi
nc fi
nger
, C2H
2 ty
pe
3.IP
R00
0719
249
388
119
Euka
ryot
ic p
rote
in k
inas
e
4.IP
R00
1254
199
131
Serin
e pr
otea
ses,
tryps
in fa
mily
5.IP
R00
1314
178
50
Chy
mot
ryps
in se
rine
prot
ease
fam
ily (S
1)
6.IP
R00
1680
167
9590
G-p
rote
in b
eta
WD
-40
repe
ats
7.IP
R00
0504
160
9255
RN
A-b
indi
ng re
gion
RN
P-1
(RN
A re
cogn
ition
mot
if)
8.IP
R00
0495
153
700
Imm
unog
lobu
lins &
maj
or h
isto
com
patib
ility
com
plex
pro
tein
s
9.IP
R00
0345
145
177
Cyt
ochr
ome
c fa
mily
hem
e-bi
ndin
g si
te
10.
IPR
0003
7914
011
238
Este
rase
/lipa
se/th
ioes
tera
se
11.
IPR
0022
9013
817
111
0Se
rine/
Thre
onin
e pr
otei
n ki
nase
s act
ive-
site
12.
IPR
0020
4813
079
16EF
-han
d fa
mily
13.
IPR
0013
5611
388
10H
omeo
box
dom
ain
14.
IPR
0005
6111
010
90
EGF-
like
dom
ain
15.
IPR
0016
1110
848
7Le
ucin
e-ric
h re
peat
16.
IPR
0018
4110
511
335
Zinc
fing
er, C
3HC
4 ty
pe (R
ING
fing
er)
17.
IPR
0023
5610
033
50
G-p
rote
in c
oupl
ed re
cept
ors,
rhod
opsi
n fa
mily
18.
IPR
0010
6697
5446
Suga
r tra
nspo
rter
19.
IPR
0011
2894
733
Cyt
ochr
ome
P450
enz
yme
20.
IPR
0021
1090
7719
Ank
yrin
-rep
eat
21.
IPR
0006
1887
00
Inse
ct c
utic
le p
rote
in
22.
IPR
0012
4587
630
Tyro
sine
kin
ase
cata
lytic
dom
ain
23.
IPR
0014
4082
4634
TPR
repe
at
24.
IPR
0001
3079
198
Neu
tral z
inc
met
allo
pept
idas
es, z
inc-
bind
ing
regi
on
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 25A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
25.
IPR
0023
8078
4122
Tran
sfor
min
g pr
otei
n P2
1 R
AS
26.
IPR
0016
5076
6675
DN
A/R
NA
hel
icas
e do
mai
n (D
EAD
/DEA
H b
ox)
27.
IPR
0016
1772
5632
AB
C tr
ansp
orte
rs fa
mily
28.
IPR
0018
4971
6727
PH d
omai
n
29.
IPR
0014
7869
602
PDZ
dom
ain
(als
o kn
own
as D
HR
or G
LGF)
30.
IPR
0014
8869
85
Myc
-type
, hel
ix-lo
op-h
elix
dim
eriz
atio
n do
mai
n si
gnat
ure
31.
IPR
0010
5167
6138
ATP
-bin
ding
tran
spor
t pro
tein
, 2nd
P-lo
op m
otif
32.
IPR
0019
9367
4335
Mito
chon
dria
l ene
rgy
trans
fer p
rote
ins
33.
IPR
0007
3466
94
Lipa
se
34.
IPR
0002
1064
103
1B
tb/tt
k do
mai
n
35.
IPR
0005
7563
5436
ATP
/GTP
-bin
ding
site
mot
if A
(P-lo
op)
36.
IPR
0014
5263
5525
Src
hom
olog
y 3
(SH
3) d
omai
n
37.
IPR
0010
9261
388
Hel
ix-lo
op-h
elix
DN
A-b
indi
ng d
omai
n
38.
IPR
0021
9861
6313
Shor
t-cha
in d
ehyd
roge
nase
/redu
ctas
e (S
DR
) sup
erfa
mily
|
39.
IPR
0021
0658
1417
Am
inoa
cyl-t
rans
fer R
NA
synt
heta
ses c
lass
-II
40.
IPR
0018
0651
4623
Ras
fam
ily
41.
IPR
0023
4750
221
Glu
cose
/ribi
tol d
ehyd
roge
nase
fam
ily
42.
IPR
0014
1046
4348
DEA
D/D
EAH
box
hel
icas
e
43.
IPR
0017
7746
432
Fibr
onec
tin ty
pe II
I dom
ain
44.
IPR
0001
6943
221
Euka
ryot
ic th
iol (
cyst
eine
) pro
teas
es a
ctiv
e si
tes
45.
IPR
0005
2142
446
Glu
tath
ione
S-tr
ansf
eras
e
46.
IPR
0016
2242
911
Pota
ssiu
m c
hann
el
47.
IPR
0025
5742
60
Chi
tin b
indi
ng d
omai
n
48.
IPR
0000
5140
3821
SAM
(and
som
e ot
her n
ucle
otid
e) b
indi
ng m
otif
49.
IPR
0021
7240
320
Low
den
sity
lipo
prot
ein
(LD
L)-r
ecep
tor c
lass
A (L
DLR
A) d
omai
n
50.
IPR
0000
6338
3212
Thio
redo
xin
fam
ily
51.
IPR
0016
2338
2922
Dna
J dom
ain
52.
IPR
0020
1838
440
Car
boxy
lest
eras
es ty
pe-B
53.
IPR
0013
0437
165
0C
-type
lect
in d
omai
n
54.
IPR
0003
8736
8312
Tyro
sine
spec
ific
prot
ein
phos
phat
ase
55.
IPR
0002
1535
90
Serp
ins
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 26A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
56.
IPR
0010
0535
1619
Myb
DN
A b
indi
ng d
omai
n
57.
IPR
0014
1235
1514
Am
inoa
cyl-t
rans
fer R
NA
synt
heta
ses c
lass
-I
58.
IPR
0019
3935
2729
AA
A-p
rote
in (A
TPas
es a
ssoc
iate
d w
ith v
ario
us c
ellu
lar a
ctiv
ities
)
59.
IPR
0019
6535
2216
PHD
-fin
ger
60.
IPR
0000
0834
349
Prot
ein
kina
se C
2 do
mai
n
61.
IPR
0006
0834
1816
Ubi
quiti
n-co
niug
atin
g en
zym
es
62.
IPR
0017
8134
334
LIM
dom
ain
63.
IPR
0009
8033
431
Src
hom
olog
y 2
(SH
2) d
omai
n
64.
IPR
0022
1333
590
UD
P-gl
ucor
onos
yl &
UD
P-gl
ucos
yl tr
ansf
eras
es
65.
IPR
0003
0132
190
Tran
smem
bran
e 4
fam
ily
66.
IPR
0009
3431
5621
Serin
e/th
reon
ine
spec
ific
prot
ein
phos
phat
ase
fam
ily
67.
IPR
0012
5131
166
CR
AL/
TRIO
dom
ain
68.
IPR
0018
8131
340
Cal
cium
-bin
ding
EG
F-lik
e do
mai
n
69.
IPR
0021
7331
42
PfkB
fam
ily o
f car
bohy
drat
e ki
nase
s
70.
IPR
0001
9430
522
ATP
synt
hase
alp
ha &
bet
a su
buni
ts
71.
IPR
0002
1729
224
Tubu
lin fa
mily
72.
IPR
0008
7329
2311
AM
P-bi
ndin
g do
mai
n
73.
IPR
0000
7328
1716
Alp
ha/b
eta
hydr
olas
e fo
ld
74.
IPR
0001
5228
280
Asp
artic
aci
d &
asp
arag
ine
hydr
oxyl
atio
n si
te
75.
IPR
0004
0828
63
Reg
ulat
or o
f chr
omos
ome
cond
ensa
tion
(RC
C1)
76.
IPR
0008
3428
91
Zinc
car
boxy
pept
idas
es, c
arbo
xype
ptid
ase
A m
etal
lopr
otea
se (M
14) f
amily
77.
IPR
0017
1528
223
Cal
poni
n ho
mol
ogy
(CH
) dom
ain
78.
IPR
0020
8628
1313
Ald
ehyd
e de
hydr
ogen
ase
fam
ily
79.
IPR
0022
1928
361
Phor
bol e
ster
s/di
acyl
glyc
erol
bin
ding
dom
ain
80.
IPR
0004
8327
70
Leuc
ine
rich
repe
at C
-term
inal
dom
ain
81.
IPR
0008
8627
811
Endo
plas
mic
retic
ulum
targ
etin
g se
quen
ce
82.
IPR
0011
7527
810
Neu
rotra
nsm
itter
-gat
ed io
n-ch
anne
l
83.
IPR
0002
1926
175
Dbl
dom
ain
(dbl
/cdc
24 rh
oGR
F fa
mily
)
84.
IPR
0006
2626
279
Ubi
quiti
n do
mai
n
85.
IPR
0006
2926
2220
ATP
-dep
ende
nt h
elic
ase,
DEA
D-b
ox su
bfam
ily
86.
IPR
0008
5926
550
CU
B d
omai
n
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 27A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
87.
IPR
0009
5826
216
KH
dom
ain
88.
IPR
0017
5226
226
Kin
esin
mot
or d
omai
n
89.
IPR
0020
6726
116
Mito
chon
dria
l car
rier p
rote
in
90.
IPR
0002
0525
2210
NA
D b
indi
ng si
te
91.
IPR
0002
9925
130
Ban
d 4.
1 fa
mily
92.
IPR
0004
4925
108
Ubi
quiti
n-as
soci
ated
dom
ain
93.
IPR
0009
1025
158
HM
G1/
2 (h
iqh
mob
ility
gro
up) b
ox
94.
IPR
0010
5425
321
Gua
nyla
te c
ycla
se
95.
IPR
0012
0225
175
WW
/rsp5
/WW
P do
mai
n
96.
IPR
0005
9524
192
Cyc
lic n
ucle
otid
e-bi
ndin
g do
mai
n
97.
IPR
0008
3224
100
G-p
rote
in c
oupl
ed re
cept
ors f
amily
2 (s
ecre
tin-li
ke)
98.
IPR
0011
4024
3010
AB
C tr
ansp
orte
r tra
nsm
embr
ane
regi
on
99.
IPR
0012
1424
276
SET-
dom
ain
of tr
ansc
riptio
nal r
egul
ator
s (TR
X, E
Z, A
SH1
etc)
100.
IPR
0018
7124
1815
bZIP
(Bas
ic-le
ucin
e zi
pper
) tra
nscr
iptio
n fa
ctor
fam
ily
101.
IPR
0020
4923
160
Lam
inin
-type
EG
F-lik
e (L
E) d
omai
n
102.
IPR
0021
1123
212
Cat
ion
chan
nels
, 6TM
regi
on (t
rans
ient
rece
ptor
pot
entia
l sub
type
)
103.
IPR
0000
4822
162
IQ c
alm
odul
in-b
indi
ng d
omai
n
104.
IPR
0013
5322
1214
Mul
tispe
cific
pro
teas
es o
f the
pro
teas
ome
105.
IPR
0018
1022
215
11F-
box
dom
ain
106.
IPR
0022
2322
340
Panc
reat
ic tr
ypsi
n in
hibi
tor (
Kun
itz) f
amily
107.
IPR
0007
1821
290
Nep
rilys
in m
etal
lopr
otea
se (M
13) f
amily
108.
IPR
0009
6421
153
Ster
ile-a
lpha
mod
ule
(SA
M) d
omai
n
109.
IPR
0013
1121
130
Solu
te b
indi
ng p
rote
in/g
luta
mat
e re
cept
or d
omai
n
110.
IPR
0013
9421
2418
Ubi
quiti
n ca
rbox
yI-te
rmin
al h
ydro
lase
s fam
ily 2
111.
IPR
0015
9421
136
DH
HC
-type
Zn-
finge
r
112.
IPR
0016
2821
224
0C
4-ty
pe st
eroi
d re
cept
or z
inc
finge
r
113.
IPR
0020
1721
193
Spec
trin
repe
at
114.
IPR
0021
1321
64
Ade
nine
nuc
leot
ide
trans
loca
tor 1
115.
IPR
0021
2621
150
Cad
herin
dom
ain
116.
IPR
0001
9520
1712
Rab
GA
P/TB
C d
omai
n
117.
IPR
0001
9820
1910
Rho
GA
P do
mai
n
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 28A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
118.
IPR
0007
9520
1715
GTP
-bin
ding
elo
ngat
ion
fact
or
119.
IPR
0019
3020
114
Mem
bran
e al
anyl
dip
eptid
ase,
fam
ily M
1
120.
IPR
0024
2220
147
Perm
ease
s for
am
ino
acid
s & re
late
d co
mpo
unds
, fam
ily II
121.
IPR
0001
6619
3316
His
tone
-fol
d/TF
IID
-TA
F/N
F-Y
dom
ain
122.
IPR
0006
9019
87
RN
A-b
indi
ng p
rote
in C
2H2
Zn-f
inge
r dom
ain
123.
IPR
0017
6619
194
Fork
hea
d do
mai
n
124.
IPR
0021
3019
178
Cyc
loph
iIin-
type
pep
tidyl
-pro
lyl c
is-tr
ans i
som
eras
e
125.
IPR
0022
9319
1625
Perm
ease
s for
am
ino
acid
s & re
late
d co
mpo
unds
, fam
ily I
126.
IPR
0001
7518
120
Sodi
um:n
eure
trans
mitt
er sy
mpo
rter f
amily
127.
IPR
0003
3018
2017
SNF2
& o
ther
s N-te
rmin
al d
omai
n
128.
IPR
0007
4218
90
EGF-
like
dom
ain,
subt
ype
2
129.
IPR
0009
6118
2410
Prot
ein
kina
se C
term
inal
dom
ain
130.
IPR
0011
7318
174
Gly
cosy
l tra
nsfe
rase
, fam
ily 2
131.
IPR
0002
4217
763
Tyro
sine
spec
ific
prot
ein
phos
phat
ases
132.
IPR
0004
6717
114
D11
1 do
mai
n
133.
IPR
0006
3617
221
Cat
ion
chan
nels
, 6TM
regi
on (n
on-li
gand
gat
ed)
134.
IPR
0007
1717
138
Dom
ain
in c
ompo
nent
s of t
he p
rote
asom
e, C
OP9
-com
plex
&el
F3 (P
CI)
135.
IPR
0009
5317
152
Chr
omo
dom
ain
136.
IPR
0010
7117
00
Alp
ha-to
coph
erol
tran
spor
t pro
tein
137.
IPR
0011
6317
1116
Smal
l nuc
lear
ribo
nucl
eopr
otei
n (S
m p
rote
in)
138.
IPR
0013
2717
44
FAD
-dep
ende
nt p
yrid
ine
nucl
eotid
e re
duct
ase
139.
IPR
0013
9517
116
Ald
o/ke
to re
duct
ase
fam
ily
140.
IPR
0017
3417
31
Sodi
um:s
olut
e sy
mpo
rter f
amily
141.
IPR
0017
5717
2217
E1-E
2 A
TPas
es p
hosp
hory
latio
n si
te
142.
IPR
0017
9117
160
Lam
inin
-G d
omai
n
143.
IPR
0018
7317
220
Am
ilorid
e-se
nsiti
ve so
dium
cha
nnel
144.
IPR
0019
6917
842
Euka
ryot
ic &
vira
l asp
arty
l pro
teas
es a
ctiv
e si
te
145.
IPR
0000
8716
166
0C
olla
gen
tripl
e he
lix re
peat
146.
IPR
0002
5316
616
Fork
head
-ass
ocia
ted
(FH
A) d
omai
n
147.
IPR
0005
3616
880
Liga
nd-b
indi
ng d
omai
n of
nuc
lear
hor
mon
e re
cept
or
148.
IPR
0013
2016
100
Liga
nd-g
ated
ion
chan
nel
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 29A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
149.
IPR
0014
8716
1310
Bro
mod
omai
n
150.
IPR
0020
2716
1124
Am
ino
acid
per
mea
se
151.
IPR
0020
4616
11
SAR
1 G
TP-b
indi
ng p
rote
in fa
mily
152.
IPR
0000
1415
81
Gen
eral
ized
PA
S do
mai
n
153.
IPR
0001
7215
10
GM
C o
xido
redu
ctas
es
154.
IPR
0002
5115
127
AD
P-rib
osyl
atio
n fa
ctor
s fam
ily
155.
IPR
0005
6915
55
HEC
T-do
mai
n (U
biqu
itin-
trans
fera
se)
156.
IPR
0007
7215
120
Lect
in d
omai
n of
rici
n b-
chai
n, 3
cop
ies
157.
IPR
0012
2315
341
Gly
cosy
l hyd
rola
ses f
amily
18
158.
IPR
0016
0915
205
Myo
sin
head
(mot
or d
omai
n)
159.
IPR
0018
2815
190
Rec
epto
r fam
ily li
gand
bin
ding
regi
on
160.
IPR
0021
2915
71
Pyrid
oxal
-dep
ende
nt d
ecar
boxy
lase
fam
ily
161.
IPR
0024
6515
10
Gro
wth
fact
or &
cyt
okin
e re
cept
or fa
mily
sign
atur
e 2
162.
IPR
0001
5914
112
Ras
-ass
ocia
ted
(Ral
GD
S/A
F-6)
dom
ain
163.
IPR
0002
2514
62
Arm
adill
o/pl
akog
lobi
n A
RM
repe
at
164.
IPR
0002
7914
108
Act
in
165.
IPR
0005
6614
60
Lipo
calin
& c
ytos
olic
fatty
-aci
d bi
ndin
g pr
otei
n
166.
IPR
0005
7714
32
Car
bohy
drat
e ki
nase
, FG
GY
fam
ily
167.
IPR
0007
4614
00
Pher
omon
e/ge
nera
l odo
rant
bin
ding
pro
tein
, PB
P/G
OB
P fa
mily
168.
IPR
0008
8414
270
Thro
mbo
spon
din
type
I do
mai
n
169.
IPR
0011
0014
53
Pyrid
ine
nucl
eotid
e-di
sulfi
de o
xido
redu
ctas
e, c
lass
I
170.
IPR
0011
5914
92
Dou
ble-
stra
nded
RN
A b
indi
ng (D
sRB
D) d
omai
n
171.
IPR
0011
9914
85
Cyt
ochr
ome
B5
172.
IPR
0013
5714
2111
BR
CT
dom
ain
173.
IPR
0015
8914
81
Act
inin
-type
act
in-b
indi
ng d
omai
n
174.
IPR
0017
5314
103
Enoy
l-CoA
hyd
rata
se/is
omer
ase
175.
IPR
0018
7814
249
Zn-f
inge
r CC
HC
type
176.
IPR
0019
5214
01
Alk
alin
e ph
osph
atas
e fa
mily
177.
IPR
0022
1614
171
Ion
trans
port
prot
ein
178.
IPR
0024
6414
98
DEA
H-b
ox su
bfam
ily A
TP-d
epen
dent
hel
icas
e
179.
IPR
0001
0713
83
SPR
Y d
omai
n
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 30A
cc. N
o.F
WY
Inte
rpro
Dom
ain
Nam
e
180.
IPR
0004
2513
86
MIP
fam
ily
181.
IPR
0005
0813
23
Sign
al p
eptid
ase
182.
IPR
0007
2713
1415
t-SN
AR
E co
iled-
coil
dom
ain
183.
IPR
0009
0113
67
Car
bam
oyl-p
hosp
hate
synt
hase
184.
IPR
0014
6113
167
Peps
in (A
1) a
spar
tic p
rote
ase
fam
ily
185.
IPR
0015
0613
360
Ast
acin
(Pep
tidas
e fa
mily
M12
A) f
amily
186.
IPR
0015
2313
110
‘Pai
red
box’
dom
ain
187.
IPR
0018
2713
2|0
‘Hom
eobo
x’ a
nten
nape
dia-
type
pro
tein
188.
IPR
0018
7613
71
Zn-f
inge
r in
ranb
p &
oth
ers
189.
IPR
0024
2313
89
TCP-
1 (T
aille
ss c
ompl
ex p
olyp
eptid
e)/c
pn60
cha
pero
nin
fam
ily
190.
IPR
0028
9313
81
MY
ND
fing
er
191.
IPR
0004
6112
48
Alp
ha a
myl
ase
192.
IPR
0007
9812
40
Ezrin
/radi
xin/
moe
sin
fam
ily
193.
IPR
0010
2312
1314
Hea
t sho
ck p
rote
in h
sp70
194.
IPR
0015
0812
10
NM
DA
rece
ptor
195.
IPR
0016
8312
915
PX (B
em1/
NC
F1/P
I3K
) dom
ain
196.
IPR
0019
17I
126
4A
min
otra
nsfe
rase
s cla
ss-I
I
197.
IPR
0019
3212
98
Prot
ein
phos
phat
ase
2C
198.
IPR
0000
5011
80
Phos
phot
yros
ine
inte
ract
ion
dom
ain
(PID
)
199.
IPR
0001
8211
99
cety
ltran
sfer
ase
(GN
AT)
fam
ily
200.
IPR
0002
4311
27
Prot
easo
me
B-ty
pe su
buni
t
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 31
Table 4The 10 InterPro protein domains occurring in the largest number of differentproteins in S. cerevisiae and C. elegans.
Acc. no. InterPro domain name No. of proteins
S. cerevisiae
IPR000719 Eukaryotic protein kinase 119
IPR001680 G-protein beta WD-40 repeats 90
IPR001650 DNA/RNA helicase domain (DEAD/DEAH box) 75
IPR001138 Fungal transcriptional regulatory protein, N-terminus 60
IPR001042 TYA transposon protein 57
IPR000504 RNA-binding region RNP-1 (RNA recognition motif) 55
IPR001410 DEAD/DEAH box helicase 48
IPR000822 Zinc finger, C2H2 type 47
IPR001066 Sugar transporter 46
IPR001969 Eukaryotic and viral aspartyl proteases active site 42
C. elegans
IPR000168 7-Helix G-protein coupled receptor, nematode (probably olfactory) family 545
IPR000694 Proline-rich region 398
IPR000719 Eukaryotic protein kinase 388
IPR002356 G-protein–coupled receptors, rhodopsin family 335
IPR001628 C4-type steroid receptor zinc finger 224
IPR001810 F-box domain 215
IPR000087 Collagen triple helix repeat 166
IPR001304 C-type lectin domain 165
IPR002900 Domain of unknown function 142
IPR000822 Zinc finger, C2H2 type 138
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 32
Table 5Proteins in D. melanogaster, C. elegans, and S. cerevisiae with more than one InterPro domain. These numbers representthe total number of recognizable domains within a single protein, no matter whether they are multiple copies of thesame domain or different domains.
InterPro domains perprotein
D. melanogaster(number of proteins)
C. elegans(number of proteins)
S. cerevisiae(number of proteins)
2 920 1236 410
3 388 458 121
4 219 182 58
5 163 98 26
6 101 72 17
7 92 53 15
8 58 27 7
9 42 25 4
10 22 18 7
11–15 73 43 6
16–20 18 17 1
21–30 22 22 0
31–50 8 5 0
51–75 4 5 0
Science. Author manuscript; available in PMC 2009 September 29.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Rubin et al. Page 33
Table 6Proteins in D. melanogaster, C. elegans, and S. cerevisiae with multiple different InterPro domains. Individual InterProdomains are counted only once per protein, regardless of how many times they occur in that protein.
Unique InterPro domainsper protein
D. melanogaster(number of proteins)
C. elegans(number of proteins)
S. cerevisiae(number of proteins)
2 1474 1248 402
3 413 335 95
4 156 114 23
5 52 38 4
6 8 9 1
7 or more 4 3 0
Science. Author manuscript; available in PMC 2009 September 29.