+ All Categories
Home > Documents > The Origin, Evolution and Structure of the Protein World

The Origin, Evolution and Structure of the Protein World

Date post: 28-Apr-2023
Category:
Upload: illinois
View: 0 times
Download: 0 times
Share this document with a friend
21
Biochem. J. (2009) 417, 621–637 (Printed in Great Britain) doi:10.1042/BJ20082063 621 REVIEW ARTICLE The origin, evolution and structure of the protein world Gustavo CAETANO-ANOLL ´ ES* 1 , Minglei WANG*, Derek CAETANO-ANOLL ´ ES*and Jay E. MITTENTHAL*Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1101 W. Peabody Drive, Urbana, IL 61801, U.S.A., and Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, 601 S. Goodwin Avenue, Urbana, IL 61801, U.S.A. Contemporary protein architectures can be regarded as molecular fossils, historical imprints that mark important milestones in the history of life. Whereas sequences change at a considerable pace, higher-order structures are constrained by the energetic landscape of protein folding, the exploration of sequence and structure space, and complex interactions mediated by the proteostasis and proteolytic machineries of the cell. The survey of architectures in the living world that was fuelled by recent structural genomic initiatives has been summarized in protein classification schemes, and the overall structure of fold space explored with novel bioinformatic approaches. However, metrics of general structural comparison have not yet unified architectural complexity using the ‘shared and derived’ tenet of evolutionary analysis. In contrast, a shift of focus from molecules to pro- teomes and a census of protein structure in fully sequenced genomes were able to uncover global evolutionary patterns in the structure of proteins. Timelines of discovery of architectures and functions unfolded episodes of specialization, reductive evolutionary tendencies of architectural repertoires in proteomes and the rise of modularity in the protein world. They revealed a biologically complex ancestral proteome and the early origin of the archaeal lineage. Studies also identified an origin of the protein world in enzymes of nucleotide metabolism harbouring the P-loop-containing triphosphate hydrolase fold and the explosive discovery of metabolic functions that recapitulated well-defined prebiotic shells and involved the recruitment of structures and functions. These observations have important implications for origins of modern biochemistry and diversification of life. Key words: evolution, fold superfamily, organismal diversifica- tion, protein domain, proteome, tripartite world. INTRODUCTION Protein molecules are vital components of life. Together with functional RNA, they are primarily responsible for the many biological activities of the cell. Proteins define the enzymatic chemistries and transport processes characteristic of metabolic pathways, regulate gene expression and many other molecular functions, are involved in signal transduction, and make up the actual molecular and cellular machinery that fuels life. They are highly diverse and embed hierarchically many layers of molecular organization. Their evolution is complex and constrained by aspects of molecular structure, thermodynamics and function. In the present review, we examine the structure of the modern protein world and discuss how evolutionary genomics and structural bioinformatics have helped to dissect the origin and history of modern proteins. We also discuss how the discovery of structure and function in the contemporary protein world has affected the distribution of molecules in proteomes. PROTEIN STRUCTURE Polypeptide chains fold into highly ordered architectures that embed protein function. These folds minimize the energy conformations of individual amino acid residues in the chain, maximize hydrogen-bonding of polar groups and form compact and well-packed 3D (three-dimensional) atomic structures that bury hydrophobic residues away from the aqueous environment. Physically, they represent spatial arrangements of more or less wound helices that locally distort the bond geometry of the polypeptide backbone (3 10 -helix, α-helix, π -helix and polyproline II-helix) and extended chain segments called β -strands that can establish long-range interactions and form β -sheets in parallel and antiparallel arrangements (often curved into open and closed barrel structures). These conformational elements (‘helical’ and ‘sheet’), originally proposed by Pauling and Corey [1] as building blocks of proteins, are defined fundamentally by hydrogen- bonding interactions of closely or distantly related regions of the polypeptide chain and make up approx. two-thirds of protein structure. They are separated by loop regions (turns and coils), which are more or less rigid stretches of the backbone that delimit their direction in space. An example is the β -hairpin, a reverse turn that links two adjacent strands and forms an antiparallel β -sheet. Linderstrøm-Lang and Schellman in the 1950s [2] realized that protein structure was hierarchical and proposed four levels of structural organization (complexity): (i) primary structure, the sequence of amino acids linked by peptide bonds; (ii) secondary structure, the hydrogen-bonding patterns that give rise to helix and sheet elements in the fold; (ii) tertiary structure, the actual fold of the molecule stabilized mainly by side-chain interactions of elements of secondary structure; and (iv) quaternary structure, the aggregation of separate polypeptide chains into a supramolecular biological unit. Function fully manifests when these four levels of complexity are achieved. The recognition that aspects in the structure of proteins are redundant and modular (see below) led to the addition of new levels: (i) supersecondary structures, recurrent motifs of secondary structure, such as α-α-hairpins, β -β -hairpins Abbreviations used: CATH, Class, Architecture, Topology, and Homologous superfamily; F, fold; FF, fold family; FSF, fold superfamily; HMM, hidden Markov model; Hsp, heat-shock protein; HSR, heat-shock response; nd, node distance; PDUG, protein domain universe graph; SCOP, Structural Classification of Proteins; 3D, three-dimensional. 1 To whom correspondence should be addressed (email [email protected]). c The Authors Journal compilation c 2009 Biochemical Society www.biochemj.org Biochemical Journal
Transcript

Biochem. J. (2009) 417, 621–637 (Printed in Great Britain) doi:10.1042/BJ20082063 621

REVIEW ARTICLEThe origin, evolution and structure of the protein worldGustavo CAETANO-ANOLLES*1, Minglei WANG*, Derek CAETANO-ANOLLES*† and Jay E. MITTENTHAL†*Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1101 W. Peabody Drive, Urbana, IL 61801, U.S.A., and †Department of Cell and Developmental Biology,University of Illinois at Urbana-Champaign, 601 S. Goodwin Avenue, Urbana, IL 61801, U.S.A.

Contemporary protein architectures can be regarded as molecularfossils, historical imprints that mark important milestones in thehistory of life. Whereas sequences change at a considerablepace, higher-order structures are constrained by the energeticlandscape of protein folding, the exploration of sequence andstructure space, and complex interactions mediated by theproteostasis and proteolytic machineries of the cell. The surveyof architectures in the living world that was fuelled by recentstructural genomic initiatives has been summarized in proteinclassification schemes, and the overall structure of fold spaceexplored with novel bioinformatic approaches. However, metricsof general structural comparison have not yet unified architecturalcomplexity using the ‘shared and derived’ tenet of evolutionaryanalysis. In contrast, a shift of focus from molecules to pro-teomes and a census of protein structure in fully sequencedgenomes were able to uncover global evolutionary patterns in

the structure of proteins. Timelines of discovery of architecturesand functions unfolded episodes of specialization, reductiveevolutionary tendencies of architectural repertoires in proteomesand the rise of modularity in the protein world. They revealed abiologically complex ancestral proteome and the early origin ofthe archaeal lineage. Studies also identified an origin of the proteinworld in enzymes of nucleotide metabolism harbouring theP-loop-containing triphosphate hydrolase fold and the explosivediscovery of metabolic functions that recapitulated well-definedprebiotic shells and involved the recruitment of structures andfunctions. These observations have important implications fororigins of modern biochemistry and diversification of life.

Key words: evolution, fold superfamily, organismal diversifica-tion, protein domain, proteome, tripartite world.

INTRODUCTION

Protein molecules are vital components of life. Together withfunctional RNA, they are primarily responsible for the manybiological activities of the cell. Proteins define the enzymaticchemistries and transport processes characteristic of metabolicpathways, regulate gene expression and many other molecularfunctions, are involved in signal transduction, and make up theactual molecular and cellular machinery that fuels life. They arehighly diverse and embed hierarchically many layers of molecularorganization. Their evolution is complex and constrained byaspects of molecular structure, thermodynamics and function. Inthe present review, we examine the structure of the modern proteinworld and discuss how evolutionary genomics and structuralbioinformatics have helped to dissect the origin and history ofmodern proteins. We also discuss how the discovery of structureand function in the contemporary protein world has affected thedistribution of molecules in proteomes.

PROTEIN STRUCTURE

Polypeptide chains fold into highly ordered architectures thatembed protein function. These folds minimize the energyconformations of individual amino acid residues in the chain,maximize hydrogen-bonding of polar groups and form compactand well-packed 3D (three-dimensional) atomic structures thatbury hydrophobic residues away from the aqueous environment.Physically, they represent spatial arrangements of more or less

wound helices that locally distort the bond geometry of thepolypeptide backbone (310-helix, α-helix, π-helix and polyprolineII-helix) and extended chain segments called β-strands that canestablish long-range interactions and form β-sheets in paralleland antiparallel arrangements (often curved into open and closedbarrel structures). These conformational elements (‘helical’ and‘sheet’), originally proposed by Pauling and Corey [1] as buildingblocks of proteins, are defined fundamentally by hydrogen-bonding interactions of closely or distantly related regions ofthe polypeptide chain and make up approx. two-thirds of proteinstructure. They are separated by loop regions (turns and coils),which are more or less rigid stretches of the backbone that delimittheir direction in space. An example is the β-hairpin, a reverse turnthat links two adjacent strands and forms an antiparallel β-sheet.

Linderstrøm-Lang and Schellman in the 1950s [2] realizedthat protein structure was hierarchical and proposed four levelsof structural organization (complexity): (i) primary structure, thesequence of amino acids linked by peptide bonds; (ii) secondarystructure, the hydrogen-bonding patterns that give rise to helix andsheet elements in the fold; (ii) tertiary structure, the actual foldof the molecule stabilized mainly by side-chain interactions ofelements of secondary structure; and (iv) quaternary structure, theaggregation of separate polypeptide chains into a supramolecularbiological unit. Function fully manifests when these four levelsof complexity are achieved. The recognition that aspects in thestructure of proteins are redundant and modular (see below) led tothe addition of new levels: (i) supersecondary structures, recurrentmotifs of secondary structure, such as α-α-hairpins, β-β-hairpins

Abbreviations used: CATH, Class, Architecture, Topology, and Homologous superfamily; F, fold; FF, fold family; FSF, fold superfamily; HMM, hiddenMarkov model; Hsp, heat-shock protein; HSR, heat-shock response; nd, node distance; PDUG, protein domain universe graph; SCOP, StructuralClassification of Proteins; 3D, three-dimensional.

1 To whom correspondence should be addressed (email [email protected]).

c© The Authors Journal compilation c© 2009 Biochemical Society

www.biochemj.org

Bio

chem

ical

Jo

urn

al

622 G. Caetano-Anolles and others

Figure 1 Levels of molecular organization in the protein world

The hierarchy and complexity of proteins is illustrated with the ATP synthase complex, a highly abundant protein ensemble responsible for ATP synthesis in the cell. The 600 kDa complex can beseparated into two subunits, F1 and Fo, which can be studied individually. Transmembrane proton gradients drive rotation of the C-subunit ring of Fo, which then propels rotation of the central stalkand the F1 head of the complex. This rotation causes conformational changes in F1 active sites that result in ATP synthesis/hydrolysis. The α-subunit of the complex has three domains, the central ofwhich has a P-loop hydrolase fold that is highlighted in the Figure. Different levels of structure occur at different levels of resolution (scale in A, where 1 A = 0.1 nm) and at different rates in evolution.For example, the discovery of ∼2 × 1024 total sequences (considering homogenization of sequence diversity at population level; see the text) suggest that new genotypes are produced on Earth at alevel of fractions of a microsecond. A similar argument can be used with secondary structure. The average length of a helical segment is 10 +− 2 residues and of a sheet element 5 +− 1 residues, andtheir average number is 6 +− 2 and 7 +− 3 per protein chain respectively (see Supplementary Table S1 at http://www.BiochemJ.org/bj/417/bj4170621add.htm). Then the maximum number of possiblepermutations of elements of different length (∼7 residues) in groups of five to ten is 2.8 × 108 (i.e. 710). If all these permutations are accessible, the rate of maximum discovery approximates 0.1structural arrangement per year. Rates of discovery at higher structural levels use frequentist arguments that relate the number of known structural arrangements to time. For example, the discovery ofthe ∼4 × 104 distinct domains indexed in SCOP [82,83] spread uniformly over a ∼4 billion year timeline renders a frequency of discovery of one domain every 105 years. PDB codes used: 1BNFand 1QO1.

and β-α-β-structures, sometimes repeated in tandem (e.g. inleucine-rich repeat proteins) or producing structures with orwithout internal symmetry (e.g. β-α-β-structures in TIM barrels)[3]; (ii) protein domains, compact units within the fold that actas structural modules and appear singly or in combination withother domains in multidomain proteins [4]; and (iii) multiproteincomplexes, heteromultimeric assemblies of functionally relatedproteins that act as high-order functional units (e.g. molecularmachines such as the ribosome, the proteasome and the dyneincomplex) [5]. Figure 1 describes how hierarchical complexitycorrelates with degree of molecular detail and time needed todevelop each level of organization in evolution. Note, however,that levels of organizational complexity can be blurred by howstructural elements are defined and new levels may arise withincreased knowledge of protein structure and drivers of proteinevolution. Similarly, the rate of change associated with structureis notoriously difficult to estimate. However, and within orders ofmagnitude, more complex structures arise through accumulationof many changes at lower levels and therefore take considerablymore time to arise in evolution.

Linderstrøm-Lang’s laboratory studies of protein denaturationand proteolysis and the accessibility of protein bonds that areburied also helped shape the idea that protein structure washighly dynamic [6]. This ultimately materialized in Anfinsen’sthermodynamic hypothesis of folding that postulates the nativestructure of a protein results from spontaneous refolding ofdenatured states [7] into the thermodynamically stable structure[8], linking primary to tertiary structure in proteins. It is nowapparent that proteins indeed achieve native structure quicklyand efficiently through a myriad of conformational changes thatare influenced by the solvent [9–11]. This involves a complexinterplay of simple pairwise and co-operative interactions thattend to stabilize protein structure towards its native state, a statein which frustration (conflicting interactions) is minimal anda ground energetic state and funnel topography dominates thelocal folding landscape (energy surface; Figure 2A). In reality,folding appears to materialize through a progressive organizationof an ensemble of partially folded structures that resembles a

rugged funnel, with trajectories defined by a series of steps oflocal optimization that minimize the free energy accessible to thepolypeptide chain and conflicting energy contributions fromthe relative position of individual residues and solvent [9–11].This complex interplay sometimes occurs in the presence of kine-tic traps that complicate the landscape, characteristic of ruggedfunnels (Figure 2A). Folding should be regarded as a transition ofdisorder to order in a global optimization process. The ‘zippingand assembly’ hypothesis captures for example the essence of thisprocess, describing microscopic routes of folding that start froma polypeptide sequence and materialize in a time series of smallerand smaller conformation ensembles [12]. These routes involvelocal metastable structures in the protein chain that progressivelyassemble into more global ones. The energy landscape hasinspired physics-based modelling with semi-empirical atomicforce fields that fold molecules in a computer and provideab initio understanding of the forces and dynamics that govern thefolding process. Great progress has been made, for example withhelical and β-hairpin peptides and small proteins, using modernforce fields, Boltzmann sampling and/or other considerations[12–17]. Pathways of folding and unfolding derived from Mole-cular Dynamics simulations are now supported by experiment-ation with analysis of transition states, determination ofintermediates with NMR spectroscopy and denatured states.One example is the folding and unfolding of the three-helixbundle protein from the Engrailed homeodomain of Drosophilamelanogaster at atomic resolution [18,19].

This multidimensional energy landscape provides a statisticalview of the energetics of protein conformations, but also manifestsat evolutionary levels, as single-molecule structure variants (andassociated conformational ensembles) generated by mutation areculled by natural selection and other evolutionary constraints (e.g.self-organization [20]). This landscape becomes evolutionarywhen fitness values are assigned to phenotypes (Figure 2B). Here,fitness embodies the advantageous contribution of individualmolecules to the reproductive success of organismal lineages.When biopolymers mutate, they embark on an exploration ofthe space of possible sequences (defined by a Hamming metric

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 623

Figure 2 Current paradigms on folding and evolution of proteins

(A) Folding funnel-shaped representation of the energy landscape that describes the transition of protein conformations from disorder to order. The free energy of a protein is displayed as a functionof the number of conformations at each energy level (density of states) that are derived from the partition function and describe the topological arrangements of the polypeptide chain in space that arepossible. Proteins fold co-operatively by channelling protein folding intermediates (non-native states) downhill into the funnel and achieving the native state at its base, after avoiding the kinetic trapsof the rugged landscape. Note, however, that proteins do not remain folded. Native proteins are slightly more stable than their denatured states so that they fold and unfold every few minutes, settingthe pace for change in the funnel. (B) Mapping of sequence (genotype) space into structure (phenotype) space and then into fitness. The first map is many-to-one and unfolds by single mutationalsteps in sequence space. The second map assigns real numbers to structures given some function that distils the phenotype. (C) Evolutionary dynamic representation of protein evolution. Themapping of mutating protein sequences (genotypes) into structures (phenotypes) defines neutral sets in sequence space, ensembles of sequences that fold into a given native structure and neutralnetworks, subset with sequences that are tightly linked by series of single point mutations. Neutral nets corresponding to four different protein folds are coloured differently in a planar representationof the multidimensional space of sequences. Mutation causes sequences to drift along the neutral nets. However, the search for thermodynamic and kinetic folding optimality (described by a thirddimension) tailors evolutionary trajectories keeping them within space attractors for individual folds (illustrated as funnels). When mutational trajectories (paths in the graph) reach new neutralnetworks, new attractors and new folds are discovered. An animated version of this Figure can be seen at http://www.BiochemJ.org/bj/417/0621/bj4170621add.htm.

of elemental mutational moves) and associated structures andfunctions. This exploration takes the form of adaptive walksin sequence space that optimize thermodynamic, kinetic andmutational features in molecules. The mapping of sequence(genotype) into structure (phenotype) has been shown to betractable in RNA [21] and also in proteins [22] and has threefundamental properties: (i) there are many more sequences thanstructures (i.e. the sequence-to-structure map is highly degene-rate); (ii) few common, but many rare, structures materializein structure space; and (iii) extensive neutral networks thatpercolate sequence space define common structures and struc-tural neighbourhoods [23,24]. Within these neutral networks(anticipated by Maynard Smith [25] and in response to Salisbury[26]), structure is impervious to mutational change at the sequencelevel and, because the distribution of sequences that fold into thesame structure (shape) is approximately random, the mapping has‘shape space covering’ properties. This means that all structurescan materialize (are accessible) within relatively few mutationalchanges in sequence space. This property is especially true forRNA and has been confirmed experimentally using functionalmolecular switches that have been engineered by in vitro evolution[27]. The existence of neutral networks and shape space coveringhas also been predicted for polypeptides [28], paraphrasingconclusions from lattice models with simplified alphabets [29–31]. In these studies, independent adaptive walks in sequencespace can produce a given structure despite lacking significantsequence similarity, matching the recurrent observation thatsometimes seemingly unrelated sequences can harbour a givenfold [32]. At the same time, and because of shape space covering,sequences that fold into completely different structures may differby a few critical amino acid residues. Consequently, extensiveneutral networks enable the efficient exploration of sequencespace, whereas shape space covering ensures a constant rateof structural discovery. These properties match, for example,some recent in vitro evolution experiments [33] that showextensive regions in natural proteins exhibit functions refractory to

mutational change [34] and demonstrate that discovery of functionin random peptide libraries is facile (e.g. [35,36]). However, thesequence-to-structure mapping of proteins is much more complexand its landscape is ‘holey’ when compared with RNA, withproteins folding into native states missing in vast segments ofsequence space. Although the neutrality of protein sequencespace is much higher than that of RNA (>90% of single aminoacid substitutions are neutral [37]), protein structures appear toconcentrate in dense clusters [38,39], whereas RNA structuresspread through sparsely connected networks [40,41]. Under a‘superfunnel’ paradigm supported by experimentation [42,43],protein sequences drift along neutral networks and are sometimestrapped into funnels (Figure 2C), defined by sequences that aremutationally more stable (they tolerate the largest number ofmutations) and, at the same time, are thermodynamically morestable [37,39]. These ‘attractors’ in neutral space are sometimesreplaced by more fit ones through smooth transitions mediatedby excited states that tend to occur between similar structuralphenotypes and genotypes [44] (Figure 2C). Smooth adaptivewalks of this kind [25] explain enzyme promiscuity [45] andreconcile recent experiments that show that proteins optimizedfor novel function arise before the original function is lost[46,47]. They can also explain gene duplication and divergenceand the effect of epistatic thresholds of stability that buffer theeffects of deleterious mutations on fitness [48]. The superfunnelparadigm therefore links the energetic landscape of folding withthe evolutionary dynamics of molecules in percolating neutralnetworks.

Selection for compact and stable fold architectures is alsomaintained at higher levels of organization by more complexcellular infrastructure, which adds further evolutionary constraintson protein architecture. For example, the HSR (heat-shockresponse) is a fundamental cytoplasmic mechanism common tothe three domains of life (Archaea, Bacteria and Eukarya) [49].When subjected to temperature increases, five groups of Hsps(heat-shock proteins) are induced (Hsp100s, Hsp90s, Hsp70s,

c© The Authors Journal compilation c© 2009 Biochemical Society

624 G. Caetano-Anolles and others

Hsp60s and small Hsps). These include chaperones, proteases,ATPases and DNA-repair proteins that mend damage and mediatenon-covalent folding, unfolding, assembly and disaggregation ofproteins. Hsp synthesis is of crucial importance for thermotoler-ance of organisms such as hyperthermophilic archaea, whichappears to exhibit minimal, but highly tailored, protein-foldingsystems [50]. Within these groups of molecules, chaperonins aremegadalton ring assemblies that mediate ATP-dependent proteinfolding to the native state (e.g. the bacterial chaperonin GroELand its co-chaperonin GroES) through complex allosteric mechan-isms [51]. Prefoldins deliver nascent unfolded proteins to thesecytosolic chaperones as they exit the ribosome, establishing spe-cific interactions with actin and tubulin in eukaryotes [52]. Sinceproteins exhibit a generic tendency to aggregate in the high macro-molecular concentrations of intracellular compartments (molec-ular crowding) [53], proteins that unfold or remain unfolded aretagged and degraded by the ubiquitin–proteasome proteolyticpathway [54]. However, in eukaryotes, more complex systemsguarantee the correct folding of a protein. These proteostasiscontrol systems regulate protein concentration, the conformationof folds and complexes, and cell localization [55–57]. Theyinvolve interactions between the folding polypeptide and cellularcomponents such as chaperones, co-chaperones, folding enzymesand components of small-molecule metabolism that stabilize thefolded state and stress sensors, including the HSR and the UPR(unfolded protein response) [57]. This adds an additional layer ofcomplexity to the already difficult folding problem and additionalevolutionary constraints to the discovery of protein architecture.

PROTEIN DIVERSITY AND THE HIERARCHICAL STRUCTUREOF THE PROTEIN WORLD

Proteins are covalently bonded linear heteropolymers made upof 20+ amino acid monomers with a specific primary sequence ofside chains spaced at regular intervals. From this perspective, theroughly 103–105 protein sequences per genome that are encodedin the genomes of the ∼107–108 species that exist on Earth [58],most of which are microbial [59], cover necessarily only a minutefraction (∼1010–1013 variants) of the enormous permutationalspace defined by amino acid sequence (∼10321–10469 possiblearrangements in sequence space), given recent estimates ofaverage protein length in genomes [60,61]. In these calculationswe assume there is no intraspecies variation, even though it isunlikely that members of a given reproducing population will beidentical. In fact, if we consider that the (4–6) × 1030 prokaryoticmicrobial cells in our planet (which account for ∼70% of life incertain habitats) have turnover rates of ∼8 × 1029 cells per year[59] and that mutations in proteins occur in clock-like fashionat rates of ∼4 × 10− 7 per microbial cell and per generation[62], then we would expect an upper boundary of ∼2 × 1032

total mutational amino acid changes in microbial proteins in the∼4-billion-year-long history of life, which is still a minutefraction of sequence space (even if these concentrate in sequencesthat fold successfully).

This limited molecular exploration of sequence space hasnevertheless encountered considerable diversity at higher levelsof structural organization, mostly because of the neutral net andshape space covering properties we discussed above. Whereas therate of discovery of new sequence genotypes on Earth appears tooccur at incredible pace and generate considerable sequence di-versity, rates at higher levels of protein organization decrease pro-gressively and in a substantial manner (Figure 1). Sequence vari-ants develop within fractions of microseconds. However, second-ary structure variants take considerably longer to be discovered,

whereas complex 3D structures arise once in hundreds of millionsof years. This is an expected outcome. Higher structural levels arelinked directly to function and are therefore the subject of naturalselection and strong evolutionary constraint [63,64]. Sequencegenotypes have a limited alphabet and change constantly bymutation, making them poor repositories of molecular history. Infact, the repeated accumulation of substitutions in nucleotide sites(site saturation) can erase evolutionary history at intermediateand deep evolutionary timescales [65–67]. In contrast, structuralphenotypes have more complex alphabets that define functiondirectly or through interactions of substructural, molecular andsupramolecular components that are collectively responsible forfunction (e.g. in molecular ensembles), all of which are oftencarefully culled by natural selection. The effects of selection areconsequently stronger at this level than at the genotype level andstructural phenotypes are generally left unchanged over short,intermediate and long timescales. However, proteins evolve atvastly different rates, and recent studies suggest that this is dueto differences in expression levels, functional roles and intra- andinter-molecular interactions [43,68–73]. For example, increasesin the density of contacts (fraction of buried sites) in domainsor entire proteins tend to increase evolutionary rates [73]. Incontrast, increases in the number of binding interfaces of multi-interacting proteins tend to decrease rates [71]. Interestingly,positively selected amino acid sites were found preferentiallylocated on the exposed surface of proteins [72]. Within individualproteins, different regions of the molecules are differentiallyconstrained and were found to be quantitatively stable overbillions of years of divergence [74]. Most notably, active sitesand residues important for structural maintenance tend to evolveslowly and were refractory to mutation. However, the relationshipbetween protein conservation and function is complex, especiallywhen molecular redundancy, strength of natural selection andgenome structure are taken into consideration [75].

Advances in comparative and structural genomics offer unpre-cedented opportunities to understand proteomic complexity andprovide insights into the diversity and structure of the proteinworld [76]. The number of protein sequences and structures hasexpanded significantly in the last few years and its organiz-ation is clearly hierarchical (Figure 3). There are currently (asof November 2008) 875 completely sequenced genomes con-tributing to ∼6 million sequences. However, only a fractionof sequences are well annotated, and the number of uniqueentries at lower levels of structural organization continues toincrease exponentially; the protein world remains uncharted atthese levels [77]. In contrast, the number of new folds that areencountered every year is decreasing considerably, supporting theidea that the repertoire of architectural designs is finite (perhaps∼1500 folds). A recent attempt to recreate all possible proteinfolds by ab initio folding from short homopolymeric sequencesrevealed all constructs matched folds in solved structures, and viceversa; all natural single-domain structures had analogues in themodel set [78]. This suggests that our knowledge of single-domainfolds is probably complete. In order to make sense of increasinginformation, a number of bioinformatic strategies of sequenceand structural comparison have led to the creation of a widerange of protein classification schemes, all of which aim to groupevolutionarily related proteins [79]. These catalogues organizesequences and proteins of known structure (currently describedby ∼54000 Protein Data Bank entries) into taxonomies in anattempt to provide a global evolutionary view of the proteinworld. The first taxonomies described were originally basedon the concept of the protein domain [80] and most modernclassifications are still organized around this structural level [79].This is predicated on the premise that domains are compact and

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 625

Figure 3 Progress in the experimental discovery of sequences and structures

(A) Protein architectures can be defined at different levels of protein hierarchy, using, for example, taxonomial classifications such as SCOP [82,83], with categories described with alphanumericlabels and identifiers. Currently, sets of approx. 1000 Fs, 1800 FSFs and 3500 FFs describe the world of proteins. (B) The continuous increase of the available numbers of sequences from the highlycurated UniProtKB, protein structures from the PDB, and F, FSF and FF architectures in SCOP. The numbers of completely sequenced genomes that have been published (indexed in the GenomesOnline Database [198]) have increased continuously from 1997 to 2007. Only the latest data were used if some databases had more than one release available in one year. Note that UniProtKB entriesrepresent only a fraction of the ∼6 million sequences in UniProtKB/TrEMBL. (C) Proteins have physical structures that were designed and constructed by Nature (architecture) defined by the foldingof the sequence at F, FSF or other levels of the structural hierarchy (domain structure) and by how domains combine with others (domain organization).

more-or-less independent globular folding elements, establishmore abundant intradomain than interdomain residue contacts,and recur in different structural contexts (i.e. they act as modules,appearing singly or in combination with other domains). Therecurrence concept is supported by a comparative frameworkon the basis of homology relationships and is fundamental. Itdefines the domain as an evolutionary unit of classification.Approx. 30 popular domain classifications based on sequenceand/or structure are currently available; they use patterns and/orprofiles in sequence to build libraries of domain families orestablish distant relationships using structure comparisons. ThePfam database of multiple sequence alignments and HMMs(hidden Markov models), for example, is a comprehensiveresource for the identification of domain families, repeats andmotifs [81]. It provides two levels of curation, one based onautomated domain sequence alignments (Pfam-B) and the otherextended by HMM-based profile searches and literature analyses(Pfam-A), which serve as seeds for the iterative construction ofHMMs. In contrast, SCOP (Structural Classification of Proteins)is a high-quality taxonomical resource that assigns domainboundaries manually at the structural level and applies therecurrence concept rigorously [82,83]. SCOP domains that areclosely related at the sequence level (generally expressing >30%pairwise amino acid residue identities) are pooled into foldfamilies (FFs), FFs sharing functional and structural featuressuggestive of a common evolutionary origin are unified furtherinto fold superfamilies (FSFs), and FSFs that share similarlyarranged and topologically connected secondary structures aregrouped further into protein folds (Fs). Fs are then grouped intoprotein classes according to organization of secondary structurein the fold, defining the major α/β, α +β, all-α, all-β, smalland multidomain groups. This architectural hierarchy (Figure 3A)somehow mimics the relative numbers of sequences and structuresthat have been discovered (Figure 3B). Unlike SCOP, the CATH(Class, Architecture, Topology, and Homologous superfamily)classification of proteins uses expert systems that automate most

steps and classify domains that may or may have not beenobserved in other structural contexts [84,85]. CATH adds anadditional hierarchical level (‘architecture’) over the fold classi-fication (‘topology’) that describes the 3D arrangement of se-condary structure but not its connectivity. A final example ofstructural classification is the DALI Dictionary, a fully automatednon-hierarchical structural alignment system that uses domainrecurrence to identify domains and provide lists of structuralneighbours [86]. Interestingly, a comparative analysis of theSCOP, CATH and DALI taxonomies revealed remarkableagreement of protein assignments at fold (75%) and superfamily(80%) levels, with discrepancies attributed to different thresholdsor manual curation [87]. In recent years, the different domainclassifications have been consolidated by cross-listing and integ-ration. For example, the InterPro consortium integrates proteinclassifications (including Pfam, SCOP and CATH) and mapsprotein families, domains, repeats and identifiable features ofknown proteins on to sequences in TrEMBL and Swiss-Prot [88].

Although taxonomies provide the framework needed to under-stand protein diversity, the definition of structure and associatedfunctions remains challenging [89]. Protein architecture, the‘fundamental build’ [the αρχι- (archi-) τεκτων (tekton)] ofa protein, is modular (Figure 3C). Domains with different 3Dstructures (domain structures) combine with others in complexarrangements (defined here as domain organization). Domainstructures associate with functions that are sometimes carriedinto the multidomain arrangement to increase enzyme specificity,provide links between other domains or regulate functionalactivity [90]. However, domain boundaries are difficult toestablish and common topological elements that make up thefolding core sometimes account for less than half of domainsequence [91]. Moreover, some CATH folds exhibit 3-foldvariation in the number of secondary structures, and certain super-families show that secondary-structure embellishments oftenassociate with change of function [91]. Peripheral regions ofsecondary structure can differ in size and conformation,

c© The Authors Journal compilation c© 2009 Biochemical Society

626 G. Caetano-Anolles and others

‘decorating’ the central folds of domains distinctively. Similarly,accretion of substructures around the core can result in functionaldiversity, as illustrated with the biochemistries that are linked toenzymes harbouring the thioredoxin-like fold [92]. To complicatematters, a measurement of how often fold substructures are sharedby fold architectures (e.g. ‘gregariousness’) suggests some foldcategories should be regarded as ‘neighbourhoods’ defined byhow much structural overlap exists between them [93,94]. Someregions of the protein fold space therefore represent a continuumfor certain architectural arrangements (sometimes linked bysupersecondary motifs), whereas, in other regions, clearly distinctnon-overlapping (discrete) topologies are observed. These regionscan be best represented as a continuous and multidimensionalenvironment [95]. Interestingly, detecting similarities betweenligand-binding sites with a new structure–function comparisonmethod tested the notion of a continuous fold space andrevealed new evolutionary relationships across an existing discreterepresentation [96]. Finally, proteins can adopt multiple structuresand functions, exhibiting conformational diversity and functionalpromiscuity. They can display ligand-independent conformationaldiversity, use structures to ‘moonlight’ different functionswithout involving their active sites or become promiscuous [45].Chameleon sequences can adopt a distinct folded conformationunder native conditions, and large-scale fold variations can altertopology in proteins [97]. This is complicated by the fluid nature ofgenome structure, which facilitates the rearrangement of domains[4,98]. These rearrangements are responsible for domains beingboth functional and structural subgenic modules that are highlyplastic.

Remarkably, protein structures are unevenly distributed in theworld of proteins [99]. Genome surveys have shown that familiesand folds in genomes follow power-law distributions and exhibitscale-free properties [100–102]. This behaviour results in a fewfolds that are highly popular (‘superfolds’ with many families;e.g. TIM barrel folds are widely distributed in metabolism) andmany that appear infrequently (‘mesofolds’ and ‘unifolds’) [103].It also implies a preference for duplication of genes encoding foldsthat are already common, as summarized in models that accountfor duplication, acquisition and loss of genes [102] or describebirth–death–innovation processes [104–106]. Interestingly, foldfrequency plots for the microbial superkingdoms Archaea andBacteria have steeper decay slopes than those for Eukarya,showing there is a larger level of architectural redundancy inthe proteomes of complex organisms [99,107]. However, foldsshared by all superkingdoms and folds shared by Eukarya andBacteria (generally the most ancient; see below) fitted Gaussian-like distributions characteristic of random graphs, suggestingthe spread of folds across superkingdoms is complex [107]. Inorder to explain the uneven proteomic distribution of structures,a number of phenomenological and physics-based models havebeen proposed that focus on functional constraints, convergence ofsequences into structures (‘designability’), or evolutionary dyna-mic considerations, some of which invoke evolutionary processesof convergence or divergence and the paradigms described inFigure 2. They have been reviewed recently [108] and will notbe discussed here. In particular, statistical mechanic approachesto evolution of simple lattice model proteins provide interestinginsights into the workings of real proteins. Most notable is a recentmicroscopic ab initio model that considers not only the fate ofgenes, but also the survival of organisms [109]. The model isbased on the central assumption that the death rate of an organismis determined by the stability of the least stable of its lattice modelproteins. Simulations reveal exponential population growth oncefavourable sequence–structure combinations are discovered andcollapse of these precursors into selected fold architectures, which

remain stable and abundant at timescales greater than organismallifetime. The rise of protein families and superfamilies and power-law distributions that match distributions for real proteins ariseas emergent properties of the physical model, which suggestsnew folds result from dominant folds by satisfying energeticallyfavoured native conformations. This is provocative and clearly inline with emerging views of protein folding and evolution.

PROTEIN EVOLUTION IN FOLD SPACE

Almost 150 years ago, in his seminal book, Charles Darwinestablished evolution by common descent as the dominantscientific explanation of biological diversity and change [110].The divergence of species was illustrated with branching historiesof inheritance (phylogenies) that allowed inference of ancestrallinks and tested evolutionary hypotheses. Phylogenetic thinkingremains fundamental in evolutionary bioinformatics today anddiversity and change are still illustrated with phylogenetic trees,graphical and mathematical representations (with branches andreticulations) that portray how contemporary is common ancestry.These trees have been particularly useful in the comparativeanalysis of nucleic acid and protein sequences and have had animpact on each and every discipline of biology, including genomescience and informatics. They seed a holistic future [111].

The evolutionary classification of protein domains has beenbased on sequence and structural homologies that make use ofphylogenetic tools and advanced bioinformatic methods [79].Protein families group together sequences that share a commonancestry, but generally do so with a low hierarchical granularity;the reliability of comparative methods break down when reach-ing the so-called ‘twilight zone’ of <30% sequence identity.However, change in protein structure is linked directly to changein biological function. This has been recognized by structuralgenomic initiatives that seek to characterize exhaustively themajor building blocks of proteins, and both structure and functionhave aided phylogenetic analyses when sequences fail to unitedistant family relatives. Evolutionary relationships have beeninferred directly from the structure of protein molecules, generallyusing formal methods of phylogenetic reconstruction [112–117].These methods have been limited to analysis of closely relatedarchitectures with backbones and secondary structures that can bemore or less superimposed. However, global views of the proteinworld that establish evolutionary relationships at superfamily orfold level require more involved and systematic approaches ofclassification. One strategy is to compare all proteins with eachother and plot relationships on existing protein fold space, withstructural similarities visualized at low dimensional level [118].For example, Gauss integrals that describe protein backbones asspace curves were used to construct a 30-dimensional vector thatwas then projected on a plane, producing 2D (two-dimensional)maps with fold distributions matching CATH classification [119].These maps divide structures belonging to α, β and αβ classesof CATH into distinct groups. Similarly, matrices of DALIalignment scores in pairwise backbone comparisons of SCOPfolds produced 3D representations that clustered folds belongingto the α/β, α + β, all-α and all-β protein classes [120] and allowedconstruction of a structural map [121]. Note, however, that asimple plot of overall length of helical segments against strandsegments was able to dissect these classes without resorting tocomplicated algorithms (Figure 4A and see Supplementary Fig-ure S1 and Supplementary Table S1 at http://www.BiochemJ.org/bj/417/bj4170621add.htm and an animated version of Figure 4(A)at http://www.BiochemJ.org/bj/417/0621/bj4170621add.htm).Typically, α/β folds have interspersed helix and strand secondary

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 627

Figure 4 Phylogenomic analysis and evolution of major structural classes of globular proteins

(A) Grouping of proteins in the all-α, all-β , α/β and α + β classes according to features of secondary structure. The average total length of segments of secondary structure in a peptidechain was calculated using DSSP [199] secondary structure assignments in proteins (61175 peptide chains) from all PDB entries in SCOP version 1.69. These features were calculated fromchains belonging to the same SCOP fold for all folds. Plots compared each feature of secondary structure with each other. The Figure shows only comparison of average total length ofα-helical and β-strand segments. Averages are described in Supplementary Table S1 at http://www.BiochemJ.org/bj/417/bj4170621add.htm. An animated version of this Figure can be seen athttp://www.BiochemJ.org/bj/417/0621/bj4170621add.htm. (B) Universal phylogenomic tree of architectures reconstructed from a genomic census of protein domain structure and organization.A tree of architectures describing the evolution of domains and domain combinations at F level was reconstructed from a protein census in 266 genomes [200]. The census involved identifyingdomains using advanced HMMs of structural recognition and SCOP as reference. The three evolutionary epochs of the protein world are overlapped to the tree and are labelled with different shades(architectural diversification, light green; superkingdom specification, salmon; organismal diversification, yellow) and follow previous definitions [149]. Terminal leaves are not labelled since theywould not be legible. Branches in red delimit the birth of architectures after the appearance of the first architecture unique to a superkingdom (broken line). The Venn diagrams show occurrence ofarchitectures in the three superkingdoms of life. Pie charts show superkingdom distribution of architectures belonging to the four major categories of domain organization. The onset of the big bangof domain combinations is indicated in the tree. (C) Cumulative frequency distribution plots describing the appearance of all-α, all-β , α/β and α + β protein classes with only one domain as wellas all the domain combinations with two domains or more than two domains along the branches of the tree described in (B). The cumulative number was given as a function of distance in nodesfrom the hypothetical ancestor (nd). The inset shows details of the accumulation of ancient domains and domain combinations. Information on trees of proteomes and architectures, data matricesand tracing exercises can be found in the MANET (Molecular Ancestry Networks) database [193] (http://manet.uiuc.edu).

structures, α + β folds segregate these elements within themolecule, and all-α and all-β proteins are mostly composed ofhelical or strand elements respectively. These simple plots revealthat helical segments were generally longer in α/β folds thanin their α + β counterparts and shorter in all-β proteins, withstrand segments being shorter in all-α proteins, the implicationof which will be discussed below. Unfortunately, global viewsplace structures in a continuum space and obscure fundamentalarchitectural differences and heterogeneities that discrete viewscan capture. Other strategies that lack these shortcomings aretherefore useful, including the generation of fold family treesbased on rules of structural transformation [122,123], taxonomiesbased on similarity of secondary structural arrangements [124]

and a PDUG (protein domain universe graph) representation ofdomains based on scores of structural similarity [125,126]. Someof these have captured salient natural features. For example,trees of secondary structures are in agreement with aspectsof protein classification and suggest a simple mechanism ofevolution that is in accord with a theory of folding based onthe energetic of backbone hydrogen bonds [127]. The PDUGnetwork representation of fold space is a graph that connectsnodes (domains) with edges (structural similarities) in threshold-delimited clusters, and, similarly, captures the scale-free networktopology that is typical of the protein world. However, problemsassociated with the systematic classification of structure at atopological level make it difficult, if not impossible, to find

c© The Authors Journal compilation c© 2009 Biochemical Society

628 G. Caetano-Anolles and others

a general metric of pairwise comparison that could be usedfor global analysis and would portray all complexities ofstructural organization [128]. One solution to this drawback is a‘periodic table’-like construct that merges the use of rules withthe comparative framework [129]. In this approach, proteinsare compared and assigned to idealized fold representations,which describe molecules as layered systems of helical andsheet structures (with curl and stagger). The approach shiftsthe problem to finding appropriate definitions for the idealizedconstructs and understanding their evolutionary meaning throughmodels of structural transformation.

PHYLOGENOMICS AND THE WORLD OF PROTEOMES

One fundamental limitation of most global approaches that tryto unify fold space is that they do not embrace the ‘sharedand derived’ tenet of evolutionary analysis. They are not trulyphylogenetic. At present, there are no reliable procedures that cangenerate phylogenetic relationships at higher hierarchical levelsof protein classification directly from the structure of proteins.Methods cannot yet reconstruct history because knowledge ofhow the ‘origami’ of protein folding evolves is still lacking. Onesolution to the conundrum of structure is to shift the focus ofstudy from molecules to proteomes, the repertoire of all proteinsof an organism. After all, proteins are encoded in the genomes ofthe many organisms that populate Earth. The rationale is simple.Proteins with structures that are fit will thrive in evolution. Theywill propagate in lineages through vertical descent, recruitmentand convergent evolutionary processes, and their architecturaldesigns will be used repeatedly in different contexts. Theirhistory should be left imprinted in the actual fold constitutionof the proteome, and a simple structural census of this historicalrepository should unlock the ‘tempo and mode’ of structuraldiscovery. Here, we summarize the exciting findings that thisnovel approach has revealed.

Structures corresponding to validated crystallographic 3Dmodels, catalogued, for example, in SCOP and CATH, havebeen assigned effectively to sequences present in proteomesusing knowledge from domain classification and sequence andstructure comparison tools such as profile-based sequence PSI-BLAST alignments, linear HMMs of structural recognition,and threading techniques [79]. Fold architectures were initiallysurveyed in a number of genomes [130–134] and this genomicdemographic census was then indexed in several populardatabases (e.g. PEDANT [135], SUPERFAMILY [136,137], andGene3D [138,139]). The census is restricted to proteins for whicha known structure can be inferred (currently, ∼60% of theproteome), but it is powerful. It allows, for example, identific-ation of SCOP FSF architectures corresponding to individualdomains in enzymes of metabolism [140,141] and explorationof arrangements of domains in biological units [142,143]. Studiesreveal patterns in both domain structure and domain organizationand suggest, for example, pervasive recruitment of structures andfunctions in biological networks and an extended combinatorialinterplay of domain modules in proteomic repertoires. The censusalso provides indications of how organisms in different super-kingdoms make use of architectures, revealing that fold abund-ance and distribution of folds among genomes are unlinked [144].Curiously, abundant protein domains occurred in proportion toproteome size in a survey of five eukaryotic genomes, suggestingfunctional constraints between interacting domains kept domainsat specific ratios in evolution [145].

Since protein structure is highly conserved (Figure 1), everyinstance of discovery or adoption of an F or FSF architecture

by a proteome represents a rare event in the history of the org-anismal lineage, and globally a rare event in the history of theprotein world. These ‘molecular fossils’ are therefore excellentfeatures (characters) for phylogenetic analysis. Gerstein [132]recognized this a decade ago and used fold occurrence in genomesand distance-based methods to build trees of proteomes (seeSupplementary Figure S2 at http://www.BiochemJ.org/bj/417/bj4170621add.htm). Since then, a number of trees of life of thiskind have been reconstructed from the occurrence and abundanceof domain structures in proteomes [107,132,134,146–149] andfrom surveys of domain organization [150,151], matching patternsobtained from other sources of genomic information [152]. Inall cases, the three superkingdoms appeared as distinct groups,confirming the tripartite nature of cellular life heralded bythe school of Carl Woese [153]. Phylogenomic trees showedpatterns that were in agreement with traditional classification,and also tested contentious hypotheses. For example, they backedthe controversial grouping of chordates with arthropods (theCoelomata hypothesis), an observation supported by whole-genome trees (e.g. [154]) and recently confirmed by an analysisof the complete collection of phylogenies of gene sequences inthe human genome (phylome) [155]. Moreover, some of thesephylogenetic methods identified a root for the universal tree[107,149,150] (see Supplementary Figure S2) and suggestedthat diversified life originated in a proto-eukaryotic organism, aproposal for which there is now an emerging consensus [156] andwhich is also supported by phylogenetic analysis of the structureof rRNA [157]. It is noteworthy that parsimony considerationsbased simply on the survey of protein repertoires suggest theancestor to the three superkingdoms was endowed with a virtualproteome akin to Eukarya [61]. A simple Venn diagram showsthat two or three superkingdoms share the majority of F or FSFarchitectures and supports this view (see Supplementary Fig-ure S3 at http://www.BiochemJ.org/bj/417/bj4170621add.htm).Most importantly, the fact that phylogenomic trees were ableto reconstruct the evolution of life satisfactorily supported theexistence of strong phylogenetic signal in the occurrence, abund-ance and organization of domains in proteomes. Trees of pro-teomes, however, could not reveal patterns of diversification ofprotein architecture directly, unless characters were traced alongbranches of the trees. For example, when domain sequences andarchitectures from 62 genomes were traced along a universalconsensus phylogeny derived for whole-genome trees, convergentevolutionary processes that could not be explained by architecturalloss were found to be rare events (∼2%) [158]. A recentstudy of Pfam domains in 96 genomes confirmed this importantobservation, although the number of convergent events in proteinstructural evolution was found to be larger (∼12%) [159]. Thissuggests that protein structures at high levels of organizationdiversify mostly by vertical descent, empowering the phylogeneticreconstruction exercise. Tracing domain occurrence patterns intrees of proteomes derived from fold occurrence and abundance[160] or universal trees reconstructed from the small subunit ofrRNA [161] also allowed to estimate the relative age of individualfolds and the antiquity of protein classes. The latter study assumes,however, that the history of a single (albeit central) RNA moleculeand of proteomes is concordant, that there is a single origin oforganismal superkingdoms, and that the bacterial outgroup chosento root the universal tree is appropriate. As we will see below, allof these assumptions can be contentious.

In search of a direct approach and using a strategy thatpolarizes characters and builds rooted phylogenetic trees [157],we introduced a new phylogenetic method that generates timelinesof architectural discovery and a global phylogenetic view ofthe protein world [107]. Data matrices that were used to build

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 629

trees of proteomes were transposed, normalized and used toreconstruct trees of architectures that were intrinsically rooted[107,149,150,162,163]. Evolution’s arrow was established dir-ectly by the evolutionary model, the rationale and assumptionsof which have been reviewed recently within a framework ofevolution of repertoires of components in living systems [164] andwill not be revisited here. Supplementary Figure S3 shows the firsttree of F architectures that was reported and examples of trees of Fand FSF architectures reconstructed more recently using updatedreleases of SCOP and information in numerous proteomes.The leaves of the trees (taxa) are, in this case, domains (seeSupplementary Figure S3) or domains and domain combinations(Figure 4B) visualized at F or FSF levels of classification. Therooted trees establish by definition evolutionary timelines ofarchitectural discovery, with time measured by a relative distancein nodes from a hypothetical ancestor at their bases (node dis-tance, nd). A timeline showing the rise of protein classes inevolution is described in Figure 4(C). These timelines revealremarkable historical patterns in the structure of proteins andproteomes, and, as we describe below, define an origin for themodern protein world and illustrate how biological functions werediscovered in time. We caution, however, that statements relateonly to modern biochemistry, as modern molecules were used toreconstruct the past. Any claims of origin and evolution relatenecessarily to the design and complexities of extant molecules,and not to those of predecessors that were perhaps lost in theevolutionary process.

THE RISE OF DIVERSIFIED PROTEOMES, MODULARITY ANDCELLULAR LIFE

The most notable feature of every tree of architectures that hasbeen generated so far is that F or FSF domains widely distributedin Nature appear at their base and are consequently ancient.They are only found to be lacking in parasitic organisms withhighly reduced genomes (e.g. Nanoarchaeum, Mycoplasma andEncephalitozoon), organisms known to have discarded enzymaticand cellular machinery in exchange for resources from their hosts[149]. The first nine F architectures to emerge in evolution arenevertheless common to every genome analysed and includearchitectures widespread in metabolism [165]; the evolution ofthe five most basal and their structures are illustrated inFigure 5. One likely interpretation of early evolution of ancientarchitectures using the neutral net paradigm described above(Figure 2C) is given in Figure 5(B). As protein sequences harbour-ing the primordial fold drift by mutation in sequence space, newneutral nets are discovered that fold sequences into new foldstructures, while variants within ancient and new folds continue tobe discovered in the original neutral nets in an ongoing explorationof more stable and fit variants. The comparison of trees of F andFSF architectures supports this view, revealing a collection ofproteins undergoing divergent, but concomitant, evolutionaryprocesses that translate into patterns of recent (close relationship)or ancient origin (distant relationship) [163]. This is a conse-quence of the hierarchical nature of protein structure andthe limited exploration of sequence and structure space. Weexpect, as corollary, a correlation between abundance and ageof individual architectures and time-lapsed discovery of foldvariants. Indeed, the distribution of branch lengths (longer towardsthe base) and the unbalanced shape of phylogenomic trees(Figure 4B and see Supplementary Figure S3) suggests stronglythat architectural discovery involved semipunctuated evolutionaryprocesses similar to those recently suggested for substitutionalchange in nucleic acids [166].

Figure 5 Evolution of the five most ancient folds

(A) Phylogenetic relationships at the base of a phylogenomic tree of domain structures at the Flevel of structural organization [165] together with examples of the different domain architectures.All ancient architectures share a common design of α-helices and β-strands that form barrelor highly symmetrical structures. The structural models illustrate the 3D arrangement of helices(cyan) and strands (mauve) separated by turns and coils (brown). Structures included are:c.37, P-loop NTP hydrolase fold of adenylate kinase from Methanococcus thermolithotrophicus(PDB code 1KI9), depicting a putative enzymatic origin of metabolism; a.4, DNA/RNA-bindingthree-helical bundle from the Trp repressor mutant V58I protein (1JHG); c.1, TIM β/αbarrel fold of inosine 5′-monophosphate dehydrogenase from Borrelia burgdorferi (PDB code1EEP); NADP(P)-binding Rossmann fold of glyceraldehyde-3-phosphate dehydrogenase fromEscherichia coli (PDB code 1GAD); d.58, ferredoxin-like fold of 7-Fe ferredoxin from Azotobactervinelandi (PDB code 7FD1). (B) One of many possible interpretations of early evolution of thefive most ancient architectures using the neutral net paradigm (Figure 2C). Circles of differentcolours illustrate neutral nets corresponding to each fold and embedding mutational walks insequence space responsible for extant structural diversity at F level of hierarchical organization.F neutral nets should map FSF neutral nets and two nets corresponding to two lower levels ofhierarchy of protein structure.

When the representation of architectures in organisms inArchaea, Bacteria and Eukarya was traced along the evolutionarytimeline, patterns of origin and evolution of our contemporarytripartite world were clearly revealed ([149] and M. Wang,unpublished work). Ancient architectures were multifunctionaland were shared by many organisms (free-living or parasitic)in the three superkingdoms [107,149,162,163]. These common

c© The Authors Journal compilation c© 2009 Biochemical Society

630 G. Caetano-Anolles and others

Figure 6 The architectural and functional complement of the communal ancestor

The complement defines 78 FSFs that appeared before the first architecture that was completely lost in a superkingdom (Archaea) in the tree of FSF architectures (see Supplementary Figure S3B athttp://www.BiochemJ.org/bj/417/bj4170621add.htm). FSFs were grouped according to coarse-grained functional SUPERFAMILY [201] categories and subcategories (peripheral pie) and accordingto major classes of globular proteins in SCOP (central pie).

architectures defined the so-called ‘architectural diversification’epoch in protein evolution in which members of an ancestralcommunity of organisms diversified their protein repertoiresthrough differential loss (light-green-shaded area overlapping thetree of domain structure and organization described in Figure 4B).Remarkably, architectural loss occurred preferentially in organ-isms belonging to the lineages of Archaea, establishing the firstorganismal divide [149]. This reductive evolutionary strategy wasprotracted and perhaps induced by adaptation to the extremephysical conditions of early Earth. The early rise of Archaeamatches recent evolutionary studies of the structure of tRNA[167] and universal trees of proteomes reconstructed fromarchitectures identified to be ancient in the tree of architectures[149]. These trees of proteomes showed a rooting that wasinternal (paraphyletic) to the Archaea and was located betweenthe Crenarchaeota and the Euryarchaeota, close to methanogenicarchaeal species. This paraphyletic rooting is consistent witha mutational comparative analysis of tRNA paralogues thatidentified molecular species in the Archaea as slow-evolving andancient [168] and the existence of ancestral genome characterssuch as split genes and operon organization [169]. It also hasan impact on the interpretation of protein evolutionary tracingexercises that consider superkingdoms as evolutionarily unifiedgroups, as these will identify not a single origin for proteins,but many [161]. A proposed multiple convergent (polyphyletic)origin of genes occurring after lineage diversification involvingthe modular reorganization of sequence [170] would, in fact,have the same effect. Nevertheless, these and many other lines ofevidence suggest that Archaea is the most ancient lineage of themodern living world, an emerging view that is gaining consensus[156].

It is important to note that reductive tendencies in Archaeastarted at a time when superkingdom-specific architectures andpresent-day organismal lineages had not developed and life

was probably communal [149]. In fact, a substantial portionof the protein world developed during this time and resulted incomplex proteomes that were rich in functions and architectures(Figure 6). These results are, for example, in line with profilesfrom phylogenetic tracing of enzymes linked to bioenergeticprocesses [171], an architectural census [172] and recent ancestralstate reconstruction of the gene content of the universal ancestor[173] that revealed a bioenergetically and functionally complexgenome with a gene complement similar in number to that ofextant free-living microbes (reviewed in [156]).

Following the architectural diversification epoch, super-kingdom-specific and lineage-specific architectures appeared inevolution as the world of organisms expanded [149]. In this new‘superkingdom specification’ epoch, new reductive tendenciesexpressed in Bacteria and the superkingdoms were specified inwhat we believe was a protracted process (salmon-shaded areasin Figure 4B). Moreover, architectural representation decreasedconsiderably with time until it approached zero, a point atwhich a large number of new architectures were clustered, eachspecific to a small number of organisms. Later on, an oppositetrend took place, in which architectures that were more specializedand were specific to relatively small sets of organisms increasedtheir representation in proteomes explosively. This architectural‘big bang’ (paraphrasing that of the universe) involved themultiple combination and rearrangement of domains (Figures 4Band 4C) and the distribution of resulting multidomain proteinsamong emergent organismal lineages. We will not discussthe evolutionary patterns and processes that underlie theseprocesses since they have been discussed recently [98]. Theyinvolve, however, preferential additions and deletions of terminaldomains and fusion and fission processes that engage (withbias) different domain modules in a combinatorial interplay.The rise of modularity in the protein world defines the ongoing‘organismal diversification’ epoch (light yellow areas). During

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 631

this last period, architectural novelties linked to multicellularityappeared massively and quite late both immediately after microbediversification events (mostly folds common to organismal do-mains) and during eukaryotic diversification (mostly Eukarya-specific) [149,162]. This included multidomain architecturesknown to be associated with programmed cell death, adhesionand recognition of cells [162].

Proteome distribution patterns along the timeline have had animpact on the constitution of present-day genomes, with thearchitectural repertoire being the largest and most diverse inEukarya, and the smallest and most homogeneous in Archaea,with Bacteria taking an intermediate position (see the pie chartsof Figure 4B). Remarkably, the diverse repertoire of the Bacteriasuperkingdom was by necessity compartmentalized into the smallproteomes of individual organisms (L. S. Yafremava and G.Caetano-Anolles, unpublished work).

EVOLUTIONARY TIMELINES AND THE DISCOVERY OFARCHITECTURES AND FUNCTIONS

There were many remarkable patterns linked to structure in thetrees and resulting evolutionary timelines. The most ancestralfolds harboured barrels [e.g. the TIM β/α-barrel fold (c.1)]or interleaved β-sheets and α-helical architectures that packedhelices to one face [e.g. the ferredoxin-like fold (d.58)] or twofaces [e.g. the P-loop-containing NTP hydrolases (c.37) and theNAD(P)-binding Rossmann fold (c.2)] of the central β-sheetarrangement [107,163]. These and other ancient architectureswere multifunctional and interacted with organic cofactors [174],especially nucleotide-containing ligands such as ATP, ADP, GDP,NAD and FAD, all of which appear to have originated early inevolution according to a power-law distribution of ligand–proteinmapping [175]. Architectures appearing later in the timelines werefunctionally more specialized and simple, with structures thatwere increasingly smaller and more compact (e.g. increases inthe tilt of strands or the frequency of open barrel structures in thepopular β-barrels; [107]). At the same time, structures becamemore refined, as illustrated with barrel structures harbouringincreasingly more complex strand topologies. Many importantstructural designs were derived in the tree (including polyhedralfolds in the all-α class and β-sandwiches, β-propellers andβ-prisms in the all-β class) and protein transformation pathwaysdescribing likely scenarios of structural evolution [176,177] andother patterns could be traced in the trees [107].

Interestingly, all classes of globular protein architectureappeared very early in evolution and in defined order, the α/βclass being the first, followed by the α + β, the all-α and theall-β classes, and by small and multidomain proteins [107,163].Patterns of origin and accumulation of protein classes wereconsistently revealed in all trees analysed, including those derivedfrom a tree of domains and domain combinations (Figure 4).A similar conclusion was reached when tracing fold occurrencealong branches of proteome [160] and rRNA trees [161], and whenstudying the evolution of aminoacyl-tRNA synthetases [178]. Weproposed architectural designs with interspersed α-helical andβ-sheet elements were segregated in the course of evolution, firstwithin their structure (α +β class) and then confined to separatemolecules (all-α and all-β classes) [107]. This is in accord withthe random origin hypothesis of proteins [179]. Several interestingfeatures distinguish the ancient α/β protein class from the rest. Forexample, topological accessibility measurements describing howeasy it is to fold a structure from any point in the polypeptide chainshowed a marked asymmetry toward the N-terminus in α/β folds,a property that was mostly confined to selected protein families

[180]. Measurements of closeness to the molecular centroid andresidue contact distribution also revealed the bias, which wasmore notable in ancient than in more recent α/β folds [181].These observations were interpreted as evidence of ancient α/βfolds predating chaperone-assisted folding and preserving thebias as a relic [180] or of unmasked co-translational foldingin extant proteins [181]. Co-translation folding is the ability ofproteins to fold as they exit the ribosome, but the process remainscontentious. Interestingly, our evolutionary timelines revealed thatFSF architectures linked to chaperone and proteostasis systemsin the cell appeared early with the ATPase domain of the Hsp90chaperone (d.122.1) and throughout the timeline (ndFSF = 0.06–0.86). However, the dominant families that contributed to the N-terminal bias the most appeared earlier (e.g. c.37 and c.2), support-ing the idea that the origin of this asymmetry lies in the past. Wefound α-helical segments were generally longer in α/β folds (Fig-ure 4A), a trend that was especially notable with the early folds(see the animated version of Figure 4A). This could indicate thesehelical segments were overrepresented in the ancient interspersedα/β structures. It is well known that single extended β-sheetsare quite effective at burying non-polar surfaces when comparedwith α-helices [182]. Moreover, a surprising in vitro model studyof co-translational protein folding suggests an initial tendencyto form misfolded sheets in an all-α protein (apomyoglobin),a tendency that decreases with protein length and underscoresthe importance of co-translationally active chaperones [183].Perhaps the length-dependent misfolding tendencies in non-nativeproteins left behind relics in the ancient α/β proteins that hadto fold unassisted, which tried to increase the length of helices tobalance the secondary structure repertoire. Interestingly, a surveyof hundreds of genomes reveals domains are longer in veryancient proteins (M. Wang, unpublished work) and not shorter asclaimed in a recent phylogenetic tracing study [161]. We thereforepropose that longer helical segments provided an advantage inearly protein evolution and were then slowly replaced by strandsand a reduction in protein length once chaperone systems were inplace. This scenario would explain α-to-β tendencies uncoveredin the tree of architectures [107].

Tracing biological function along the timeline revealed patternsof origin of fundamental cellular processes (Figure 7), confirmingthe very early and explosive onset of metabolism [149] and small-molecule-binding chemistries [175]. It appears that proteins werefirst associated with organic cofactors, but later involved transitionmetals as ligands, perhaps mediated by the increasing energydemands of the ancient world. Timelines revealed a relativelyearly rise of metallomes (with the zinc-metallome appearing first)(C. L. Dupont, G. Caetano-Anolles and P. E. Bourne, unpublishedwork), and the late appearance of oxygenic photosynthesis, whichwas preceded and followed by the discovery of functions typicalof Eukarya (cell adhesion, receptors and chromatin structure, andfunctions linked to multicellularity). Some of these results areconsistent with a proteomic analysis that suggest that shifts in tracemetal geochemistry related to the redox state of ancient oceansare imprinted in protein architecture and suggests that prokaryotesevolved in anoxic marine environments, whereas eukaryotes didso in oxic counterparts [184]. The late evolutionary appearance ofoxygenic photosynthesis confirms results from a phylogenomicanalysis of metabolic networks [149] and is consistent withmolecular and geological records that suggest that oxygen enteredour atmosphere after major microbial divergences in the tree oflife [185].

All functional categories and most subcategories appeared forthe first time during the architectural diversification epoch, lend-ing credence to the complex nature of the ‘communal ancestor’to diversified life [149]. In fact, the functional and structural

c© The Authors Journal compilation c© 2009 Biochemical Society

632 G. Caetano-Anolles and others

diversity of its architectural complement (Figure 6) suggeststhat biological functions were geared fundamentally to metabolicactivities, proteostasis and protein degradation, and, as expected,were embodied mostly in α/β protein architectures. Major sub-categories pooled transferases, nucleotide metabolism and small-molecule binding enzymes, matching recent metabolic networkinvestigations [165]. Coenzyme, carbohydrate and energymetabolisms also featured prominently. These cells also hadarchitectures involved in an incipient translation apparatus. Nuc-leic acid processing (DNA replication/repair) was embodied inNudix (d.113.1) and DNA breaking–rejoining enzyme (d.163.1)FSFs linked to pyrophosphorylase/pyrophosphatase and RNA-decapping activities and integrases and topoisomerases respect-ively. Only five functional subcategories originated later on in theorganismal specification epoch and were clearly related tothe cellular make-up of organisms; they involved lipid/membranebinding and structural proteins, proteins associated with cellenvelope biogenesis and the outer membrane, viral proteins andproteins related to oxygenic photosynthesis. Only one subcateg-ory had its origin in the organismal diversification epoch (bloodclotting). These proteins are therefore important markers inthe architectural chronology. Similarly, α-solenoids, β-pro-pellers, coiled coils and other architectures linked to thenuclear pore complex [186], a marker for the nuclear envelopein Eukarya and some bacterial lineages (e.g. Planctomycete andVerrucomicrobia [187]), appeared (together with karyopherinsthat interact transiently with the complex) very late in evolution(ndFSF = 0.82–1.00). Nuclear pores therefore represent verymodern protein complexes that were horizontally transferred orevolved convergently in Eukarya and Bacteria. Of all the maincategories, extracellular processes appeared the latest, close tothe boundary of the superkingdom specification epoch. Thesecategories involve immune responses, toxin and defence enzymes,and cell adhesion, functions related to definition of self andintercellular interactions (competition and multicellularity). It islogical that these functions would appear at the end of a communalworld of organisms.

The appearance of information-related processes and cellularmotility has important consequences for origins of modernbiochemistry and diversified life. Translation originated quiteearly and preceded the DNA repair/replication, transcription,RNA processing and chromatin structure subcategories, whichdeveloped in the timelines in that order (Figure 7). The earlyorigin of translation was confirmed by tracing architectures ofaminoacyl-tRNA synthetases, elongation factors and ribosomalproteins derived from crystallographic models and HMM searchesin the trees (D. Caetano-Anolles, unpublished work). Models ofamino acid evolution also supported the antiquity of aminoacyl-tRNA synthetases [178]. The observation that the origin of modernprotein synthesis developed only after metabolic proteinaceousenzymes were in place suggests strongly that the translationapparatus suffered a fundamental revision during evolution ofmodern proteins. The inception of cell motility also has importantconsequences. The microbes of the communal world wereprobably auxotrophic or heterotrophic organisms seeking toimprove their survival in the changing environments of earlyEarth. Cellular motility allowed better tools to seek and ingestfood and in some lineages to prey on other members of thecommunity. The development of phagotrophy (a hallmark ofEukarya) and mechanisms of cell motility could have ignitedthe rise of the tripartite world [188]. Indeed, fundamentalFSF architectures associated with a number of importantmolecules linked to cell movement (e.g. tubulin, moesin,profilin and actin) originated at the end of the architecturaldiversification epoch (e.g. tubulin nucleotide-binding domain-

like and C-terminal domain like, ndFSF = 0.31) and continuedto accumulate (Figure 7), but together with toxins and defencearchitectures, which could have brought other means of warfare(Figure 7). Whereas many important proteins related to motilitydeveloped later in the timeline [e.g. phase 1 flagellin do-main (ndFSF = 0.58), profilin (ndFSF = 0.73), moesin tail domain(ndFSF = 0.76), actin-cross-linking and depolymerizing do-mains (ndFSF = 0.85)], others that were multifunctional andancient were probably recruited for the task (e.g. actin-like ATPasedomain architecture, ndFSF = 0.04). It is therefore quite likely thatthe world of organisms underwent a transition from communalto competitive during superkingdom specification and that thistriggered diversification of life.

THE ORIGIN OF THE PROTEIN WORLD AND THE RISEOF MODERN METABOLISM

It is generally assumed that life originated as an emergent dissip-ative system with a series of autocatalytic processes that producedprimordial metabolites [189–192]. Among these chemicals arethe nucleotides and amino acids that are prerequisite for anancient RNA world and an emergent protein world. As the latterdeveloped, the first reactions available for RNA and proteinmolecules must have been metabolic reactions. Timelines alreadysuggest that modern metabolism appeared very early on inevolution (Figure 7). However, a detailed phylogenomic tracinganalysis of protein architecture in metabolic networks [193]revealed that the nine most ancient architectures were responsiblefor the explosive appearance of most modern enzymatic functions[165]. In fact, a careful dissection of recruitment patternsindicated that modern metabolism originated in enzymes ofnucleotide metabolism harbouring the P-loop-containing NTPhydrolase fold, probably in pathways linked to the purinemetabolic subnetwork. This study was complemented recentlywith a battery of other evolutionary bioinformatic approaches,which revealed a succession of recruitment gateways, eachmediated by the discovery of a new primordial fold [194].These gateways produced a layered system reminiscent ofMorowitz’s prebiotic shells [195] describing early evolutionaryprogressions and take-overs of ancient prebiotic chemistries.The first gateway originated in nucleotide metabolism, involvedmostly transferases and was then extended to metabolism ofcofactors. It was immediately followed by an ‘energy amphiphile’lipid–carbohydrate core that provided enzymes for energy andhydrocarbon precursors established primordially in the self-replicating prebiotic entity. The TIM β/α barrel-mediated gatewaylater introduced amination reactions that converted keto acidsinto amino acids, mediating the incorporation of nitrogen into amultitude of metabolic processes. These opened new recruitmentpossibilities and generated explosively the chemical diversitywe currently encounter in modern metabolism. Phylogenomicstherefore provides for the first time a link between the prebioticand modern worlds, showing metabolism as a palimpsest thatrecapitulates prebiotic and perhaps ribozymic chemistries. Wenote that many of the very ancient architectures were involved infunctions associated with ancient genes that were recentlyidentified by physical clustering in bacterial genomes [196]. In thisstudy, the evolutionarily conserved gene core divided into threelayers, the first highly connected centred around informationalprocesses (fundamentally the ribosome and translation), a secondfeaturing tRNA synthetases and other processes (e.g. proteolysis),and an outer loosely connected layer (assumed to be moreancestral) linked to metabolism and highlighting metabolismof nucleotides, coenzymes and fatty acid molecules. The

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 633

Figure 7 Evolution of biological function in the protein world

The evolutionary timeline shows the discovery of protein FSF architectures associated with different coarse-grained functional SUPERFAMILY categories in each superkingdom, with time measured bya relative distance in nodes from a hypothetical ancestral architecture at the base of the tree of architectures (see Supplementary Figure S3B at http://www.BiochemJ.org/bj/417/bj4170621add.htm).Pie charts below bins describe the distribution of architectures that are unique or shared between superkingdoms, and their areas are proportional to the total number of architectures in that bin.Arrowheads indicate the first appearance of architectures associated with functional subcategories that are listed. Details of their individual accumulation can be found in Supplementary Figure S4 athttp://www.BiochemJ.org/bj/417/bj4170621add.htm. The three evolutionary epochs and corresponding phases of the protein world are labelled with different shades and follow previous definitions[149].

c© The Authors Journal compilation c© 2009 Biochemical Society

634 G. Caetano-Anolles and others

overall picture of these studies points clearly to an origin ofmodern proteins in the synthesis of nucleotides for a world inwhich RNA was the only encoded catalyst, but also to thecoexistence of RNA, proteins and prebiotic chemistries, a conceptthat is in line with recent prebiotic experiments [192]. Thecentrality of RNA in the primordial make-up of the early protein-encoding organisms is revealed.

PROSPECTS

Ever since the first crystallographic structure was reported50 years ago for sperm whale myoglobin (PDB code 1MBN)[197], advances in comparative and structural genomics continueto provide an increasing number of sequences and crystal struc-tures that are available for the study of the modern protein world.Recent advances in our understanding of protein structure andfolding and the construction of powerful classification schemesprovide a more thorough description of the hierarchical structureof this world. The linking of molecular evolution and struc-tural biology now provides evolutionary views that are unprece-dented. They prompt us to answer important questions. Howdiscrete or continuous is protein space? What are the fund-amental processes that drive the evolution of protein structure?What is the tempo and mode of architectural discovery? Atwhat structural resolution do proteomes differ and how does itaffect our definition of species? What are the principles that drivethe evolutionary mechanics of domain combination in Nature?When and how did individual biological functions originated andevolved?

We have reviewed the remarkable patterns related to the origin,evolution and structure of the protein world and the diversificationof life inferred from comparative and phylogenomic analysis ofprotein structure. History reconstruction exercises unfoldtimelines of the discovery of architectures and functions andan emergent picture of primordial biochemistries. They uncoverepisodes of specialization, exemplified by the explosive riseof functionally specialized multidomain proteins. They alsoreveal patterns of simplification, such as reductive tendenciesof protein repertoires in the proteomes of microbial organisms.More importantly, results test long-standing and controversialhypotheses of how life originated and evolved. The gates to themysteries of how the living world emerged have been opened, andwe are expecting a flood of new exciting discoveries.

ACKNOWLEDGMENTS

We thank Professor Steven Huber and Professor Alex Toker for the invitation to write thisreview, and members of the GCA research group for constructive discussions.

FUNDING

Supported by the National Science Foundation [grant numbers MCB-0343126 and MCB-0749836], the C-FAR Sentinel Program, the United States Department of Agriculture andthe Critical Research Initiative of the University of Illinois.

REFERENCES

1 Pauling, L. and Corey, R. B. (1951) The polypeptide-chain configuration in hemoglobinand other globular proteins. Proc. Natl. Acad. Sci. U.S.A. 37, 282–285

2 Linderstrøm-Lang, K. and Schellman, J. A. (1959) Protein structure and enzymaticactivity. In The Enzymes, 2nd edn (Lardy, H. and Myrback, K, eds.), pp. 443–510,Academic Press, New York

3 Soding, J. and Lupas, A. N. (2003) More than the sum of their parts: on the evolution ofproteins from peptides. BioEssays 25, 837–846

4 Vogel, C., Bashton, M., Kerrison, N. D., Chothia, C. and Teichmann, S. A. (2004)Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 14,208–216

5 Pereira-Leal, J. B., Levy, E. D., Kamp, C. and Teichmann, S. A. (2007) Evolution ofprotein complexes by duplication of homomeric interactions. Genome Biol. 8, R51

6 Schellman, J. A. and Schellman, C. G. (1997) Kaj Ulrik Linderstrøm-Lang (1896–1959).Protein Sci. 6, 1092–1100

7 Epstein, C. J., Goldberger, R. F. and Anfinsen, C. B. (1963) The genetic control of tertiaryprotein structure: model systems. Cold Spring Harbor Symp. Quant. Biol. 28, 439–449

8 Anfinsen, C. B. (1973) Principles that govern the folding of protein chains. Science 181,223–230

9 Onuchic, J. N. and Wolynes, P. G. (2004) Theory of protein folding. Curr. Opin. Struct.Biol. 14, 70–75

10 Dill, K. A., Ozkan, S. B., Shell, M. S. and Weiki, T. R. (2008) The protein folding problem.Annu. Rev. Biophys. 37, 289–316

11 Englander, S. W., Mayne, L. and Krishna, M. M. G. (2007) Protein folding andmisfolding: mechanism and principles. Q. Rev. Biophys. 40, 287–326

12 Ozkan, S. B., Wu, G. H. A., Chodera, J. D. and Dill, K. A. (2007) Protein folding byzipping and assembly. Proc. Natl. Acad. Sci. U.S.A. 104, 11987–11992

13 Duan, Y. and Kollman, P. A. (1998) Pathways to a protein folding intermediate observedin a 1-microsecond simulation in aqueous solution. Science 282, 740–744

14 Zagrovic, B., Snow, C. D., Shirsts, M. R. and Pande, V. S. (2002) Simulation of folding ofa small α-helical protein in atomistic detail using world-wide distributed computing.J. Mol. Biol. 323, 927–937

15 Felts, A. K., Harano, Y., Gallicchio, E. and Levy, R. M. (2004) Free-energy surfaces ofβ-hairpin and α-helical peptides generated by replica exchange molecular dynamicswith the AGBNP implicit solvent models. Proteins 56, 310–321

16 Ołdiej, S., Czaplewski, C., Liwo, A., Chinchio, M., Nanias, M., Vila, J. A., Khalili, M.,Arnautova, Y. A., Jagielska, A., Makowski, M. et al. (2005) Physics-basedprotein-structure prediction using a hierarchical protocol based on the UNRES forcefield: assessment in two blind tests. Proc. Natl. Acad. Sci. U.S.A. 102, 7547–7552

17 Lei, H. and Duan, Y. (2007) Ab initio folding of albumin binding domain from all-atommolecular dynamics simulation. J. Phys. Chem. B 111, 5458–5463

18 Major, U., Guydosh, N. R., Johnson, C. M., Grossmann, J. G., Sato, S., Jas, G. S.,Freund, S. M. V., Alonso, D. O. V., Daggett, V. and Fersht, A. R. (2003) The completefolding pathway of a protein from nanoseconds to microseconds. Nature 421, 863–867

19 Religa, T. L., Markson, J. S., Major, U., Freund, S. M. and Fersht, A. R. (2005) Solutionstructure of a protein denatured state and folding intermediate. Nature 437, 1053–1056

20 Hoelzer, G. A., Smith, E. and Pepper, J. W. (2006) On the logical relationship betweennatural selection and self-organization. J. Evol. Biol. 19, 1785–1794

21 Schuster, P., Fontana, W., Stadler, P. and Hofacker, I. (1994) From sequences to shapesand back: a case study in RNA secondary structures. Proc. R. Soc. London Ser. B 255,279–284

22 Babajide, A., Hofacker, I. L., Sippl, M. J. and Stadler, P. F. (1997) Neutral networks inprotein space: a computational study based on knowledge-based potential of meanforce. Folding Des. 2, 261–269

23 Fontana, W. (2002) Modelling ‘evo-devo’ with RNA. BioEssays 24, 1164–117724 Schuster, P. and Stadler, P. F. (2003) Networks in molecular evolution. Complexity 8,

34–4225 Maynard Smith, J. (1970) Natural selection and the concept of protein space. Nature

225, 563–56426 Salisbury, F. B. (1969) Natural selection and the complexity of the gene. Nature 224,

342–34327 Schultes, E. A. and Bartel, D. P. (2000) One sequence, two ribozymes: implications for

the emergence of new ribozyme folds. Science 289, 448–45228 Babajide, A., Farber, R., Hofacker, I. L., Inman, J., Lapedes, A. S. and Stadler, P. F. (2001)

Exploring protein sequence space using knowledge based potentials. J. Theor. Biol.212, 35–46

29 Bornberg-Bauer, E. (1997) How are model protein structures distributed in sequencespace? Biophys. J. 73, 2393–2403

30 Bastolla, U., Roman, H. E. and Vendruscolo, M. (1999) Neutral evolution of modelproteins: diffusion in sequence space and overdispersion. J. Theor. Biol. 200, 49–64

31 Govindarajan, S. and Goldstein, R. A. (1997) The foldability landscape of modelproteins. Biopolymers 42, 427–438

32 Orengo, C. A., Jones, D. T. and Thornton, J. M. (1994) Protein superfamilies and domainsuperfolds. Nature 372, 631–634

33 Bershtein, S. and Tawfik, D. S. (2008) Advances in laboratory evolution of proteins.Curr. Opin. Chem. Biol. 12, 151–158

34 Martinez, M. A., Pezo, V., Marlere, P. and Wain-Hobson, S. (1997) Exploring thefunctional robustness of an enzyme by in vitro evolution. EMBO J. 15, 1203–1210

35 Keefe, A. D. and Szostak, J. W. (2001) Functional proteins from a random-sequencelibrary. Nature 410, 715–718

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 635

36 Seelig, B. and Szostak, J. W. (2007) Selection and evolution of enzymes from a partiallyrandomized non-catalytic scaffold. Nature 448, 828–831

37 Bornberg-Bauer, E. and Chan, H. S. (1999) Modeling evolutionary landscapes:mutational stability, topology, and superfunnels in sequence space. Proc. Natl. Acad.Sci. U.S.A. 96, 10689–10694

38 Taverna, D. M. and Goldstein, R. A. (2002) Why are proteins so robust to site mutations?J. Mol. Biol. 315, 479–484

39 Wroe, R., Bornberg-Bauer, E. and Chan, H. S. (2005) Comparing folding codes in simpleheteropolymer models of protein evolutionary landscapes: robustness of the superfunnelparadigm. Biophys. J. 88, 118–131

40 Huynen, M. A., Stadler, P. F. and Fontana, W. (1996) Smoothness within ruggedness:the role of neutrality in adaptation. Proc. Natl. Acad. Sci. U.S.A. 93, 397–401

41 van Nimwegen, E., Crutchfield, J. and Huynen, M. (1999) Neutral evolution ofmutational robustness. Proc. Natl. Acad. Sci. U.S.A. 96, 9716–9720

42 Cordes, M. H. J., Burton, R. E., Walsh, N. P., McKnight, C. J. and Sauer, R. T. (2000) Anevolutionary bridge to a new protein fold: interconversion of two native structures in asingle mutant protein. Nat. Struct. Biol. 7, 1129–1132

43 Bloom, J. D., Silberg, J. J., Wilke, C. O., Drummond, D. A., Adami, C. and Arnold, F. H.(2005) Thermodynamic prediction of protein neutrality. Proc. Natl. Acad. Sci. U.S.A.102, 606–611

44 Wroe, R., Chan, H. S. and Bornberg-Bauer, E. (2007) A structural model of latentevolutionary potentials underlying neutral networks in proteins. HFSP J. 1, 79–87

45 James, L. C. and Tawfik, D. S. (2003) Conformational diversity and protein evolution:a 60-year old hypothesis revisited. Trends Biochem. Sci. 28, 361–368

46 Aharoni, A., Gaidukov, L., Khersonsky, O., Gould, S. M., Roodvelt, C. and Tawfik, D. S.(2005) The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73–76

47 Amitai, G., Devi-Gupta, R. and Tawfik, D. S. (2007) Latent evolutionary potentials underthe neutral mutational drift of an enzyme. HFSP J. 1, 67–68

48 Bershtein, S., Segal, M., Bekerman, R., Tokuriri, N. and Tawfik, D. S. (2006)Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein.Nature 444, 929–932

49 Trent, J. D., Gabrielsen, M., Jensen, B., Neuhard, J. and Olsen, J. (1994) Acquiredthermotolerance and heat shock proteins in thermophiles from the three phylogeneticdomains. J. Bacteriol. 176, 6148–6152

50 Laksanalamai, P., Whitehead, T. A. and Robb, F. T. (2004) Minimal protein-foldingsystems in hyperthermophilic Archaea. Nat. Rev. Microbiol. 2, 315–324

51 Saibil, H. R. (2008) Chaperone machines in action. Curr. Opin. Struct. Biol. 18, 35–4252 Vainberg, I. E., Ampe, C., Cowan, N. J., Kleine, H. L., Lewis, H., Rommelaere, J. and

Vandekerckhove, J. (1998) Prefoldin, a chaperone that delivers unfolded proteins tocytosolic chaperonin. Cell 93, 863–873

53 Ellis, R. J. and Minton, A. P. (2006) Protein aggregation in crowded environments.Biol. Chem. 387, 485–497

54 Glickman, M. H. and Ciechanover, A. (2002) The ubiquitin–proteasome proteolyticpathway: destruction for the sake of construction. Physiol. Rev. 82, 373–428

55 Balch, W. E., Morimoto, R. I., Dillin, A. and Kelly, J. W. (2008) Adapting proteostasis fordisease intervention. Science 319, 916–919

56 Ron, D. and Walter, P. (2007) Signal integration in the endoplasmic reticulum unfoldedprotein response. Nat. Rev. Mol. Cell Biol. 8, 519–529

57 Wiseman, R. L., Powers, E. T., Buxbaum, J. N., Kelly, J. W. and Balch, W. E. (2007) Anadaptable standard for protein export from the endoplasmic reticulum. Cell 131,809–821

58 Bull, A. T., Goodfellow, M. and Slater, J. H. (1992) Biodiversity as a source of innovationin biotechnology. Annu. Rev. Microbiol. 46, 219–252

59 Whitman, W. B., Coleman, D. C. and Wiebe, W. J. (1998) Prokaryotes: the unseenmajority. Proc. Natl. Acad. Sci. U.S.A. 95, 6578–6583

60 Brocchieri, L. and Karlin, S. (2005) Protein length in eukaryotic and prokaryoticproteomes. Nucleic Acids Res. 33, 3390–3400

61 Kurland, C. G., Canback, B. and Berg, O. G. (2007) The origins of modern proteomes.Biochimie 89, 1454–1563

62 Drake, J. W., Charlesworth, B., Charlesworth, D. and Crow, J. F. (1998) Rates ofspontaneous mutation. Genetics 148, 1667–1686

63 Bajaj, M. and Blundell, T. (1984) Evolution and the tertiary structure of proteins.Annu. Rev. Biophys. Bioeng. 13, 453–492

64 Vukmirovic, O. G. and Tilghman, S. M. (2000) Exploring genome space. Nature 405,820–822

65 Sober, E. and Steel, M. (2002) Testing the hypothesis of common ancestry. J. Theor.Biol. 218, 395–408

66 Penny, D., Hendy, M. D. and Poole, A. M. (2003) Testing fundamental evolutionaryhypotheses. J. Theor. Biol. 223, 377–385

67 Mossell, E. (2003) On the impossibility of reconstructing ancestral data andphylogenies. J. Comp. Biol. 10, 669–678

68 Pal, C., Papp, B. and Hurst, L. D. (2001) Highly expressed genes in yeast evolve slowly.Genetics 158, 927–931

69 Wall, D. P., Hirsch, A. E., Fraser, H. B., Kum, J., Giaever, G., Eisen, M. B. and Feldman,M. W. (2005) Functional genomic analysis of the rates of protein evolution. Proc. Natl.Acad. Sci. U.S.A. 102, 5483–5488

70 Drummond, D. A., Raval, A. and Wilke, C. O. (2006) A single determinant dominates therate of yeast protein evolution. Mol. Biol. Evol. 23, 327–337

71 Kim, P. M., Lu, L. J., Xia, Y. and Gerstein, M. B. (2006) Relating three-dimensionalstructures to protein networks provides evolutionary insights. Science 314, 1938–1941

72 Kim, P. M., Korbel, J. A. and Gerstein, M. B. (2007) Positive selection at the proteinnetwork periphery: evaluation in terms of structural constraints and cellular context.Proc. Natl. Acad. Sci. U.S.A. 104, 20274–20279

73 Zhou, T., Drummond, D. A. and Wilke, C. O. (2008) Contact density affects proteinevolutionary rate from bacteria to animals. J. Mol. Evol. 66, 395–404

74 Simon, A. L., Stone, E. A. and Sidow, A. (2002) Inference of functional regions inproteins by quantification of evolutionary constraints. Proc. Natl. Acad. Sci. U.S.A. 99,2912–2917

75 Cooper, G. M. and Brown, C. D. (2008) Qualifying the relationship between sequenceconservation and molecular function. Genome Res. 18, 201–205

76 Grant, A., Lee, D. and Orengo, C. (2004) Progress towards mapping the universe ofprotein folds. Genome Biol. 5, 107

77 Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. and Ouzounis, C. A. (2003) Myriads ofprotein families, and still counting. Genome Biol. 4, 401

78 Zhang, Y., Hubner, I. A., Arakaki, A. K., Shakhnovich, E. and Skolnick, J. (2006) On theorigin and highly likely completeness of single-domain protein structures. Proc. Natl.Acad. Sci. U.S.A. 103, 2605–2610

79 Marsden, R. L. and Orengo, C. A. (2008) The classification of protein domains. InBioinformatics, Volume II: Structure, Function and Applications, vol. 453 (Keith, J. M.,ed.), pp. 123–146, Humana Press, Totowa

80 Richardson, J. S. (1981) The anatomy and taxonomy of protein structure.Adv. Protein Chem. 34, 167–339

81 Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann,T., Moxon, S., Marshall, M., Khanna, A., Durbin, R. et al. (2006) Pfam: clans, web toolsand services. Nucleic Acids Res. 34, D247–D251

82 Murzin, A., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) SCOP: a structuralclassification of proteins database for the investigation of sequences and structures.J. Mol. Biol. 247, 536–540

83 Andreeva, A., Howorth, D., Chandonia, J. M, Brenner, S. E., Hubbard, T. J., Chothia, C.and Murzin, A. G. (2008) Data growth and its impact on the SCOP database: newdevelopments. Nucleic Acids Res. 36, D414–D425

84 Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M.(1997) CATH: a hierarchic classification of protein structure. Structure 5, 1093–1098

85 Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O.,Pearl, F., Nambudiry, R., Reid, A. et al. (2007) The CATH domain structure database: newprotocols and classification levels give a more comprehensive resource for exploringevolution. Nucleic Acids Res. 35, D291–D297

86 Holm, L. and Sander, C. (1998) Dictionary of recurrent domains in protein structures.Proteins 33, 88–89

87 Hardley, C. and Jones, D. T. (1999) A systematic comparison of protein structureclassifications: SCOP, CATH and FSSP. Structure 7, 1099–1112

88 Mulder, N. J., Apweiler, R., Altwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P.,Buillard, V., Cerutti, L., Copley, R. et al. (2007) New developments in the InterProdatabase. Nucleic Acids Res. 35, D224–D228

89 Redfern, O. C., Desailly, B. and Orengo, C. A. (2008) Exploring the structure andfunction paradigm. Curr. Opin. Struct. Biol. 18, 394–402

90 Bashton, M. and Chothia, C. (2007) The generation of new functions by the combinationof domains. Structure 15, 85–99

91 Reeves, G. A., Dallman, T. J., Redfern, O. C., Akpor, A. and Orengo, C. A. (2006)Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360,725–741

92 Pan, J. L. and Bardwell, J. C. A. (2006) The origami of thioredoxin-like folds.Protein Sci. 15, 2217–2227

93 Shindyalov, I. N. and Bourne, P. E. (2000) An alternative view of protein fold space.Proteins 38, 247–260

94 Harrison, A., Pearl, F., Mott, R., Thornton, J. and Orengo, C. (2002) Quantifying thesimilarities within fold space. J. Mol. Biol. 323, 909–926

95 Kolodny, R., Petrey, D. and Honig, B. (2006) Protein structure comparison: implicationsfor the nature of ‘fold space’, and structure and function prediction. Curr. Opin. Struct.Biol. 16, 393–398

96 Xie, L. and Bourne, P. E. (2008) Detecting evolutionary relationships across existing foldspace, using sequence order-independent profile-profile alignments. Proc. Natl. Acad.Sci. U.S.A. 105, 5441–5446

c© The Authors Journal compilation c© 2009 Biochemical Society

636 G. Caetano-Anolles and others

97 Andreeva, A. and Murzin, A. G. (2006) Evolution of protein fold in the presence offunctional constraints. Curr. Opin. Struct. Biol. 16, 399–408

98 Moore, A. D., Bjorklund, A. K., Ekman, D., Bornberg-Bauer, E. and Elofsson, A. (2008)Arrangements in the modular evolution of proteins. Trends Biochem. Sci. 33, 444–451

99 Koonin, E. V., Wolf, Y. I. and Karev, G. P. (2002) The structure of the protein universe andgenome evolution. Nature 420, 218–223

100 Huynen, M. A. and van Nimwegen, E. (1998) The frequency distribution of family sizesin complete genomes. Mol. Biol. Evol. 15, 583–589

101 Rzhetsky, A. and Gomez, S. M. (2001) Birth of scale-free molecular networks and thenumber of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996

102 Quian, J., Luscombe, N. M. and Gerstein, M. (2001) Protein family and fold occurrencein genomes: power-law behavior and evolutionary model. J. Mol. Biol. 313, 673–681

103 Coulson, A. F. and Moult, J. A, (2002) A unifold, mesofold and superfold model ofprotein fold use. Proteins 46, 61–71

104 Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. and Koonin, E. V. (2002) Birthand death of protein domains: a simple model of evolution explains power law behavior.BMC Evol. Biol. 2, 18

105 Karev, G. P., Wolf, Y. I. and Koonin, E. V. (2003) Simple stochastic birth and deathmodels of genome evolution: was there enough time for us to evolve? Bioinformatics 19,1889–1900

106 Karev, G. P., Wolf, Y. I., Berezovskaya, F. S. and Koonin, E. V. (2004) Gene familyevolution: an in-depth theoretical and simulation analysis of non-linearbirth–death–innovation models. BMC Evol. Biol. 4, 32

107 Caetano-Anolles, G. and Caetano-Anolles, D. (2003) An evolutionarily structureduniverse of protein architecture. Genome Res. 13, 1563–1571

108 Goldstein, R. A. (2008) The structure of protein evolution and the evolution of proteinstructure. Curr. Opin. Struct. Biol. 18, 170–177

109 Zeldovich, K. B., Chen, P., Shakhnovich, B. E. and Shakhnovich, E. I. (2007) Afirst-principles model of early evolution: emergence of gene families, species andpreferred protein folds. PLoS Comput. Biol. 3, 1224–1238

110 Darwin, C. R. (1859) On the Origin of Species by Means of Natural Selection, Murray,London

111 Woese, C. R. (2004) A new biology for a new century. Microbiol. Mol. Biol. Rev. 68,173–186

112 Eventhoff, W. and Rossmann, M. G. (1975) The evolution of dehydrogenases andkinases. CRC Crit. Rev. Biochem. 3, 111–140

113 Johnson, K. S., Sutcliff, M. J. and Blundell, T. L. (1990) Molecular anatomy: phyleticrelationships derived from three-dimensional structures of proteins. J. Mol. Evol. 30,43–59

114 Bujnicki, J. M. (2000) Phylogeny of restriction endonuclease-like superfamily inferredfrom comparison of protein sequences. J. Mol. Evol. 50, 39–44

115 Breitling, R., Laubner, D. and Adamski, J. (2001) Structure-based phylogenetic analysisof short-chain alcohol dehydrogenases and reclassification of the 17β-hydroxysteroiddehydrogenase family. Mol. Biol. Evol. 18, 2154–2161

116 O’Donoghue, P. and Luthey-Schulten, Z. (2003) On the evolution of structure inaminoacyl-tRNA synthetases. Microbiol. Mol. Biol. Rev. 67, 550–573

117 Scheef, E. D. and Bourne, P. E. (2005) Structural evolution of the protein kinase-likesuperfamily. PLoS Comp. Biol. 1, e49

118 Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distancematrices. J. Mol. Biol. 223, 123–138

119 Røgen, P. and Fain, B. (2003) Automatic classification of protein structure by usingGauss integrals. Proc. Natl. Acad. Sci. U.S.A. 100, 119–124

120 Hou, J., Sims, G. E., Zhang, C. and Kim, S.-H. (2003) A global representation of theprotein fold space. Proc. Natl. Acad. Sci. U.S.A. 100, 2386–2390

121 Hou, J., Jun, S.-H., Zhang, C. and Kim, S.-H. (2005) Global mapping of the proteinstructure space and application in structure-based inference of protein function.Proc. Natl. Acad. Sci. U.S.A. 102, 3651–3656

122 Efimov, A. V. (1997) Structural trees for protein superfamilies. Proteins 28, 241–260123 Zhang, C. and Kim, S.-H. (2000) A comprehensive analysis of the Greek key motifs in

protein β-barrels and β-sandwiches. Proteins 40, 409–419124 Przytycka, T., Aurora, R. and Rose, G. D. (1999) A protein taxonomy based on secondary

structure. Nat. Struct. Biol. 6, 672–682125 Dokholyan, N. V., Shakhnovich, B. and Shakhnovich, E. I. (2002) Expanding protein

universe and its origin from the biological Big Bang. Proc. Natl. Acad. Sci. U.S.A. 99,14132–14136

126 Shakhnovich, B. E. (2005) Improving the precision of the structure–function relationshipby considering phylogenetic context. PLoS Comput. Biol. 1, e9

127 Rose, G. D., Fleming, P. J., Banavar, J. R. and Maritan, A. (2006) A backbone-basedtheory of protein folding. Proc. Natl. Acad. Sci. U.S.A. 103, 16623–16663

128 Taylor, W. R. (2007) Evolutionary transitions in protein fold space. Curr. Opin. Struct.Biol. 17, 354–361

129 Taylor, W. R. (2002) A ‘periodic table’ for protein structures. Nature 416, 657–660

130 Gerstein, M. and Levitt, M. (1997) A structural census of the current population ofprotein sequences. Proc. Natl. Acad. Sci. U.S.A. 94, 11911–11916

131 Gerstein, M. (1997) A structural census of genomes: comparing bacterial, eukaryoticand archaeal genomes in terms of protein structure. J. Mol. Biol. 274, 562–576

132 Gerstein, M. (1998) Patterns of protein-fold usage in eight microbial genomes:a comprehensive structural census. Proteins 33, 518–534

133 Frishman, D. and Mewes, H.-W. (1997) Protein structural classes in five completegenomes. Nat. Struct. Biol. 4, 626–628

134 Wolf, Y. I., Brenner, S. E., Bash, P. A. and Koonin, E. V. (1999) Distribution of proteinfolds in the three superkingdoms of life. Genome Res. 9, 17–26

135 Frishman, D. and Mewes, H.-W. (1997) PEDANTic genome analysis. Trends Genet. 13,415–416

136 Gough, J., Karplus, K., Hughey, R. and Cothia, C. (2001) Assignment of homology togenome sequences using a library of Hidden Markov Models that represent all proteinsof known structure. J. Mol. Biol. 313, 903–991

137 Wilson, D., Madera, M., Vogel, C., Chothia, C. and Gough, J. (2007) The SUPERFAMILYdatabase in 2007: families and functions. Nucleic Acids Res. 35, D308–D313

138 Buchan, D., Pearl, F., Lee, D., Shepherd, A., Rison, S., Thornton, J. M. and Orengo, C.(2002) Gene3-D: structural assignments for whole genes and genomes using the CATHdomain structure database. Genome Res. 12, 503–514

139 Yeats, C., Lees, J., Reid, A., Kelam, P., Martin, N., Liu, X. and Orengo, C. A. (2008)Gene3D: comprehensive structural and functional annotation of genomes. Nucleic AcidsRes. 36, D414–D418

140 Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. and Chothia, C.(2001) Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol. 19,482–486

141 Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. and Chothia, C.(2001) The evolution and structural anatomy of the small molecule metabolic pathwaysin Escherichia coli. J. Mol. Biol. 311, 693–708

142 Apic, G., Gough, J. and Teichmann, S. A. (2001) An insight into domain combinations.Bioinformatics 17 (Suppl. 3), S83–S89

143 Apic, G., Gough, J. and Teichmann, S. A. (2001) Domain combinations in archaeal,eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325

144 Abeln, S. and Deane, C. M. (2005) Fold usage on genomes and protein fold evolution.Proteins 60, 690–700

145 Malek, J. A. (2001) Abundant protein domains occur in proportion to proteome size.Genome Biol. 2, research0039

146 Lin, J. and Gerstein, M. (2000) Whole-genome trees based on the occurrence of foldsand orthologs: implications for comparing genomes on different levels. Genome Res.10, 808–818

147 Deeds, E. J., Hennessey, H. and Shakhnovich, E. I. (2005) Prokaryotic phylogeniesinferred from protein structural domains. Genome Res. 15, 393–402

148 Yang, S., Doolittle, R. F. and Bourne, P. E. (2005) Phylogeny determined by proteindomain content. Proc. Natl. Acad. Sci. U.S.A. 102, 373–378

149 Wang, M., Yafremava, L. S., Caetano-Anolles, D., Mittenthal, J. E. and Caetano-Anolles,G. (2007) Reductive evolution of architectural repertoires in proteomes and the birth ofthe tripartite world. Genome Res. 17, 1572–1585

150 Wang, M. and Caetano-Anolles, G. (2006) Global phylogeny determined by thecombination of protein domains in proteomes. Mol. Biol. Evol. 23, 2444–2454

151 Fukami-Kobayashi, K., Minezaki, Y., Tateno, Y. and Nishikawa, K. (2007) A tree of lifebased on protein domain organizations. Mol. Biol. Evol. 24, 1181–1189

152 Doolittle, R. F. (2005) Evolutionary aspects of whole-genome biology. Curr. Opin. Struct.Biol. 15, 248–253

153 Woese, C. R., Kandler, O. and Wheelis, M. L. (1990) Towards a natural system oforganisms: proposal for the domains Archaea, Bacteria and Eucarya. Proc. Natl. Acad.Sci. U.S.A. 87, 4576–4579

154 Wolf, Y., Rogozin, I. B. and Koonin, E. V. (2004) Coelomata and not Ecdysozoa: evidencefrom genome-wide phylogenetic analysis. Genome Res. 14, 29–36

155 Huerta-Cepas, J., Dopazo, H., Dopazo, J. and Gabaldon, T. (2007) The human phylome.Genome Biol. 8, R109

156 Glansdorff, N., Xu, Y. and Labedan, B. (2008) The Last Universal Common Ancestor:emergence, constitution and genetic legacy of an elusive forerunner. Biol. Direct 3, 29

157 Caetano-Anolles, G. (2002) Evolved RNA secondary structure and the rooting of theuniversal tree of life. J. Mol. Evol. 54, 333–345

158 Gough, J. (2005) Convergent evolution of domain architectures (is rare). Bioinformatics21, 1464–1471

159 Forslund, K., Henricson, A., Hollich, V. and Sonnhammer, E. L. L. (2008) Domaintree-based analysis of protein architecture evolution. Mol. Biol. Evol. 25, 254–264

160 Winstanley, H. F., Abeln, S. and Deane, C. M. (2005) How old is your fold?Bioinformatics 21, i449-i458

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world 637

161 Choi, I.-G. and Kim, S.-H. (2006) Evolution of protein structural classes and proteinsequence families. Proc. Natl. Acad. Sci. U.S.A. 103, 14056–14061

162 Caetano-Anolles, G. and Caetano-Anolles, D. (2005) Universal sharing patterns inproteomes and evolution of protein fold architecture and life. J. Mol. Evol. 60,484–498

163 Wang, M., Boca, S. M., Kalelkar, R., Mittenthal, J. E. and Caetano-Anolles, G. (2006) Aphylogenomic reconstruction of the protein world based on a genomic census of proteinfold architecture. Complexity 12, 27–40

164 Caetano-Anolles, G., Sun, F. J., Wang, M., Yafremava, L. S., Harish, A., Kim, H. S.,Knudsen, V., Caetano-Anolles, D. and Mittenthal, J. E. (2008) Origins and evolution ofmodern biochemistry: insights from genomes and molecular structure. Front. Biosci.13, 5212–5240

165 Caetano-Anolles, G., Kim, H. S. and Mittenthal, J. E. (2007) The origin of modernmetabolic networks inferred from phylogenomic analysis of protein architecture.Proc. Natl. Acad. Sci. U.S.A. 104, 9358–9363

166 Pagel, M., Venditti, C. and Meade, A. (2006) Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level. Science 314, 119–121

167 Sun, F.-J. and Caetano-Anolles, G. (2008) Evolutionary patterns in the sequence andstructure of transfer RNA: early origins of Archaea and viruses. PLoS Comput. Biol. 4,e1000018

168 Xue, H., Tong, K. L., Mark, C., Grosjean, M. and Wong, J. T. (2003) Transfer RNAparalogs: evidence for genetic code–amino acid biosynthesis coevolution and anarchaeal root of life. Gene 22, 59–66

169 Di Giulio, M. (2007) The tree of life might be rooted in the branch leading toNanoarchaeota. Gene 401, 108–113

170 Di Giulio, M. (2008) The origin of genes could be polyphyletic. Gene 426, 39–46171 Castresana, J. (2001) Comparative genomics and bioenergetics. Biochim. Biophys. Acta

1506, 147–162172 Ranea, J. A. G., Sillero, A., Thornton, J. M. and Orengo, C. A. (2006) Protein superfamily

evolution and the Last Universal Common Ancestor (LUCA). J. Mol. Evol. 63, 513–525173 Ouzounis, C. A., Kunin, V., Darzentas, N. and Goldovsky, L. (2006) A minimal estimate

for the gene content of the last universal common ancestor: exobiology from a terrestrialperspective. Res. Microbiol. 157, 57–68

174 Ma, B.-G., Chen, L., Ji, H.-F., Chen, Z.-H., Yang, F.-R., Wang, L., Qu, G., Jiang, Y.-Y.,Ji, C. and Zhang, H.-Y. (2008) Characters of very ancient proteins. Biochem. Biophys.Res. Commun. 366, 607–611

175 Ji, H.-F., Kong, D.-X, Shen, L., Chen, L.-L., Ma, B.-G. and Zhang, H.-Y. (2007)Distribution patterns of small-molecule ligands in the protein universe and implicationsfor origin of life and drug discovery. Genome Biol. 8, R176

176 Murzin, A. (1998) How far divergent evolution goes in proteins. Curr. Opin. Struct. Biol.8, 380–387

177 Grishin, N. V. (2001) Fold change in evolution of protein structures. J. Struct. Biol. 134,167–185

178 Ji, H.-F. and Zhang, H.-Y. (2007) Protein architecture chronology deduced fromstructures of amino acid synthases. J. Biomol. Struct. Dyn. 24, 321–323

179 White, S. H. (1994) Global statistics of protein sequences: implications for the origin,evolution, and prediction of structure. Annu. Rev. Biophys. Biomol. Struct. 23, 407–439

180 Taylor, W. R. (2006) Topological accessibility shows a distinct asymmetry in the folds ofαβ proteins. FEBS Lett. 580, 5263–5267

181 Deane, C. M., Dong, M., Huard, F. P. E., Lance, B. K. and Wood, G. R. (2007)Cotranslational protein folding: fact or fiction? Bioinformatics 23, i142–i148

182 Chothia, C. (1976) The nature of the accessible and buried surfaces in proteins. J. Mol.Biol. 105, 1–12

183 Chow, C. C., Chow, C., Raghunathan, V., Huppert, T. J., Kimball, E. B. and Cavagnero, S.(2003) Chain length dependence of apomyoglobin folding: structural evolution frommisfolded sheets to native helices. Biochemistry 42, 7090–7099

184 Dupont, C. L., Yang, S., Palenik, B. and Bourne, P. E. (2006) Modern proteomes containputative imprints of ancient shifts in trace metal geochemistry. Proc. Natl. Acad. Sci.U.S.A. 103, 17822–17827

185 Raymond, J. and Segre, D. (2006) The effect of oxygen on biochemical networks and theevolution of complex life. Science 311, 1764–1767

186 Devos, D., Dokudovskaya, S., Williams, R., Alber, F., Eswar, N., Chait, B. T., Rout, M. P.and Sali, A. (2006) Simple fold composition and molecular architecture of the nuclearpore complex. Proc. Natl. Acad. Sci. U.S.A. 103, 2172–2177

187 Fuerst, J. A. (2005) Intracellular compartmentation in planctomycetes. Annu. Rev.Microbiol. 59, 299–328

188 Kurland, C. G., Collins, L. J. and Penny, D. (2006) Genomics and the irreducible natureof eukaryotic cells. Science 312, 1011–1014

189 Lazcano, A. and Miller, S. L. (1999) On the origin of metabolic pathways. J. Mol. Evol.49, 424–431

190 Orgel, L. E. (2000) Self-organizing biochemical cycles. Proc. Natl. Acad. Sci. U.S.A. 97,12503–12507

191 Orgel, L. E. (2000) Some consequences of the RNA world hypothesis. Origin Life Evol.Biosphere 33, 211–218

192 Wachtershauser, G. (2007) On the chemistry and evolution of the pioneer organism.Chem. Biodiversity 4, 584–602

193 Kim, H. S., Mittenthal, J. E. and Caetano-Anolles, G. (2006) MANET: tracing evolution ofprotein architecture in metabolic networks. BMC Bioinformatics 7, 351

194 Caetano-Anolles, G., Yafremava, L. S., Gee, H., Caetano-Anolles, D., Kim, H. S. andMittenthal, J. E. (2008) The origin and evolution of modern metabolism. Int. J. Biochem.Cell Biol. 41, 285–297

195 Morowitz, H. (1999) A theory of biochemical organization, metabolic pathways, andevolution. Complexity 4, 39–53

196 Danchin, A., Fang, G. and Noria, S. (2007) The extant core bacterial proteome is anarchive of the origin of life. Proteomics 7, 875–889

197 Kendrew, J. C., Bodo, G., Dintzis, H. M., Parrish, R. G., Wycoff, H. W. and Phillips, D. C.(1958) A three-dimensional model of the myoglobin molecule obtained by X-rayanalysis. Nature 181, 662–666

198 Liolios, K., Tavernarakis, N., Huhenholtz, P. and Kyrpides, N. C. (2006) The Genomes OnLine Database (GOLD) v2: a monitor of genome projects worldwide. Nucleic Acids Res.34, D332–D334

199 Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: patternrecognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637

200 Wang, M. and Caetano-Anolles, G. (2009) The evolutionary mechanics of domainorganization in proteomes and the rise of modularity in the protein world. Structure, inthe press

201 Vogel, C. and Chothia, C. (2006) Protein family expansions and biological complexity.PLoS Comput. Biol. 2, e48

Received 10 October 2008/11 November 2008; accepted 17 November 2008Published on the Internet 16 January 2009, doi:10.1042/BJ20082063

c© The Authors Journal compilation c© 2009 Biochemical Society

Biochem. J. (2009) 417, 621–637 (Printed in Great Britain) doi:10.1042/BJ20082063

SUPPLEMENTARY ONLINE DATAThe origin, evolution, and structure of the protein worldGustavo CAETANO-ANOLLES*1, Minglei WANG*, Derek CAETANO-ANOLLES*† and Jay E. MITTENTHAL†*Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1101 W. Peabody Drive, Urbana, IL 61801, U.S.A., and †Department of Cell and Developmental Biology,University of Illinois at Urbana-Champaign, 601 S. Goodwin Avenue, Urbana, IL 61801, U.S.A.

Table S1 Average length and number of secondary structures in all-α, all-β, α/β and α+β protein classes

Structures: H, α-helix; B, residues in isolated β-bridge; E, β-strands (they participate in β-sheets); G, 310-helix; I, π-helix; T, hydrogen-bonded turn; S, bend; C, coil. Values are means +− S.D.

Structural propertiesof protein class H B E G I T S C

Average length of segmentsAll-α 12.78 (+−3.75) 0.68 (+−0.48) 2.18 (+−2.20) 2.77 (+−1.38) 0.45 (+−1.45) 2.00 (+−0.28) 1.51 (+−0.26) 2.02 (+−0.42)All-β 6.54 (+−3.25) 0.98 (+−0.21) 5.67 (+−1.38) 2.97 (+−1.13) 0.39 (+−1.34) 2.18 (+−0.19) 1.66 (+−0.25) 1.84 (+−0.39)α/β 10.50 (+−1.40) 0.99 (+−0.17) 4.90 (+−0.72) 3.36 (+−0.65) 1.12 (+−2.14) 2.06 (+−0.16) 1.51 (+−0.13) 1.85 (+−0.21)α+β 10.78 (+−2.84) 0.84 (+−0.39) 5.66 (+−1.64) 2.87 (+−1.18) 0.33 (+−1.25) 2.07 (+−0.29) 1.60 (+−0.26) 1.93 (+−0.32)

Average number of segmentsAll-α 6.82 (+−4.46) 1.15 (+−1.81) 1.60 (+−2.93) 1.52 (+−1.70) 0.013 (+−0.090) 8.01 (+−6.40) 7.72 (+−6.48) 12.79 (+−10.31)All-β 1.95 (+−1.96) 2.46 (+−1.76) 10.43 (+−5.42) 1.53 (+−1.21) 0.00070 (+−0.0036) 8.69 (+−4.90) 10.56 (+−6.24) 19.51 (+−9.82)α/β 9.00 (+−4.71) 2.95 (+−2.36) 9.41 (+−3.91) 3.11 (+−2.04) 0.021 (+−0.075) 14.80 (+−7.40) 15.35 (+−7.91) 26.99 (+−12.86)α+β 4.52 (+−2.81) 2.06 (+−2.07) 6.87 (+−3.92) 1.77 (+−1.53) 0.0090 (+−0.076) 8.53 (+−4.88) 9.48 (+−5.58) 17.03 (+−9.56)

Average total length of segmentsAll-α 85.14 (+−54.96) 1.77 (+−2.05) 9.83 (+−18.90) 6.19 (+−5.56) 0.45 (+−1.45) 16.18 (+−13.14) 11.95 (+−10.22) 25.57 (+−21.00)All-β 18.80 (+−18.58) 3.08 (+−1.74) 59.13 (+−32.88) 6.38 (+−4.02) 0.39 (+−1.34) 19.16 (+−10.78) 17.68 (+−10.31) 36.31 (+−20.93)α/β 94.36 (+−49.95) 3.52 (+−2.29) 45.99 (+−18.33) 11.31 (+−6.92) 1.12 (+−2.15) 30.58 (+−15.83) 23.22 (+−11.86) 50.53 (+−25.66)α+β 48.19 (+−31.24) 2.57 (+−2.04) 38.80 (+−23.09) 6.82 (+−5.10) 0.33 (+−1.25) 17.95 (+−10.62) 15.16 (+−8.77) 32.99 (+−19.46)

1 To whom correspondence should be addressed (email [email protected]).

c© The Authors Journal compilation c© 2009 Biochemical Society

G. Caetano-Anolles and others

Figure S1 Major protein classes of globular proteins grouped according tofeatures of secondary structure

The DSSP program [1] that standardizes secondary structure assignment was used to calculatethe average number (A), average length (B) and average total length (C) of segments of secondarystructure in a peptide chain. All PDB files in SCOP version 1.67 were included (61175 peptidechains) in the analysis, and features were calculated from chains belonging to the same SCOPfold for all folds. Plots compared each feature of secondary structure with each other. The Figureshows only comparison of average total length of α-helical and β-strand segments for the all-α,all-β , α/β and α+β classes of globular proteins. Averages are described in SupplementaryTable S1.

Figure S2 Universal phylogenomic trees of proteomes reconstructed froman analysis of protein domains at different architectural levels

Construction of these trees involved a structural census that assigns domain structure tosequences. Trees were obtained from a fold-usage distance-based analysis of the occurrenceof 338 F (SCOP version 1.35) in eight [2] and 20 [3] genomes (A), and from a maxi-mum parsimony analysis of the abundance of 507 F (SCOP version 1.59) in 32 genomes[4] (B) and 1259 FSFs (SCOP version 1.67) in 185 genomes [5] (C) respectively. Insome cases in (C), terminal leaves are not labelled with organismal names as theywould not be legible. Arrowheads indicate the location of the root when using polarizedcharacters. Organism abbreviations: Aaeo, Aquifex aeolicus; Aful, Archaeoglobus fulgidus;Aper, Aeropyrum pernix; Atha, Arabidopsis thaliana; Bbur, Borrelia burgdorferi; Bsub, Bacillussubtilis; Cace, Clostridium acetobutylicum; Cele, Caenorhabditis elegans; Cpne, Chlamydiapneumoniae; Ctra, Chlamydia trachomatis; Dmel, Drosophila melanogaster; Drad, Deinococcusradiodurans; Ecol, Escherichia coli; Halo, Halobacterium sp.; Hinf, Haemophilus influenzae;Hpyl, Helicobacter pylori; Mgen, Mycoplasma genitalium; Mjan, Methanococcus jannaschii;Mpne, Mycoplasma pneumoniae; Mthe, Methanobacterium thermoautotrophicum; Mtub,Mycobacterium tuberculosis; Ncra, Neurospora crassa; Phor, Pyrococcus horikoshii; Rpro,Rickettsia prowazekii; Saur, Staphylococcus aureus; Scer, Saccharomyces cerevisiae; Spom,Schizosaccharomyces pombe; Ssol, Sulfolobus solfataricus; Stok, Sulfolobus tokodaii; Syne,Synechocystis sp.; Taci, Thermoplasma acidophilum; Tmar, Thermotoga maritima; Tpal,Treponema pallidum.

c© The Authors Journal compilation c© 2009 Biochemical Society

Evolution of the protein world

Figure S3 Universal phylogenomic trees of architectures reconstructedfrom a genomic census of protein domain structure

Trees of domain architectures at F (A) and FSF (B and C) levels were reconstructed from a proteindomain census in 32, 185 and 584 genomes respectively ([5,6], and M. Wang, unpublishedwork). In all cases, the census involved identifying domains using PSI-BLAST or advancedHMMs of structural recognition and using different versions of SCOP as reference. The threeevolutionary epochs of the protein world are overlapped to the trees and are labelled withdifferent shades (architectural diversification, light green; superkingdom specification, salmon;organismal diversification, yellow) and follow previous definitions [5]. Terminal leaves are notlabelled since they would not be legible. Branches in red delimit the birth of architectures after theappearance of the first architecture unique to a superkingdom (broken line). The Venn diagramsshows occurrence of architectures in the three superkingdoms of life. Note the relative decreasein number of FSF architectures corresponding to the organismal specification epoch in the treeof (C) due to newly discovered FSFs described in the last release of SCOP.

REFERENCES

1 Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: patternrecognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637

2 Gerstein, M. (1998) Patterns of protein-fold usage in eight microbial genomes:a comprehensive structural census. Proteins 33, 518–534

3 Lin, J. and Gerstein, M. (2000) Whole-genome trees based on the occurrence of folds andorthologs: implications for comparing genomes on different levels. Genome Res. 10,808–818

4 Caetano-Anolles, G. and Caetano-Anolles, D. (2003) An evolutionarily structured universeof protein architecture. Genome Res. 13, 1563–1571

5 Wang, M., Yafremava, L. S., Caetano-Anolles, D., Mittenthal, J. E. and Caetano-Anolles, G.(2007) Reductive evolution of architectural repertoires in proteomes and the birth of thetripartite world. Genome Res. 17, 1572–1585

6 Wang, M., Boca, S. M., Kalelkar, R., Mittenthal, J. E. and Caetano-Anolles, G. (2006) Aphylogenomic reconstruction of the protein world based on a genomic census of proteinfold architecture. Complexity 12, 27–40

c© The Authors Journal compilation c© 2009 Biochemical Society

G. Caetano-Anolles and others

Figure S4 Evolution of biological function in the protein world

The evolutionary timeline shows the discovery of protein FSF architectures associated with different functional SUPERFAMILY subcategories in each superkingdom, with time measured by a relativedistance in nodes from a hypothetical ancestral architecture at the base of the tree of architectures (Supplementary Figure S3B). The number of architectures are given as percentage of the total.

Received 10 October 2008/11 November 2008; accepted 17 November 2008Published on the Internet 16 January 2009, doi:10.1042/BJ20082063

c© The Authors Journal compilation c© 2009 Biochemical Society


Recommended