+ All Categories
Home > Documents > Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene...

Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene...

Date post: 01-Feb-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
8
Investigation of terpene diversification across multiple sequenced plant genomes Alexander M. Boutanaev a,1 , Tessa Moses b , Jiachen Zi c , David R. Nelson d , Sam T. Mugford b , Reuben J. Peters c , and Anne Osbourn b,1 a Institute of Basic Biological Problems, Russian Academy of Sciences, Pushchino, Moscow Region, 142290 Russia; b Department of Metabolic Biology, John Innes Centre, Norwich NR4 7UH, United Kingdom; c Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011; and d Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, TN 38163 Edited by Rodney B. Croteau, Washington State University, Pullman, WA, and approved November 17, 2014 (received for review October 11, 2014) Plants produce an array of specialized metabolites, including chem- icals that are important as medicines, flavors, fragrances, pigments and insecticides. The vast majority of this metabolic diversity is untapped. Here we take a systematic approach toward dissecting genetic components of plant specialized metabolism. Focusing on the terpenes, the largest class of plant natural products, we investigate the basis of terpene diversity through analysis of multiple sequenced plant genomes. The primary drivers of terpene diversification are terpenoid synthase (TS) signatureenzymes (which generate scaffold diversity), and cytochromes P450 (CYPs), which modify and further diversify these scaffolds, so paving the way for further downstream modifications. Our systematic search of sequenced plant genomes for all TS and CYP genes reveals that distinct TS/CYP gene pairs are found together far more commonly than would be expected by chance, and that certain TS/CYP pair- ings predominate, providing signals for key events that are likely to have shaped terpene diversity. We recover TS/CYP gene pairs for previously characterized terpene metabolic gene clusters and demonstrate new functional pairing of TSs and CYPs within pre- viously uncharacterized clusters. Unexpectedly, we find evidence for different mechanisms of pathway assembly in eudicots and monocots; in the former, microsyntenic blocks of TS/CYP gene pairs duplicate and provide templates for the evolution of new pathways, whereas in the latter, new pathways arise by mixing and matching of individual TS and CYP genes through dynamic genome rearrangements. This is, to our knowledge, the first docu- mented observation of the unique pattern of TS and CYP assembly in eudicots and monocots. terpenes | terpenoid synthases | cytochrome P450 | metabolic gene clusters | genome evolution P lants produce a rich and diverse array of specialized metab- olites (1, 2). These compounds have important ecological functions, providing protection against pests, diseases, UV-B damage and other environmental stresses, and serve as attrac- tants for pollinators and seed dispersal agents. They are exploited by humans as pharmaceutics, agrochemicals, and in a wide variety of other industrial applications. Metabolic diversification in higher plants is likely to have been driven by the need to adapt and survive in different ecological niches (3, 4). Although a consider- able proportion of the genes in higher plant genomes are pre- dicted to encode enzymes with roles in metabolism (20% in Arabidopsis thaliana; ref. 5), most of these are as yet uncharac- terized. The availability of a growing number of sequenced plant genomes now makes it possible to exploit knowledge extracted from multiple diverse species to take a more holistic approach toward understanding mechanisms of metabolic diversification in plants (1, 2). The terpenes are the largest class of plant-derived natural products, with over 40,000 structures reported to date (68). As such they provide an excellent entrée for investigation of mechanisms of metabolic diversification. Terpenes range from simple flavor and fragrance compounds such as limonene and cymene to complex triterpenes, and have numerous potential applications across the food and beverage, pharmaceutical, cosmetic and agriculture industries. They include taxol (one of the most widely prescribed anticancer drugs) and artemisinin (the most potent antimalarial compound). This major class of compounds represents tremendous chemical diversity of which only a relatively small fraction has so far been accessed and used by industry (9). This is because the biosynthetic pathways for the vast majority of these compounds are unknown due to the challenges associated with mining large and complex genomes and establishing the function of genes implicated in specialized metabolism. Many of these genes are divergent members of multigene families, making the delineation of new metabolic pathways extremely difficult (1013). The primary drivers of terpene diversification are the terpe- noid synthase (TS) signatureenzymes (which generate scaffold diversity), and the cytochrome P450-dependent monooxygenases (CYPs), which modify and further diversify these scaffolds, also paving the way for subsequent downstream modifications (1015). TSs are defined as the related superfamily of biosynthetic en- zymes involved in construction of the basic backbone structure of terpene natural products (16). As such, this includes the trans- isoprenyl diphosphate synthases and squalene synthases (SSs) that form the basic linear chains, as well as terpene synthases (TPSs) and triterpene cyclases (TTCs) that cyclize and rearrange these (16). Our knowledge of how the genes for terpene bio- synthetic pathways are organized in plant genomes is limited, because the genomes of plants that produce some of the best characterized terpenoids (e.g., artemisinin and taxol) have not Significance The terpenes are the largest class of plant natural products. This major class of compounds represents tremendous chemical diversity of which only a relatively small fraction has so far been accessed and used by industry. The primary drivers of terpene diversification are terpenoid synthases and cyto- chromes P450, which synthesize and modify terpene scaffolds. Here, focusing on these two gene families, we investigate terpene synthesis and evolution across 17 sequenced plant genomes. Our analyses shed light on the roots of terpene biosynthesis and diversification in plants. They also reveal that different genomic mechanisms of pathway assembly pre- dominate in eudicots and monocots. Author contributions: A.M.B., T.M., J.Z., S.T.M., R.J.P., and A.O. designed research; A.M.B., T.M., J.Z., and S.T.M. performed research; A.M.B., T.M., J.Z., D.R.N., S.T.M., R.J.P., and A.O. analyzed data; and A.M.B., T.M., J.Z., D.R.N., S.T.M., R.J.P., and A.O. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Freely available online through the PNAS open access option. 1 To whom correspondence may be addressed. Email: [email protected] or boutanaev@ mail.ru. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1419547112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1419547112 PNAS | Published online December 10, 2014 | E81E88 PLANT BIOLOGY PNAS PLUS Downloaded by guest on February 14, 2020
Transcript
Page 1: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

Investigation of terpene diversification across multiplesequenced plant genomesAlexander M. Boutanaeva,1, Tessa Mosesb, Jiachen Zic, David R. Nelsond, Sam T. Mugfordb, Reuben J. Petersc,and Anne Osbournb,1

aInstitute of Basic Biological Problems, Russian Academy of Sciences, Pushchino, Moscow Region, 142290 Russia; bDepartment of Metabolic Biology, JohnInnes Centre, Norwich NR4 7UH, United Kingdom; cDepartment of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011;and dDepartment of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, TN 38163

Edited by Rodney B. Croteau, Washington State University, Pullman, WA, and approved November 17, 2014 (received for review October 11, 2014)

Plants produce an array of specialized metabolites, including chem-icals that are important as medicines, flavors, fragrances, pigmentsand insecticides. The vast majority of this metabolic diversity isuntapped. Here we take a systematic approach toward dissectinggenetic components of plant specialized metabolism. Focusingon the terpenes, the largest class of plant natural products, weinvestigate the basis of terpene diversity through analysis ofmultiple sequenced plant genomes. The primary drivers of terpenediversification are terpenoid synthase (TS) “signature” enzymes(which generate scaffold diversity), and cytochromes P450 (CYPs),which modify and further diversify these scaffolds, so paving theway for further downstream modifications. Our systematic searchof sequenced plant genomes for all TS and CYP genes reveals thatdistinct TS/CYP gene pairs are found together far more commonlythan would be expected by chance, and that certain TS/CYP pair-ings predominate, providing signals for key events that are likelyto have shaped terpene diversity. We recover TS/CYP gene pairsfor previously characterized terpene metabolic gene clusters anddemonstrate new functional pairing of TSs and CYPs within pre-viously uncharacterized clusters. Unexpectedly, we find evidencefor different mechanisms of pathway assembly in eudicots andmonocots; in the former, microsyntenic blocks of TS/CYP genepairs duplicate and provide templates for the evolution of newpathways, whereas in the latter, new pathways arise by mixingand matching of individual TS and CYP genes through dynamicgenome rearrangements. This is, to our knowledge, the first docu-mented observation of the unique pattern of TS and CYP assemblyin eudicots and monocots.

terpenes | terpenoid synthases | cytochrome P450 | metabolic geneclusters | genome evolution

Plants produce a rich and diverse array of specialized metab-olites (1, 2). These compounds have important ecological

functions, providing protection against pests, diseases, UV-Bdamage and other environmental stresses, and serve as attrac-tants for pollinators and seed dispersal agents. They are exploitedby humans as pharmaceutics, agrochemicals, and in a wide varietyof other industrial applications. Metabolic diversification in higherplants is likely to have been driven by the need to adapt andsurvive in different ecological niches (3, 4). Although a consider-able proportion of the genes in higher plant genomes are pre-dicted to encode enzymes with roles in metabolism (∼20% inArabidopsis thaliana; ref. 5), most of these are as yet uncharac-terized. The availability of a growing number of sequenced plantgenomes now makes it possible to exploit knowledge extractedfrom multiple diverse species to take a more holistic approachtoward understanding mechanisms of metabolic diversificationin plants (1, 2).The terpenes are the largest class of plant-derived natural

products, with over 40,000 structures reported to date (6–8).As such they provide an excellent entrée for investigation ofmechanisms of metabolic diversification. Terpenes range fromsimple flavor and fragrance compounds such as limonene and

cymene to complex triterpenes, and have numerous potentialapplications across the food and beverage, pharmaceutical,cosmetic and agriculture industries. They include taxol (one ofthe most widely prescribed anticancer drugs) and artemisinin(the most potent antimalarial compound). This major class ofcompounds represents tremendous chemical diversity of whichonly a relatively small fraction has so far been accessed and usedby industry (9). This is because the biosynthetic pathways forthe vast majority of these compounds are unknown due to thechallenges associated with mining large and complex genomesand establishing the function of genes implicated in specializedmetabolism. Many of these genes are divergent members ofmultigene families, making the delineation of new metabolicpathways extremely difficult (10–13).The primary drivers of terpene diversification are the terpe-

noid synthase (TS) “signature” enzymes (which generate scaffolddiversity), and the cytochrome P450-dependent monooxygenases(CYPs), which modify and further diversify these scaffolds, alsopaving the way for subsequent downstream modifications (10–15).TSs are defined as the related superfamily of biosynthetic en-zymes involved in construction of the basic backbone structureof terpene natural products (16). As such, this includes the trans-isoprenyl diphosphate synthases and squalene synthases (SSs)that form the basic linear chains, as well as terpene synthases(TPSs) and triterpene cyclases (TTCs) that cyclize and rearrangethese (16). Our knowledge of how the genes for terpene bio-synthetic pathways are organized in plant genomes is limited,because the genomes of plants that produce some of the bestcharacterized terpenoids (e.g., artemisinin and taxol) have not

Significance

The terpenes are the largest class of plant natural products.This major class of compounds represents tremendous chemicaldiversity of which only a relatively small fraction has so farbeen accessed and used by industry. The primary drivers ofterpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and modify terpene scaffolds.Here, focusing on these two gene families, we investigateterpene synthesis and evolution across 17 sequenced plantgenomes. Our analyses shed light on the roots of terpenebiosynthesis and diversification in plants. They also reveal thatdifferent genomic mechanisms of pathway assembly pre-dominate in eudicots and monocots.

Author contributions: A.M.B., T.M., J.Z., S.T.M., R.J.P., and A.O. designed research; A.M.B.,T.M., J.Z., and S.T.M. performed research; A.M.B., T.M., J.Z., D.R.N., S.T.M., R.J.P., and A.O.analyzed data; and A.M.B., T.M., J.Z., D.R.N., S.T.M., R.J.P., and A.O. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.1Towhom correspondence may be addressed. Email: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1419547112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1419547112 PNAS | Published online December 10, 2014 | E81–E88

PLANTBIOLO

GY

PNASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 2: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

yet been sequenced. However, in a number of cases the genes forterpene biosynthetic pathways have been shown to be organizedas metabolic gene clusters (14, 17). These include two diter-pene clusters from Oryza sativa (rice) [the momilactone andphytocassane clusters (18, 19)], three triterpene clusters [thethalianol and marneral clusters from A. thaliana (20, 21), theavenacin cluster from Avena strigosa (oat) (22, 23)], andclusters for steroidal glycoalkaloids and other terpenes in theSolanaceae (24, 25). Potential new clusters implicated in ter-pene synthesis have also been reported in A. thaliana (20, 26–28) and cucumber (29). The available evidence indicates thatthe characterized clusters have arisen within recent evolu-tionary history by gene duplication, acquisition of new functionand genome reorganization, and that they are not products ofhorizontal gene transfer from microbes (reviewed in refs. 14, 17,and 30). Clustering has also been shown for other classes ofplant natural products and is likely to facilitate coinheritance ofbeneficial gene combinations and also regulation at the levelof chromatin (14, 17, 30–32).TSs and CYPs are the core components of terpene bio-

synthetic pathways and together are responsible for the genera-tion of a vast array of diverse terpene structures (10–13, 15, 33).Here we have selected these two enzyme superfamilies asmarkers to investigate the foundations of terpene synthesis andevolution across 17 sequenced plant genomes. Our analyses shedlight on the roots of terpene biosynthesis and diversification inplants. They also reveal that different genomic mechanisms ofpathway assembly predominate in eudicots and monocots.

ResultsMining the Terpenome of Sequenced Plant Genomes. During thecourse of this project, assembled genome sequences were avail-able for nine eudicot species (A. thaliana, Brassica rapa, Glycinemax, Lotus japonicus, Medicago truncatula, Populus trichocarpa,Solanum lycopersicum, Solanum tuberosum, and Vitis vinifera)(5, 34–41) and four monocot species (Brachypodium distachyon,O. sativa, Sorghum bicolor, and Zea mays) (42–45). In addition,

three dicot genomes (Cucumis sativus, Mimulus guttatus, andRicinus communis) and the monocot Setaria italica genome wereavailable as scaffolds (29, 46–48). Detailed genomic and func-tional annotations were available for A. thaliana, M. truncatula,O. sativa, P. trichocarpa, and S. bicolor. To investigate the occurrenceand distribution of genes for terpene biosynthesis, all TS andCYP genes were identified based on functional annotation(mostly automated, in the case of the newer genomes) and thephysical distance between genes computed using gene coordinates.Where suitable annotation was not available (as for B. distachyon,B. rapa, C. sativus, G. max, L. japonicus, M. guttatus, R. communis,S. italica, S. lycopersicum, S. tuberosum, V. vinifera, and Z. mays),we used a stand-alone BLAST database built from genomicsequences, the tBLASTn search engine, and 1,000 plant proteinsequences for each of the TS and CYP protein superfamilies(obtained from the UniProt database) (Materials and Methods).The protein sequences were used to search the nucleotide da-tabase (e = 0.001), the BLAST output file was parsed, and allnecessary information was extracted where the score parameterwas ≥100. The output contained the coordinates of alignment ofeach protein sequence mapped to corresponding exons of pre-dicted genes in each genome. Overlapping coordinates werejoined to give possible exons. The predicted TS and CYP genecoordinates were then calculated for each genome. Althoughless sophisticated, this procedure is similar to conventionalgene annotation using known amino acid sequences corre-sponding to homologous genes. We then searched for TS/CYPgene pairs located within 30-, 50-, 100-, 150-, and 200-kb regionsin each genome.Table 1 summarizes the results for all genomes investigated.

We considered two genes to be associated if the distance be-tween them was ≤50 kb. Increasing the window from 30 kb to50 kb either had no effect or resulted in only a small increase inthe number of TS/CYP gene pairs detected (Table 1). Withdistances of >50 kb, more TS/CYP gene pairs were detected butintervening genes with no obvious predicted functions in terpenebiosynthesis were also included. Although stringent, a 50-kb

Table 1. Comparison of observed and random distributions of clustered TS/CYP gene pairs

Species

Maximal TS-CYP distance, Kb

P valueGenomesize, Mb

30 50 100 150 200

Observed Random Observed Random Observed Random Observed Random Observed Random

A. thaliana 13 3.1 ± 0.1 13 4.9 ± 0.1 13 9.3 ± 0.1 15 13.0 ± 0.2 17 16.4 ± 0.2 1.7e-9 119B. rapa 9 1.8 ± 0.1 9 3.1 ± 0.1 13 6.1 ± 0.2 15 8.9 ± 0.2 16 11.4 ± 0.2 1.2e-13 284C. sativus* 2 0.9 ± 0.1 2 1.5 ± 0.1 2 2.9 ± 0.1 3 4.1 ± 0.1 3 5.4 ± 0.1 0.53 203G. max 2 0.3 ± 0.04 2 0.5 ± 0.1 3 1.0 ± 0.1 3 1.4 ± 0.1 3 1.8 ± 0.1 3.5e-4 973L. japonicus 0 0.5 ± 0.1 0 1.0 ± 0.1 1 2.0 ± 0.1 3 2.9 ± 0.1 4 3.9 ± 0.1 0.73 500M. guttatus* 13 0.7 ± 0.1 14 1.1 ± 0.1 16 2.2 ± 0.1 17 3.1 ± 0.1 18 4.0 ± 0.1 5.0e-121 321M. truncatula 7 1.2 ± 0.1 7 1.9 ± 0.1 8 3.7 ± 0.1 8 5.3 ± 0.2 11 6.8 ± 0.2 1.9e-10 307P. trichocarpa 7 1.2 ± 0.1 8 2.1 ± 0.1 10 4.3 ± 0.1 13 6.2 ± 0.2 13 8.1 ± 0.2 10.0e-9 417R. communis* 7 0.4 ± 0.1 9 0.7 ± 0.1 11 1.5 ± 0.1 13 2.1 ± 0.1 14 2.8 ± 0.1 2.3e-67 324S. lycopersicum 6 0.4 ± 0.1 7 0.7 ± 0.1 12 1.5 ± 0.1 13 2.1 ± 0.1 14 2.8 ± 0.1 7.6e-66 900S. tuberosum 4 0.9 ± 0.1 7 1.5 ± 0.1 10 2.8 ± 0.1 14 4.2 ± 0.1 16 5.5 ± 0.1 4.3e-19 706V. vinifera 4 1.6 ± 0.1 4 2.7 ± 0.1 7 5.2 ± 0.1 8 7.4 ± 0.1 9 9.9 ± 0.1 0.28 486B. distachyon 10 0.5 ± 0.1 10 0.9 ± 0.1 11 2.0 ± 0.1 11 3.1 ± 0.1 14 4.1 ± 0.1 5.2e-76 271O. sativa 7 1.3 ± 0.1 9 2.1 ± 0.1 11 4.0 ± 0.1 11 5.7 ± 0.2 13 7.5 ± 0.2 1.6e-11 374S. bicolor 9 0.4 ± 0.04 12 0.7 ± 0.1 15 1.3 ± 0.1 16 2.0 ± 0.1 18 2.7 ± 0.1 2.1e-149 699S. italica* 4 1.0 ± 0.1 4 1.6 ± 0.1 5 3.2 ± 0.1 6 4.7 ± 0.1 6 6.0 ± 0.2 7.3e-3 405Z. mays 1 0.2 ± 0.03 3 0.4 ± 0.04 3 0.8 ± 0.1 6 1.1 ± 0.1 7 1.5 ± 0.1 5.6e-14 2065

Numbers of TS/CYP gene pairs located at different distances from each other are shown. TS, terpene synthase; CYP, cytochrome P450-dependent mono-oxygenase. Random numbers are presented as the average of 200 computer simulations for each genome. Genomes which were available only in scaffoldsare marked by asterisks. Genome size numbers were taken from the Plant Genome Database or from the Phytozome (tomato) resources (www.plantgdb.organd www.phytozome.net, respectively). P values were obtained by comparing the observed versus expected values for each species across all five intervalsusing χ2 tests.

E82 | www.pnas.org/cgi/doi/10.1073/pnas.1419547112 Boutanaev et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 3: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

window was therefore considered optimal for our purposes.Comparison of the observed frequencies of TS/CYP gene pairswith those expected by chance at the whole genome level in-dicated nonrandom association of TS and CYP genes for most ofthe genomes under investigation (P = 0.007 for S. italica and P <0.001 for all other species except C. sativus, L. japonicus, andV. vinifera: P = 0.53, 0.73 and 0.28, respectively) (Table 1). Theaccuracy and completeness of these predictions will, of course, de-pend on the quality of the assembled genome sequence in question.As sequence quality improves it is likely that more TS/CYP genepairs will be found, including in C. sativus, L. japonicus, andV. vinifera. Additionally, as might be expected, the numbers ofTS/CYP gene pairs decreased with increasing genome size (corre-lation coefficient, r = −0.63).

Analysis of TS/CYP Gene Pairs. We next carried out a systematicinvestigation of the sequence similarity of all 120 TS/CYP pairs(≤50 kb; Table 1) to establish whether these were likely to havearisen independently or to share a common origin. We obtainedthe coding/mRNA sequences for the TS/CYP genes from theNCBI or Phytozome resources according to the gene IDs in SIAppendix, Table S1. The TS genes were classified into the ter-pene synthase (TPS) family, and into more specific subfamilies asdefined (10), or as triterpene cyclases (TTCs), or squalene syn-thases (SSs). The CYP genes were classified first into clans andthen into their constituent families and subfamilies followingestablished convention (49, 50). The complete list of gene pairscan be found in SI Appendix, Table S1. Importantly, we re-covered TS/CYP gene pairs for the four previously characterizedterpene metabolic gene clusters from A. thaliana and rice (18–21) and also for other previously reported candidate clusters. Wealso identified previously unknown candidate terpene clusters(SI Appendix, Tables S1 and S2).A summary of all TS/CYP combinations found is presented in

Fig. 1. Representation of genes for the different CYP families inTS/CYP gene pairs differed from that expected by chance. Thisdifference was evident from comparison of the frequency withwhich these CYP family genes occurred within the TS/CYP genepairs reported here with the combined full genome complementsof assigned CYP genes available for a total of 13 sequencedeudicot and monocot genomes (from the Cytochrome P450website; ref. 50) (SI Appendix, Table S3). TPS genes were pre-dominantly found in combination with CYP71 clan genes be-longing to the CYP71 family in both eudicots and monocots(Fig. 1). However, pairings of TPS genes with members of otherCYP families belonging to the CYP71 clan, and with members ofother CYP clans were also observed. In eudicots, the TTC geneswere found predominantly in pairings with the CYP705 family(CYP71 clan) and CYP716 family (CYP85 clan) genes. Singleexamples of TTC genes in association with CYP71 clan genes(one each for the CYP71 and CYP99 families) were detected inmonocots, along with examples of pairings of TPS and TTCgenes with CYP51H genes.Among the TPS/CYP pairings, there are a significant number

of pairs containing TPS-a and CYP71 family members. To verifythe functional nature of these pairs, we investigated the pairsfound in castor bean (Ricinus communis). Previous work hasdemonstrated that three of these TPS-a subfamily membersare diterpene synthases that react with the general diterpenoidprecursor (E,E,E)-geranylgeranyl diphosphate (GGPP), with twoproducing casbene and the third neo-cembrene (51). Here wefound that these are part of a larger gene cluster that also con-tains another TPS (from the TPS-g subfamily), along with eightCYPs (two from the CYP80C and six from the CYP726A sub-families; note that the CYP726A subfamily actually falls withinthe CYP71D subfamily, with the different nomenclature a his-torical relic), as well as two short-chain alcohol dehydrogenasesand an acyl-transferase, suggesting complex biosynthetic func-

tional and evolutionary history (SI Appendix, Table S2). Func-tional analysis was carried out using a modular metabolicengineering approach, as enabled by the use of CYP genescodon-optimized for expression in E. coli (52). Notably, via thisapproach it was found that CYP726A17 will hydroxylate thecasbene product of the paired TPS-a subfamily member (SIAppendix, Fig. S1A). Moreover, the closely related CYP726A14also found in this larger cluster carries out not only the same

Fig. 1. Summary of TS/CYP combinations found in sequenced eudicot andmonocot genomes. The TS genes are classified as either the terpene synthase(TPS) family and, hence, more specific subfamilies as defined (10), or as tri-terpene cyclase (TTC) or squalene synthase (SS). The CYP genes are classifiedinto clans and then their constituent families (49, 50). The most abundantpairings are highlighted in red. In eudicots TPS genes were found pre-dominantly in combination with CYP71 clan genes, and within this clan mostnotably with members of the CYP71 family. In monocots, TPS genes werealso found predominantly in association with CYP71 clan genes, primarilyfrom the CYP71 and CYP99 families. TTC genes were found predominantly inpairings with CYP71 clan (CYP705 family) and CYP85 clan (CYP716 family)genes in eudicots. In monocots only three TTC/CYP gene pairs were found.Full sequences can be found in Datasets S3 and S4.

Boutanaev et al. PNAS | Published online December 10, 2014 | E83

PLANTBIOLO

GY

PNASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 4: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

hydroxylation, but further reactions as well (SI Appendix, Fig. S1B-F). NMR analysis of the CYP726A14 products demonstratedthat these corresponded to 5-hydroxy-casbene, which was sub-sequently oxidized to 5-keto-casbene, with further hydroxylationof this to 5-keto-6-hydroxy-casbene (Fig. 2 and SI Appendix,Fig. S1G and Tables S4-6). Notably, the production of 5-keto-6-hydroxy-casbene is consistent with the presence of matchingoxy groups in a previously isolated castor bean diterpenoid (53).Production of 5-keto-casbene by these CYPs and further inves-tigation of this terpenoid biosynthetic gene cluster also has justbeen reported elsewhere (54). The ability of CYP71D subfamilymembers to operate in diterpenoid biosynthesis is consistent with

recent work demonstrating functional pairing of CYP71D51 witha diterpene TPS (albeit from the TPS-e/f subfamily) from a to-mato (Solanum lycopersicum) terpenoid biosynthetic gene cluster(55). We also selected a TTC/CYP716 pair from A. thaliana anda TTC/CYP81 pair from cucumber (Cucumis sativus) (pairs 13 and23; SI Appendix, Table S1) for further analysis. These pairs also liewithin larger candidate terpene metabolic clusters (Fig. 2 and SIAppendix, Table S2). We used a combination of genome browsing,comparative expression profiling and heterologous expression inSaccharomyces cerevisiae and Nicotiana benthamiana to furtherdefine and investigate these regions (SI Appendix, Fig. S2). Theseexperiments demonstrate that coexpression of a TTC/CYP716

C D

E G

Cucumis sativusArabidopsis thaliana

AT5G36150

AT5G36140

AT5G36130

AT5G36120

AT5G36170

AT5G36160

AT5G36180

AT5G36110

PEN3CYP716A2HCF109

SCP-L1

CYP716A1

Chromosome 5

Cucsa.349060

Cucsa.349050

Cucsa.349040

Cucsa.349020

Cucsa.349080

Cucsa.349070

Cucsa.349100

CPQ,Csa

008595

CYP81Q59,

Csa008596

CYP87D19,

Csa008415

CYP81Q58,

Csa008597

Chromosome 6

Cucsa.349030

CYP89A140,

Csa008414

BAHDACT,Csa

008594

Cucsa.349090

Cucsa.349110

eRF1domain,

Csa008593

Csa008416

Csa008417

FLOWER

STEM

LEAF

ROOT

AT5G

3615

0

AT5G

3611

0

AT5G

3614

0

AT5G

3613

0

AT5G

3618

0

AT5G

3617

0

AT1G

1332

0

AT5G

6039

0

Gene expressed as transcript

FLOWER

STEM

LEAF

ROOT

FRUIT

Gene expressed as transcript

Csa

0085

95

Csa

0085

96

Csa

0084

14

Csa

0085

97

Csa

0085

94

Csa

0084

15

CA

CS

TIP

41

Csa

0085

93

12 2016 Retention time (min)

Total ion current5x105

12

Csa008595+Csa008597

Csa008595

Control

12 2418 Retention time (min)

Total ion current5x105

1

AT5G36150+AT5G36110

AT5G36150

Control

EICm/z=498

1514

1

23

30169t000052

30169t000050

CYP726A14

SDRCYP726A13

CYP726A17

TPS-aNCS

30169t000051

CYP726A15

30169t000053

CYP726A16

30169t000054

30169t000056

AT

30169t000058

30169t000059

TPS-aCS

30169t000060

CYP726A18

30169t000062

TPS-aCS

30169t000063

SDR

30169t000064

CYP726Ap

30169t000065

TPS-g

30169t000066

CYP80C9

30169t000067

CYP80C8

30169t000068

Ricinus communisA

B

casbene

OH

5-hydroxy-casbene

O

5-keto-casbene

O

OH

5-keto-6-hydroxy-casbene

O

OH

O

OMeO

O

CYP726A14CYP726A17

CYP726A14 CYP726A14

OH tirucalla-7,24-dien-3β-ol (1) OH cucurbitadienol

F H

Fig. 2. Functional analysis of R. communis TPS-a/CYP726, A. thaliana TTC/CYP716, and C. sativus TTC/CYP81 gene pairs. (A) Gene components of theR. communis diterpenoid cluster region consisting of 16 genes. AT, acyl-transferase; CS, casbene synthase; NCS, neo-cembrene synthase; SDR, short-chainalcohol dehydrogenase/reductase. (B) Reactions catalyzed by CYP726A14 (or CYP726A17) with casbene and potential role in castor bean diterpenoid bio-synthesis. (C and D) Gene components of the A. thaliana PEN3 cluster region on chromosome 5 (C) and C. sativus Csa008595 cluster region on chromosome 6(D). The TTC/CYP716 and TTC/CYP81 gene pairs identified in our bioinformatics analysis are underlined. Arrows boxed in: red, candidate cluster genes; black,gene adjacent to cluster. (E,G) Qualitative RT-PCR analysis of candidate cluster genes, gene adjacent to the cluster and two housekeeping genes in differenttissues of A. thaliana (E) and C. sativus (G). Candidate genes showing tissue-specific coexpression are highlighted in red. Gene pairs identified in our bio-informatics analysis are underlined. (F and H) GC-MS analysis of N. benthamiana leaf extracts transiently expressing AT5G36150 with (red) or without (blue)AT5G36110 (F) or Csa008595 with (red) or without (blue) Csa008597 (H), and an untransformed leaf control (gray). Peaks unique to each chromatogram arepointed with arrows. EIC, extracted ion chromatogram. EI-MS corresponding to the peaks are given in the SI Appendix, Fig. S2 D and I. Structures of the TTCcyclization product are given at the bottom.

E84 | www.pnas.org/cgi/doi/10.1073/pnas.1419547112 Boutanaev et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 5: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

pair from A. thaliana results in synthesis and hydroxylation of tir-ucalla-7,24-dien-3β-ol, whereas a TTC/CYP81 pair fromC. sativus isable to synthesize cucurbitadienol and modify it to as yet unknownproducts. Altogether we have confirmed functional coupling ofthree previously unidentified TS/CYP combinations and have par-tially characterized their involvement in previously unknown ter-pene biosynthetic clusters.

Different Mechanisms of Pathway Assembly Predominate in Eudicotsand Monocots. The predominance of particular TS/CYP pairingssuggests that either these gene pairs may have evolved froma common ancestral pairing and/or that strong selection hasdriven the repeated assembly of such gene pairs independentlythrough convergent evolution. To distinguish between these twopossibilities, we performed pairwise alignments of all TS codingsequences and (separately) their corresponding CYP codingsequences for the most abundant TS/CYP combinations. Wethen carried out correlation analysis of the TS/TS and CYP/CYPidentity values between different TS/CYP pairs (with the ex-ception of pairs with pseudogene members). Fig. 3A showsdata for the frequent eudicot gene pair groupings consisting ofTPS/CYP71 clan (circles), TTC/CYP71 clan (triangles), TTC/CYP85 clan (crosses), and TPS/CYP72 clan (squares). Thisplot demonstrates strong mutual proportional dependency of thetwo variables (TS/TS vs. CYP/CYP identities) (correlation co-efficient, r = 0.68). TPS/CYP71 clan pairs had lower identity valuesthan TTC/CYP85 and TPS/CYP72 clan pairs, whereas TTC/CYP71clan (CYP705 family) pairs had highest identities among the fourcombinations, most likely reflecting the evolutionary age of dif-ferent TS/CYP associations. This finding provides support fora scenario in which many of the eudicot TS/CYP gene pairs havearisen from a common ancestral gene pair by duplication. Thus,much of the diversity in terpene synthesis within the eudicots islikely to have been built on a foundation of these four simplebuilding blocks. This is consistent with a previously proposedmodel for the evolution of the thalianol and marneral clusters inA. thaliana, in which we postulated that these two triterpeneclusters could have been founded by duplication of a commonancestral TTC/CYP71 clan (CYP705 family) gene pair, followedby independent recruitment of additional gene cluster members(20). Five TPS/CYP71 clan gene pairs from A. thaliana, M. guttatus,and R. communis (for which the CYP71 families were CYP705,CYP80, CYP71, CYP81 and CYP89; highlighted SI Appendix,Table S1) did not follow the trend shown in Fig. 3A (Fig. 3B).The same is true for the TTC/CYP98 (71 clan) pair from G. max(Fig. 3B). These gene combinations may be located in closeproximity by chance, or alternatively may represent gene com-binations that have arisen independently.A similar analysis was applied to the identity values obtained

by pairwise alignment of TS and CYP sequences from themonocots. In these genomes the majority of TS/CYP pairs alsoconsisted of a TPS gene and a CYP71 clan gene (SI Appendix,Table S1 and Fig. 1). Surprisingly, there was no correlation be-tween TPS/TPS and CYP/CYP (71 clan) identity values amongmonocot species (Fig. 3C). Similarly, comparisons between alleudicot and all monocot TPS/CYP71 clan gene pairs failed toreveal any correlation (Fig. 3D). The absence of correlationsbetween TS/CYP gene pairs in monocots suggests that eitherthese gene pairs are present by chance and do not share func-tional connectivity (although this is unlikely because TS/CYPgene pairs are nonrandom in monocots, as in eudicots; Table 1),or that they may have arisen independently as a consequence ofdynamic genome rearrangements. There is good evidence that atleast some of these gene pairs are likely to form parts of func-tional terpene pathways because they include genes for the twocharacterized rice diterpene clusters (SI Appendix, Table S1; refs.18 and 19). Furthermore, we previously reported that the avenacincluster for the synthesis of antimicrobial triterpenes in diploid oat

contains a TTC gene immediately adjacent to a CYP51H gene,these two genes encoding the enzymes for the first and second stepsin the avenacin pathway, respectively (56). Importantly, this cluster hasarisen in recent evolutionary time, since the divergence of oats from

40

50

60

70

80

40 50 60 70 80

Dicots (r = 0.68)A

40

50

60

70

80

40 50 60 70 80

Monocots (r = 0.14)

B

C

D

40

50

60

70

80

40 50 60 70 80

Dicots vs. monocots(r = 0.11)

CYP/

CYP

iden

�ty

(%)

TS/TS iden�ty (%)

40

50

60

70

80

40 50 60 70 80

Dicots (no correlation,r = -0.08)

Fig. 3. Correlation analysis of the relatedness of multiple TS/CYP gene pairsfrom different plant genomes. (A) Correlated TS/CYP gene pairs from eudi-cot species. Each data point represents a comparison of two TS/CYP pairsbased on their TS/TS and CYP/CYP identities. Circles indicate gene pairsconsisting of terpene synthases (TPSs) coupled with CYPs belonging to theCYP71 clan; crosses indicate pairs of triterpene cyclases (TTCs) coupled withCYPs belonging to the CYP716 family (CYP85 clan); triangles indicate pairs ofTTCs coupled with CYPs belonging to the CYP705 and CYP76 families of theCYP71 clan; squares indicate pairs of TPS coupled with the CYP72 clanmembers. The high correlation coefficient (r) suggests that these gene pairsare likely to have evolved from a common ancestral pairing. (B) Non-correlated TPS/CYP gene pairs from eudicots (highlighted in SI Appendix,Table S1). The circles and triangles indicate TPS/CYP71 clan and TTC/CYP71clan gene pairs, respectively. Out of all eudicot TPS/CYP71 clan nonpseudogene pairs analyzed, only five did not follow the trend shown in (A) (all fiveshown in B). Similarly, out of eight TTC/CYP71 clan pairs (seven of themcorrelated, as shown in A), only one pair was not correlated with other suchpairs, as shown in B. (C) TPS/TPS and CYP/CYP (71 clan) identity values formonocot gene pairs are not correlated. (D) Similarly, comparison between alleudicot and all monocot TPS/CYP71 clan gene pairs failed to reveal anysignificant correlation.

Boutanaev et al. PNAS | Published online December 10, 2014 | E85

PLANTBIOLO

GY

PNASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 6: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

other cereals and grasses, by gene duplication, neofunctionalizationand relocation (22, 23, 56).In filamentous fungi, genes for specialized metabolic pathways

are commonly organized in clusters that are located close to theends of chromosomes (57). These regions are highly dynamicand can be regarded as “evolutionary playgrounds” that favornonallelic recombination, DNA inversion, partial deletions,translocations and other genomic rearrangements. Previously wehave shown that the oat avenacin cluster, like the fungal clusters,is subtelomeric (30). In contrast, the A. thaliana thalianol andmarneral clusters are not positioned near the ends of chromo-somes but are located within dynamic chromosomal regions thatare significantly enriched in transposable elements (TEs) (20).We therefore investigated the chromosomal locations of theidentified TS/CYP gene pairs and the distribution of TEs in thevicinity of clustered and dispersed TS and CYP genes. In mono-cots, the TS/CYP gene pairs were found predominantly towardthe ends of the chromosomes (Fig. 4A), as for the oat avenacincluster (30). This, coupled with the lack of correlation in sequencesimilarity between TS/CYP gene pairs in monocots (Fig. 3C), isconsistent with a “mix-and-match” model for terpene pathwayassembly in monocots. The two rice diterpene clusters are alsobelieved to have assembled independently of each other (58). Theflanking regions of the monocot TS/CYP gene pairs were signifi-cantly enriched in TEs relative to the norm for all TS and CYPgenes across the sequenced grass genomes (Fig. 4B), suggestingthat TE-mediated recombination may provide a mechanism forrelocation of TS and CYP genes into proximity. The situationwas different in eudicots, where the TS/CYP gene pairs did notshow an overall trend toward subtelomeric localization (Fig. 4A).Interestingly, however, there were differences within the eudi-cots, the Brassicaceae tending toward pericentromeric locations(SI Appendix, Fig. S3). Our previous analysis of the A. thalianathalianol and marneral clusters indicated that these clusters arein TE-rich regions of the genome, based on analysis of 100-kbwindows (20); in the current analysis the immediate flankingregions of the TS and CYP genes (20–50 kb) showed a small, butsignificant enrichment in TEs when clustered and dispersedgenes were compared (Fig. 4B). Altogether, our analyses suggestthat different mechanisms of terpene pathway assembly anddiversification are favored in eudicots and monocots. TS/CYPmicrosynteny is more likely to be preserved in eudicots, providinga foundation for evolution of new pathways derived from a commonancestral template, while mixing and matching of novel TS/CYPgene combinations appears to be the norm in monocots, at leastfor the grasses considered here. TE-mediated recombination maycontribute to cluster formation in both cases.

DiscussionHere we have mined the sequenced genomes of diverse eudicotand monocot plant species, identified and annotated all pre-dicted TS and CYP genes, and defined their gene coordinates.Analysis of the distribution of these genes revealed strikingnonrandom association of TS/CYP gene pairs for most of thespecies examined (Table 1). We recovered TS/CYP gene pairsfor the four characterized terpene clusters from A. thaliana andrice and also for other previously reported candidate terpeneclusters (SI Appendix, Table S1). We also identified several otherpreviously unknown candidate clusters (SI Appendix, Table S2).Representation of CYP genes within these TS/CYP gene pairswas skewed toward particular CYP families and was not simplya reflection of the relative overall abundance of different types ofCYP family genes within plant genomes (Fig. 1 and SI Appendix,Table S3). TPS genes were found primarily in combination withCYP71 family genes in both eudicots and monocots, but alsowith genes for other CYP71 clan members (e.g., CYP99, CYP76and CYP81) and other CYP clans. TTC genes were found pre-dominantly with CYP71 clan (CYP705 family) and CYP85 clan

(CYP716 family) genes in eudicots. In monocots fewer TTC/CYP gene pairs were found; these included examples of pairingswith genes belonging to the CYP71 and CYP99 (CYP71 clan)families and the CYP51H subfamily (CYP51 clan). Thesepairings resonate with those found in characterized clusters. Forexample, the thalianol and marneral clusters in A. thaliana bothcontain TTC genes in combination with CYP705 genes as theircore components (20, 21); the rice diterpene clusters containTPS genes in combination with CYP99 (momilactone synthesis)or CYP71 and CYP76 (phytocassane/oryzalide synthesis) genes,all of which belong to the CYP71 clan (18, 19), and the oatavenacin cluster contains a TTC gene in combination witha CYP51H gene (22, 23, 56). Members of the CYP71 clan havebeen shown to have functions in terpenoid metabolism in variousplant species (59–62). This work has necessarily focused on theangiosperm plants for which high quality genome sequences arecurrently available. It will be interesting to establish whetherfunctional clustering of TSs and CYPs is also observed in earlierdiverging plants. For example, while the currently availablegenomes are highly fragmented (63), gymnosperm diterpeneresin acid biosynthesis has been elucidated and it is possible thatthe relevant diterpene synthases will be clustered with the CYPsthat act immediately following (64, 65). Although this study fo-cuses on the potentially functional pairing of TSs and CYPs inplant genomes, it should be noted that there are also cases wheresuch consecutively acting enzymes are not coclustered. For ex-ample, in Arabidopsis the CYP82G1 that acts on geranyl-linaloolis found on chromosome 3 (66), while the geranyl-linalool pro-

0

1

2

3

10 20 50

0

1

2

10 20 50

Scor

e>=

100

Scor

e>=

50

Med

ian

ofTE

/Kb

Chromosome Gene flanks (Kb)

TS/C

YPlo

ca�o

ns 0

5

10

0.1 0.3 0.5 0.7 0.9

Monocots

0

5

10

15

0.1 0.3 0.5 0.7 0.9

Dicots

A B

TS/CYP gene pairsAll TS and CYP genes in genome

TS/CYP gene pairsAll TS and CYP genes in genome

Eudicots:

Monocots:

Fig. 4. Chromosomal location of TS/CYP gene pairs and TE density offlanking DNA. (A) Distribution of TS/CYP pairs along chromosomes in theeudicot and the monocot assembled genomes. The horizontal axis desig-nates intervals of chromosome length expressed as a fraction of 1.0. Thevertical axis designates the numbers of TS/CYP pairs located in the intervals.(B) Transposable element (TE) distribution in the vicinity of TS and CYPgenes. The numbers of predicted TEs were computed in 10-, 20-, and 50-kbregions from both sides of coupled TS and CYP genes. The same procedurewas applied on a global scale to all TS and CYP genes in the genomes underinvestigation. Data are shown for BLAST score parameters ≥ 50 (Upper)and ≥ 100 (Lower).

E86 | www.pnas.org/cgi/doi/10.1073/pnas.1419547112 Boutanaev et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 7: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

ducing TPS is found on chromosome 1 (67). Similarly, in riceCYP701A8 acts on diterpenes produced by TPSs located indisparate locations in the genome, with one of the relevant TPSfound clustered with other CYPs (68).The CYP51 clan is one of the most ancient of the CYP clans (11,

12, 49). These enzymes are sterol demethylases with a highlyconserved function in synthesis of essential sterols in fungi, animalsand plants. The CYP51H that forms part of the avenacin pathwayin oat is the first CYP51 enzyme to be shown to have a function inspecialized metabolism rather than in primary sterol metabolism. Itbelongs to a newly defined divergent group of CYP51 enzymesknown as the CYP51H subfamily, which appears to be restricted tomonocots and which also includes nine members of unknownfunction from rice (56). We have previously shown, through com-prehensive analysis of the rice genome, that there are no candidategene clusters similar to the oat avenacin cluster in rice (69),a finding that is consistent with the absence of TTC/CYP51 genepairs in rice in the present study. Interestingly, however, one ofthe new candidate clusters that we identified in B. distachyon (theclosest relative of oat for which a complete genome sequence isavailable) contained a TTC gene and several CYP51H genesclustered with other genes implicated in specialized metabolism (SIAppendix, Table S2). Comparison of this region with the oat ave-nacin gene cluster indicated that the candidate B. distachyon clusterwas clearly distinct in terms of gene content and organization and itremains to be seen whether this is indeed a functional metabolicgene cluster. However, analysis of the wider genomic regionrevealed close relatives of four of the five characterized membersof the avenacin biosynthetic gene cluster (SI Appendix, Fig. S4 andTable S7). This region is substantially larger than the oat genecluster and contains many hundreds of interspersed and function-ally unrelated genes, but the order of the genes in the regionmirrors that of the related genes in the oat gene cluster. One in-terpretation is that this may reflect an early stage in the evolutionof the oat avenacin cluster in a common ancestor, the gaps betweenthe genes in oat being subsequently reduced through a process thatwe refer to as genome defragmentation (14). However, it is notclear that the genes present in this region of the B. distachyongenome represent true orthologs of the avenacin biosynthetic genesin all cases, and it is possible that the relative positions of some ofthese genes may have arisen independently in the two lineages.A surprising discovery was that different genomic mechanisms

of assembly of terpene pathway components appear to prevail ineudicots and monocots. This is suggestive of fundamental dif-ferences in genome dynamics in these two groups of plants. Ourfindings show that microsynteny is conserved at the level of smallblocks of genes in eudicots, but not in monocots, at least whenterpene pathway components are considered. Diversificationof individual enzymes can be driven by gene duplication andsubsequent mutation followed by natural selection (70). Under-standing how entire new pathways emerge is a rather biggerchallenge. Through distilling the genetic basis of metabolic di-

versification from multiple plant genomes it should become possibleto trace the evolutionary paths that have led to the diversificationof individual enzymes and pathways, and to investigate the ge-nomic mechanisms underpinning metabolic diversification. Suchapproaches will enable us to access and exploit nature’s chemicaltoolkit by providing grist for the synthetic biology mill. Our overallstrategy also has potential for wider investigations of the organi-zation and evolution of physically clustered functional gene setswithin genomes, extending beyond terpene biosynthesis.

Materials and MethodsDatabases. Information about databases and the versions of genome as-semblies used can be found in the additional online Methods section. Ter-penoid synthase (TS), cytochrome P450-dependent monooxygenase (CYP)and transposable element (TE) protein sequences were downloaded from theProtein Knowledgebase (71) (see Datasets S1 and S2 for sequence IDs andDatasets S3 and S4 for full sequences).

Software. Customized software was used to identify TS/CYP gene pairs,calculate transposable elements/kb ratios (TE/kb), and compute identityvalues between DNA sequences of different TS/CYP pair members. To findcandidate TS and CYP genes in plant genomes we took advantage of thetBLASTn search engine. A stand-alone BLAST database of the plant genomesin question was created (72) and TS and CYP protein sequences were pipedthrough this (e = 0.001) to give a BLAST output file with protein/proteinalignments of TS and CYP sequences mapped to corresponding exons in eachof the plant genomes. The program then parsed the output file andextracted all necessary information, essentially alignment co-ordinates, withscore parameter ≥ 100. The same program joined overlapping co-ordinatesto give possible exons and computed co-ordinates of suggested genes. Fi-nally, the program identified all TS/CYP gene pairs located within 30, 50,100, 150, and 200 kb from each other in each genome under investigation. Asimilar approach was used to find TE sequences (SI Appendix, SI Materialsand Methods). To compute the DNA sequence identity values of different TS/CYP pairs, the ClustalW2 computer program was incorporated in our soft-ware for global pairwise alignment of DNA sequences. To simulate randomdistribution of gene pairs on the whole genome scale we used the randomnumber generator built on the basis of the Mitchell-Moore algorithm (73).The results of computer simulation were compared with the observed dis-tributions of TS/CYP gene pairs (further information is provided in SI Ap-pendix, SI Materials and Methods). χ2 tests were used to estimate thesignificance of the difference between the observed distributions and thecorrespondent null hypotheses.

Functional Analysis of Clustered Genes. See SI Appendix, SI Materialsand Methods.

ACKNOWLEDGMENTS. We thank H.-W. Nüztmann, K. Papadopoulou, J. Dicks,P. O’Maille, R. Morris, M. Bibb, and S. Rosser for comments. A.O. was sup-ported by Biotechnology and Biological Sciences Research Council (BBSRC)Institute Strategic Programme Grant ‘Understanding and Exploiting Plant andMicrobial Secondary Metabolism’ (BB/J004561/1), the John Innes Foundation,and the Engineering and Physical Sciences Research Council (EP/H019154/1);D.N. was supported by National Science Foundation Grant IOS-12432275; andR.J.P. was supported by National Institutes of Health Grant GM 076324. A.B.thanks N. Gudkov for useful discussion on computer simulation.

1. De Luca V, Salim V, Atsumi SM, Yu F (2012) Mining the biodiversity of plants: Arevolution in the making. Science 336(6089):1658–1661.

2. Yonekura-Sakakibara K, Saito K (2009) Functional genomics for plant natural productbiosynthesis. Nat Prod Rep 26(11):1466–1487.

3. Banks JA, et al. (2011) The Selaginella genome identifies genetic changes associatedwith the evolution of vascular plants. Science 332(6032):960–963.

4. Weng JK, Philippe RN, Noel JP (2012) The rise of chemodiversity in plants. Science336(6089):1667–1670.

5. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flow-ering plant Arabidopsis thaliana. Nature 408(6814):796–815.

6. Chappell J (2002) The genetics and molecular genetics of terpene and sterol origami.Curr Opin Plant Biol 5(2):151–157.

7. Croteau R, Kutchan TM, Lewis NG (2000) Natural products. Biochemistry and Molec-ular Biology of Plants, eds Buchanan B, Gruissem W, Jones R (American Society ofPlant Physiologists, Rockville, MD), pp 1250–1318.

8. Gershenzon J, Dudareva N (2007) The function of terpene natural products in thenatural world. Nat Chem Biol 3(7):408–414.

9. Bohlmann J, Keeling CI (2008) Terpenoid biomaterials. Plant J 54(4):656–669.

10. Chen F, Tholl D, Bohlmann J, Pichersky E (2011) The family of terpene synthases inplants: A mid-size family of genes for specialized metabolism that is highly diversifiedthroughout the kingdom. Plant J 66(1):212–229.

11. Hamberger B, Bak S (2013) Plant P450s as versatile drivers for evolution ofspecies-specific chemical diversity. Philos Trans R Soc Lond B Biol Sci 368(1612):20120426.

12. Mizutani M (2012) Impacts of diversification of cytochrome P450 on plant metabo-lism. Biol Pharm Bull 35(6):824–832.

13. Trapp SC, Croteau RB (2001) Genomic organization of plant terpene synthases andmolecular evolutionary implications. Genetics 158(2):811–832.

14. Osbourn A (2010) Secondary metabolic gene clusters: Evolutionary toolkits forchemical innovation. Trends Genet 26(10):449–457.

15. Christianson DW (2008) Unearthing the roots of the terpenome. Curr Opin Chem Biol12(2):141–150.

16. Gao Y, Honzatko RB, Peters RJ (2012) Terpenoid synthase structures: A so far in-complete view of complex catalysis. Nat Prod Rep 29(10):1153–1175.

17. Nützmann HW, Osbourn A (2014) Gene clustering in plant specialized metabolism.Curr Opin Biotechnol 26:91–99.

Boutanaev et al. PNAS | Published online December 10, 2014 | E87

PLANTBIOLO

GY

PNASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0

Page 8: Investigation of terpene diversification across multiple sequenced … · 2015-01-02 · terpene diversification are terpenoid synthases and cyto-chromes P450, which synthesize and

18. Shimura K, et al. (2007) Identification of a biosynthetic gene cluster in rice formomilactones. J Biol Chem 282(47):34013–34018.

19. Wilderman PR, Xu M, Jin Y, Coates RM, Peters RJ (2004) Identification of syn-pimara-7,15-diene synthase reveals functional clustering of terpene synthases involved in ricephytoalexin/allelochemical biosynthesis. Plant Physiol 135(4):2098–2105.

20. Field B, et al. (2011) Formation of plant metabolic gene clusters within dynamicchromosomal regions. Proc Natl Acad Sci USA 108(38):16116–16121.

21. Field B, Osbourn AE (2008) Metabolic diversification—independent assembly ofoperon-like gene clusters in different plants. Science 320(5875):543–547.

22. Mugford ST, et al. (2013) Modularity of plant metabolic gene clusters: A trio of linkedgenes that are collectively required for acylation of triterpenes in oat. Plant Cell 25(3):1078–1092.

23. Qi X, et al. (2004) A gene cluster for secondary metabolism in oat: Implications for theevolution of metabolic diversity in plants. Proc Natl Acad Sci USA 101(21):8233–8238.

24. Itkin M, et al. (2013) Biosynthesis of antinutritional alkaloids in solanaceous crops ismediated by clustered genes. Science 341(6142):175–179.

25. Matsuba Y, et al. (2013) Evolution of a complex locus for terpene biosynthesis insolanum. Plant Cell 25(6):2022–2036.

26. Aubourg S, Lecharny A, Bohlmann J (2002) Genomic analysis of the terpenoid syn-thase (AtTPS) gene family of Arabidopsis thaliana. Mol Genet Genomics 267(6):730–745.

27. Castillo DA, Kolesnikova MD, Matsuda SP (2013) An effective strategy for exploringunknown metabolic pathways by genomemining. J Am Chem Soc 135(15):5885–5894.

28. Ehlting J, et al. (2008) An extensive (co-)expression analysis tool for the cytochromeP450 superfamily in Arabidopsis thaliana. BMC Plant Biol 8(47):47.

29. Huang S, et al. (2009) The genome of the cucumber, Cucumis sativus L. Nat Genet41(12):1275–1281.

30. Chu HY, Wegel E, Osbourn A (2011) From hormones to secondary metabolism: Theemergence of metabolic gene clusters in plants. Plant J 66(1):66–79.

31. Takos AM, Rook F (2012) Why biosynthetic genes for chemical defense compoundscluster. Trends Plant Sci 17(7):383–388.

32. Wegel E, Koumproglou R, Shaw P, Osbourn A (2009) Cell type-specific chromatindecondensation of a metabolic gene cluster in oats. Plant Cell 21(12):3926–3936.

33. Zerbe P, et al. (2013) Gene discovery of modular diterpene metabolism in nonmodelsystems. Plant Physiol 162(2):1073–1091.

34. Lotus japonicus genome sequencing website: www.kazusa.or.jp/lotus/.35. Jaillon O, et al.; French-Italian Public Consortium for Grapevine Genome Character-

ization (2007) The grapevine genome sequence suggests ancestral hexaploidizationin major angiosperm phyla. Nature 449(7161):463–467.

36. Schmutz J, et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature463(7278):178–183.

37. Tuskan GA, et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr.& Gray). Science 313(5793):1596–1604.

38. Wang X, et al.; Brassica rapa Genome Sequencing Project Consortium (2011) Thegenome of the mesopolyploid crop species Brassica rapa. Nat Genet 43(10):1035–1039.

39. Young ND, et al. (2011) The Medicago genome provides insight into the evolution ofrhizobial symbioses. Nature 480(7378):520–524.

40. Tomato Genome Consortium (2012) The tomato genome sequence provides insightsinto fleshy fruit evolution. Nature 485(7400):635–641.

41. Xu X, et al.; Potato Genome Sequencing Consortium (2011) Genome sequence andanalysis of the tuber crop potato. Nature 475(7355):189–195.

42. International Brachypodium Initiative (2010) Genome sequencing and analysis of themodel grass Brachypodium distachyon. Nature 463(7282):763–768.

43. Paterson AH, et al. (2009) The Sorghum bicolor genome and the diversification ofgrasses. Nature 457(7229):551–556.

44. International Rice Genome Sequencing Project (2005) The map-based sequence of therice genome. Nature 436(7052):793–800.

45. Schnable PS, et al. (2009) The B73 maize genome: Complexity, diversity, and dynamics.Science 326(5956):1112–1115.

46. Mimulus genome browser: www.phytozome.net/mimulus.47. Chan AP, et al. (2010) Draft genome sequence of the oilseed species Ricinus com-

munis. Nat Biotechnol 28(9):951–956.

48. Zhang G, et al. (2012) Genome sequence of foxtail millet (Setaria italica) providesinsights into grass evolution and biofuel potential. Nat Biotechnol 30(6):549–554.

49. Nelson D, Werck-Reichhart D (2011) A P450-centric view of plant evolution. Plant J66(1):194–211.

50. Nelson DR (2009) The cytochrome P450 homepage. Hum Genomics 4(1):59–65.51. Kirby J, et al. (2010) Cloning of casbene and neocembrene synthases from Euphor-

biaceae plants and expression in Saccharomyces cerevisiae. Phytochemistry 71(13):1466–1473.

52. Kitaoka N, Lu X, Yang B, Peters RJ (2014) The application of synthetic biology to eluci-dation of plant mono-, sesqui- and diterpenoidmetabolism.Mol Plant, 10.1093/mp/ssu104.

53. Tan Q-G, Cai X-H, Du Z-Z, Luo X-D (2009) Three terpenoids and a tocopherol‐relatedcompound from Ricinus communis. Helv Chim Acta 92:2762–2768.

54. King AJ, Brown GD, Gilday AD, Larson TR, Graham IA (2014) Production of bioactivediterpenoids in the euphorbiaceae depends on evolutionarily conserved gene clus-ters. Plant Cell 26(8):3286–3298.

55. Zi J, et al. (2014) Biosynthesis of lycosantalonol, a cis-prenyl derived diterpenoid. J AmChem Soc 136(49):16951–16953.

56. Qi X, et al. (2006) A different function for a member of an ancient and highly con-served cytochrome P450 family: From essential sterols to plant defense. Proc NatlAcad Sci USA 103(49):18848–18853.

57. Hoffmeister D, Keller NP (2007) Natural products of filamentous fungi: Enzymes,genes, and their regulation. Nat Prod Rep 24(2):393–416.

58. Swaminathan S, Morrone D, Wang Q, Fulton DB, Peters RJ (2009) CYP76M7 is an ent-cassadiene C11α-hydroxylase defining a second multifunctional diterpenoid bio-synthetic gene cluster in rice. Plant Cell 21(10):3315–3325.

59. Collu G, et al. (2001) Geraniol 10-hydroxylase, a cytochrome P450 enzyme involved interpenoid indole alkaloid biosynthesis. FEBS Lett 508(2):215–220.

60. Lupien S, Karp F, Wildung M, Croteau R (1999) Regiospecific cytochrome P450 limo-nene hydroxylases from mint (Mentha) species: cDNA isolation, characterization, andfunctional expression of (-)-4S-limonene-3-hydroxylase and (-)-4S-limonene-6-hydroxylase.Arch Biochem Biophys 368(1):181–192.

61. Ralston L, et al. (2001) Cloning, heterologous expression, and functional character-ization of 5-epi-aristolochene-1,3-dihydroxylase from tobacco (Nicotiana tabacum).Arch Biochem Biophys 393(2):222–235.

62. Wang E, et al. (2001) Suppression of a P450 hydroxylase gene in plant trichome glandsenhances natural-product-based aphid resistance. Nat Biotechnol 19(4):371–374.

63. Nystedt B, et al. (2013) The Norway spruce genome sequence and conifer genomeevolution. Nature 497(7451):579–584.

64. Hall DE, et al. (2013) Evolution of conifer diterpene synthases: Diterpene resin acidbiosynthesis in lodgepole pine and jack pine involves monofunctional and bi-functional diterpene synthases. Plant Physiol 161(2):600–616.

65. Keeling CI, Bohlmann J (2006) Genes, enzymes and chemicals of terpenoid diversity inthe constitutive and induced defence of conifers against insects and pathogens. NewPhytol 170(4):657–675.

66. Lee S, et al. (2010) Herbivore-induced and floral homoterpene volatiles are bio-synthesized by a single P450 enzyme (CYP82G1) in Arabidopsis. Proc Natl Acad SciUSA 107(49):21205–21210.

67. Herde M, et al. (2008) Identification and regulation of TPS04/GES, an Arabidopsisgeranyllinalool synthase catalyzing the first step in the formation of the insect-induced volatile C16-homoterpene TMTT. Plant Cell 20(4):1152–1168.

68. Wang Q, Hillwig ML, Wu Y, Peters RJ (2012) CYP701A8: A rice ent-kaurene oxidaseparalog diverted to more specialized diterpenoid metabolism. Plant Physiol 158(3):1418–1425.

69. Inagaki YS, et al. (2011) Investigation of the potential for triterpene synthesis in ricethrough genome mining and metabolic engineering. New Phytol 191(2):432–448.

70. Ober D (2010) Gene duplications and the time thereafter - examples from plantsecondary metabolism. Plant Biol (Stuttg) 12(4):570–577.

71. UniProt Consortium (2012) Reorganizing the protein space at the Universal ProteinResource (UniProt). Nucleic Acids Res 40(Database issue):D71–D75.

72. BLAST Help [Internet]. Bethesda (MD): National Center for Biotechnology Information(US); 2008-. www.ncbi.nlm.nih.gov/books/NBK1762/.

73. Knuth DE (1981) The Art of Computer Programming: Semi-Numerical Algorithms(Addison-Wesley, Reading, Mass).

E88 | www.pnas.org/cgi/doi/10.1073/pnas.1419547112 Boutanaev et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 14

, 202

0


Recommended