+ All Categories
Home > Documents > Genes Calore: A Summary of Methods for Accessing … · Results from Large-Scale Partia1 Sequencing...

Genes Calore: A Summary of Methods for Accessing … · Results from Large-Scale Partia1 Sequencing...

Date post: 30-Jul-2018
Category:
Upload: vuongdat
View: 215 times
Download: 0 times
Share this document with a friend
15
Plant Physiol. (1994) 106: 1241-1255 Genes Calore: A Summary of Methods for Accessing Results from Large-Scale Partia1 Sequencing of Anonymous Arabidopsis cDNA Clones' Tom Newman, Frans J. de Bruijn, Pam Green, Ken Keegstra, Hans Kende, Lee Mclntosh, JohnOhlrogge, Natasha Raikhel, Shauna Somerville, Mike Thomashow, Ernie Retzel, and Chris Somerville* Arabidopsis Expressed Sequence Tag Project, Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, Michigan 48824 (T.N., F.J.d.B., P.G., K.K., H.K., L.M., J.O., N.R., M.T.); Computational Biology Center, Medical School, University of Minnesota, 1460 Mayo, UMHC 196, 420 Delaware Street S.E., Minneapolis, Minnesota 55455-031 2 (E.R.); and Carnegie lnstitution of Washington, Department of Plant Biology, 290 Panama Street, Stanford, California 94305-41 O1 (S.S., C.S.) High-throughput automated partial sequencing of anonymous cDNA clones provides a method to survey the repertoire of ex- pressed genes from an organism. Comparison of the coding capac- ity of these expressed sequence tags (ESTs) with the sequences in the public data bases results in assignment of putative fundion to a significant proportion of the ESTs. Thus, the more than 13,400 plant ESTs that are currently available provide a new resource that will facilitate progress in many areas of plant biology. These opportunities are illustrated by a description of the results obtained from analysis of 1500 Arabidopsis ESTs from a cDNA library prepared from equal portions of poly(A+) mRNA from etiolated seedlings, roots, leaves, and flowering inflorescences. More than 900 different sequences were represented, 32% of which showed significant nucleotide or deduced amino acid sequence similarity to previously charaderized genes or proteins from a wide range of organisms. At least 165 of the clones had significant deduced amino acid sequence homology to proteins or gene products that have not been previously characterized from higher plants. A summary of methods for accessing the information and materials generated by the Arabidopsis cDNA sequencing projeds is provided. Because of the rapid proliferation of amino acid sequence information deduced from cloned genes of known function and from purified proteins, it is now frequently possible to infer the probable function of a newly isolated gene solely on the basis of nucleotide or deduced amino acid sequence homology to genes or gene products of known function (Pearson, 1991). This fact, in conjunction with the commercial availability of reliable automated DNA sequenators capable of very high throughput (Hunkapiller et al., 1991), has led to large-scale partial sequencing of anonymous cDNA clones (ESTs) from humans and several model organisms (e.g. Ad- ams et al., 1991, 1992; McCombie et al., 1992; Waterston et al., 1992). In one of the first tests of this approach, an average This work was supported in part by grants from the National Sdence Foundation (BIR9313751) and the U.S. Department of En- ergy (DE-FG02-90ER20021). * Corresponding author; fax 1-415-325-6857. of 397 bp of sequence was obtained from one end of each of 2375 randomly selected clones from several commercially available cDNA libraries of human brain (Adams et al., 1992). In spite of the fact that no effort was made to eliminate redundant sequences, no gene was sequenced more than 16 times (actin), and the total number of 3-fold or greater redundancies was 142 (i.e. <5%). Approximately 17% of a11 ESTs were assigned a probable function by homology to known sequences from humans or other organisms, including plants (Adams et al., 1992). Structural and metabolic classes comprised about 30% of the ESTs, 25% were involved in regulatory pathways, and the rest were not simply classified. This work and several other contemporaneous experiments demonstrated the high frequency with which an othenvise anonymous cDNA can be assigned probable function by data base searching. Analysis of ESTs from plants has produced results similar to those obtained with animals. ESTs have been reported in the published literature for 3089 rice cDNAs (Uchimiya et al., 1992; Sasaki et al., 1994), 200 maize cDNAs (Keith et al., 1993), 197 Brassica napus cDNAs (Park et al., 1993), and 1152 Arabidopsis cDNAs (Hofte et al., 1993). The published Arabidopsis ESTs were obtained from five libraries represent- ing mRNA expressed during floral development, embryogen- esis, seed maturation, development of etiolated plants, and cell culture. The 1152 sequences contained 895 nonredundant ESTs, 32% of which had significant probable matches to known genes from Arabidopsis or other organisms. As co- gently noted by Hofte et al. (1993), these sequences, which are available from the public data bases such as dbEST, represent a valuable resource for the f a d e identification of plant genes. Indeed, there are more than 8000 Arabidopsis and more than 4300 rice sequences in these data bases at present and the number is expected to grow by tens of thousands during the next several years. Because of the magnitude and rate of growth of these sequence data bases, Abbreviations: EST, expressed sequence tag; HSP, high-scoring pair; NCBI, National Center for Biotechnology Information; WWW, world wide web. I 1241 www.plantphysiol.org on July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.
Transcript

Plant Physiol. (1994) 106: 1241-1255

Genes Calore: A Summary of Methods for Accessing Results from Large-Scale Partia1 Sequencing of

Anonymous Arabidopsis cDNA Clones'

Tom Newman, Frans J. de Bruijn, Pam Green, Ken Keegstra, Hans Kende, Lee Mclntosh, John Ohlrogge, Natasha Raikhel, Shauna Somerville, Mike Thomashow, Ernie Retzel, and Chris Somerville*

Arabidopsis Expressed Sequence Tag Project, Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, Michigan 48824 (T.N., F.J.d.B., P.G., K.K., H.K., L.M., J.O., N.R., M.T.); Computational Biology Center, Medical School, University of Minnesota, 1460 Mayo, UMHC 196, 420 Delaware Street S.E.,

Minneapolis, Minnesota 55455-031 2 (E.R.); and Carnegie lnstitution of Washington, Department of Plant Biology, 290 Panama Street, Stanford, California 94305-41 O1 (S.S., C.S.)

High-throughput automated partial sequencing of anonymous cDNA clones provides a method to survey the repertoire of ex- pressed genes from an organism. Comparison of the coding capac- ity of these expressed sequence tags (ESTs) with the sequences in the public data bases results in assignment of putative fundion to a significant proportion of the ESTs. Thus, the more than 13,400 plant ESTs that are currently available provide a new resource that will facilitate progress in many areas of plant biology. These opportunities are illustrated by a description of the results obtained from analysis of 1500 Arabidopsis ESTs from a cDNA library prepared from equal portions of poly(A+) mRNA from etiolated seedlings, roots, leaves, and flowering inflorescences. More than 900 different sequences were represented, 32% of which showed significant nucleotide or deduced amino acid sequence similarity to previously charaderized genes or proteins from a wide range of organisms. At least 165 of the clones had significant deduced amino acid sequence homology to proteins or gene products that have not been previously characterized from higher plants. A summary of methods for accessing the information and materials generated by the Arabidopsis cDNA sequencing projeds is provided.

Because of the rapid proliferation of amino acid sequence information deduced from cloned genes of known function and from purified proteins, it is now frequently possible to infer the probable function of a newly isolated gene solely on the basis of nucleotide or deduced amino acid sequence homology to genes or gene products of known function (Pearson, 1991). This fact, in conjunction with the commercial availability of reliable automated DNA sequenators capable of very high throughput (Hunkapiller et al., 1991), has led to large-scale partial sequencing of anonymous cDNA clones (ESTs) from humans and several model organisms (e.g. Ad- ams et al., 1991, 1992; McCombie et al., 1992; Waterston et al., 1992). In one of the first tests of this approach, an average

This work was supported in part by grants from the National Sdence Foundation (BIR9313751) and the U.S. Department of En- ergy (DE-FG02-90ER20021).

* Corresponding author; fax 1-415-325-6857.

of 397 bp of sequence was obtained from one end of each of 2375 randomly selected clones from several commercially available cDNA libraries of human brain (Adams et al., 1992). In spite of the fact that no effort was made to eliminate redundant sequences, no gene was sequenced more than 16 times (actin), and the total number of 3-fold or greater redundancies was 142 (i.e. <5%). Approximately 17% of a11 ESTs were assigned a probable function by homology to known sequences from humans or other organisms, including plants (Adams et al., 1992). Structural and metabolic classes comprised about 30% of the ESTs, 25% were involved in regulatory pathways, and the rest were not simply classified. This work and several other contemporaneous experiments demonstrated the high frequency with which an othenvise anonymous cDNA can be assigned probable function by data base searching.

Analysis of ESTs from plants has produced results similar to those obtained with animals. ESTs have been reported in the published literature for 3089 rice cDNAs (Uchimiya et al., 1992; Sasaki et al., 1994), 200 maize cDNAs (Keith et al., 1993), 197 Brassica napus cDNAs (Park et al., 1993), and 1152 Arabidopsis cDNAs (Hofte et al., 1993). The published Arabidopsis ESTs were obtained from five libraries represent- ing mRNA expressed during floral development, embryogen- esis, seed maturation, development of etiolated plants, and cell culture. The 1152 sequences contained 895 nonredundant ESTs, 32% of which had significant probable matches to known genes from Arabidopsis or other organisms. As co- gently noted by Hofte et al. (1993), these sequences, which are available from the public data bases such as dbEST, represent a valuable resource for the f a d e identification of plant genes. Indeed, there are more than 8000 Arabidopsis and more than 4300 rice sequences in these data bases at present and the number is expected to grow by tens of thousands during the next several years. Because of the magnitude and rate of growth of these sequence data bases,

Abbreviations: EST, expressed sequence tag; HSP, high-scoring pair; NCBI, National Center for Biotechnology Information; WWW, world wide web.

I

1241

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1242 Newman et al. Plant Physiol. Vol. 106, 1994

it is expected that most of the sequences will not be reported in the published literature and will be available only by data base analysis.

In this report, we present an analysis of 1500 anonymous cDNA clones from a library composed of mRNA from etio- lated seedlings, roots, leaves, and flowering inflorescences of Arabidopsis. As in an earlier report describing high-through- put cDNA sequencing of Arabidopsis clones (Hofte et al., 1993), we have found this approach to be a highly efficient method for the identification of new plant genes. Since producing the 1500 sequences reported here, we have de- posited an additional 3700 EST sequences in public data bases. We are currently producing approximately 1000 Aru- bidopsis ESTs per month and anticipate adding tens of thou- sands of sequences during the next severa1 years. This article describes the cDNA libraries used, provides an example of the kind of information obtained by EST analysis, outlines some of the pitfalls associated with using ESTs, and sum- marizes how to exploit these resources.

MATERIALS AND METHODS

cDNA Libraries

All of the clones described here were from the Columbia wild type of Arabidopsis thaliana (L.) Heynh. The clones and the PRL2 library are available from the Arabidopsis Biological Resource Center at The Ohio State University. Most of the clones sequenced were from a XZipLox library designated PRL2, which was constructed specifically for the EST se- quencing project as described below. However, during pre- liminary experiments, a library designated PRLl and a library obtained from Dr. N.H. Chua (Rockefeller University, New York) were also used.

For the preparation of the PRLl and PRL2 libraries, RNA was prepared as described (Nagy et al., 1988) from four tissue types of the Columbia wild type of Arabidopsis (Ohio State University Stock Center accession No. Col-O): (a) sterile eti- olated seedlings grown on Murashige and Skoog medium (Sigma) for 7 d at 23OC in tinfoil-wrapped Petri dishes; (b) sterile roots grown for 5 to 7 d on the surface of minimal- salts agar-solidified medium in vertically oriented Petri dishes in continuous light at 23OC; (c) rosette plants of various ages grown in soil at 23OC (half of the plants were grown in continuous fluorescent illumination [150 pmol m-’ s-l PAR], half were grown in a 16-h photoperiod and harvested in the dark); (d) stems, flowers, and siliques at a11 stages from floral initiation to mature seeds from plants grown as in (c). Plants were initiated at weekly intervals and when the first batch produced mature seeds, the stems and a11 associated tissues were harvested from a11 the plants concurrently.

For the PRL2 library equal amounts of poly(A+) mRNA from the four RNA preparations was converted to cDNA using a SuperScript kit from Gibco BRL. Reverse transcription of the poly(A+) fraction was primed with a NotI primer (5’-GACTAGTTCTAGATCGCGAGCGGCCGCCCTI5) and, after second strand synthesis, a Sal1 adapter (5’-TCGA- CCCACGCGTCCG) was added (the underlined region was double stranded). The cDNA was digested with NotI, ligated into the SalI-NotI sites of XZipLox (Gibco BRL), packaged,

and plated on Escherichia coli Y1090(pZIP). Approximately 1.2 x 106 primary recombinant phage were obtained. 4ZipLox is a cre-lox vector (Palazzolo et al., 1990) that coiitains a ColEl-derived plasmid (pZL1) flanked by loxP sites. Infection of stains of E. coli that express the phage P1 cre gene causes site-specific recombination at the loxP sites and extision of the 4307-bp plasmid. The sequence of pZLl is availa d e from Gibco BRL.

The PRLl library was custom-made from the sarne RNA preparation as the PRL2 library by Novagen (Madison, WI). The cDNA was made by oligo(dT) priming and dire1:tionally cloned into the EcoRI and HindIII sites of the XShlox 1 vector (Novagen). The AT-NHC cDNA library was obtained from Dr. N.H. Chua. This library was made by ligating EcoRI-SmaI adaptors to oligo(dT)-primed cDNA and cloning into the EcoRI site of XZAP (Stratagene, La Jolla, CA).

Template Preparation

Plasmid templates for the first 600 ESTs from the XZipLox library were produced by randomly picking single plaques from the primary library onto a lawn of E. coli DHlOB(pZ1P) F- mcrA A(mrr-hsdRMS-mcrBC) q~80dlacZAM15 AlacX74 endAl recAl deo A(ara-leu)7697 araD139 galU ga‘K npuG bis6 ind pZIP (P1 ori-kan-cre), which had been plated on Luria Broth agar containing 100 pg/mL ampicillin. The re- sulting amp‘ colonies were grown at 37OC in 5-mL cultures of Temfic Broth (Gibco BRL) and plasmids were extracted using Magic Minipreps from Promega (Madison, WI). The use of DHlOB(pZ1P) for plasmid preparation is essential to overcome the low yield of loxP-containing plasmids in cre+ cells (Palazzolo et al., 1990). This problem has been avercome in the XZipLox system by exploiting the incompatibility be- tween the P1 origin of replication on the plasmid th,it canies the cre recombinase gene (pZIP) and the P1 incA l o a s canied on pZLl (Abeles and Austin, 1991). In brief, after infection of a cell with XZipLox, pZLl is excised and expression of the incA gene on pZLl suppresses replication of pZIP leading to loss of cre.

Sequencing and Data Analysis

Taq polymerase cycle-sequencing reactions were per- formed by an ABI Catalyst 8000 Molecular Workstation (Applied Biosystems, Foster City, CA) using conditions and reagents provided by the manufacturer and fluorescent T7, M13(-21), and M13-reverse dye primers. The primer used to sequence a particular clone is indicated by the last two letters of the laboratory accession number of the clone (eg. 32B4T7, 32B4X [or 32B4XP1, and 32B4R were sequenced using the T7, M13(-21), and M13-reverse primers, respectively). The sequence ladders were resolved by ABI373A sequenators ( Applied Biosy stems) .

Sequences were edited manually to remove vector and ambiguous sequences at the ends. The EST nucleotide se- quences were compared to the nucleotide sequences in GenBank release 70 by using the BLAST e-mail server pro- vided by NCBI. The six possible deduced amino acid se- quences of the ESTs were compared to the nonredundant protein data bases by using the BLASTX e-mail server pro-

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1243

vided by NCBI. PAM120 scores of >80 were considered to indicate potentially significant homology.

Data and Plasmid Storage

A11 EST sequences have been deposited in dbEST, a public- access data base designed specifically for ESTs (Boguski et al., 1993). Information on how to retrieve a sequence can be obtained from [email protected] by placing the word "help" in the body of the message and leaving the subject line blank. An efficient mechanism to determine if an EST for a particular protein of interest is available is to run a TBLASTN search against dbEST using the amino acid sequence of the known protein as the query. Information on how to format such a query can be obtained from [email protected] by placing the word "help" in the body of the message.

The plasmids corresponding to the ESTs have been depos- ited with the Arabidopsis Biological Resource Center at The Ohio State University, 1735 Neil Avenue, Columbus, OH 43210. DNA may be ordered by mail, fax (1-614-292-0603), on line through the AIMS data base (for help contact [email protected]), or by e-mail from dna@ genesys.cps.msu.edu. Use of the AIMS data base is recom- mended because it contains a record of previous requests for each EST clone.

EST ldentifiers

The clone names available in the dbEST report reflect the position of the clone in a 96-well plate (i.e. 49CllT7 is from plate 49, row C, column 11). As noted above, the last two or three letters indicate the primer that was used to produce the sequence.

RESULTS

cDNA Libraries

During the initial stages of the experiments described here, cDNA clones were picked randomly from several available libraries. During the latter stages of the project, clones were from an oriented XZipLox library, designated PRL2, which was constructed from equal amounts of mRNA from etiolated seedlings, roots, leaves, and shoots of a11 maturity stages. The primary library contained 1.2 X 106 recombinant phage and, therefore, was considered to have adequate representation of the expressed genes. The quality of the library (with respect to insert size) was assessed by comparing the partia1 nucleo- tide sequences obtained for abundant isoforms of catalase and several Chl a binding proteins (Fig. 1). Of 12 sequences obtained for catalase, 7 contained the translation initiation codon. Similarly, of 15 Chl a binding clones sequenced, the translation initiation ATG was present in 12 clones. Thus, it appears that for mRNA species in the range of 1 to 1.7 kb, a majority of the cDNA clones contain the translational start codon.

Since homology between plant and nonplant gene products is frequently not found at the amino-terminal region of the proteins, the assignment of probable function to ESTs by data base analysis is facilitated by the presence in the library

of a certain proportion of less than full-length cDNAs so that interna1 sequences can also be obtained. However, since it eventually may be desirable to obtain the complete sequence of a11 the ESTs, the use of a library with a high proportion of full-length clones was considered preferable in the long term.

Sequence Analysis

A total of 1518 single-pass nucleotide sequences were obtained from 1477 randomly picked cDNA clones. Forty- one of these clones were sequenced from both ends. For most of the sequences from the oriented libraries (PRL1 and PRL2), the sequences were obtained only from the putative 5' end of the cDNA to enhance the probability of obtaining coding sequence. Each sequence was manually processed to remove vector sequences from the 5' end, to resolve (as far as possible) sequencing ambiguities that were not assigned by the automated sequenator, and to decide where to terminate the sequence. The average EST produced in this way contains approximately 375 bp of sequence. Comparison of 31 ESTs with previously published sequences indicated that the error rate was approximately 0.3% (29/10,500) for the first 300 bp and about 4% for >300 bp. Severa1 of the ESTs obtained during the early stages of the project and deposited in dbEST have subsequently been found to contain vector sequences, which resulted from the sequencing run being longer than the insert in the cDNA clone or for other reasons. For this and related reasons, it is advisable to analyze any EST se- quence for homology to the vector and to the current release of the data base before proceeding to use the sequence or the clone for any experimental purposes.

Each of the edited sequences was deposited in dbEST and compared to the nonredundant nucleotide and protein se- quences data bases by BLASTN (nucleotide homology) and BLASTX (deduced amino acid sequence homology) searches (Altschul et al., 1990). Deduced amino acid sequence homol- ogy between an EST and a known sequence was deemed significant if the BLASTX PAM120 score was greater than 80. From this analysis 292 of the ESTs had significant ho- mology to 88 previously identified cDNA clones from Ara- bidopsis (Table I). In most cases the previously identified clones were ESTs reported by Hofte et al. (1993). In many instances the sequence identity between the newly identified ESTs and the previously identified cDNA clones was less than 95% at the nucleotide level, indicating that the clones might represent isoforms of the previously identified gene (or, less likely, an abnormally high sequence error rate). As an example of the biological complexity underlying this ap- parent redundancy, the current release of dbEST contains 316 Arabidopsis ESTs with homology to known kinases. A preliminary analysis of the number of different kinases rep- resented by this collection indicated that there are dozens of structurally different enzymes represented by this subset of ESTs. Thus, a detailed analysis of the total number of differ- ent genes represented by the ESTs is beyond the scope of this article. The list of individual clones with homology to the EST classes in Table I can be obtained from the data bases by following the instructions presented below.

One hundred seventy-seven of the ESTs showed signifi- cant deduced amino acid sequence homology to 113 previ-

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1244 Newman et al.

A * 46F6T7STD

40G377 STD -B

35G12T7 STD

15D5T7.STD > 15F3T7.STD

40AlT7 STD

20ClT7.STD > , 42A6T7.STD > ,

35F2T7 STD . + 36ClT7.STD .~

3406T7 STD - 438477 S l D .

Plant Physiol. Vol. 106, 1994

B

ATHLHCP3 DNA

43A7T7 STD > , 47H3T7 STD

35A7T7 STD > 34ClZT7.STD , ,

2D4T7P STD

3601T7.STD

35G10T7.STD > I

I , 3 5 G l l T 7 STD > I

ATHLHCPP DNA * 20C6T7 STD

L 36H7T7 STD

31DlT7.STD

33G6T7 STD , ~~~~~ ~~~~ ~

1 1.132

Figure 1. Comparison of EST sequences obtained for catalase (A) and Chl a/b binding proteins (B). The extent of the full- length cDNA sequence for each gene is shown as long arrows representing the single full-length clone for catalase (CAT1) or the three clones available for members of the Chl a/b binding protein family (ATHLHCP3, ATHLHCPZ, ATHLHCP1). The direction and extent of the ESTs obtained for each gene are shown as horizontal arrows. The positions of Met codons are shown as small boxes and the translation initiator codons are indicated by solid vertical arrows.

ously identified genes from plants other than Arabidopsis (Table 11). In some instances these ESTs appeared to represent new isoforms of the previously identified plant genes. A striking example of this is represented by the isolation of seven distinct Cyt P450 sequences. In view of the relatively low deduced amino acid sequence identity between the Ara- bidopsis clones and the previously isolated clones from other species, it is possible that each of the corresponding proteins catalyzes a different enzymatic reaction. Thus, even though this class of ESTs corresponds at some leve1 to previously known plant genes, the clones may prove useful starting materials for investigations of the corresponding functions in Arabidopsis. In this respect it should be noted that some of the apparently distinct EST sequences may actually represent different nonoverlapping regions from the same gene.

One hundred eighty-three of the ESTs showed significant homology to 165 previously identified genes from species other than higher plants (Table 111). The sources of the homologous genes varied from bacteria to humans. Many of the ESTs showed homology to enzymes from ubiquitous metabolic pathways, structural proteins, and components of

the transcriptional or translational apparatus. Others showed homology to proteins involved in functions thaí are not known to exist in plants. For example, ESTs were identified with apparent homologies to bovine brown fat uncoupling protein, the agglutinin core subunit from yeast, a cyclic nucleotide gated channel from catfish, a fibronectin binding protein, and many other proteins that cannot be iminediately assigned probable functions in plants. Many other ESTs corresponded to functions that might correspond to known functions. For instance, a putative clone for an epoxide hy- drolase could be involved in cutin synthesis or in c xotenoid metabolism. A clone for a putative acyl-COA bindicg protein could represent a new lead to the as-yet unresolvec problem of how lipids move between intracellular membranes.

DISCUSSION

The importance of high-throughput cDNA sequmcing re- sides in the fact that it is an extremely efficient way of connecting plant biology to nonplant biology. For the past 15 years, biologists who do not work on plants h,we been

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1245

Table 1. List of previously identified Arabidopsis genes or gene families for which one or more E5Ts were identified 14-3-3 protein 2s seed storage protein RNA binding protein 31 kD Adenylate translocator ADP-ribosylation factor a tubulin Amino acid permease I Annexin APG protein ATHBZ homeobox protein Auxin-induced protein ,8 tubulin Blue copper binding protein Chl a/b binding protein Chl a/b binding protein Chl a/b binding protein Calmodulin-like 22-kD protein Carbonic anhydrase Carboxypeptidase Y Catalase Cdc2 Chalcone synthase CHLl gene product Cor47 CP29 Cruciferin DNA binding protein DRT 100 gene product Elongation factor 1 -a Elongation factor Tu Enolase Ethylene-forming enzyme Eukaryotic initiation factor 5A Fd Ferritin Flavonol4-sulfotransferase Fr-bisP aldolase Glutamine synthetase Glutathione 5-transferase Glyceraldehyde-3-P dehydrogenase Gly-rich proten GTP binding protein Heat-shock 70-kD cognate Hydroxymethylglutaryl-COA reductase

Hypothetical transmembrane protein Ascorbate peroxidase Laminin receptor Leu aminopeptidase Ltil40 gene product Meri-5 Metallothionein Lipid transfer protein Peroxidase Phosphoribulokinase PSll 10-K protein PSll 33-kD protein Plasmamembrane H+ ATPase Poly(A) binding protein Polyubiquitin Protein kinase Protein kinase C inhibitor PSll 33-kD oxygen-evolving protein Pyrophosphate-energized vacuolar proton pump Receptor-like protein kinase Ribosomal protein L12 Ribosomal protein L17 Ribosomal protein L19 Ribosomal protein L27 Ribosomal protein L3 Ribosomal protein L9 Ribosomal protein S13 Ribosomal protein S19 Rubisco activase Rubisco SS 1A Rubisco SS 16 Rubisco SS 26 Rubisco SS 38 5-Adenosylmethionine synthase Superoxide dismutase Thaumatin Thioglucosidase Tonoplast intrinsic protein Topoisomerase Transmembrane protein Ubiquitin Ubiquitin-conjugating enzyme Ubiquitin extension protein Vacuolar ATP synthase

producing large amounts of sequence information about pro- teins and genes of known function from a wide variety of organisms. The availability of a comprehensive collection of Arabidopsis and rice ESTs will facilitate the ability of plant biologists to directly utilize this vast pool of knowledge about proteins and genes from nonplant organisms. Frequently, the products of these nonplant genes exhibit enough homology to the corresponding plant genes so that only a few dozen amino acid residues of sequence information are sufficient to identify a statistically significant match. In other cases, such as the family of Cyt P450s described here, homology between genes of related but different function can be used to identify potentially useful new genes. This example also illustrates an important caveat to the use of homology searching: genes of

different function may appear homologous. Thus, results from EST analysis are essentially just hypotheses that must be tested by other criteria. Nevertheless, because of the rapidity with which the data bases of ESTs are currently growing, it is important to know how to access and use this information. The following discussion identifies some of the relevant issues in this respect.

Cene Representation in the EST Data Bases

Based on the rate at which EST sequences are currently being produced in various laboratories, we believe that partia1 sequence information will be available for the majority of plant genes in the foreseeable future. Based on estimates of

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1246 Newman et a]. Plant Physiol. Vol. 106, 1994

Table II. lnventory of Arabidopsis €STs with significant homology to genes from other plants ESTs with homology to Arabidopsis genes are not listed. The EST# is the accession number assigned by dbEST. The numbers in ttie columns

designated ID, Similar, and Overlap refer to the number of identical (ID) or similar (Similar) amino acids in a contiguous region of ;I particular length (Overlap). The heading Organism refers to the source of the protein that exhibits homology to the Arabidopsis EST. In those cases where more than one EST showed sigificant homology to a particular protein, the number of "hits" i s indicated by a number in parentheses in the Putative ldentification column.

EST#

3461 2 21240 35217 34874 35102 21164 34964 20833 34893 21138 20775 21577 34784 34760 34794 21055 35173 3472 1 3462 1 3471 1 21035 21310 34914 34891 21207 20901 21629 21 529 20949 20831 21232 35095 34823 35104 21239 21077 35193 35018 35046 35094 21611 21368 21405 21 126 21 597 34838 2 1489 2 1545 20791 35147 35055 34632 34857 34996 34942 21110 34625

Putative ldentification ID Similar Overlap Score Organism

14-3-3-like protein (2) ACC oxidase (2) Acetyl-COA carboxylase Actinidin Adenylate kinase ADP-GIu pyrophosphorylase Aleurain a-Galactosidase Annexin Anther-specific protein SF18 ATP synthase 6, mitochondrial ATP synthase 7, mitochondrial Auxin down-regulated gene ADR11 Auxin-induced protein PCNTlO7 P-1,3-Glucanse p-Clucosidase (4) P-Ketoacyl-ACP synthase Chl alb binding Cathepsin B Chloroplast inner envelope protein Cinnamyl-alcohol dehydrogenase CP24 Chl alb binding 1OB (3) Cystatin (2) Cys synthase

Cyt P450 type I Cyt P450 type II Cyt P450 type III Cyt P450 type IV Cyt P450 type V Cyt P450 type VI Cyt P450 type VI1 Di hydroflavonol-4-reductase Dihydrolipoamine dehydrogenase Early light-inducible protein Elongation factor I a (2) Endo-l,3-p glucosidase Endoplasmin (HSP 90) ENOD8 Ethylene-forming enzyme Flower senescence-related protein (6) Fru-bisP aldolase (5) Ceranylgeranyl PPi synthetase Glutamate synthase Heat-shock 70-kD, mitochondrial Hyp-rich glycoprotein Hypothetical 16.5-kD protein (4) Hypothetical protein (6) lniation factor 4A lnitiation factor 5A lsocitrate dehydrogenase (NADPH) lsopropylmalate dehydrogenase Jacalin heavy chain (2) Late embryogenesis abundant protein Lectin I Lectin I I Legumin

Cyt B6-F

61 51 57 66 92 64 21 21 22 25 25 28 21 50 17 55 97 73 22 37 72 57 34 96 28 19 47 27 33 36 22 46 26 23 22 57 24 55 22 59 64 99 24 82 95 23 30 30

107 16 91 75 20 27 24 22 15

72 72 75 74

1 o1 74 35 28 26 33 31 30 30 57 23 70

106 85 39 45 86 59 52

1 O 0 35 35 63 38 44 54 40 62 32 25 30 75 40 67 28 61 86

107 32 89

104 28 48 38

111 21 99 81 31 43 41 33 26

81 98

102 1 O 0 116 92 75 42 41 50 37 34 63 72 32 91

126 103 67 50

110 67 77

115 61 53 91 56 64 74 71

108 45 26 49 98 58

103 5 o 73

1 o4 124 53 97

109 67

81 51

112 24

113 88 62 8 3 88 63 46

316 293 294 346 469 300

85 124 99

157 141 131 115 264

93 309 494 380 106 212 395 325 189 472 104 102 240 141 190 212 114 224 140 127 1 o1 308 133 275

96 316 360 489 107 420 483 107 130 163 597

83 482 386 95

121 102 106 90

Oenothera hookeri Tomato Maize Kiwi Rice Potato Barley Cyanopsis tetragoncdoba Tomato Sunflower Sweet potato Sweet potato Soybean Tobacco Brassica napus Jrifolium repens Castor Tomato Wheat Spinach Tobacco Tomato Maize Spinach Tobacco Avocado Avocado Avocado Catharanthus roseu:; Avocado Avocado Avocado Antirrhinum majus Pea Barley Rice Barley Barley Medicago sativa Brassica juncea Dia n th us ca ryop h yl.'us Spinach Capsicum annum Maize Pea Zea diploperennis Tobacco Strawberry Tobacco Medicago sativa Soybean Brassica napus Jackfruit Cotton Medicago truncatula Doliehos biflorus Vicia faba

Continued on next page

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1247

Table II. Continued EST# Putative ldentification ID Similar Overlap Score Organism

34757 3491 7 21641 34716 21075 21315 35 164 351 58 21468 21365 20850 20933 34862 20842 20954 21588 34974 21072 21102 2 1589 34999 21612 21504 21256 35040 20957 21 163 21212 21151 20950 35112 21171 21490 21471 21104 35162 20915 21018 2 1542 35083 2 1474 34758 35216 35012 21609 20962 34693 21270 34935 2 1459 2 1066 34789 34759 34768

Lupin-specific protein PPLZO2 Major latex protein (2) Malate dehydrogenase (NADP) Malate dehydrogenase, glyoxysomal Malate synthase, glyoxysomal Malic enzyme, NADP-dependent (2) MAP kinase homolog type I MAP kinase homolog type II Membrane channel, root specific Monodehydroascorbate reductase Multiple stimulus response protein (2) Myb proten 308 Myrosinase (3) Oryzain a chain (2) Oryzain chain (2) Oryzain y chain Pathogenesis-related protein Pectate lyase Pectinesterase 2 PEP carboxylase (2) Peroxidase I (3) Peroxidase II Peroxidase III Peroxidase, cationic I Peroxidase, neutra1 Phosphate translocator, chloroplast (3) Phosphoglycerate kinase Phosphoglycerate mutase (2) Pistil extensin-like protein Polygalacturonase inhibitor (2) Profilin I, pollen antigen (4) Pro-rich protein (2) PSI reaction center subunit IV PSI 20-kD protein (2) PSI subunit III (2) PSll 16-kD subunit PSll 23-kD protein (8) Putative membrane channel protein Pyruvate decarboxylase RAS-related CTP-binding protein RAS-related CTP-binding protein RAS-related GTP-binding protein (2) Receptor-like protein kinase Ribosomal proten L16 Ribosomal protein L23 Ribosomal protein L24 Rubber elongation factor S-Receptor kinase (2) S-Adenosyl-Met decarboxylase Senescence-related protein D l N l (3) Stearoyl-ACP desaturase Stem-specific protein Suc-phosphate synthase Vacuolar ATP synthase 16-kD subunit (3)

27 30 56 34 49

105 38 16

121 33 87 86 42 88 85 17 42 68 22 99 52 43 42 50 36 94 44 94 19 25 67 18 16 81 50 30 83 71 36 39 42 96 32 59 90 61 24 49 42 70 32 24 25 35

35061 Vacuolar ATP synthase 69-kD subunit 96

37 37 64 38 50

117 56 20

131 43

116 89 47 94

102 23 46 83 30

106 58 56 50 59 37

1 O0 45

102 23 38 76 23 17 86 56 35 97 83 41 41 52 99 48 61 91 78 38 60 54 86 40 40 30 36 98

50 72 82 45 57

133 73 21

145 46

128 95 49

108 119 31 57

111 46

1 o9 72 68 69 74 44

104 52

114 55 55

1 O0 35 17 90 61 42

111 92 47 44 56

104 1 O0 75 93

106 62 83 75

110 52 96 44 54

115

137 133 293 171 2 72 544 220

88 640 174 533 471 233 464 487 81

216 366 102 520 277 245 21 1 2 70 185 498 21 1 495

95 110 369 105 88

423 262 144 426 362 207 160 314 447 111 31 1 501 31 7 130 244 216 360 190 93

126 160 477

Lupinus polyphyllus Papaver Sorghum Citrullus vulgaris Brassica Populus trichocarpa Pea Medicago sativa Tobacco Cucumis sativa Tobacco An tirrhinum Brassica napus Rice Rice Rice Tobacco Tobacco Tomato Sorgh u m Cotton Vigna angular; Turnip Tomato Horseradish Spinach Spinach Maize Tobacco Pyrus communis Maize Brassica napus Barley Spinach Haveria trinervia Spinach Tomato Tobacco Maize Rice Pea Pea Pyrus communis Spinach Sinapis alba Pea Hevea brasiliensis Brassica Potato Radish Jojoba Tobacco Potato Avena sativa Carrot

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1248 N e w m a n et al. Plant Physiol. Vol. 106, 1994 ~ ~ ~~~ _ ____~ ~

Table 111. lnventory of Arabidopsis ESTs with significant homology to nonplant genes See Table II for an explanation of the column headinns.

EST# Putative ldentification ID Similar OverlaD Score Oraanism

3471 7 21168 21327 21470 20896 2 1287 20853 34961 20862 34824 35201 34661 21174 21148 34725 21451 34907 20952 21614 20849 21 343 35126 21606 20782 20927 34602 21521 21051 351 11 21250 34754 21019 21568 2 1242 2 1622 2 1060 21427 20865 2 1064 21350 34668 21342 34669 21515 2 1030 21109 34634 35035 34742 34945 21639 34666 21650 34752 34655 21 579 35125 34833 34989 33953 34066

26s protease subunit 4 4-Nitrophenylphosphatase a-Agglutin core subunit (Ser rich) Acyl carrier protein Acyl-COA binding protein Acyl-COA oxidase, peroxisomal (2) ADP/ATP carrier protein ADP/ATP carrier protein, mitochondrial Ala aminotransferase Alcohol dehydrogenase Aldehyde dehydrogenase type I Aldehyde dehydrogenase type II a toxin a-Glucosidase, lysosomal (3) Aminomethyltransferase Ankyrin 2 Apolipoprotein A-IV Arsenical pump-driving ATPase Aspartic acid rich proten ATP synthase B’, chloroplast ATP synthase E subunit, vacuolar ATP-binding protein Bacteriochlorophyll synthase Bactoferritin co-migratory protein P-Hydroxybutryl-COA dehydrogenase Brown fat uncoupling protein Cathepsin E Cell-division control protein Cell-division protein Chaperonin-like protein Citrate lyase Collagen-related protein 2 Cyclic nucleotide gated channel Cysteinyl-tRNA synthetase Cyt B561 Cytoplasmic protein transport (sec23) Diaminopimelate epimerase DNA repair protein RAD18 Dynamin-1 Elongation factor 2 Elongation factor 3 Elongation factor I, gamma (2) Elongation factor Tu Endoglucanase (cellulase) Epoxide hydrolase Ferripyrochelin binding protein Fibronectin binding protein (Pro rich) Galactokinase Gephyrin, microtubule-associated proten Glc derepression factor POP2 Glc transport protein Glc-6-phosphate dehydrogenase (2) Glutamate decarboxylase (2) Glutamate synthase Glutaredoxin Granaticin polyketide synthase GTP binding protein GTP binding protein GTP binding protein GTP cyclohydrolase II Hemoglobinase

35 23 28 38 40 69 45 27 27 27 39 24 33 29 18 34 19 23 11 37 41 20 23 21 41 17 23 88 73 77 25 16 25 27 39 60 37 14 63 91 19 30 84 25 41 24 20 33 23 13 32 51 49 22 29 25 45 23 31 52 38

47 34 53 47 54 90 68 38 39 33 43 31 44 38 21 48 34 32 12 67 56 24 29 29 60 20 34

1 o1 84 90 44 23 49 28 64 80 51 17 85

1 O0 27 39 96 34 64 37 25 46 35 27 63 65 68 27 37 35 62 30 37 60 50

61 50 97 59 77

130 97 61 51 52 56 50 69 54 26 88 72 47 14

123 78 28 44 46 64 26 61

117 1 o1 105 92 28

1 O 0 39

113 104 79 34

117 108 45 58

116 46 99 50 51 81 52 40

105 87 95 47 59 61 64 57 41 81 69

183 125 1 o1 202 206 380 235 141 152 144 196 104 185 181 105 103 86

1 O0 92

147 204 1 o1 85 94

282 88

110 405 383 406

81 101 94

150 158 345 180 85

328 485 94

146 447 142 265 143 97

165 112 86

122 2 70 227 91

137 126 243 113 162 258 200

Human Yeast Yeast Neurospora crassa Human Rat Rickettsia prowazekii Yeast Human Chicken Bovine Human Clostridium perfringer s Human Bovine Human Mouse E. coli Plasmodium falciparuin Synechococcus PCC6301 Manduca sexta E . coli Rhodobacter sphaero der E. coli Clostridium acetobut),licum Bovine Cavia porcellus Yeast E. coli Human Rat Hydra magnipapillata Channel catfish E. coli Bovine Yeast E. coli Yeast Rat Chlorella kessleri Yeast Artemia Thermus aquaticus Clostridium thermocdlum Human Pseudomonas aerugirtosa Staphylococcus aureiis Human Rat Yeast Synechocystis PCC6803 Yeast E. coli Azospirillum brasi1eni;e Yeast Streptomyces violace Pruber Yeast Yeast S. pombe B. subtilis Schistosoma japoniciim

. Continued cin next page

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1249

Table 111. Continued EST#

~

21627 2 1636 35084 35067 21 533 21086 35072 35098 34961 21245 21 541 20970 34828 20940 21286 2 1208 34778 34763 20946 21 083

21071 21098 2 1005 35034 20789 21219 34929 21 169 21037 2 1603 20839 34926 34745 34909 21175 2 1643 21565 35063 20928 35026 2 1032 35166 21380 35003 34624 21505 21027 3481 1 2 1266 21288 35089 20904 34783 34679 2151 1 20942 20958 34596 21264 35180 34860

Putative ldentification ID Similar Overlap Score Organism

His-rich glycoprotein Histone H2A type IV (2) Histone H2A type VI Hydantoinase Hydrogenase (2) Hydroxylase Hypothetical20-kD open reading frame Hypothetical 272-kD protein Hypothetical 96-kD yeast protein YKL525 Hypothetical Pro-rich protein Hypothetical protein lnositol 1,4,5-triphosphate 5 phosphatase lsocitrate dehydrogenase (NAD+) Isopentenyl-diphosphate isomerase lsopropylmalate dehydratase KDEL receptor Keratin type II Kinase, casein type I 6 Kinase, casein type II Kinesin light chain isoform 4

Lactaldehyde dehydrogenase Lactoyl glutathione lyase Lipase Malate dehydrogenase Malate dehydrogenase, cytoplasmic Mei2 gene Met synthase Met synthase Methylamine oxidase MHC class III RD-repeat protein MHC H-ZK/t-w5-linked open reading frame Microtubule-associated protein Mitosis inducer (protein kinase) MOV34, embryogenesis factor Multidrug resistance protein Multiple antibiotic resistance NADPH dehydrogenase Neurofilament protein H, form H1 Nucleolin Oxoglutarate/malate carrier protein Oxoisovalerate dehydrogenase Oxysterol binding protein (2) Paired amphipathic helix protein Pancreatic tumor-related protein Peptidyl-prolyl cis-trans isomerase Phosphogluconate dehydrogenase Phospholipase C Phosphoprotein phosphatase 2C Placenta1 protein 15 Poly-pyrimidine tract-binding protein Pre-mRNA splicing factor Prohibitin Proteasome component C3 Protein kinase C receptor Protein kinase, G2-specific Proteosome component PUP1 Quinone oxidoreductase Raf-1 proto oncogene Riboflavin synthetase Ribosomal protein HS6 Ribosomal protein LI0

38 63 40 42 20 14 19 51 27 16 45 24 36 23 38 43 36 29 23 24

17 17 19 36 22 54 53 43 49 21 20 23 28 65 44 29 30 32 20 23 77 18 18 29 16 20 16 24 31 19 26 56 47 54 38 37 32 15 34 32 23

53 72 48 57 33 29 25 65 38 16 56 34 45 27 55 53 38 39 35 34

34 25 25 40 29 69 60 52 73 39 29 43 33 88 58 30 54 40 37 31 91 25 37 39 27 31 28 28 47 26 39 79 55 66 50 54 42 20 55 53 37

98 93 66 86 61 44 37 90 61 19 84 52 70 33

102 63 74 51 50 77

50 38 36 55 40 98 93 88

101 55 40 98 65

114 70 30

128 98 79 49

110 38 73 75 47 39 37 40 69 45 56 99 69 88 74 67 65 28 71 83 57

214 2 70 201 219

99 83

101 31 7 141 105 223 125 190 126 170 221 177 165 113 89

96 85 90

169 111 285 263 189 291 135 88 87

118 352 230 153 133 107 83

112 423

98 102 130 87

120 94

129 170 88

149 294 227 285 197 187 161 80

200 166 111

Plasmodium lophurae Volvox carteri C h icken Pseudomonas putida Anabaena cylindrica Streptomyces halstedill E. coli C. elegans Yeast Owenia fusiformis C. elegans Human Yeast Yeast Phycomyces blakesleeanus Human Mouse Rat Human Strongylocentrotus purpura tus E. coli Human Rhizomucor miehei Thermus aquaticus Pig S. pombe E. coli Yeast Arthrobacter sp. Mouse Mouse Rat S. pombe Mouse Human E. coli Yeast Rabbit C h icken Human Bovine Human Yeast Human E. coli E . coli Listeria Rat Human Rat Human Human Human Rat S. pombe Yeast E. coli Human Photobacterium Haloarcula marismortui Yeast

Continued on next page

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1250 Newman et al. Plant Physiol. Vol. 106, 1994

~

Table 111. Continued

EST# Putative ldentification ID Similar Overlap Score Organism

2 1408 35141 34777 21421 20797 21441 21 51 3 34657 21 353 34892 35058 21 553 35096 20909 21318 21113 35142 21224 2 1293 21218 21 114 21158 21397 20956 21007 21627 34904 21155 2 1644 35122 21026 21210 21360 21430 20984 34667 34957 21 387 34741 34795

Ribosomal protein L14A Ribosomal protein L3 Ribosomal protein L36 (2) Ribosomal protein L38 Ribosomal protein L4 Ribosomal protein L41 Ribosomal protein L6 Ribosomal protein L8 Ribosomal protein S2 Ribosomal protein S3 (2) Ribosomal protein S6 (2) Ribosomal protein S8 Ribosomal protein URPl (2) Ribosomal protein YL41 RNA binding protein RNA helicase (2) RNA polymerase I suppressor protein S-Adenosyl-Met decarboxylase (2) Secl4 Ser-rich protein Single-stranded DNA binding protein Small nuclear ribonucleoprotein Splicing factor U2AF 65-kD subunit Stress-inducible protein Succinyl COA synthetase Sulfated surface glycoprotein (2) Surface glycoprotein Synaptobrevin T-cell specific protein Tetrahydrofolate synthase Thermophilic factor Thiolase (2) Transaldolase Transketolase Translation factor Sul1 Tyrosine aminotransferase Vacuolar ATPase 36-kD subunit Vacuolar sorting protein VPS35 Valosin-containing polypeptide (CDC48) Vitelloaenin (Ser-rich)

40 13 41 38 15 73 45 51 43 92 47 60 42 20 33 36 22 40 27 24 18 48 29 31 16 25 17 23 20 32 25 17 26 43 20 41 28 27 70 26

50 20 51 42 29 77 60 68 53 96 60 80 55 22 42 57 29 54 35 29 28 59 45 41 27 26 22 50 26 46 35 28 43 56 24 60 44 40 86 31

66 26 80 45 38 78 79 89 90

116 82

112 86 25 57 76 39 92 62 45 44 74 63 52 37 42 27 77 37 84 74 44 65 82 32 97 67 62

1 o1 57

206 82

196 21 1 101 404 246 290 199 456 251 314 218 1 o1 178 210 114 173 124 87

101 248 168 166 90

140 94

136 115 156 104 93

136 197 1 o9 222 149 135 363 80

Xenopus Cyanophora paradoxa Rat Rat Yeast Candida maltosa Cyanophora paradoxa Rat Drosop h ila Xenopus Human Rat Yeast Yeast Drosop h ila E . coli Yeast Rat Yeast Plasmodium falciparurn Human Drosophila Mouse Fusarium oxysporum lhermus aquaticus Volvox carteri Trypanosoma brucei Yeast Mouse Yeast Sulfolobus shibatae Human Yeast E. coli Yeast Rat Yeast Yeast Pig Chicken

the total genome size of Arabidopsis, the average size of a gene, and the average distance between genes, it has been estimated that Arabidopsis has enough DNA to encode only about 25,000 genes at most (assuming 1 kb between adjacent genes) (Gibson and Somerville, 1993) and probably has fewer (Meyerowitz, 1994). Thus, if redundant sequencing could be avoided, one laboratory with severa1 automated sequenators could be expected to obtain sequences of cDNAs for most or a11 of the genes in Arabidopsis in about 3 years.

As in other EST sequencing projects, we initially relied on randomly chosen cDNA clones as a source of ESTs. Thus, the first few thousand Arabidopsis and rice ESTs are enriched with sequences representing highly abundant cDNAs. Be- cause moderately expressed genes tend to have a higher probability of showing homology to a known gene in the sequence data bases than weakly expressed genes (Green et al., 1993), the relatively high frequency of ESTs that exhibited

homology to a known gene may not be sustained upon further sequencing. However, during the sequencing of more than 85,000 human ESTs, there was no significant decline in the frequency with which an EST could be assigncsd putative function by comparison to the data bases (C. Venter, personal communication).

The rationale for using the single cDNA library described here rather than a collection of libraries preparec! from dif- ferent tissues (Hofte et al., 1993) is that in order to sequence cDNAs for a11 of the expressed genes in Arubidopsis by sequencing randomly chosen anonymous clones, it will be necessary to develop methods that limit redundan t sequenc- ing of abundant cDNAs and simultaneously enhance the frequency of cDNAs that are expressed at very lo” levels or that are expressed only in a small number of cells or under nonambient conditions. One of the ways in which this may be accomplished is by the construction of a normalized library

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1251

in which a11 cDNA clones are represented with similar abun- dance (Sankhavaram et al., 1991). The use of the XZipLox vector and the decision to pool mRNA from various tissues rather than to sequence tissue-specific libraries was mediated by considerations pertaining to library normalization. Since many examples of "tissue-specific gene expression" are quan- titative rather than qualitative, effective normalization of tissue-specific libraries would be expected to eliminate much or a11 of the information associated with the origin of the RNA. By contrast, since many of the same genes would be present in a11 of the tissue-specific libraries, the use of these libraries would require the sequencing of much larger num- bers of ESTs in order to sample at least 95% of a11 cDNAs at least once. Tissue-specific libraries do- offer the advantage that the tissue of origin of a cDNA is known. However, since it is not known if this is the only tissue of expression, this information is of limited value unless a very large number of sequences are available from a11 tissues.

In summary, we believe that it is feasible to obtain partia1 sequence information on a11 cDNAs in the foreseeable future. Thus, many of the problems currently associated with gene isolation in higher plants will increasingly become problems associated with gene identification in data bases.

Accessing the Data Bases

Although various lists of EST homologies are available (e.g. Tables I, 11, and III), such lists are rapidly made obsolete because of increases in the information and sequence content of the international data bases. Most of the groups that are currently sequencing large numbers of plant ESTs deposit the sequences in intemational public data bases on a regular basis but may not publish summaries of the information for ex- tended periods, if at all. Therefore, the only practical method to determine if an EST is available for a particular protein is to search the data bases directly on a regular basis. There are two different ways to search for an EST that differ in the speed, the leve1 of computer resources required, and the ability of the user to control the parameters of the search. Perhaps the simplest method, but also the most powerful, is to directly compare a known gene product with the transla- tion of a11 six frames of a11 the ESTs in the public data bases. The second method is to perform a text search of a "preana- lyzed" version of the EST data base.

The various steps involved in comparing an amino acid sequence with the six-frame translation of the dbEST data base is best illustrated by example. Suppose the goal is to determine if an Arabidopsis EST corresponding to the KDEL receptor was present in the data base (an EST corresponding to the KDEL receptor was, in fact, identified at an early stage of the EST project and shown to encode a functional homolog of the yeast protein [Lee et al., 19931). The first step would be to retrieve the sequences of the various KDEL receptors known in other organisms. This could be done by sending an e-mail message with the keyword erd2 (the name of the yeast gene for the KDEL receptor) to the address: [email protected]. The subject of the message could look exactly like the following example (although other op- tional instructions could be added and data bases other than GenBank could be searched):

DATALIB GENBANK BEGIN ERD2

Within severa1 minutes, the RETRIEVE server will send back a listing of the records for a11 accessions with the keyword erd2. It is important to note that similar information could be obtained by using many other keywords or accession numbers. For instance, the search could also be done using the words KDEL and receptor instead of the word erd2. To obtain detailed instructions for the formatting of searches and other options, send the word help to the same address.

The next step is to use one or more of the protein sequences corresponding to known KDEL receptors to search the dbEST data base by sending an e-mail message to the BLAST server at [email protected]. The BLAST algorithm is a heuris- tic for finding ungapped, locally optimal sequence align- ments, which was developed by the NCBI at the National Library of Medicine (Altschul et al., 1990). The BLAST family of programs employs this algorithm to compare an amino acid query sequence against a protein sequence data base or a nucleotide query sequence against a nucleotide sequence data base, as well as other combinations of protein and nucleic acid. In the example below we request that the yeast erd2 amino acid sequence be compared to the six possible translations of a11 the nucleotide sequences in the dbEST data base. It is a testament to the power of the current generation of computers that this task requires only a few minutes. Here we have used only the yeast sequence, but in practice it would probably be best to use a11 known sequences from different classes of organisms. The message could be struc- tured as follows (although other options are available):

PROGRAM TBLASTN DATALIB DBEST BEGIN >TEST OF YEAST ERD2 MNPFRILGDLSHLTSILILIHNIKTTRYIEGISFKTQTLYALVF ITRY LDLLTFHWVSLYNALMKIFFIVSTAY IVVLLQGSKRT NTI AYNEMLMHDTFKIQHLLIGSALMSVFFHHKFTFLE LAWSFSVWL ESVAILPQLYMLSKGGKTRSLTVHYIFAMGLYRALYIP NWIWRY STEDKKLDKIAFFAGLLQTLLY SDFFYIYYTKVIRGKGFKL PK

The fifth line is a comment that serves to identify the search. The amino acid sequence that follows should not have any blanks or characters other than the single-letter code for the amino acid sequence. The result of this search is a listing of the entries in dbEST that exhibit the highest degree of protein and nucleotide similarity to the query sequence. An example of the abbreviated output for this search is shown in Figure 2. The search identified two Ara- bidopsis ESTs with significant homology to the erd2 protein, as well as a yeast and human EST.

The final step is to retrieve the nucleotide sequences for the relevant ESTs from the dbEST data base. Each EST has a variety of numbers associated with it that can be used to retrieve it from the data base. The simplest of these is the

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1252 Newman et al. Plant Physiol. Vol. '106, 1994

Puery= TEST OF YEAST ERD2

Database: Database o f Expressed Sequence Tags, Release 2.11, May 5, 1994

(219 l e t t e r s )

34,292 sequences; 10,827,722 t o t a l Letters.

Smal l e s t Poi sson

Reading High P r o b a b i l i t y Sequences producing High-scoring Segment Pairs: Frame Score P(N) N

gn l idbest'21208 cDNA Lambda-PRL2 A.thaliana Homology:. . . + I 165 165 134 87 61 58 57 45 54 53

gnl dbest 19765 cDNA TEST1, Hunan adu l t Test is t i ssue ... +2 48 gn l ldbest I4042 cDNA CLONTECH cDNA 1 i b r a r y CCRF-CEM, c.. . -2 37

53 40 53 40

gn l dbest I 34427 cDNA Lambda-PRL2 A.thaliana Homology: ... +I gnl I I dbest 36310 cDNA Rice c a l l u s 0.sativa gnl I I dbest 27594 cDNA I n f a n t Brain, Bento Soares H.sapi ... +2 gnl I I dbest 26586 cDNA Feta l brain, Stratagene (catH362 ... -1 gnl I I dbest 37898 cDNA Rice root 0.sativa Homology: sp l ... +3 gnl ldbest I 38740 cDNA STRATAGENE Hunan ske le ta l muscle ... +I gnl I I dbest 28850 Genomic gmbPfHB3.1, G. Roman Reddy P.f ... +1 gnl I I dbest 44038 cDNA cbsPfHB3.1, Debopam Chakrabarti P... -3 gn l I I dbest 33267 cDNA Strasbourg-FA A.thaliana

gn l ldbest I 29657 cDNA Ra147.1 A. t h a l iana Homology: gbl . . . +3 gnl I I dbest 29383 cDNA MHB3MA Cot8-HAP-Ft H.sapiens Ho... +2 gn l I I dbest 21810 cDNA Stratagene cDNA l i b r a r y Hunan hea... -1 gn l I I dbest 4426 cDNA CLONTECH cDNA l i b r a r y CCRF-CEM, c... -3 gnlIdbest,30892 I I cDNA Hunan pancreatic i s l e t H.sapiens ... +3 43

Homology: s... +2

Homolog ... -3 I I

l . l e - 1 7 l . le -17 1.2e-12 l . l e - 0 6 0.17 0.51 0.56 0.57 0.87 0.93 0.94 0.97 0.97 0.97 0.97 0.98 0.991

1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 1

>gnl idbestl21208 cDNA Lambda-PRL2 A.thaliana Homology: pirlA42286lA42286 ERD-2-like prote in , ELP-1 - human Score: 221 pVal: 9.5e-27 Length = 397

Plus Strand HSPs:

Score = 165 (77.7 b i t s ) , Expect = 1.le-17, P = l . l e - 1 7 I d e n t i t i e s = 34/53 (64%), Pos i t i ves = 39/53 (73%), Frame = +1

Query: 1 MNPFRILGDLSHLTSILILIHNIKTTRYIEGISFKTPTLYALVFITRYLDLLT 53

Sbjct: 94 MNIFRFAGDMSHLISVLILLLKIYATKSCAGISLKTPELYALVFLTRYLDLFT 252 MN FR GD+SHL S+LIL+ I T+ G I S KTP LYALVF+TRYLDL T

Figure 2. Example of t h e result of a TBLASTN search of dbEST using the amino acid sequence of the yeast erd2 gene product as a query. The output of t h e search has been abbreviated by omission of additional descriptive information.

dbEST accession number, which we have used for the ex- amples in Tables I1 and 111. In the following example, we retrieved only the top two Arabidopsis ESTs. The retrieval is done by sending the following e-mail message to [email protected]

RPT 21208 34427

Once the sequences are retrieved, it is worthwhile to com- plete severa1 additional searches. First, the nucleotide se- quences should be compared to the data bases to determine if additional parts of the cDNA clone are represented in the data base. This is probably best done by appending the uninterrupted nucleotide sequence as the fifth and subse- quent lines to the following message and sending it to the BLAST server at [email protected]:

PROGRAM BLASTN DATALIB NR BEGIN >COMMENT

It is also useful to compare the deduced amino acid se- quence encoded by the six frames of the EST to the protein data bases. In this case the unintempted nucleotidl. sequence is placed behind the following message to the BLAST server:

PROGRAM BLASTX DATALIB NR BEGIN >COMMENT

On-Lhe Similarity Results

The simplest method to determine if an EST i!; available for a particular gene product is to use one of the newly developed WWW servers to perform a text search of a11 the dbEST records. This service exploits the fact that when each new EST is deposited in the dbEST data base, thc, six-frame translation and the nucleotide sequence of the EST is com- pared to the known or deduced protein and niicleic acid sequences of a11 known genes and gene products and a report

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1253

of significant homologies is appended to the EST sequence. Each time the sequence data bases is updated, the EST records are also reanalyzed and updated. The WWW servers permit one to search the entire record of a11 ESTs for the presence or absence of specific combinations of words. A consequence of this is that if one uses the word Arabidopsis to search dbEST, you recover not only a11 the Arabidopsis sequences but also a11 ESTs from other organisms that show homology to an Arabidopsis gene or gene product. However, by using the word cress only Arabidopsis ESTs are recovered because only primary Arabidopsis records contain the common name of the species (i.e. thale cress).

The WWW is an information retrieval system that provides on-line 'point-and-click" access to documentation and mul- timedia information through hypertext links. A description of how to access the WWW is beyond the scope of this article. However, interested readers should consult a very intelligible introduction to the WWW that describes how to obtain the appropriate software and some features of some of the other molecular biology servers available on the WWW (Appel et al., 1994). To perform a search of the EST data base dbEST, open the URL HTTP://WWW.NCBI.NLM.NIH.GOV. When the NCBI top page opens, select dbEST by clicking on the hypertext link (the blue-colored word dbEST), then select search dbEST and enter the desired search conditions. For instance, to retrieve the Arabidopsis ESTs with homology to kinases, enter cress and kinase.

In addition to the archival sequence information contained in dbEST, analytical results from similarity searches per- formed at the University of Minnesota are available via a new WWW server. The information presented in this server represents a subset of the information being developed to support the Arabidopsis sequencing effort at Michigan State University. In this data base project, raw data from the MSU Arabidopsis EST project is uploaded, prepared, and analyzed with a variety of tools, including automatic remova1 of vector sequences, similarity searches with BLASTN and BLASTX, and the detection and analysis of low-complexity regions of sequence. Reports from these analyses are indexed, and hence are searchable by terms found in the descriptions of similar- ities generated by the BLAST suite of programs. Hence, a search tenn can be entered in the field and a11 clones having similarity to this search term in their descriptions will be displayed. A hypertext link from the clone name accesses a complete report for that clone, which includes the informa- tion as to whether vector information was detected and removed, the percentage of ambiguous bases in the sequence, a summary of a11 hits generated by BLASTN and BLASTX (in addition to the complete reports), whether low-complexity regions were detected, and if so, the report of BLASTP reanalysis of the sequence. Links are also made from the reports to the full data base entry and related information presented by the NCBI WWW server.

In addition to the textual report information, a graphical, nontext representation of these reports is currently under development, and a view of this graphical presentation will be included as a supplement to the textual information. The ArabidopsislMN WWW server can be accessed at URL HTTP://LENTI.MED.UMN.EDU (page down and click on the Arabidopsis icon to access the EST server). This server is

at an early stage of development and is, therefore, undergoing constant modifications. New capabilities will be added with regularity, particularly in the area of new similarity tools and querying abilities.

lnterpreting Scores

The BLAST programs report the degree of sequence simi- larity between an EST and the sequences in the public data bases by three summary numbers, the score, the expect value, and probability (Fig. 2). The derivation of these numbers and the underlying design of the BLAST programs has been described by Altschul et al. (1990). A nontechnical summary of what the score means can be obtained by sending an e- mail message with the word help to blast@ncbi. nlm.nih.gov. In brief, the BLAST score is calculated by painvise mapping of each amino acid (or small groups of amino acids) from a segment of the query sequence onto a gap-free segment of the subject sequence. A 20 x 20 substitution matrix is used to assign an integer value to each aligned pair of amino acids based on a model of the probability of exchanging one amino acid for another. The most widely used substitution scores are variations of the PAM matrix, which is a statistical sum- mary of amino acid substitutions observed in groups of closely related proteins from more than 70 superfamilies of proteins (Dayhoff et al., 1979). The score is the sum of the integer values of the aligned amino acids from the region of the sequences with the highest similarity.

The expect value reported for each HSP is the number of times an HSP of equal or greater score is expected to occur by chance alone during the data base search. Thus, the total length of the data base figures into this estimate. The Poisson P-value for any given HSP is a function of its expected frequency of occurrence (due to chance) and the number of HSPs observed against the same data base sequence with scores at least as high. The expect and probability (P) values reported for HSPs are dependent on numerous factors, in- cluding the scoring scheme employed, the residue composi- tion of the query sequence, an assumed residue composition for a typical data base sequence, the length of the query sequence, and the total length of the data base.

The value of the ESTs resides in the high frequency with which a hypothetical identity can be assigned to an EST by a comparison of the deduced amino acid sequence of the EST with the data base of known proteins. For the 1500 sequences reported here, possible homology to a known protein was obtained for more than 30% of the ESTs using a cutoff score of 80. A key question facing anyone who proposes to inves- tigate an EST that shows similarity to a known protein is what criterion of significance should be used? Empirical studies of this question have generally indicated that BLASTX, BLASTP, or TBLASTN scores of approximately 80 or higher are worth further investigation (Pearson, 1991). However, there are many different ways to generate a score of 80. As noted by Pearson (1991), one often finds that two sequences share a region of 15 to 25 amino acids with 70% identity. This raises the question as to whether such a small region of high similarity is more significant than finding a 50-amino acid region with 35% identity. Although there are no concise rules for resolving these and related questions,

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

1254 Newman et al. PIant Physiol. Vol. 106, 1994

Pearson (1991) outlines some of the additional computational criteria that may be applied in the analysis of apparent similarity .

Sources of Potential Problems

To be cost effective, high-throughput sequencing is inev- itably prone to certain problems that demand attention by users of ESTs. When an EST clone is received from the stock center, a partia1 nucleotide sequence should be obtained from the ends of the clone to verify that the correct clone has been received. Since some of the tissues used for construction of the cDNA libraries were not sterile, some of the cDNA clones may be from contaminating organisms. This can be checked by using the clone to probe an Arubidopsis genomic Southem blot at high stringency. Although the overall accuracy of the sequence information is estimated to be >99% over the first 300 bp, sequencing errors are present and can cause problems in designing PCR primers. However, an analysis of the effects of sequence errors has indicated that such errors do not significantly reduce the probability that the EST sequence can be identified by BLAST searches (States, 1992). For some of the ESTs, sequence information is available from both ends of the clone. Therefore, when an EST report is obtained, it may be useful to search dbEST for possible sequence information for the other end of the clone by using a hypo- thetical clone ID based on the rules for clone names described in "Materials and Methods" (e.g. the other end of clone 38E4BT7 is 38E4BXP). Finally, there are sequences in the data bases that have been misclassified with respect to func- tion. Thus, in the event of apparent homology between an EST and a protein of "known function," it is essential to critically evaluate the evidence supporting the assignment of function to the reference clone.

Because the ESTs are public information, the informal mechanisms that have been used to minimize duplication of effort by different laboratories with similar interests are no longer adequate. Since the resources of the EST data bases are freely available, it is inevitable that different laboratories may simultaneously undertake projects to exploit the same genes. To try and minimize the frequency of needless dupli- cation, the Arubidopsis Biological Resource Center (ABRC) maintains a data base of a11 requests for EST clones that contains the identities of anyone who requests an EST clone so that different groups with similar interests can make contact. This information can be obtained by logging on to the ABRC on-line data base AIMS, either directly or via a GOPHER server (for information, send a help message to inquire- aims@genesy s .cps .msu . edu) .

I

Future Directions

One implication of the EST projects is that within the foreseeable future, the sequences of most or a11 of the genes in several plants will be available in public-access data bases. Thus, much of the effort that is currently expended at cloning genes by various criteria may be obviated by methods based on data base analysis. In view of this, it would be prudent for anyone embarking on a project to isolate a new gene to first evaluate the possibility that it could be identified by

some criterion in a data base. Conversely, since we can now envision a day when a11 the cDNA sequences Df several plants will be available in data bases, it may beconie increas- ingly worthwhile to simply pick an anonymous cDNA that is not homologous to any known sequence and to design experiments to deduce the function of the conesponding gene. Although this approach may seem radical at present, it is not fundamentally different than solving thc chemical structure of a metabolite and then designing experiments to deduce the role of the metabolite. Indeed, in organisms such as yeast, where the complete sequence of the genome will soon be available, this approach has already b1:en imple- mented (Oliver et al., 1992).

From the preliminary results presented here and elsewhere, it is apparent that the function of approximately 70% of the Arabidopsis genes cannot currently be deduceè. solely by sequence analysis. Although the proportion of c.nidentified genes will continually decrease because of progress in iden- tifying the function of plant and nonplant genes by other means, additional developments will be required to provide information conceming gene function. One way to add in- formation to large numbers of ESTs is to correlate the genetic map position of the ESTs with the map locatioris of muta- tions. More than 800 genetic loci have been marked by mutation in Arabidopsis and the list of registered IoCUs names is growing rapidly (Dennis et al., 1993). The ESTs can be genetically mapped by using one of the sets of recombinant inbred lines that are available from the Arabii!opsis Stock Centers (Reiter et al., 1992; Líster and Dean 1993:l or, in some cases, by hybridzing the ESTs to genetically miipped yeast artificial chromosomes containing Ara bidopsis D NA (Last et al., 1991). Although not a11 of the Arabidopsis yc,ast artificial chromosomes have been aligned with the genetic: map as yet (Hwang et al., 1991; Matallana et al., 1992), this approach avoids the necessity of identifying a polymorphism for each EST and is, therefore, suitable for large-scale mapping of a11 the ESTs. Indeed, the very act of hybridizing laige numbers of ESTs to the yeast artificial chromosome libraIies will lead to the identification of a complete set of overkipping yeast artificial chromosomes that span the genome and are an- chored to the genetic map (Matallana et al., 199 2 ) .

The final component of genome technology that will be required to fully exploit the ESTs is the ability to use an EST to create a mutation that eliminates the fun8:tion of the corresponding gene. The use of antisense technology is a useful step in this direction and the development of f a d e new transformation techniques for Arabidopsis (Bechtold et al., 1993) have made the creation of antisense, plants very simple. However, because of the limitations of antisense technology, a high priority should be placed on the devel- opment of facile methods for directed gene disruption (So- merville, 1993).

ACKNOWLEDCMENT

We thank Elliot Meyerowitz for identifying one of the sources of possible artifacts in the EST sequences.

Received July 13, 1994; accepted August 23, 1994. Copyright Clearance Center: 0032-0889/94/106/1241/1.5.

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.

Arabidopsis cDNA Sequences 1255

LITERATURE ClTED

Abeles AL, Austin SL (1991) Antiparallel plasmid pairing may control P1 plasmid replication. Proc Natl Acad Sci USA 8 8

Adams MA, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC (1992) Sequence identification of 2,375 human brain genes. Nature 355 632-634

Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, Kerlavage AR, McCombie WR, Venter JC (1991) Complementary DNA sequencing: expressed sequence tags and the human genome proj- ect. Science 252: 1651-1656

Altschul SF, Gish W, Miller W, Myers EW, Lipman D (1990) Basic local alignment search tool. J Mo1 Biol215: 403-410

Appel RD, Bairoch A, Hochstrasser DF (1994) A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem Sci 19: 258-260

Bechtold N, Ellis J, Pelletier G (1993) In planta Agrobacterium- mediated gene transfer by infiltration of adult Arabidopsis thaliana plants. CR Acad Sci Paris 316 1194-1199

Boguski MS, Lowe TMJ, Tolstoshev CM (1993) dbEST database for 'expressed sequence tags." Nature Genet 4: 332-333

Dayhoff MO, Schwartz RM, Orcutt BC (1979) Survey of new data and computer methods of analysis. In MO Dayhoff, ed, Atlas of Protein Sequence and Structure, Vol5, Suppl3. National Biomed- ical Research Foundation, Washington, DC, pp 1-9

Dennis L, Dean C, Flavell R, Goodman H, Koornneef M, Meye- rowitz E, Shimura Y, Somerville C (1993) The multinational coordinated Arabidopsis thaliana genome research project progress report: year three. U.S. National Science Foundation Publication

Gibson S, Somerville CR (1993) Isolating plant genes. Trends Bio- technol 11: 306-313

Green P, Lipman D, Hillier L, Waterston R, States D, Claverie JM (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259 1711-1716

Hofie H, Desprez T, Amselem J, Chiapello H, Caboche M, Moisan A, Jourjon MF, Charpenteau JL, Berthomieu P, Guerrier D, Giraudat J, Quigley F, Thomas F, Yu DY, Mache R, Raynal M, Cooke R, Grellet F, Delseny M, Parmentier Y, Marcillac G, Gigot C, Fleck J, Philipps G, Axelos M, Bardet C, Tremousaygue D, Lescure B (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. Plant J 4 1051-1061

Hunkapiller T, Kaiser RJ, Koop BF, Hood L (1991) Large-scale and automated DNA sequence determination. Science 254 59-67

Hwang I, Kohchi T, Hauge B, Goodman H, Schmidt R, Cnops G, Dean C, Gibson S, Iba K, Lemieux B, Arondel V, Danhoff L, Somerville CR (1991) Identification and map position of YAC clones comprising one third of the Arabidopsis genome. Plant J 1:

Keith CS, Hoang DO, Barret BM, Feigelman B, Nelson MC, Thai H, Baysdorfer C (1993) Partia1 sequence analysis of 130 randomly selected maize cDNA clones. Plant Physiol101: 329-332

Last RL, Bissinger PH, Mahoney DJ, Radwanski ER, Fink GR (1991) Tryptophan mutants in Arabidopsis: the consequences of duplicated tryptophan synthase

Lee H, Gal S, Newman TC, Raikhel NV (1993) The Arabidopsis

9011-9015

NSF 93-173

367-374

genes. Plant Cell 3 345-358

endoplasmic reticulum retention receptor functions in yeast. Proc Natl Acad Sci USA 9 0 11433-11437

Lister C, Dean C (1993) Recombinant inbred lines for mapping RFLP and phenotypic markers in Arabidopsis thaliana. Plant J 4 745-750

Matallana E, Bell CJ, Dunn PJ, Lu M, Ecker JR (1992) Genetic and physical linkage of the Arabidopsis genome. Zn C Konz, NH Chua, J Schell, eds, Methods in Arabidopsis Research. World Scientific, Teaneck, NJ, pp 144-170

McCombie WR. Adams MD. Kellev IM. FitzGerald MG. Utterback TR, Khan M; Dubnick M, Kerl&age AR, Venter JC, Fields C (1992) Caenorhabditis elegans expressed sequence tags identify gene families and potential disease gene homologues. Nature Genet 1:

Meyerowitz E (1994) Structure and organization of the Arabidopsis thaliana nuclear genome. In E Meyerowitz, CR Somerville, eds, Arabidopsis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, (in press)

Nagy F, Kau SA, Chua NH (1988) Analysis of gene expression in transgenic plants. In SB Gelvin, RA Schilperoot, eds, Plant Molec- ular Biology Manual. Kluwer Academic, Boston, MA, pp B4:l-29

Oliver SG, et al. (1992) The complete sequence of yeast chromosome 111. Nature 357: 40-47

Palazzolo MJ, Hamilton BA, Ding D, Martin CH, Mead DA, Mierendorf RC, Raghavan KV, Meyerowitz EM, Lipshitz HD (1990) Phage lambda cDNA cloning vectors for subtractive hy- bridization, fusion-protein synthesis and Cre-loxP automatic plas- mid subcloning. Gene 8 8 25-36

Park YS, Kwak JM, Kim YS, Lee DS, Cho MJ, Lee HH, Nam HG (1993) Generation of expressed sequence tags of random root cDNA clones of Brassica napus by single-run partial sequencing. Plant PhysiollO3 359-370

Pearson WR (1991) Identifying distantly related protein sequences. Curr Opinion Struct Biol 1: 321-326

Reiter RS, Williams JGK, Feldman KA, Rafalski JA, Tingey SV, Scolnik PA (1992) Global and local genome mapping in Arabidop- sis by using recombinant inbred lines and random amplified pol- ymorphic DNA. Proc Natl Acad Sci USA 8 9 1477-1481

Sankhavaram RP, Parimoo S, Weissman SM (1991) Construction of a uniform abundance (normalized) cDNA library. Proc Natl Acad Sci USA 8 8 1943-1947

Sasaki T, Song J, Koga-Ban Y, Matsui E, Fang F, Higo H, Nagasaki H, Hori M, Miya M, Murayama-Kayano E, Takiguchi T, Taka- suga A, Niki T, Ishimaru K, Ikeda H, Yamamoto Y, Mukai Y, Ohta I, Miyadera N, Havukkala I, Minobe Y (1994) Toward cataloguing a11 rice genes: large-xale sequencing of randomly chosen rice cDNAs from a callus cDNA library. Plant J 6 615-624

Somerville CR (1993) New opportunities to dissect and manipulate plant processes. Proc R SOC Lond B 339 199-206

States DJ (1992) Molecular sequence accuracy: analysing imperfect data. Trends Genet 8 52-55

Uchimiya H, Kidou S, Shimazaki T, Takamatsu S, Hashimoto H, Nishi R, Aotsuka S, Matsubayashi Y, Kidou N, Umeda M, Kato A (1992) Random sequencing of cDNA libraries reveals a variety of expressed genes in cultured cells of rice (Oryza sativa L.). Plant

Waterston R, Martin C, Craxton M, Coulson A, Hillier L, Durbin R, Green P, Shownkeen R, Halloran N, Metzstein M, Hawkins T, Wilson R, Berks M, Du Z, Thomas K, Thierry-Mieg J, Sulston J (1992) A survey of expressed genes in Caenorhabditis elegans. Nature Genet 1: 114-123

124-131

J 2 1005-1009

www.plantphysiol.orgon July 29, 2018 - Published by Downloaded from Copyright © 1994 American Society of Plant Biologists. All rights reserved.


Recommended