+ All Categories
Home > Documents > Integrating gene and protein expression data: pattern analysis and profile mining

Integrating gene and protein expression data: pattern analysis and profile mining

Date post: 31-Oct-2016
Category:
Upload: brian-cox
View: 215 times
Download: 2 times
Share this document with a friend
12
Methods 35 (2005) 303–314 www.elsevier.com/locate/ymeth 1046-2023/$ - see front matter 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ymeth.2004.08.021 Integrating gene and protein expression data: pattern analysis and proWle mining Brian Cox a,b,¤ , Thomas Kislinger a,c , Andrew Emili a,c,¤ a Department of Medical and Molecular Genetics, University of Toronto, Toronto, Ont., Canada b Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto Ont., Canada M5G 1X5 c Program in Proteomics and Bioinformatics, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ont., Canada M5G 1L6 Accepted 25 August 2004 Available online 12 January 2005 Abstract Proteomics and functional genomics are emerging new research Welds devoted to the study of the entire collection of proteins and mRNA transcripts (collectively known as gene products) that deWne a biological system. DNA microarrays are now a popular plat- form for measuring changes in messenger RNA transcript levels on a genome-wide scale, while gel-free shotgun proWling methods based on tandem mass spectrometry are increasingly being used to determine the identity, modiWcation states, and relative abun- dance of large numbers of proteins. By deWning the behavior of entire biological pathways and networks under various physiological states, these studies aim to extend traditional reductionist molecular genetic approaches regarding the biological roles of the vast array of uncharacterized gene products. A key goal is to determine how the information encoded by the myriad of expressed gene products is integrated at the molecular, cellular, and even whole organism level to create the dynamic biochemical processes and complex physiological controls that sustain life. While comparison of the complementary information contained in proteomic and mRNA data sets poses considerable analytical challenges, these eVorts should provide added insight into the fundamental mecha- nisms underlying physiology, development, and the emergence of disease. Here, we outline several analytical approaches, methods, and tools that have proven to be helpful in the face of this important challenge. 2004 Elsevier Inc. All rights reserved. Keywords: Expression proWling; Informatics; Microarrays; Protein mass spectrometry; Shotgun sequencing; Proteomics; Data analysis; Clustering; Data mining; Pattern recognition 1. Introduction DNA microarrays are commonly used to examine global changes in messenger RNA abundance across diVerent biological settings [1]. Likewise, advances in mass spectrometry-based proteomics technology now make it possible to characterize large-numbers of pro- teins in complex biological samples [2]. Gel-free proWling procedures coupling high-performance liquid chromato- graphic fractionation of protein tryptic digests with auto- mated tandem mass spectrometry (LC-MS) represent particularly powerful technology for elucidating the iden- tities, abundance, and post-translational states of hun- dreds to thousands of proteins. This technology can be applied to study proteins present at speciWc time-points within the life-cycle of a cell or organism [3,45,46]. Because cells generally respond to diverse physiological cues, developmental signals, and environmental perturba- tions, changes in mRNA and protein levels can serve as a particularly informative readout of phenotypic state [4]. * Corresponding author. Fax: +1 416 946 7281. E-mail addresses: [email protected] (B. Cox), andrew.emili@ utoronto.ca (A. Emili).
Transcript
Page 1: Integrating gene and protein expression data: pattern analysis and profile mining

Methods 35 (2005) 303–314

www.elsevier.com/locate/ymeth

Integrating gene and protein expression data: pattern analysisand proWle mining

Brian Coxa,b,¤, Thomas Kislingera,c, Andrew Emilia,c,¤

a Department of Medical and Molecular Genetics, University of Toronto, Toronto, Ont., Canadab Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto Ont., Canada M5G 1X5

c Program in Proteomics and Bioinformatics, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ont., Canada M5G 1L6

Accepted 25 August 2004Available online 12 January 2005

Abstract

Proteomics and functional genomics are emerging new research Welds devoted to the study of the entire collection of proteins andmRNA transcripts (collectively known as gene products) that deWne a biological system. DNA microarrays are now a popular plat-form for measuring changes in messenger RNA transcript levels on a genome-wide scale, while gel-free shotgun proWling methodsbased on tandem mass spectrometry are increasingly being used to determine the identity, modiWcation states, and relative abun-dance of large numbers of proteins. By deWning the behavior of entire biological pathways and networks under various physiologicalstates, these studies aim to extend traditional reductionist molecular genetic approaches regarding the biological roles of the vastarray of uncharacterized gene products. A key goal is to determine how the information encoded by the myriad of expressed geneproducts is integrated at the molecular, cellular, and even whole organism level to create the dynamic biochemical processes andcomplex physiological controls that sustain life. While comparison of the complementary information contained in proteomic andmRNA data sets poses considerable analytical challenges, these eVorts should provide added insight into the fundamental mecha-nisms underlying physiology, development, and the emergence of disease. Here, we outline several analytical approaches, methods,and tools that have proven to be helpful in the face of this important challenge. 2004 Elsevier Inc. All rights reserved.

Keywords: Expression proWling; Informatics; Microarrays; Protein mass spectrometry; Shotgun sequencing; Proteomics; Data analysis; Clustering;Data mining; Pattern recognition

1. Introduction

DNA microarrays are commonly used to examineglobal changes in messenger RNA abundance acrossdiVerent biological settings [1]. Likewise, advances inmass spectrometry-based proteomics technology nowmake it possible to characterize large-numbers of pro-teins in complex biological samples [2]. Gel-free proWling

* Corresponding author. Fax: +1 416 946 7281.E-mail addresses: [email protected] (B. Cox), andrew.emili@

utoronto.ca (A. Emili).

1046-2023/$ - see front matter 2004 Elsevier Inc. All rights reserved.doi:10.1016/j.ymeth.2004.08.021

procedures coupling high-performance liquid chromato-graphic fractionation of protein tryptic digests with auto-mated tandem mass spectrometry (LC-MS) representparticularly powerful technology for elucidating the iden-tities, abundance, and post-translational states of hun-dreds to thousands of proteins. This technology can beapplied to study proteins present at speciWc time-pointswithin the life-cycle of a cell or organism [3,45,46].Because cells generally respond to diverse physiologicalcues, developmental signals, and environmental perturba-tions, changes in mRNA and protein levels can serve as aparticularly informative readout of phenotypic state [4].

Page 2: Integrating gene and protein expression data: pattern analysis and profile mining

304 B. Cox et al. / Methods 35 (2005) 303–314

There currently exists a growing literature outliningmethods for integrating and comparing functionalproteomics data sets, such as the composition of proteincomplexes, networks of protein–protein interactions[5,6] or microarray-derived gene expression patterns [7].However, a more fundamental and pressing question isthe correspondence of transcriptional responses to cellu-lar protein abundance. That is, to what extent does thepattern of gene expression, which reXects RNA tran-scription and degradation rates, correlate with the corre-sponding protein levels, which are also inXuenced bytranslational and post-translational mechanisms? Thelimited number of comparative studies carried out todate indicates that the correlation across large data setsis typically modest, presumably due to substantial varia-tions in post-translational processing [8–11]. Forinstance, a recent genome-scale epitope-tagging study ofprotein abundance in yeast by Weissman and colleagues[12] indicates that many essential proteins and transcrip-tion factors are present at levels that are not readily pre-dicted by mRNA levels. These studies are likely tobecome far more common, and informative, as the num-ber of related proteomic and genomic data sets steadilygrows.

Of course, the value of comparing proteomics andmRNA data sets can go far beyond mere simple correla-tion analysis of gene product quantities (i.e., relative lev-els of protein and mRNA detected for the same gene).For instance, pioneering studies by Gerstein and col-leagues [13] have revealed considerable similaritybetween the transcriptome and the proteome in terms ofenrichment for speciWc structural and functional proper-ties. We believe that this form of comparative analysiswill increasingly be used to bridge the burgeoning gapbetween the proteomics and functional genomicsresearch communities by creating a common, interactiveknowledge-base. Such comparisons should also allowfor a better determination of the suitability of using genetranscript levels as a surrogate for protein activity, aswell as provide insight into molecular pathways thatdetermine and link gene and protein expression patterns.Lastly, we expect such comparative studies to improveour understanding of the biochemical mechanisms thatcontrol a range of cellular responses.

Here, we provide an overview of common proteomicand microarray expression proWling procedures, andoutline basic methods and freely available tools that canbe used to map and compare mRNA transcript and pro-tein levels, with an emphasis on deriving broad biologi-cal inferences. We emphasize critical steps and analyticalissues that need to be considered to meaningfully com-pare the results obtained from high-throughput micro-array studies with those from shotgun massspectrometry-based proteomic analyses, illustratingeach of these steps with examples of real experimentaldata.

2. Description of the method

2.1. Generation of proteomic and genomic data sets

Proteomics is deWned as the large-scale examinationof protein expression, localization, modiWcation, struc-ture, function, and activity. 2D-gel electrophoresis hashistorically served as a preferred method for high-resolu-tion separation of protein mixtures prior to MS analysis.However, alternative gel-free LC-MS procedures greatlyimprove proteome coverage, leading to the detection ofmany of the low-abundance and membrane proteinstypically missed by 2D-PAGE. Greatly improved detec-tion limits have been achieved using capillary-scalemulti-dimensional chromatography [14]. Combined withsub-cellular fractionation, modern LC-MS-based proWl-ing methods can be used for the unbiased detection andidentiWcation of literally thousands of proteins in a sin-gle overnight analysis [15], well within the range extract-able from small amounts of mouse [15] or human tissue[16]. Moreover, relative protein abundance between sam-ples can often be accurately determined using in vitroand in vivo protein labeling methods conceptually anal-ogous to those used in microarray studies [17].

Microarray-based proWling experiments are typicallydesigned to detect changes in transcript levels underdiVerent experimental conditions, such as various time-points during development, following treatment with adrug or as a result of gene mutation [1]. There are multi-ple microarray platforms, each of which is optimized formeasuring changes in transcript ratios rather than abso-lute abundance. The original microarray platforminvolved spotting large-numbers of cDNAs onto mem-branes or glass slides [18,19]. These probes range from»250 bp to »2 kb, and are usually generated by PCRfrom an arrayed plasmid library [18]. Like all high-throughput methods, this approach is subjected to spuri-ous experimental artifacts and systemic bias [19–21]. Oneobvious failure stems from the variable GC content andsequence length of the probes, which can lead to diVerenthybridization eYciencies. A second alternative arrayingmethod that partly overcomes this limitation is to uselong synthetic oligos (»60 bases) unique to each tran-script but with similar �G values of annealing [47]. Arange of oligo-based microarrays are available commer-cially. The third is the short oligo array sold by AVyme-trix (www.AVymetrix.com), which is discussed below[22].

To allow for more robust sample measurements, mul-tiple probes are repeatedly spotted for each gene. More-over, experiments are typically run in triplicate tovalidate the statistical signiWcance of outlier values [19–21]. Hybridizations are usually carried out using cDNApools generated by reverse transcription (RT) of totalRNA or puriWed polyA mRNA using an oligo primerdirected to the polyA sequence. cDNA is either directly

Page 3: Integrating gene and protein expression data: pattern analysis and profile mining

B. Cox et al. / Methods 35 (2005) 303–314 305

labeled with Xuorescent nucleotide analogs or reactiveside groups analogs for subsequent labeling, during theRT-reaction [19]. Reference RNA is often then co-hybridized along with the experimental sample tonormalize array intensities across diVerent chips [19].However, diYculties in generating consistent referenceRNA and improved imaging/scanning technologies havereduced this practice. Indeed, experimental samples arehybridized alone using the popular AVymetrix gene chipplatform [22,23], which uses an array of 11–20 perfectmatch probes consisting of 25-mer nucleotide sequencestargeting unique regions on each transcript. A parallelset of mismatch probes with a single base substitution inthe middle of the probe serves as a background control.Biotinylated cRNA is fragmented and annealed to theslide, and hybridization is detected with a Xuorescentlylabeled antibody speciWc to biotin. The scanned probeset intensities are then subjected to detection statisticalanalysis, using proprietary algorithms which assign abinary absent/present call to each measured gene alongwith an estimate of background noise, allowing estima-tion of the signiWcance of diVerences in gene expressionratios across samples.

2.2. Considerations for protein and RNA samples

Detection of meaningful diVerences in recorded pro-tein and gene expression patterns requires the use ofcomputational tools to allow for statistically sound anal-ysis and mining of the data. Since integration of proteo-mic and genomic data sets relies on the carefulcomparison of large heterogeneous data sets, variousdiVerent technical limitations associated with each pro-Wling platform must be considered. For instance, micro-arrays can only detect those transcripts having arepresentative probe on the chip, a limitation rapidlybeing overcome with advancing technology, improvedgene prediction algorithms, and the completion ofgenome sequencing projects. Cross-hybridization andspurious signal is also a frequent if under-appreciatedconcern.

MS identiWcation of proteins is limited by the incom-pleteness and redundancy of protein sequence databasesused for searching MS spectra. The choice of database,and even the search algorithm, can be critical determi-nants of protein identiWcation success rates [24–26]. Evenhigh-throughput protein identiWcation by methods suchas capillary-scale multi-dimensional chromatography[14] face limitations imperfect due to chromatographicseparation and the under-sampling by the mass spec-trometer system being used. The complexity of mamma-lian tissue represents a considerable experimentalchallenge, and pre-fractionation methods are generallyrequired to increase proteome coverage. One cannotunderestimate the importance of proper sample selectionfor generating meaningful data. For instance, proWling

whole brain may obscure changes in gene expression inthe hypothalamus during treatment with a drug. Mosttissues and organs are heterogeneous and made up ofmany cell types. While cell sorting, tissue culture, andsub-cellular fractionation can be used to simplify themixtures [15,27,28], sample preparation can still be prob-lematic for genes/proteins involved in speciWc settings,such as the critical transitions of the cell cycle. The lastchallenge is in extracting quantitative information forlow intensity peptides as a reliable signature, since high-abundance proteins, such as housekeeping enzymes, arepreferentially detected by LC-MS.

Another critical consideration is the adoption of suit-able informatics criteria to evaluate the signiWcance ofputative protein matches. To this end, conWdence Wltersbased on probability distributions and statistical algo-rithms should be used to determine the likelihood ofputative protein identiWcations essential for eliminatingfalse-positive matches as well as provide for standardiza-tion in the reporting, and comparison of diVerent datasets. To obtain an accurate proWle, quantitative datadescribing relative protein abundance under the varioussettings must also be obtained. One option is to usediVerential labeling of protein samples in a manneranalogous to the use of two-label systems in manymicroarray studies. Several innovative chemical- or iso-tope-based labeling strategies have been shown toimprove the reliability of quantitative inferences madeby LC-MS [17,29]. However, the impact of these special-ized methods has been restricted to data due to the sig-niWcant expertise and cost associated with theseanalyzes. We believe that peptide or spectral count oVersa far simpler semi-quantitative Wrst pass measure fortracking changes in protein abundance for the purposeof global data set comparisons and biomarker discovery.Protein levels can be readily estimated to a good Wrstapproximation based on the peptide count or cumulativesum of recorded peptide spectra that can be reliablymatched to a given protein [48]. Experimental repetitionis often needed for proper determination of the spectralcount, however, to deal with statistical issues that arisedue to MS sampling ineYciencies leading to spuriousvariations in large-multivariate proteomic data sets.

2.3. Linking heterogeneous databases

Regardless of which platform one chooses to use, theWrst task is to match the genes represented on the micro-array with the corresponding proteins identiWed in aproteomics experiment. By luck or by design, the lists ofsequence identiWers may be from the same database, butmore likely one has to perform some cross-referencingor indexing across platforms. Most commercial sourcesof microarray provide downloadable support tablesof gene accession numbers and common gene IDs relat-ing to one or more public annotation databases. At a

Page 4: Integrating gene and protein expression data: pattern analysis and profile mining

306 B. Cox et al. / Methods 35 (2005) 303–314

minimum, suppliers must provide a list of sequencesspotted on the array. Detailed cross-referenced annota-tions are available on the web for registered AVymetrixarray users (www.aVymetrix.com). The NIA also main-tains extensive annotation tables for a suite of oligo andcDNA microarrays (http://lgsun.grc.nia.nih.gov/cDNA/cDNA.html). Typical identiWers include accession refer-ences to databases such as SwissProt/Trembl (SPTR),NCBI, ENSEMBL, and Unigene (described below).However, there are several pitfalls to consider whenusing these identiWers. For instance, while SwissProt (SP)is a popular choice for spectral database searches since itis a highly curated, stable protein sequence database, itgenerally does not house the complete set of proteins formany organisms. Its companion database TrEMBL(TR), a computer-annotated supplement containing allthe translations of EMBL nucleotide sequence entriesnot yet integrated in SP, oVers more extensive coverage.However, TR IDs are unannotated, frequently redun-dant, and are continuously retired and replaced with SP-based accessions (IDs) as proteins migrate to SP.

NCBI maintained gene and protein databases suVerfrom these same problems, although the creation of theRefSeq NP (protein) and NM (mRNA) accession system[30] is an attempt to standardize and reduce redundancy.Moreover, the Uniprot database [31] is trying to create asingle identiWer for all proteins. This may only be furtherproblematic as each research group picks their favoriteunique identiWer and we are still left with the task oflinking disparate data sets. Other groupings such asGeneLynx and gene card databases [32–34] are trying toconsolidate all of these diVerent identiWers into a singleresource, similar to what has been done with Drosophila(www.fruitXy.org/annot/) and Caenorhabditis elegans(www.wormbase.org/). A key identiWer is the Unigenedatabase (www.ncbi.nlm.nih.gov), which is generatedfrom species-speciWc clustered nucleotide sequences thatoverlap with high-percent sequence identity. However,as new members are added to the collection, Unigeneclusters are recalculated, new clusters are generated, withsome members of a cluster being moved into a newcluster and occasionally a cluster ID being completelyremoved, which creates problems with legacy data sets.A caveat, then, to linking data sets using Unigene IDs isto ensure that the build dates are the same. Ensembl(www.ensembl.org) has one advantage in that IDs areonly assigned to gene/proteins that can be associatedwith the assembled genome, thus providing a stable non-redundant set of identiWers. However, not all genomeassemblies are not Wnished (e.g. mouse) and gene annota-tion generally not always complete.

If you need to retrieve tables of sequences or annota-tions, each database maintains its own service. Sequenceand annotation information can readily be retrievedfrom the Ensembl database using the Ensmart system(www.ensembl.org/Multi/martview). Data and annota-

tion from NCBI can be obtained by batch Entrez(www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?). Swiss-Prot and Trembl data can be accessed from the Expasyweb site using list retrieval (http://ca.expasy.org/sprot/sprot-retrieve-list.html) or SRS (http://ca.expasy.org/srs5/). For those working with mouse data, a particularlyuseful site is www.informatics.jax.org, where a myriad ofdata is maintained, including a phenotype browser formutations and diseases. To facilitate routine gene-to-protein mappings, our group has developed a cross-ref-erencing database lookup tool Protein2Gene Lookup(http://emili11.med.utoronto.ca/~dbgroup/lookup.php)that provides a user-friendly web-accessible interface.Researchers simply enter accessions of interest to obtainthe matching IDs/accessions from several major geno-mic and proteomic reference databases.

In some cases, a BLAST sequence alignment may bethe best way to link two data sets. Using bulk sequenceretrieval tools, a user can pull sequences using ID tags orentire species-speciWc sets. Searches can then be doneusing the stand-alone BLAST tool (NCBI downloads) ora utility like BioEdit (www.mbio.ncsu.edu/BioEdit/bio-edit.html), using one sequence set as reference. Outputshould be generated in text Wle format so that it can bereadily parsed into columns for query ID, subject ID,and scores (e value, percent identical, gaps, etc.). Cautionshould be used in interpreting the output, though, a Blaste value of 0 does not mean that two sequences are exactlyidentical. Rather, percent identity and sequence coverageshould also be checked to ensure acceptable alignmentsand incorporated into the selection of an acceptablethreshold cut-oV. Validated alignments can then beextracted as a two-column ID table, a relationship thatcan be used to bridge data sets. A further test of homol-ogy is to use reciprocal Blast, in which the Blast search isdone twice swapping the query and subject databases,this is done to avoid linking paralogous proteins/genes.

2.4. Data analysis—methods of comparison

Once the microarray and proteomic data sets havebeen cross-matched, more informative comparisons canbe made. ConWrmation of changes in expression at boththe gene and protein levels can help focus target selec-tion. Of course, there will generally be obvious gaps inthe data sets in instances where there are no gene expres-sion data available for a protein or vice versa. Indeed,one needs to consider the reasons for missing values andhave some rationale to assess the overall quality of thenon-overlapping data. One can choose to focus only onthe overlapping gene product data, excluding all datapoints that do not cross-map. Conversely, one can fullymerge the data sets, placing blank values for missedmatches. Since microarray studies are typically of fargreater scope at present than most protein proWlingeVorts, this last choice may not be helpful due to the

Page 5: Integrating gene and protein expression data: pattern analysis and profile mining

B. Cox et al. / Methods 35 (2005) 303–314 307

large-amount of blanks, which impede data processing.By using a focused approach, gene products of interestcan be assessed by comparison of the overlapping datawithin the cluster.

Lastly, since microarrays often contain multipleprobes for a given gene, a method of handling thisredundancy is required. One can simply replicate theprotein information for each redundantly matchingprobe or one can take an average of redundant probesprior to data set comparison.

2.4.1. Correlating protein and microarray dataMost of the correlation studies comparing protein

and gene expression in the literature have dealt with therelative abundance of mRNA and protein. In a pioneer-ing study, Gygi et al. [9] tested gene–protein correspon-dence in yeast by using [35S]methionine labeling forprotein quantitation and SAGE analysis for mRNAtranscript quantitation, expressing both as total copiesper cell. They collected complementary data for 156genes and could show a modest positive correlation ofmRNA and protein abundance. More recently, a groupof researchers [8] correlated the expression patterns ofmitochondrial proteins in mammalian tissue with publicmicroarray data, using a simpler present/absent test forconcordance [8]. A positive score was assigned when asimilar preferential tissue pattern was detected for boththe corresponding mRNA and protein. By this scheme,426 of 569 detected gene products were found to be con-cordant. One criticism raised is an obvious bias in thedata, wherein the average mRNA abundance of thedetectable proteins was found to be nearly 5-fold higherthan for all annotated mitochondrial genes, suggestingonly high-abundance gene products strongly correlate.Further, the scoring schema does not oVer a reliableassessment of relative abundance as it only consideredextremes in the data.

GriYn et al. [10] asked the more complex question ofwhether changes in expression correlate at the proteinand transcript levels between two yeast populationsgrown in diVerent carbon sources. They determined theratio of protein expression using ICAT labeling and theratio of mRNA using spotted cDNA microarrays. Com-plementary protein and mRNA abundance data wereobtained for 245 genes. Many gene products linked tocarbon metabolism showed expected changes in abun-dance, but did not change with similar scalars or magni-tudes at the individual protein and mRNA levels. Theseobservations suggest that genes with similar expressionmight not translate into similar protein levels and rather,post-translation control mechanisms are a normal partof a physiological response.

2.4.2. Protein relative abundance by spectral countsAlthough comparing protein lists can be informative,

some estimate of relative quantity across samples is

generally required for meaningful comparisons withmicroarray data. Protein levels can be readily estimatedto a good Wrst approximation based on the peptidecount or cumulative sum of recorded peptide spectrathat can be reliably matched to a given protein [48]. Theredundant peptide count allows the estimation of the rel-ative abundance of a protein across a series of experi-ments, provided the isolation techniques were similar,for example with respect to buVers, homogenizationmethod, and fractionation. Spectral counts typicallyshow a slight but statistically signiWcant (p < 0.05) biasfor molecular weight; in our hands, mouse proteins withlow spectral counts have a mean molecular weight of56 kDa (median 45 kDa) and those with 100 or morespectral counts a mean of 79 kDa (median 53 kDa).Spectral count values do not allow for intercomparisonof expression levels of diVerent proteins. Once a table ofproteomic data (with spectral counts) is assembled, itcan be treated much like a microarray spreadsheet withrespect to data normalization and calculation of expres-sion ratios (discussed below).

2.4.3. Example of concordance of mRNA and protein relative abundance

The rapid progress in high-throughput LC-MS-basedproWling [3] suggests we are now reaching a point wherewe may begin to systematically address post-transcrip-tional regulatory mechanisms. As an illustration of this,we compared a recently published extensive data set ofthe proteomic patterns of adult mouse lung and liver [15]to a published AVymetrix MGU74 gene chip data forlung and liver [35]. We linked the two data sets usingSPTR IDs: the proteins had been directly identiWed froma database of SPTR sequences and the SPTR annotationfor the MGU74 probe set was downloaded fromAVymetrix. Approximately 1200 non-redundant protein-microarray pairs were found, of which 623 were usedbecause they had a signiWcant detection p value as calcu-lated by AVymetrix MAS5.0. The detection p value pro-vides a statistical evaluation of the observed intensity asa measure of the true binding of the expected transcriptversus background noise or spurious cross-hybridiza-tion. A similar concordance-scoring scheme to that ofMootha et al. [8] was adopted, with the exception thatthe total spectral count was substituted for the binaryabsent/present call to provide a better measure of rela-tive abundance. This results in a positive correlation ifboth the microarray and the corresponding proteinshow a similar ratio of expression in the lung and liver(e.g., both >1).

Considering all 623 pairs of microarray and proteindata, a concordance of »60% is observed (372/623).However, taking into account the fold-diVerence betweenthe two tissues (for either protein or mRNA), the concor-dance improves with an increasing fold-diVerence (theratio of spectral count, or microarray intensity). A

Page 6: Integrating gene and protein expression data: pattern analysis and profile mining

308 B. Cox et al. / Methods 35 (2005) 303–314

maximal concordance of 68% (257/376) occurs at a pro-tein cut-oV 7-fold, while for mRNA a maximum concor-dance of 69% (145/210) is observed at a cut-oV of 3-fold.Note that the total number of protein–mRNA pairs con-sidered for the concordance is reduced as we are onlyconsidering the pairs for which either the protein or themicroarray ratios are above the fold cut-oV. This discrep-ancy suggests experimental noise at the lower ratios.From our example, a positive concordance for a proteinratio of twofold would not be considered as strongly as a7-fold ratio with a positive concordance. It may also betaken from this example that the signal from the micro-array is less noisy than the protein spectral count as amaximum concordance is observed at a lower ratio. Ofcourse, many of these outliers are of special interest asthey may reXect divergent, yet physiologically relevant,regulation at the transcriptional, and post-transcrip-tional levels. Samples for which there is no concordance(the ratio of microarray intensity to protein intensitydoes not agree) may be taken as evidence of normalstochastic Xuctuation in protein and mRNA levels, withhigher ratios better reXecting meaningful biologicalvariation.

2.4.4. Clustering dataThe increasing size and complexity of proteomic and

microarray data sets provide opportunities and chal-lenges for researchers to extract biologically relevantinformation. Two key goals, Wnding meaningful patternsin the data sets and classifying samples, can both beaccomplished by applying diVerent data-mining, patternrecognition, clustering, classiWcation, and other associa-tion techniques to the data sets. Clustering is a commonapproach for sorting related sets of proteomic andmicroarray data and samples (tissues or experiments),and is generally the preferred and simplest routine forvisually assessing intrinsic patterns within the data sets[21]. Most data sets, even from samples that cannot beclassiWed in any obvious way, may contain hidden infor-mation about regulatory patterns (such as co-expres-sion) that can be revealed by cluster analysis. Thediscovery of hidden patterns in expression proWles,referred to as proWle mining, is possible if a large numberof signature proWles derived for a set of unclassiWed orunclassiWable samples is available, or if there is a humanexpert who can provide correct information to guide thediscovery of the hidden patterns. It should be noted thatclustering only two experimental data sets is not neededfor comparison as simple sorting by the ratio in aspreadsheet application is generally suYcient. Clusteringis best used for tackling larger data sets where the crite-ria for sorting expression proWles across multiple experi-ments are not obvious.

Many commercial informatics packages are availablefor data clustering, but powerful public software pack-ages that will meet most users needs are freely available

for download. A popular tool is Cluster 3.0 developed byde Hoon and colleagues (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/index.html), a multi-plat-form compliant Java program based on the originalclustering program developed by the Eisen group [36](http://rana.lbl.gov/EisenSoftware.htm) but improved tohandle more genes and experiments. This packageenables users to perform most tasks, such as data Wlter-ing, adjusting (normalizing), and clustering using com-mon algorithms and proWle similarity (a.ka distance)metrics. A second package, called TreeView (http://jtreeview.sourceforge.net) developed and adapted byAlok Saldanha from the original package developed bythe Eisen group, allows users to graphically display theoutput of the clustering program, with easy selection andextraction of interesting clusters.

2.4.5. Calculating a relative ratio of protein expressionSome estimate of relative protein quantity across

samples is generally required for meaningful compari-sons with microarray data and for clustering to be ade-quately performed. As discussed above, the spectralcount can be used as a measure of relative abundance ofa protein across experiments. Again, once a table of pro-teomic data expressed as spectral counts is assembled, itcan be treated much like a microarray spreadsheet withrespect to data normalization and calculation of expres-sion ratios.

For normalization of experimental data sets, globalaverage (or median) scaling based on average number ofpeptides detected per protein per experiment can be cal-culated. The spectral count in each experiment can thenbe scaled by a constant such that the global average (ormedian) peptide or spectral count is the same across allexperiments. Typically the lowest and highest Wfth per-centile data are trimmed Wrst to prevent skewing of theaverage. (Of course, these outliers may have biologicalrelevance and must be reconsidered in later analyses.)

There are two ways to generate a suitable expressionratio. In the Wrst approach, a common baseline referencesample data set is used as the denominator, while theexperimental sample values are used as the numerator.For example, untreated cell cultures could be the base-line, and cell cultures were transformed with increasingamounts of an expression vector as the experimental testsamples. By this method, the changes in gene expressiondetected in the experimental samples are all reported asthe ratio relative to that observed in the untreated con-trol cells. Another method is to generate a pseudo experi-ment. In our case, the pseudo experiment data set wouldcontain a list of all proteins detected in each of the sam-ples as well as the corresponding average calculated spec-tral count value detected for each protein across allexperiments. This method can be particularly beneWcialin time-course experiments as such data reXect changesin expression overtime, generally resulting in a more even

Page 7: Integrating gene and protein expression data: pattern analysis and profile mining

B. Cox et al. / Methods 35 (2005) 303–314 309

data point distribution. Expression ratios are often trans-formed with either log base-2 or -10 to distribute thedata evenly around 0, which signiWes no change in abun-dance. Down- and up-regulated expression can be easilyvisualized using a heat map graphical display format.

2.4.6. Considerations for clustering protein dataAs an illustrative example, we will evaluate parallel

mRNA and protein expression proWling data sets ofdeveloping mouse lungs recorded over three develop-mental time-points (B.C., T.K., and A.E., manuscript inpreparation) consisting of early (a composite of embry-onic day 14 (E14) and day 16 (E16)), mid (composite ofday 18 (E18) and postnatal day 2 (P2)), and late (a com-posite of postnatal day 14 (P14) and Adult) stages. Thelung development gene expression proWles were based ona published, publicly accessible microarray data setreported by Mariani et al. [41], which were generatedusing AVymetrix Mu11K A and B chipsets, while theproteomic data sets were generated in-house using theLC-MS-based PRISM proWling methodology [15].BrieXy, lung tissue was separated into nuclear, cytosolic,and mitochondrial protein fractions using diVerentialcentrifugation. These fractions were digested with tryp-sin and the peptide mixtures were separated by two-dimensional chromatography. The eluting peptides wereelectrosprayed directly (in real time) into an ion trap tan-dem mass spectrometer. The spectra were searchedagainst a database of mouse protein sequences obtainedfrom the European Bioinformatics Institute (EBI; SP/TR) using the SEQUEST database search software [37].High-conWdence (error p value <0.05) matches were sta-tistically validated using the STATQUEST probabilityWltering algorithm [15]. The data were then assembledinto a table of log2 ratios of spectral counts against apseudo experiment as described above.

Clustering proteomic data can cause certain problemsas compared to microarray analyses due to diVerences intypes of biological information represented and the dis-parate methods of data collection. Microarrays typicallygenerate a detectable signal for nearly all gene spots, evenwhen a transcript is absent (i.e., background). In a proteinproWling experiment, any protein absent from a sample(or suYciently low in relative abundance) will go unde-tected, and hence its corresponding level will necessarilybe reported with a blank or missing value in the datamatrix. Null data points are generally ignored when cal-culating distances for generating clusters. This is typicallynot problematic if only a few points are missing in a dataset. Indeed, there are suitable methods available to calcu-late ‘missing’ data ranging from simply Wlling in an aver-age value to imputing the data, although generally thesemethods are only valid if less than 10% of the data ismissing [38–40]. However, it is common for proteomicsamples to be quite distinct, for example when comparingdiVerent sub-cellular fractions or diVerent tissues [15].

To circumvent this problem, we typically substituteall blank (missing or null) values with a nominal, lowlog2 ratio value, based on the lowest observed log2 ratiovalue. Fig. 1 presents an example of cluster analysis ofthe proteomic patterns detected in cytosolic and nuclearfractions from three diVerent time points of the lung pro-tein data set. It can be seen that Wlling in a low valueprior to clustering the data (Fig. 1A) and then removingthese values for visualization (Fig. 1B, gray representsnull data points) generates a superior (more consistent)cluster than obtained by leaving the values blank (Fig.1C, detail Fig. 1D). Note how proteins detected exclu-sively in a single tissue fraction properly cluster together(Fig. 1B), whereas these same proteins become distrib-uted and the clusters broken up when the data are clus-tered using blanks (Figs. 1C and D).

A Wnal consideration is calculation of the similarity ofexpression using a suitable distance metric, which is amethod for calculating the resemblance between theexpression patterns or proWles of two diVerent geneproducts across all experiments. Some methods utilizeaverage data values to minimize the eVects of spuriousoutlier data points, while others use absolute values. Thisis not to say that one metric is necessarily better thananother, as this is a subjective criterion. The three aver-age linkage clusters shown in Figs. 1E–G were generatedby using the more common Pearson, Spearman, andEuclidean distance metrics, respectively. Initially, allmetrics appear to generate similar clusters (Figs. 1E–G)but upon closer inspection, it can be seen that changes inthe signal log ratios are treated diVerently. This is espe-cially evident in a group of proteins at the bottom ofeach cluster that have more similar expression patternsin the cytosolic fractions as compared to the nuclearfractions.

The Pearson metric deWnes a correlation coeYcientbetween two lists of values. The version of Pearson usedhere is centered, which means the correlation is notaVected by linear transformation of the data, such asadding or multiplying all values of one set by a constant.Again, this eVect is most evident in the cluster of proteinsat the bottom which are all preferentially detected (ele-vated abundance) in the cytosol rather than in the nuclei.The Spearman metric is a non-parametric distance mea-sure and is less aVected by outliers, such as the presenceof weaker signal detected in the nuclear fractions. TheEuclidean metric calculates expression diVerencesdirectly and is therefore more sensitive to the magnitudeof expression, resulting in better separation of proteinsor mRNA species that exhibit similar fold changes inexpression but diVerent overall signal intensities.

2.4.7. Example of combined microarray and protein data sets

To examine if the changes in protein abundance weobserved correlated with changes in mRNA abundance

Page 8: Integrating gene and protein expression data: pattern analysis and profile mining

310 B. Cox et al. / Methods 35 (2005) 303–314

(as reported in the published probe sets) throughoutdevelopment, we Wrst summed the spectral count acrossall organellar protein fractions obtained for each time-point. To simplify the analysis, the data sets were binnedinto early (E13 and E16), mid (E18 and P2), and late(P14 and adult) developmental stages. We then gener-ated the pseudo experiment for the summed data andcalculated the log2 ratio for each time bin relative to thisreference. To further facilitate the comparison, the

microarray data were also expressed as a signal log2ratio of its own pseudo experiment. Next, the proteinsand corresponding mRNA probe sets were cross-mapped based on a common annotation cross-reference.As is quite commonly seen, the microarray data set wasfound to contain many redundant probe sets (i.e., map-ping to the same protein), which were averaged prior tocombining the two data sets. Over 1800 protein/probesets pairs (referred to as data pairs) were matched in this

Fig. 1. Clustering of protein and gene expression data. Color schema: green, low expression; red, high expression; black, no diVerence; and gray, nodata. Ratios were calculated based the observed protein spectral count in an experiment (sample) over the average spectral count for all experiments.(A) Three diVerent lung cytosolic (left side) and nuclear (right side) protein fraction proWling datasets clustered using a low value substitute (brightgreen) for missing (blank) values. (B) The low Wlled values have been removed and replaced with the original blank values (gray). (C) Same datasetclustered using original blank values. (D) Expanded view of clusters. (E) Same dataset as in (B) clustered using the Pearson distance metric. (F) Samedataset clustered with the Spearman distance metric. (G) Same dataset clustered with the Euclidean metric. (H) Cluster of »1,800 protein gene-prod-uct pairs showing early, mid and late protein ratios (left) and microarray gene expression ratios (right) arranged in a similar orientation from early-to-late time points. (I,J) Cluster detail of a concordant (I) and a discordant (J) cluster of protein-microarray proWles.

Page 9: Integrating gene and protein expression data: pattern analysis and profile mining

B. Cox et al. / Methods 35 (2005) 303–314 311

fashion. The Wnal combined table contained a single-col-umn of unique (non-redundant) gene–protein identiWers,the corresponding columns of protein expression ratiosacross all three developmental time-points, in chrono-logical order from earliest to latest, followed by themRNA transcript expression ratios, also in the samechronological order.

Using the publicly available Cluster 3.0 program, themerged data sets were clustered using the Pearson metricand average linkage by gene. Broadly viewed (Fig. 1H),the microarray and proteomic data appeared to corre-late quite well. For instance, higher transcript levels atspeciWc time-points were generally likewise reXected withelevated protein detection levels. Moreover, clusteringgenerated many sub-groups where the mRNA and pro-tein proWles are largely in agreement (Fig. 1I). However,several clusters of gene products did not not appear to

be similar at the protein or gene levels (Fig. 1J), suggest-ing either a high-degree of post-transcriptional regula-tion or signiWcant error in the measurement of proteinabundance and/or mRNA transcript levels.

To thoroughly assess the relationship between theobserved transcriptome and proteome, a more rigorousanalysis of the correlation of co-expression is required.To this end, we decided to develop a simpliWed linear Wtmodel to better examine the relationship between geneand protein levels. We focused our analysis on caseswhere both the protein and corresponding mRNA mes-sage were observed at all time-points (807 data pairs) orwhere the protein was detected exclusively at only a sin-gle time-point (382 data pairs), excluding all proteomicdata points having incomplete microarray data (640 pro-teins). To generate the linear model, we independentlyplotted the log2 ratios of protein and gene expression as

Fig. 2. Viewing data trends. (A) Plot of the slope (change in relative ratio over time) of measured protein levels versus the slope of the correspondingmicroarray gene transcript values. The data falls into three groups: regulated, where both protein and mRNA have non-zero slopes; neutral-regu-lated, where only one has a non-zero slope; and neutral, where both gene products have slopes not signiWcantly diVerent from zero. (B) Detail of theprotein spectral count log ratios from lower left quadrant of panel (A), where both protein and mRNA have negative slopes. (C) Detail of the micro-array signal log ratios from the same region of panel (A).

Page 10: Integrating gene and protein expression data: pattern analysis and profile mining

312 B. Cox et al. / Methods 35 (2005) 303–314

a function of time, with the latter also log2 transformedto make it more similar in magnitude as compared to theexpression data. Regression analysis was then performedacross all protein–gene probe set pairs, and the slope wasdetermined across all time-points. Genes and/or proteinsexhibiting non-zero slopes using a suitable two-tail t testcut-oV were selected for further analysis (critical valuesof t deWned as a 90% conWdence for the microarray dataand 80% conWdence for the proteomic data). Based onthe correspondence between these patterns, the gene–protein data pairs were then further classiWed as either:(i) regulated, with both the mRNA and proteins exhibit-ing non-zero slopes (228 data pairs); (ii) neutral-regu-lated, with only one of each pair showing a non-zeroslope (408 data pairs); and (iii) neutral, with two zeroslopes (no signiWcant change in expression) observed(176 data pairs).

A plot of these three data groupings (Fig. 2A) indi-cates the tight clustering of the neutral set around aslope of zero, which implies that the scoring of a zeroslope is due to constitutive expression of both themRNA message and the protein end product. Incontrast, 83% (189/228) of the regulated gene productsshowed clear evidence of co-regulation (either two posi-tive or two negative slopes). However, the correlationscore (r2 value) for the regulated group was determinedto be 0.39, which indicates a true (albeit modest) correla-tion overall, whereas the correlation score is only 0.18 ifall the data pairs are considered. A closer examination ofthe regulated data found in the lower left quadrant ofthe plot (co-regulated negative slopes) is provided inFigs. 2B and C. The observed trends in spectral countratios (Fig. 2B) and microarray signal ratios (Fig. 2C)clearly show the parallel patterns of downregulation(reduced expression) seen with this subset of geneproducts.

Although no exact slope could be precisely calculatedfor the many proteins detected at only a single time-point, we assumed a strong negative slope in the case ofearly speciWc protein expression and a strong positiveslope for late-stage expression. Of the 430 proteinsdetected at single time-points, 382 had complete corre-sponding microarray data. Of these, 96, 115, and 171were uniquely detected at early, mid and late stages ofdevelopment, respectively. Importantly, greater than halfof the early and late unique proteins (those detected onlyin the early or late time-points) showed evidence of co-regulation (that is the slope of the microarray tested asbeing non-zero at a 90% conWdence level), with less than15% being disregulated (the mRNA tested as being non-zero, but with an opposite slope to that assumed for theprotein counterpart) and »30% judged to be neutral(that is, the slope could not be determined above a 90%conWdence, and was therefore assumed to be zero). Incontrast, the mid unique proteins (those detected exclu-sively at the mid developmental time point) had only

25% co-regulated and 25% dis-regulated, with theremaining 50% neutral. Hence, this relatively simplecomparative modeling of the microarray and proteomicdata suggests that many of the singleton proteomic datapoints reXect genuine developmental regulation of pro-tein abundance at the early and late time-points.

Alternative models of the expression patterns may beuseful. As evidence of this, the mid unique protein data(that is, proteins detected exclusively at the mid develop-mental time-point) did not show a good Wt to a linearregression model when the protein expression patternswere assumed to have either a positive or negative slope.These data may Wt better to a second order polynomial,with expression being low at early, high at mid, and lowat late time-points.

2.4.8. Biological inference and validationClustering is only useful if it reveals relationships in

data that are biologically meaningful. A rapid means ofassessing functional clustering of data is by statisticallytesting clusters for enrichment or depletion of selectfunctional categories. Several software packages arefreely available to analyze clustered data, usually on thebasis of annotations obtained from the gene ontology(GO) database. GenMapp [42], for example, allows theuser to enter comma-delimited tables of protein ormicroarray data to calculate signiWcant changes in GOterms by applying user-speciWed Wlters to the data

Table 1Flow chart overview of method for preparing protein and microarraydata for merger and analysis

Methods of data comparison are outlined.

Concordance scheme1. Normalize the protein spectral counts by global scaling to the

average spectral count detected per protein per sample.2. Normalize the microarray data set by similar or other methods.3. Cross reference the two data sets through a common ID (SwissProt/

Trembl accession number, Ensembl, etc.).4. Merge the two data sets:

4.1. Average the intensities of the redundant matches.4.2. Remove all non-paired data.

5. Use a concordance score to compare the intensities of protein and mRNA in diVerent experiments to evaluate the correlation of co-expression.

Analysis of change1. Normalize as in the concordance scheme.2. Generate ratios by a pseudo-experiment.

2.1. Generate a pseudo-experimental data set, where the intensity values for a protein is the average spectral count of that protein across all experiments considered.2.2. The ratio is the log2 of the spectral count over the average count.2.3. This Wle should be in a tab-delimited format, which can be imported and used by various clustering software tools.

3. The protein data may be optionally merged with the microarray data before clustering3.1. Select a clustering method and similarity metric (e.g., Pearson, Spearman or Euclidean distances).

4. Analyze the cluster for statistical enrichment of select markers or annotation features (e.g., GO terms) of interest.

Page 11: Integrating gene and protein expression data: pattern analysis and profile mining

B. Cox et al. / Methods 35 (2005) 303–314 313

(www.genmapp.org/download.asp). Due to the multi-step nature of the data processing for the methoddescribed, a schematic summary is displayed in Table 1.A simpler package, called GoMiner (http://dis-cover.nci.nih.gov/gominer/index.jsp), uses two lists ofgene identiWers 1) the whole data set and 2) a sub-listthat the user has selected as being diVerent (up- or down-regulated) [43].

We and others [44] have also developed publicly acces-sible web tools to perform this sort of analysis. FatiGO(http://fatigo.bioinfo.cnio.es/), for instance, carries outsimple data-mining by assigning the most characteristicGO terms to each cluster using Fishers exact test for sta-tistical signiWcance testing of groups of SPTR identiWers.The results are displayed in HTML and text format,along with a tree view of associated GO terms along withthe number of linked gene products. On the other hand,MouseSpec (http://tap.med.utoronto.ca/~posman/mouse-spec_two/) inputs a list of protein or gene IDs and out-puts a summary of GO-based functional classes,biological roles, and cellular localization that areenriched in the list. p values are calculated using thehypergeometric distribution, which represent the proba-bility that the intersection of given list with any givenfunctional category occurs by chance. To correct for spu-rious false discovery due to multiple repeat testing, aBonferroni-correction factor can be applied to normalizefor the number of tests conducted. Only those categoriesfor which the chance probability of enrichment is lowerthan a pre-deWned p value threshold are displayed.

Here, we used MouseSpec to examine the propertiesof the co-regulated gene clusters (co-positive slopes, co-negative slopes, and neutral) from the lung mRNA andprotein data comparison. Table 2 summarizes some ofthe statistically enriched GO categories associated witheach of these groupings, many of which are biologicallyinteresting. For instance, the positive co-regulated groupshowed a clear enrichment for structural molecules, sug-gestive of the acquisition of a terminally diVerentiatedcell state, whereas the negatively co-regulated group wasenriched for gene products involved in gene expression(e.g., transcription factors), perhaps specifying or

determining cell fates. The neutral group was enrichedfor certain mitochondrial gene products, in particularthose involved in electron transport chain activity, aswell as for gene products involved in RNA binding-related functions, such as RNA processing and splic-ing—processes found in virtually all cell types.

3. Conclusion

One key aim of proWling studies is to accurately cata-logue quantitative diVerences in the abundance of one ormore gene products present in various biological sam-ples. Pattern recognition algorithms can then be appliedto sort and classify the samples and gene products basedon their characteristic expression proWles. At present, themassively parallel nature of microarray technologyallows for a far more comprehensive molecular analysisof the transcriptome as compared to direct measure-ments of the proteome using proteomic methods such asLC-MS, which generally exhibit more limited sensitivityand dynamic range. However, since proteomicapproaches can provide additional insight into keydeterminants of biological activity—such as protein sub-cellular localization, protein–protein interactions, andpost-translational modiWcations—protein proWling stud-ies will undoubtedly continue to grow in scale and inpopularity. It is therefore hoped that researchers willincreasingly beneWt from the unique insights into biol-ogy aVorded by comparisons of the proteome and tran-scriptome using one or more of the approaches,methods, and/or tools outlined in this review.

References

[1] L. Smith, A. GreenWeld, Hum. Mol. Genet. 12 (Spec No 1) (2003)R1–8.

[2] R. Aebersold, M. Mann, Nature 422 (2003) 198–207.[3] T. Kislinger, A. Emili, Curr. Opin. Mol. Ther. 5 (2003) 285–293.[4] M. Tyers, M. Mann, Nature 422 (2003) 193–197.[5] A.C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A.

Bauer, J. Schultz, J.M. Rick, A.M. Michon, C.M. Cruciat, M.

Table 2Functional categories (GO terms) enriched in the three main gene–protein co-expression subgroups

Table of GO term enrichment as p-value for diVerent grouping of gene product pairs. Positive, both protein and mRNA time courses have a positiveslope; Negative, both datasets exhibit negative slope; Neutral, both proWles have a slope of 0. NS, not statistically signiWcant.

Category (GO term) Positive slope Negative slope Neutral

p value Gene product number

p value Gene product number

p value Number productnumber

Mytochondrion [GO:0005739] 0.00001 9 NS 0.00001 20Cytoskeleton [GO:0005856] 0.00001 8 NS NSCell adhesion [GO:0007155] 0.0003 6 NS NSRegulation of transcription, DNA-dependent

[GO:0006355]NS 0.0005 23 NS

RNA binding [GO:0003723] NS 0.00001 15 0.0000 14Electron transport [GO:0006118] NS NS 0.0012 12

Page 12: Integrating gene and protein expression data: pattern analysis and profile mining

314 B. Cox et al. / Methods 35 (2005) 303–314

Remor, C. Hofert, M. Schelder, M. Brajenovic, H. RuVner, A.Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A.Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.A. Heurtier, R.R.Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M.Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neu-bauer, G. Superti-Furga, Nature 415 (2002) 141–147.

[6] Y. Ho, A. Gruhler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams,A. Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting,I. Donaldson, S. SchandorV, J. Shewnarane, M. Vo, J. Taggart, M.Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Micha-lickova, A.R. Willems, H. Sassi, P.A. Nielsen, K.J. Rasmussen, J.R.Andersen, L.E. Johansen, L.H. Hansen, H. Jespersen, A. Pod-telejnikov, E. Nielsen, J. Crawford, V. Poulsen, B.D. Sorensen, J.Matthiesen, R.C. Hendrickson, F. Gleeson, T. Pawson, M.F.Moran, D. Durocher, M. Mann, C.W. Hogue, D. Figeys, M. Tyers,Nature 415 (2002) 180–183.

[7] H. Ge, Z. Liu, G.M. Church, M. Vidal, Nat. Genet. 29 (2001) 482–486.

[8] V.K. Mootha, J. Bunkenborg, J.V. Olsen, M. Hjerrild, J.R. Wis-niewski, E. Stahl, M.S. Bolouri, H.N. Ray, S. Sihag, M. Kamal, N.Patterson, E.S. Lander, M. Mann, Cell 115 (2003) 629–640.

[9] S.P. Gygi, Y. Rochon, B.R. Franza, R. Aebersold, Mol. Cell. Biol.19 (1999) 1720–1730.

[10] T.J. GriYn, S.P. Gygi, T. Ideker, B. Rist, J. Eng, L. Hood, R.Aebersold, Mol. Cell. Proteomics 1 (2002) 323–333.

[11] G. Chen, T.G. Gharib, C.C. Huang, J.M. Taylor, D.E. Misek, S.L.Kardia, T.J. Giordano, M.D. Iannettoni, M.B. Orringer, S.M.Hanash, D.G. Beer, Mol. Cell. Proteomics 1 (2002) 304–313.

[12] S. Ghaemmaghami, W.K. Huh, K. Bower, R.W. Howson, A. Belle,N. Dephoure, E.K. O’Shea, J.S. Weissman, Nature 425 (2003) 737–741.

[13] D. Greenbaum, R. Jansen, M. Gerstein, Bioinformatics 18 (2002)585–596.

[14] M.P. Washburn, D. Wolters, J.R. Yates 3rd, Nat. Biotechnol. 19(2001) 242–247.

[15] T. Kislinger, K. Rahman, D. Radulovic, B. Cox, J. Rossant, A.Emili, Mol. Cell. Proteomics 2 (2003) 96–106.

[16] Y. Pan, T. Kislinger, A.O. Gramolini, E. Zvaritch, E.G. Kranias,D.H. MacLennan, A. Emili, Proc. Natl. Acad. Sci. USA 101 (2004)2241–2246.

[17] W.A. Tao, R. Aebersold, Curr. Opin. Biotechnol. 14 (2003) 110–118.

[18] M. Schena, D. Shalon, R.W. Davis, P.O. Brown, Science 270 (1995)467–470.

[19] P. Hegde, R. Qi, K. Abernathy, C. Gay, S. Dharap, R. Gaspard,J.E. Hughes, E. Snesrud, N. Lee, J. Quackenbush, Biotechniques29 (2000) 548–550 52–4, 56 passim.

[20] J. Quackenbush, Nat. Rev. Genet. 2 (2001) 418–427.[21] J. Quackenbush, Nat. Genet. 32 (Suppl) (2002) 496–501.[22] R.J. Lipshutz, S.P. Fodor, T.R. Gingeras, D.J. Lockhart, Nat.

Genet. 21 (1999) 20–24.[23] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo,

M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton,E.L. Brown, Nat. Biotechnol. 14 (1996) 1675–1680.

[24] D.L. Tabb, A. Saraf, J.R. Yates 3rd, Anal. Chem. 75 (2003) 6415–6421.

[25] R.G. Sadygov, J.R. Yates 3rd, Anal. Chem. 75 (2003) 3792–3798.

[26] M.J. MacCoss, C.C. Wu, J.R. Yates 3rd, Anal. Chem. 74 (2002)5593–5599.

[27] S.W. Taylor, E. Fahy, S.S. Ghosh, Trends Biotechnol. 21 (2003)82–88.

[28] S. Brunet, P. Thibault, E. Gagnon, P. Kearney, J.J. Bergeron, M.Desjardins, Trends Cell. Biol. 13 (2003) 629–638.

[29] G. Cagney, A. Emili, Nat. Biotechnol. 20 (2002) 163–170.[30] K.D. Pruitt, K.S. Katz, H. Sicotte, D.R. Maglott, Trends Genet. 16

(2000) 44–47.[31] R. Apweiler, A. Bairoch, C.H. Wu, W.C. Barker, B. Boeckmann, S.

Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Mar-tin, D.A. Natale, C. O’Donovan, N. Redaschi, L.S. Yeh, NucleicAcids Res. 32 (Database issue) (2004) D115–D119.

[32] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, D. Lancet, TrendsGenet. 13 (1997) 163.

[33] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, D. Lancet, Bioinformat-ics 14 (1998) 656–664.

[34] M. Safran, I. Solomon, O. Shmueli, M. Lapidot, S. Shen-Orr, A.Adato, U. Ben-Dor, N. Esterman, N. Rosen, I. Peter, T. Olender,V. Chalifa-Caspi, D. Lancet, Bioinformatics 18 (2002) 1542–1543.

[35] A.I. Su, M.P. Cooke, K.A. Ching, Y. Hakak, J.R. Walker, T. Wilt-shire, A.P. Orth, R.G. Vega, L.M. Sapinoso, A. Moqrich, A. Pata-poutian, G.M. Hampton, P.G. Schultz, J.B. Hogenesch, Proc. Natl.Acad. Sci. USA 99 (2002) 4465–4470.

[36] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Proc. Natl.Acad. Sci. USA 95 (1998) 14863–14868.

[37] J.K. Eng, A.L. McCormack, J.R.I. Yates, J. Am. Soc. Mass Spec-trom. 11 (1994) 976–989.

[38] T.H. Bo, B. Dysvik, I. Jonassen, Nucleic Acids Res. 32 (2004) e34.[39] S. Oba, M.A. Sato, I. Takemasa, M. Monden, K. Matsubara, S.

Ishii, Bioinformatics 19 (2003) 2088–2096.[40] X. Zhou, X. Wang, E.R. Dougherty, Bioinformatics 19 (2003)

2302–2307.[41] T.J. Mariani, J.J. Reed, S.D. Shapiro, Am. J. Respir. Cell Mol. Biol.

26 (2002) 541–548.[42] K.D. Dahlquist, N. Salomonis, K. Vranizan, S.C. Lawlor, B.R.

Conklin, Nat. Genet. 31 (2002) 19–20.[43] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sun-

shine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi,K.J. Bussey, J. Riss, J.C. Barrett, J.N. Weinstein, Genome Biol. 4(2003) R28.

[44] F. Al-Shahrour, R. Diaz-Uriarte, J. Dopazo, Bioinformatics 20(2004) 578–580.

[45] K.G. LeRoch, J.R. Johnson, L. Florens, Y. Zhou, A. Santrosyan,M. Grainger, S.F. Yan, K.C. Williamson, A.A. Holder, D.J. Caru-cci, J.R. Yates III, E.A. Winzeler, Genome Res. 14 (11) (2004)2308–2318.

[46] J.R. Johnson, L. Florens, D.J. Carucci, J.R. Yates III, J. ProteomeRes. 3 (2004) 296–306.

[47] T.R. Hughes, M. Mao, A.R. Jones, J. Burchard, M.J. Marton,K.W. Shannon, S.M. Lefkowitz, M. Ziman, J.M. Schelter, M.R.Meyer, S. Kobayashi, C. Davis, H. Dai, Y.D. He, S.B. Stephani-ants, G. Cavet, W.L. Walker, A. West, E. CoVey, D.D. Shoemaker,R. Stoughton, A.P. Blanchard, S.H. Friend, P.S. Linsley, Nat. Bio-technol. 19 (2001) 324–327.

[48] H. Liu, R.G. Sadygov, J.R. Yates III, Anal. Chem. 76 (2004) 4193–4201.


Recommended