+ All Categories
Home > Documents > Prediction of dual protein targeting to plant...

Prediction of dual protein targeting to plant...

Date post: 29-Mar-2018
Category:
Upload: dangminh
View: 213 times
Download: 0 times
Share this document with a friend
13
Research 224 New Phytologist (2009) 183: 224–236 © The Authors (2009) 224 www.newphytologist.org Journal compilation © New Phytologist (2009) Blackwell Publishing Ltd Oxford, UK NPH New Phytologist 0028-646X 1469-8137 © The Authors (2009). Journal compilation © New Phytologist (2009) 2832 10.1111/j.1469-8137.2009.02832.x March 2009 0 224??? 235??? Methods XX XX Methods Prediction of dual protein targeting to plant organelles Jan Mitschke 1 *, Janina Fuss 1 *, Torsten Blum 3 , Annette Höglund 3 , Ralf Reski 1,2 , Oliver Kohlbacher 3 and Stefan A. Rensing 2 1 Plant Biotechnology, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany; 2 FRISYS, Faculty of Biology, University of Freiburg, Hauptstr. 1, D-79104 Freiburg, Germany; 3 Center for Bioinformatics Tübingen, University of Tübingen, Sand 14, D-72076 Tübingen, Germany Summary Dual targeting of proteins to more than one subcellular localization has been found in animals, in fungi and in plants. In the latter, ambiguous N-terminal targeting signals have been described that result in the protein being located in both mitochondria and plastids. We have developed ambiguous targeting predictor (ATP), a machine-learning implementation that classifies such ambiguous targeting signals. Ambiguous targeting predictor is based on a support vector machine implementation that makes use of 12 different amino acid features. Prediction results were validated using fluorescent protein fusion. • Both in silico and in vivo evaluations demonstrate that ambiguous targeting pre- dictor is useful for predicting dual targeting to mitochondria and plastids. Proteins that are targeted to both organelles by tandemly arrayed signals (so-called twin targeting) can be predicted by both ambiguous targeting predictor and a combination of single targeting prediction tools. Comparison of ambiguous targeting predictor with previous experimental approaches, as well as in silico approaches, shows good congruence. Based on the prediction results, land plant genomes are expected to encode, on aver- age, > 400 proteins that are located in mitochondria and plastids. Ambiguous targeting predictor is helpful for functional genome annotation and can be used as a tool to further our understanding about dual protein targeting and its evolution. Author for correspondence: Stefan A. Rensing Tel: +49 761 203 6974 Email: [email protected] freiburg.de Received: 27 November 2008 Accepted: 15 February 2009 New Phytologist (2009) 183: 224–236 doi: 10.1111/j.1469-8137.2009.02832.x Key words: ambiguous targeting, chloroplast, genome annotation, intracellular sorting, mitochondrion. Introduction N-terminal targeting signals that target proteins to mitochon- dria, plastids and the secretory pathway are not conserved at the level of the primary sequence. Therefore, various machine learning approaches have been employed to identify typical features of such signals and to predict a protein’s subcellular localization. Numerous approaches have been implemented and are available as prediction services (Nakai & Horton, 1999; Emanuelsson et al., 2000; Guda et al., 2004; Small et al., 2004; Boden & Hawkins, 2005; Höglund et al., 2006). Prediction accuracy is often high, yet, even the best chloroplast predictor, TargetP, has a false positive rate of approx. 69% and a true positive rate of approx. 86% (Zybailov et al., 2008) and there are sets of consistently misclassified proteins, some of which we address in this study. Existing tools usually assume that each protein is targeted to a single location (i.e. that the targeting signals unam- biguously determine the final location of the mature protein). However, this is not always the case. In the last decade, dual targeting of a multitude of proteins has been described for native plant proteins (Peeters & Small, 2001; Silva-Filho, 2003; Mac- kenzie, 2005). For certain gene families, such as Arabidopsis thaliana aminoacyl-transfer RNA (tRNA) synthetases, dual mitochondrial/plastidal targeting is the rule (17/24 proteins) rather than the exception (Duchene et al., 2005). The previously unexpected high rate of dual targeting has even led to higher estimates for the size of the plastid (2700) and the mitochondrial (2000) proteomes (Millar et al., 2006), the plastid proteome having been estimated to be larger still (> 3400) in other studies *These authors contributed equally to this work.
Transcript

Research

224 New Phytologist (2009) 183: 224–236 © The Authors (2009)224 www.newphytologist.org Journal compilation © New Phytologist (2009)

Blackwell Publishing LtdOxford, UKNPHNew Phytologist0028-646X1469-8137© The Authors (2009). Journal compilation © New Phytologist (2009)283210.1111/j.1469-8137.2009.02832.xMarch 200900224???235???MethodsXX XX

Methods

Prediction of dual protein targeting to plant organelles

Jan Mitschke1*, Janina Fuss1*, Torsten Blum3, Annette Höglund3, Ralf Reski1,2, Oliver Kohlbacher3 and Stefan A. Rensing2

1Plant Biotechnology, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany; 2FRISYS, Faculty of Biology, University of

Freiburg, Hauptstr. 1, D-79104 Freiburg, Germany; 3Center for Bioinformatics Tübingen, University of Tübingen, Sand 14, D-72076 Tübingen, Germany

Summary

• Dual targeting of proteins to more than one subcellular localization has been foundin animals, in fungi and in plants. In the latter, ambiguous N-terminal targeting signalshave been described that result in the protein being located in both mitochondria andplastids. We have developed ambiguous targeting predictor (ATP), a machine-learningimplementation that classifies such ambiguous targeting signals.• Ambiguous targeting predictor is based on a support vector machine implementationthat makes use of 12 different amino acid features. Prediction results were validatedusing fluorescent protein fusion.• Both in silico and in vivo evaluations demonstrate that ambiguous targeting pre-dictor is useful for predicting dual targeting to mitochondria and plastids. Proteins thatare targeted to both organelles by tandemly arrayed signals (so-called twin targeting)can be predicted by both ambiguous targeting predictor and a combination of singletargeting prediction tools. Comparison of ambiguous targeting predictor with previousexperimental approaches, as well as in silico approaches, shows good congruence.• Based on the prediction results, land plant genomes are expected to encode, on aver-age, > 400 proteins that are located in mitochondria and plastids. Ambiguous targetingpredictor is helpful for functional genome annotation and can be used as a tool tofurther our understanding about dual protein targeting and its evolution.

Author for correspondence:Stefan A. RensingTel: +49 761 203 6974Email: [email protected]

Received: 27 November 2008Accepted: 15 February 2009

New Phytologist (2009) 183: 224–236doi: 10.1111/j.1469-8137.2009.02832.x

Key words: ambiguous targeting, chloroplast, genome annotation, intracellular sorting, mitochondrion.

Introduction

N-terminal targeting signals that target proteins to mitochon-dria, plastids and the secretory pathway are not conserved atthe level of the primary sequence. Therefore, various machinelearning approaches have been employed to identify typicalfeatures of such signals and to predict a protein’s subcellularlocalization. Numerous approaches have been implemented andare available as prediction services (Nakai & Horton, 1999;Emanuelsson et al., 2000; Guda et al., 2004; Small et al., 2004;Boden & Hawkins, 2005; Höglund et al., 2006). Predictionaccuracy is often high, yet, even the best chloroplast predictor,TargetP, has a false positive rate of approx. 69% and a true positive

rate of approx. 86% (Zybailov et al., 2008) and there are setsof consistently misclassified proteins, some of which we addressin this study. Existing tools usually assume that each protein istargeted to a single location (i.e. that the targeting signals unam-biguously determine the final location of the mature protein).However, this is not always the case. In the last decade, dualtargeting of a multitude of proteins has been described for nativeplant proteins (Peeters & Small, 2001; Silva-Filho, 2003; Mac-kenzie, 2005). For certain gene families, such as Arabidopsisthaliana aminoacyl-transfer RNA (tRNA) synthetases, dualmitochondrial/plastidal targeting is the rule (17/24 proteins)rather than the exception (Duchene et al., 2005). The previouslyunexpected high rate of dual targeting has even led to higherestimates for the size of the plastid (2700) and the mitochondrial(2000) proteomes (Millar et al., 2006), the plastid proteomehaving been estimated to be larger still (> 3400) in other studies*These authors contributed equally to this work.

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 225Methods

(van Wijk, 2004). Protein targeting can depend on the develop-mental stage (e.g. the tissue type), as has been demonstrated forsecretory pathway targeting in seeds and leaves of Nicotianatabacum (Petruccelli et al., 2006). Also, protein folding, post-translational modification and protein–protein interaction canbe involved in determining the targeting of proteins withmultiple sites of action (Karniely & Pines, 2005). The impor-tance of cis-elements, especially of the 5′ untranslated region(UTR), for determining the subcellular localization of duallytargeted proteins has been demonstrated in several cases (Chri-stensen et al., 2005; Kabeya & Sato, 2005; Sunderland et al.,2006; Puyaubert et al., 2008). Organelles might ‘compete’ fordually targeted proteins. In such a ‘tug-of-war’ scenario, highlyefficient transport to one organelle might occur, obscuring thelocalization of the protein to the second target (Karniely & Pines,2005). Also, a high abundance of a protein at one localizationmight render its detection in alternative localizations all butimpossible (Duchene et al., 2005; Karniely & Pines, 2005).These factors make the straightforward detection of dual tar-geting using experimental approaches difficult.

During evolution, after transfer of an organellar gene to thenucleus the gene needs to acquire a targeting signal in order forthe encoded protein to be imported into its genes’ organelle oforigin. Acquisition and subsequent evolution of a hydrophobicstretch at the N-terminus might be a prerequisite for this (Michlet al., 1999). Such signals might come into being by exonshuffling (i.e. by the acquisition of a pre-existing exon), by inte-gration into a gene already carrying a targeting signal (Adamset al., 2000) or by a random process involving transcriptionand translation of the 5′ stretch of DNA. In terms of evolutionof dual-targeting capability, alteration of dual-targeting signalsin response to dietary necessities has been observed in Herbivoraand Carnivora (Birdsey et al., 2004). In a biotype of the weedAmaranthus tuberculatus, herbicide resistance evolved via a codondeletion conferring dual targeting to mitochondria and plastids(Patzoldt et al., 2006). Once two functionally redundant genesare encoded in the nuclear genome, the evolution of a dualtargeting signal and the subsequent deletion of one of the genecopies follows the parsimonious principle of evolution. In fact,deletion of both organellar and nuclear gene copies has beendemonstrated recently in the case of the dually targeted plantribosomal protein, S16 (Ueda et al., 2008). On the other hand,establishment of dual targeting for nonredundant proteins mightenable neo-functionalization of organelles. Dual targeting mightalso serve to couple cytoplasmic processes with organellar pro-cesses (e.g. division, signaling, stress tolerance). A substantialfraction of A. thaliana (56) and Oryza sativa (103) transcriptionfactors were predicted to be dually targeted to the nucleus andplastids or mitochondria (Schwacke et al., 2007), and therefore,dually targeted proteins might enable enforcement of nuclearcontrol upon organelles (e.g. through RNA polymerases, tran-scription factors and tRNA synthetases).

Two principally different dual targeting mechanisms havebeen suggested: twin signals and ambiguous signals (Mackenzie,

2005). Whereas twin signals rely on two forms of the preproteinbeing translated upon different transcriptional or translationalinitiation or alternative splicing, ambiguous targeting signals havebeen shown to guide the same preprotein into two differentcompartments. In addition, protein isoforms generated by atwin mechanism can also be subject to ambiguous targeting,increasing the combinatorial complexity (von Braun et al., 2007;Puyaubert et al., 2008). N-terminal targeting sequences confer-ring targeting to mitochondria or plastids have a similar over-all composition. Ambiguous targeting signals are similar to bothsignals; they are enriched in serine and arginine, and deficientin asparagine, glutamic acid and glycine, by comparison withmature proteins (Pujol et al., 2007). Investigation of the leading20 residues showed that arginine is abundant in mitochondrialtargeting sequences compared with those of chloroplasts, andambiguous targeting sequences represent an intermediatesituation.

It appears as if dual protein targeting is an iceberg of whichwe know only the tip, as only around 12 twin targeted proteinsare known from plants to date, while c. 50 proteins exhibitingambiguous targeting have been described. Dually targeted pro-teins are often misclassified by current prediction tools becausea potential second localization is neglected. Among the scarceamount of data available to date, a total of 40 plant proteinshave been described to contain an ambiguous targeting signalthat directs them to both mitochondria and plastids. Therefore,we aimed to develop a tool for the accurate prediction of suchambiguous dual targeting. We tested the prediction results usingthe model plant Physcomitrella patens because dual targeting hasbeen described to occur in this organism (Richter et al., 2002;Kiessling et al., 2004; Kabeya & Sato, 2005), detection of fluores-cent protein fusions using transient protoplast transfection assaysis a standard technique (Frank et al., 2005; Quatrano et al., 2007)and the genome sequence is available (Rensing et al., 2008).

Materials and Methods

Cultivation of plant material, RNA isolation and cDNA synthesis

P. patens (Hedw.) Bruch & Schimp. ssp. patens ‘Gransden 2004’(Rensing et al., 2008) was cultivated as described previously(Bierfreund et al., 2003). To isolate RNA, protonema washarvested, frozen in liquid nitrogen and disrupted with a ballmill for 1 min at 30 Hz. The frozen material was mixed with1 ml of Trizol reagent (Invitrogen) per 100 mg of plant material.After 5 min of incubation at 20°C and 20 min of centrifugationat 5000 g and 4°C, chloroform extraction (0.2 ml ml−1 ofTrizol) and isopropanol precipitation (0.5 ml ml−1 of Trizol) ofthe supernatant were carried out. Reverse primers weresituated well behind the end of the putative signal peptideand contained an EcoRV restriction site at their 5′ end; theforward primers were situated at the first ATG codon (earlyresponse to dehydration 4 homolog (ERD4) or at the

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research226 Methods

beginning of the 5′-UTR (pectin methylesterase (PME),phosphatidylinositol-dependent phospholipase C (PLC), aplastid division protein (FtsZ), fasciclin-like protein (FLP)and delta-aminolevulenic acid dehydratase 2 (Hem2)) andcontained, at their 5′ end, a BamHI restriction site as well astwo additional bases in front of it to ameliorate restrictionefficiency. For complementary DNA (cDNA) synthesis theRNA was treated with DNAse I (2.5 U per 10 µg of RNA)and ethanol precipitation was carried out to removeremnants of enzyme and buffer. Per reaction, 1–2 µg ofDNAse I-treated RNA was used. The first-strand synthesiswas performed using M-MuLV reverse transcriptase (Fermentas,St Leon-Rot, Germany), according to the manufacturer’sprotocol.

Molecular cloning

Cloning of PCR products was performed using the ‘TOPOTA cloning kit for sequencing’ (Invitrogen), according to theprovided protocol, and clones were sequenced for checkingusing T3/T7 primers. After digestion with BamHI and EcoRVthe DNA fragments obtained were ligated into a modifiedreporter-vector, mAV4 (Kircher et al., 1999), containing a cyanfluorescent protein (CFP) gene instead of a green fluorescentprotein (GFP) gene, yielding N-terminal fusions of thetargeting signals to the fluorescent protein. The followingoligonucleotide primers (Biomers, Ulm, Germany or Operon,Cologne, Germany) were used for cDNA synthesis and reversetranscription (RT)-PCR:PME forward, ATGGATCCTCGTTCCTCGCTGGGAT-CAG;PME reverse, GATATCAGGAATGTAGATCACAATGCG;FLP forward, ATGGATCCGCACCGCAAATTTCAAA-CTG;FLP reverse, GATATCATCTGGGGCAATTACGGTGAC;FtsZ forward, ATGGATCCGCCGTGTTGCGTAGCCT-TTG;FtsZ reverse, GATATCCCGCTTCTGTAGATGCACAAG;PLC forward, ATGGATCCATGGTGTCTATTGCGCG-ATTG;PLC reverse, GATATCTACTCGGTGACCGTTAAATTC;Hem2 forward, ATGGATCCATGGTAGGTGTGATGAT-GGC;Hem2 reverse, GACATCTGGGAGGATGAAATTTGCAGG;ERD4 forward, ATGGATCCATGACGGCTACAGCAG-CGTTC;ERD4 reverse, GATATCGAAGTTGTTATTCTCCGTCGC;

Transient transfection of P. patens protoplasts and confocal laser scanning microscopy

Protoplast transfection was performed as previously described(Frank et al., 2005). After at least 3 d of regeneration, protoplastswere analyzed using an LSM 510-i confocal laser scanning micro-

scope (Carl Zeiss, Jena, Germany). To avoid false-positivedetection of chloroplast signals, linear unmixing was carriedout to separate the CFP spectrum from plastid autofluorescence.MitoTracker green FM (Invitrogen), mAV4–CFP and the signalpeptide of FtsZ1-2 in-frame with GFP, which has beendescribed as chloroplast localized (Kiessling et al., 2004), wereused as controls.

Implementation of the ambiguous targeting predictor

Ambiguous targeting predictor (architecture, see Fig. S1) wasimplemented using Libsvm release 2.38 (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/) and Python release 2.4 (http://www.python.org).

Training and test data sets

All negative and positive examples are available as fasta files onthe ambiguous targeting predictor website. The ambiguoustargeting predictor training data set consists of 43 proteins thathave been described in the literature to be ambiguously targetedto mitochondria and plastids (Table S1). Another 44 proteinswere used as negative examples (10–12 proteins each that weredescribed to be exclusively targeted to the cytoplasm, plastids,mitochondria and the secretory pathway) in order to achieve abalanced training set. The negative examples were mostly derivedfrom the TargetP (Emanuelsson et al., 2000) data set. In orderto keep the approximate species distribution of the positiveexamples, some more recent sequence entries from SwissProtwere also included.

For testing, 27 additional (independent) single targeted pro-teins were added to the 44 negative examples mentioned above.The resulting 71 sequences (none of which share > 48% iden-tical positions within the N-terminal 70 amino acids) are repre-sented by triangles in Fig. 1(b). Together with the 43 positiveexamples mentioned in the previous paragraph (squares inFig. 1b), these sequences were used to generate the receiveroperating characteristic (ROC) plot (Fig. 1a). In addition, sevenindependent positives from several species (A. thaliana, N. taba-cum, Zea mays), none of them sharing more than 31% sequenceidentity within the N-terminal 70 amino acids with any of the43 positive proteins of the training data set, were used for testing(circles in Fig. 1b).

Parameter optimization

We used support vector machines (SVMs) to analyze the N-terminal part of the amino acid sequences. Therefore, the leading70 amino acids were scanned using a sliding window approachwith a step size of one in order to generate support vectors(Fig. S1). The window size was variable to be open for optimi-zation, and the primary sequence in each window was neglected;instead, the amino acid composition was derived. The following12 different amino acid features were used: hydrophobicity;

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 227Methods

random coil; alpha helix; beta sheet; beta turn; negative residues;positive residues; small residues; tiny residues; arginine; alanine;and leucine/phenylalanine. The amino acid composition wasevaluated for each amino acid feature based on the AAindex(Kawashima & Kanehisa, 2000). The feature values providedby this database were normalized using the normalize command,which sets the smallest value to 0.0 and the highest value to1.0. These values provided (in addition to the window size) asecond variable for each amino acid feature to determine whetheror not this particular feature is present at a given position of thesequence. For each amino acid feature a single SVM was trainedand both variables (window size and feature cut-off ) were opti-mized in a grid search approach (Table 1, Fig. S1). For the aminoacid features alpha helix, beta sheet and beta turn, all threefeatures were calculated and if one of the other features yieldedbetter results than the main feature for a given position, the mainfeature was not taken into account, even if it was above theoptimized cut-off.

Based on the data of each SVM, a fivefold cross-validationwas carried out and the Matthew’s correlation coefficient (MCC)

(Eqn 1) was calculated as a measure that takes sensitivity, aswell as specificity, into account:

Eqn 1

(TP, true positive; TN, true negative; FP, false positive; and FN,false negative.) As kernel for the SVMs, the radial-basis functionwas used, which has been shown to be very efficient for this typeof biological targeting prediction (Höglund et al., 2006). Theoptimal SVM parameters c and γ were identified in a grid search(Table 1, Fig. S1).

Training

As mentioned in the previous section, a single SVM was trainedfor each feature using the radial basis kernel. For the first training,c and γ were set to estimated default values (c, 0.03125; γ, 0.5)and the AAindex sliding window size and feature cut-off wereoptimized using a grid search based on five-fold cross-validation(Fig. S1). Using these optimized variables, the kernel variables

Fig. 1 Ambiguous targeting predictor (ATP) prediction accuracy. (a) Receiver operating characteristic (ROC) plot of ATP performance on the expanded training data set. Sensitivity (true positive rate, y-axis) and overall specificity (true negative rate, x-axis) are plotted for score cut-offs of 0.4–1.0. Data values are shown next to the data points (open circles). We chose a cut-off of 0.7, which implies a specificity of 1.00 and a sensitivity of 0.98. While nucleo-cytoplasma and secretory pathway reach a compartment specificity of 1.0 for the score range 0.4–1.0, specificity for mitochondria and plastids converges at 0.6 and 0.7, respectively (data not shown). (b) Proteins predicted as positives at different score cut-offs. Proteins were divided into 10 bins (x-axis) by dividing the score range (0.0–1.0). The percentage of proteins per bin is shown on the y-axis for different data sets. While the number of negative examples (the single-compartment proteins mentioned above; triangles) decreased to < 10% between 0.3 and 0.4, the number of positive examples used for training (squares) significantly increased only beyond 0.7. The additional positive examples not used for training (circles) are predicted at values between 0.4 and 0.9. The complete Arabidopsis thaliana proteome (diamonds) is shown for comparison.

( * ) ( * )

( )*( )*( )*( )

TP TN FN FP

TN FN TP FN TN FP TP FP

−+ + + +

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research228 Methods

c and γ were optimized in the second grid search (Fig. S1). In thethird and final grid search, a second optimization of the AAindexsliding window size and feature cut-off was performed, yieldingthe final variable sets (optimized parameters, Fig. S1). Using thesesets, training of each SVM was carried out individually on thepositive and negative data sets.

Weighting and normalization

The individual SVM prediction results were weighted basedon their MCC:

Eqn 2

Eqn 3

The score of each SVM (fScore) was weighted and normalizedto the percentage it contributed to the total sum of all MCCs(wScore); the sum of all wScores is the resulting score (cScore),which is therefore (Eqn 3) normalized to [0.0–1.0] (Fig. S1).

Twin targeting analysis using existing tools

Eight examples of proteins previously described to be duallytargeted by the twin mechanism (Table S2) were analyzed. Theprotein sequences were modified: the altered sequences simu-lated a second, shorter isoform, which might be generated byalternative transcription or translation initiation. The originalsequence and the modified sequence were both tested usingexisting targeting prediction tools (as described later in this para-graph). In the best case both sequences should yield high valuesfor different compartments other than cytoplasm (the latter wouldhint at the protein not being subject to dual targeting mech-anisms). The original sequence was truncated at the N-terminal

end just before the second methionine unless this methioninewas within the first 25 amino acids. In that case, the first meth-ionine beyond amino acid 25 was used. An internal ribosomeentry site (IRES) motif search was also carried out, but no poten-tial IRES motifs were found in the training data set. As predictiontools for twin targeting, MultiLoc/TargetLoc (Höglund et al.,2006), WoLF PSORT (Nakai & Horton, 1999) and TargetP(Emanuelsson et al., 2000) were used. The results were nor-malized to the respective highest possible value. To increase theinformative value and to ensure that distinct results werefavoured, the value of the second-best hit was subtracted fromthe value of the best hit. This quality measure was comparedfor the original sequence and the modified sequences.

Results and Discussion

Ambiguous targeting predictor architecture

The ambiguous targeting predictor uses SVMs (Vapnik, 1998)for the prediction of ambiguous targeting. Support vectormachines have already been successfully used in several locali-zation prediction tools (Park & Kanehisa, 2003; Höglund et al.,2006; Shatkay et al., 2007) and have shown very good perfor-mance. Typical chloroplast targeting signals are 30–80 aminoacids long (average 58 amino acids) and typical mitochondrialsignals are 20–60 amino acids long (average 42 amino acids)(Zhang & Glaser, 2002). The input features of the SVM-basedprediction engine ambiguous targeting predictor are thereforeconstructed from the 70 N-terminal amino acids using a slidingwindow approach (Fig. S1). A total of 12 different amino acidproperties were used, which were selected based on previousresults (Peeters & Small, 2001) and textbook knowledge (Lodishet al., 2007). Certain features of ambiguous targeting signalshave recently been analyzed in a mutational approach, revealingthe importance of arginine residues and of the second N-terminalamino acid, often an alanine (Pujol et al., 2007). Ambiguous

Table 1 Optimized parameters for the amino acid feature support vector machines (SVMs)Feature Short name Window size Cut-off c γ MCC

Random coil RDCL 10 0.34 0.5 0.125 0.5Alpha helix APHX 2 0.19 2 0.125 0.5Negative residues NEGR 1 1.00 2 0.5 0.5Hydrophobicity HYPB 11 0.68 2048 0.0005 0.45Beta turn BTTN 4 0.48 2048 0.00049 0.43Arginine ARGN 1 1.00 0.5 0.5 0.37Beta sheet BTSH 14 0.03 2 0.031 0.36Tiny residues TNYR 1 1.00 2 0.008 0.23Alanine ALAN 1 1.00 0.031 2 0.19Leucine/phenylalanine LEPH 1 1.00 128 0.008 0.17Positive residues POSR 1 1.00 2 2 0.16Small residues SMLR 1 1.00 8 0.031 0.13

The amino acid features used by ambiguous targeting predictor (ATP) are sorted by relative importance, measured using their Matthew’s correlation coefficient (MCC). Optimized AAindex (window size, cut-off) and kernel parameters (c, γ) are shown.

wScore fScoreMCC

MCCfScore

fScore

( ) ( )*( )

( )...

i ii

ii

=

=∑1 12

cScore wScore==∑ ( )

...

ii 1 12

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 229Methods

targeting predictor therefore includes the presence of arginineand alanine as amino acid feature vectors. For each of the 12feature vectors, a distinct support vector classifier was trainedand these classifiers were combined into a joint prediction usinga simple weighted voting scheme. The individual SVM predic-tion results (one for each amino acid feature) are weighted basedon their MCC value on the training set (Table 1). This weightedscore is then normalized to yield a score between 0.0 and 1.0,the latter being the best achievable score (Fig. S1). Support vectormachine classifiers with a better prediction performance (a highMCC) will thus contribute more to the final result than lessreliable classifiers (with a low MCC). The combination of 12independent classifiers yielded superior results compared withthe standard approach (i.e. a combined feature vector for all12 feature sets). The ambiguous targeting predictor webtool is available online at http://www.cosmoss.org/bm/ATP.

Importance of individual amino acid features

The influence of each amino acid feature on the ambiguoustargeting predictor score can be derived from its MCC. The threetop scoring features are random coil, alpha helix and negativeresidues, closely followed by hydrophobicity and beta turn(Table 1). Arginine and beta sheet also contribute well, whilethe other five features are of lesser importance. While for someof the features a qualitative difference exists for the full 70 aminoacids (e.g. alpha helix, random coil, negative residues, arginine;Fig. S2), others exhibit regional differences (e.g. hydrophobicity,beta turn, beta sheet). Some of the more prominent differencesthat can be seen in the distribution plots (Fig. S2) are the lackof a hydrophobic stretch in the first 20 amino acids, a lowerabundance of negatively charged amino acids and a higherabundance of arginine (Pujol et al., 2007) in the ambiguoustargeting signals. The results from the feature optimization canbe used to inform mutational research that aims to clarify themechanism of ambiguous targeting.

Evaluation of the ambiguous targeting predictor prediction accuracy

For testing, the initial 43 positive examples (Table S1) were usedand the negative examples were increased from 44 to 71 proteinsby including 27 single localization proteins that were not partof the training data set (see the Materials and Methods for detailson the training and test data sets). Different score cut-offs wereevaluated based on their specificity (true negative rate) and sensi-tivity (true positive rate).1 At a threshold of 0.7, which is the bestperforming cut-off in the ROC plot (Fig. 1a), all single targeted

proteins were detected as true negatives, while 98% of theambiguous signal sequences were detected as true positives. Theambiguous targeting predictor score is clearly correlated withsensitivity (correlation coefficient −0.84). Based on seven addi-tional positive examples from several species that were not partof the training data set (Table S2), the accuracy of the methodwas further evaluated and, by using a score cut-off of 0.7, demon-strated average sensitivity (43%). At a score cut-off of 0.6, thesensitivity was 57%; the lowest score achieved among the seventrue positives was 0.39. Therefore, scores of 0.8 and higher areexpected to yield a very low rate of false positives while missingsome of the true positives. Scores below 0.8 recover an increasingnumber of the true positives with a rising rate of false positives.A score cut-off of 0.7 seems to represent a good trade-off forpractical application (Fig. 1). As an additional negative control,scores for the Saccharomyces cerevisiae proteome (all proteinsconsidered negatives) were predicted. This approach led to 49out of 5784 proteins (0.85%, equaling 99% specificity) beingpredicted as false positives using a score cut-off of 0.7. Becausesome S. cerevisiae proteins might contain functional dual targe-ting signals (Huang et al., 1990), the actual specificity might evenbe slightly higher. The score distribution for the A. thalianaproteome (Fig. 1b) is spread around a score of 0.4 (i.e. themajority of (single targeted) proteins achieves ambiguous tar-geting predictor scores of c. 0.4). Comparison of these scorevalues with those for the training and test data sets (Fig. 1b)demonstrates that scores of < 0.4 yield a high number of falsepositives, that the score range 0.4–0.7 should be taken into con-sideration with caution and that scores of > 0.7 usually representtrue positives.

Comparison with other in silico approaches and databases

Recently, a combined approach using existing tools revealed amultitude of A. thaliana and rice transcription factors that arepredicted to be targeted to either plastids or mitochondria inaddition to being present in the nucleo-cytoplasm (Schwackeet al., 2007). Several of these proteins would be predicted to bedually targeted to both organelles by ambiguous targeting pre-dictor, namely six out of 78 A. thaliana transcription factorspredicted for plastid targeting (AT5G52020, AT2G22200,AT1G77640, AT2G44940, AT5G29000 and AT1G14410)and one out of 12 predicted for mitochondrial targeting(AT1G68180). Such proteins might thus exert nuclear transcrip-tional control in both semi-autonomous organelles.

The 39 A. thaliana proteins present in the training and testdata set were compared with the A. thaliana subcellular database,SUBA v2.2 (Heazlewood et al., 2007). A total of 32 proteins(82%) are present among those 189 entries in the databasefor which dual mitochondrial and plastidal localization hasbeen inferred by fluorescent protein fusion (16 proteins),mass spectrometry (three proteins), annotation based on TheArabidopsis Information Resource, TAIR (two proteins),

1. Sensitivity = TP/(TP + FN); a measure of the amount of TPs that are correctly identified.Specificity = TN/(TN + FP); a measure of the amount of TNs that are correctly identified.(TP, true positives; TN, true negatives; FP, false positives; FN, false negatives.)

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research230 Methods

AmiGO (19 proteins), Swissprot (two proteins) or a combinationthereof. When applying ambiguous targeting predictor (with ascore cut-off of 0.7) to the A. thaliana proteome, 523 proteinswere predicted to be ambiguously targeted (Fig. 2). Ofthose, 37 overlapped with the 189 aforementioned SUBAentries (average ambiguous targeting predictor score for theentries: 0.5). A total of 35 proteins (90%) were present amongthose SUBA database entries predicted by computational tools.The individual tools predicted the proteins to be present eitherin plastids or mitochondria to a very different extent (TargetP71/27%, Mitoprot2 0/96%, Subloc 0/24%, Ipsort 38/51%,Predotar 47/40%, Mitopred 0/44%, Wolf PSort 89/4%,Multi-Loc 69/31%, Loctree 49/29%, respectively). While those8741 database entries that were predicted to be present inmitochondria and plastids based on a combination of compu-tational tools contained 90% of the true positives checked,the number of false positives generated using this methodis probably vast, given > 8700 entries compared with 523predicted by ambiguous targeting predictor (Fig. 2).

We also compared the ambiguous targeting predictorprediction with proteomics data. For this purpose, all 690nuclear-encoded A. thaliana plastid proteins present in the plastidprotein (plprot) database (Kleffmann et al., 2006) wereretrieved. In addition, all 457 A. thaliana mitochondrialproteins from the Arabidopsis mitochondrial protein database(AMPDB) (Heazlewood & Millar, 2005) that were determined

by both gel-based and gel-free procedures, were selected. Theintersection of both data sets, 66 proteins, was subjected toambiguous targeting predictor. The average score was0.43, which is significantly higher than the score of 0.34(P = 6.09E-07, one-tailed t-test) achieved on all A. thalianaproteins. Still, this average score is probably negatively biasedbecause of the fact that the databases are not manually curatedand therefore might contain a certain number of false positives.A comparison with the manually curated plant proteomedatabase (PPDB) (Sun et al., 2008) revealed an average ambig-uous targeting predictor score for A. thaliana plastid proteinsof 0.51, clearly demonstrating again that an intermediateambiguous targeting predictor score is no clear indication ofdual targeting. Yet, the average ambiguous targeting predictorscore for those 53 manually curated A. thaliana TAIR7 PPDBproteins that are annotated as present in both plastids andmitochondria, was found to be 0.73 (i.e. above the suggestedconfidence cut-off of 0.7).

The recently published ‘Database of proteins with multiplesubcellular localizations’ (DBMLoc) (Zhang et al., 2008) con-tains a total of 29 proteins for which both mitochondria andplastids are listed as subcellular compartments. Of those, onlythree were present in the ambiguous targeting predictor trainingdata set. A close inspection of the remaining 26 proteins revealedthat the majority (20 cytochrome c6, two apocytochrome f andtwo voltage-dependent anion-selective channels) were selected asa result of a combination of experimental data and homologyevidence, probably leading to a false-positive dual-targeting pre-diction. The remaining two proteins, Spinacea oleracera proto-porphyrinogen oxidase and N. tabacum DNA-directed RNApolymerase 2, are proteins dually targeted by the twin mechanism.

Twin targeting prediction

Some of the proteins selected for experimental validation(Table 2) were considered as candidates for targeting using atwin mechanism, based on the presence of a secondary meth-ionine within the putative targeting signal. In order to be able toanalyze putative twin targeting of these proteins in greater detail,existing tools for the prediction of subcellular localization wereapplied to eight examples of proteins previously described in theliterature to be dually targeted using the twin mechanism (TableS2). Subsequently, the prediction method described in theremainder of this section was applied to the P. patens proteinsselected for experimental validation (Table 2).

By predicting the localization for the full-length protein aswell as for a truncated form (starting at the putative secondarymethionine), in conjunction with score normalization, the pre-diction of dual targeting based on tandemly arrayed signalsequences is possible. The normalized score cut-offs yieldingthe best combination of specificity and sensitivity were 0.3(WoLF PSORT), 0.4 (TargetP; Emanuelsson et al., 2000), 0.8(TargetLoc; Höglund et al., 2006) and 0.5 (MultiLoc; Höglundet al., 2006). None of the tools clearly outperformed any of the

Fig. 2 Cross-genome comparison of ambiguous targeting predictor (ATP) prediction results. The absolute number of proteins predicted to be ambiguously targeted to plastids and mitochondria (based on a score cut-off of 0.7) is shown for several land plants and algae. O. sativa, Oryza sativa; A. thaliana, Arabidopsis thaliana; P. trichocarpa, Populus trichocarpa; V. vinifera, Vitis vinifera; P. patens,Physcomitrella patens; C. reinhardtii, Chlamydomonas reinhardtii; C. merolae, Cyanidioschyzon merolae; O. tauri, Ostreococcustauri; O. lucimarinus, Ostreococcus lucimarinus.

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 231Methods

other tools. As it turned out, the difference between the nor-malized scores for the best localization and the second bestlocalization predicted for a given protein isoform can be usedas a quality measure to assess the probability of the predictionresult. However, not all tools will always yield the correct result,suggesting that several tools should be used and a consensusapproach applied.

Validation of the prediction results using fluorescent protein fusion

To evaluate the in silico results in vivo, CFP fusion constructsof P. patens protein-coding genes were generated to check theirlocalization in transfected P. patens cells by confocal laser scan-ning microscopy (CLSM). As candidates, we chose P. patensproteins with ambiguous targeting predictor scores betweenc. 0.5 and 0.7 from the whole-proteome prediction (3619proteins compared with 296 proteins with scores ≥ 0.7; Fig. 2)because this range is critical concerning the true positive/negativerate, as mentioned earlier. Moreover, we chose three proteinsbelow this range to investigate whether proteins with a lowercut-off may also be dually targeted. The chosen candidates(Table 2) were PME (score 0.09), FLP (score 0.13), PLC (score0.4), ERD4 (score 0.49), Hem2 (score 0.56) and FtsZ (score0.73). The analyzed constructs contained the 5′ part of thecoding sequence, encompassing the signal peptide, in-framewith the CFP. For PLC, PME, FLP, FtsZ and Hem2, theprobability exists that their dual targeting is regulated via thetwin mechanism (Table 2). Therefore, the 5′-UTR was includedin those constructs in case it is important for regulating twinmechanism (Christensen et al., 2005; Sunderland et al., 2006;Puyaubert et al., 2008).

The localization of most of the fusion proteins confirmedthe expectations. The FtsZ protein is localized in both mito-chondria and plastids, confirming the prediction result ofambiguous targeting predictor (score 0.73, Fig. 3e, Table 2). Inthe case of Hem2 (score 0.56), the annotated localization in the

plastid was confirmed but no additional fluorescence in themitochondria could be found (Fig. 3g), making it a probablefalse-positive result. This confirms that the chosen score cut-offof 0.7 is reasonable if one wants to exclude false positives. How-ever, for some of the proteins with lower scores, dual targetingcould also be demonstrated. An interesting case is the ERD4homolog (score 0.49), which, during the first days of protoplastregeneration, is localized in the mitochondria, whereas after 10d of regeneration localization switches to the chloroplast (Fig. 3b/c). Therefore, this protein seems to be another example of tar-geting being dependent on environmental/developmental con-ditions (Karniely & Pines, 2005; Petruccelli et al., 2006). Theobserved dual localization of PLC (Table 2) has been surprisingbecause the ambiguous targeting predictor score for this proteinis rather low (0.4). However, the dual targeting to mitochondriaand plastids in this case might also be a result of the twin mech-anism. The results for FtsZ and PLC suggest that predictionusing ambiguous targeting predictor might, in some cases,correlate with twin prediction if the protein is targeted tomitochondria and plastids (Table 2). This might be a result ofthe fact that ambiguous signals resemble both plastid andmitochondrial signals and therefore an ambiguous signal resem-bles the tandem array of targeting signals found in twin pro-teins and vice versa. The two chosen proteins at the lower endof the score range (FLP and PME with scores of 0.13 and0.09, respectively) are clearly localized to only one compartment(Table 2), which suggests that at this low score the number oftrue positives is indeed probably very low. Taken together, threeout of four proteins with an ambiguous targeting predictorscore of between 0.4 and 0.73 could be shown to be duallytargeted and thus are considered as true positive predictions(Table 2).

Comparison with experimental data

The protoporphyrinogen oxidase from A. tuberculatus, in whichambiguous targeting to plastids and mitochondria evolved by

Table 2 Experimentally validated Physcomitrella patens proteins Protein Gene model ATP score Localization Twin prediction

FtsZ Phypa_187670 0.73 mt&pt pt/mtHem2 Phypa_221821 0.56 pt pt/mtERD4 Phypa_180964 0.49 mt&pt secretory/cytoplasmPLC Phypa_202996 0.4 mt&pt mt/ptFLP Phypa_109369 0.13 secretory secretory/ptPME Phypa_145966 0.09 secretory secretory/pt

The column protein contains the abbreviated protein name used throughout the text. The P. patens gene model (Rensing et al., 2008) for the corresponding locus is shown in column 2, followed by the ambiguous targeting predictor (ATP) score in column 3, the localization as confirmed by cyan fluorescent protein (CFP) fusion experiments in column 4 and the outcome of the twin targeting prediction in column 5.ERD4, early response to dehydration 4 homolog; FLP, fasciclin-like protein; FtsZ, a plastid division protein; Hem2, delta-aminolevulenic acid dehydratase 2; PLC, phosphatidylinositol-dependent phospholipase C; PME, pectin methylesterase.

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research232 Methods

Fig. 3 Fluorescent protein fusion localization in transfected Physcomitrella patens protoplasts. In order to detect the subcellular localization of fluorescent protein fusions (cyan fluorescent protein (CFP) or green fluorescent protein (GFP)), polyethylene glycol (PEG)-transfected P. patens protoplasts were analyzed, after at least 3 d of regeneration, using confocal laser scanning microscopy. MitoTracker green FM, mAV4–CFP (encoding no targeting signal) and the signal peptide of a plastid division protein (FtsZ1-2) in-frame with GFP (previously described as chloroplast localized (Kiessling et al., 2004)), were used as controls. Fluorescent protein emission is shown in cyan or green, and plastid autofluorescence is shown in red. The larger images represent merged channels and the adjacent smaller images represent the respective separate channels; sp, signal peptide. (a) Localization of mAV4–CFP (nucleo-cytoplasmic control); (b and c) early response to dehydration 4 homolog signal peptide (ERD4sp):CFP (Phypa_180964; Table 2) on day 10 (plastid localization) and day 3 (mitochondrial localization); (d) MitoTracker staining of an untransfected protoplast (mitochondrial control; the green structure to the lower left is a ghost emitted by the protoplast, probably heavily stained by MitoTracker or autofluorescent); (e) FtsZsp:CFP (Phypa_187670; Table 2) localized in plastids and mitochondria; (f) FtsZ1-2sp:GFP (plastid control); (g) Hem2sp:CFP (Phypa_221821, Table 2) localized in plastids.

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 233Methods

a codon deletion leading to a 30-amino acid extension of theN-terminus (Patzoldt et al., 2006), yields an ambiguous tar-geting predictor score of 0.57. The short form of the protein,from the herbicide-susceptible biotype, yields a distinctly lowerscore of 0.49. In vitro evidence suggests that the A. thalianaWhirly 2 protein might be dually targeted to mitochondria andchloroplasts; the ambiguous targeting predictor score for thisprotein is 0.61. Recently, dual targeting has been demonstratedof the Z. mays seryl-tRNA synthetase (Rokov-Plavec et al., 2008);this protein achieves an ambiguous targeting predictor scoreof 0.51.

The A. thaliana holocarboxylase synthetase 1 (HCS1) geneis essential for biotin metabolism. Alternative splicing of the 5′-UTR has recently been shown to remove a small upstream open-reading frame (ORF), which represents a switch for the selectionof the translation initiation site among two in-frame AUGcodons (Puyaubert et al., 2008). The resulting proteins havebeen shown to be localized in the cytoplasm or chloroplasts,respectively. However, enzymatic activity of the protein in mito-chondria has been shown and thus suggests ambiguous targeting,which might be obscured by a more efficient transport to chlo-roplasts (Puyaubert et al., 2008). Targeting of HCS1 might beregulated in response to metabolic requirements, comparableto expression control by metabolite-binding riboswitches (Cheahet al., 2007). The ambiguous targeting predictor score for theA. thaliana HCS1 is 0.47, making an ambiguous targetingmechanism possible. The P. patens homolog, Phypa_143161(http://www.cosmoss.org), even generates an ambiguous tar-geting predictor score of 0.62.

It has been demonstrated that multiple in-frame start codonsalter the localization of A. thaliana tRNA nucleotidyltransferaseby differing transcriptional initiation (von Braun et al., 2007).Fluorescent protein fusion experiments performed in Allium cepaand N. tabacum cells suggested that the targeting signal startingat the very first methionine (ambiguous targeting predictor score0.65) leads to localization in both mitochondria and plastids,whereas the targeting peptide lacking the first five amino acids(ambiguous targeting predictor score 0.91) was targeted to plas-tids. The protein starting at methionine 69 remained in thecytosol. Therefore, the proposed ambiguous targeting is alsosuggested by ambiguous targeting predictor, although the pro-tein lacking the first five amino acids generates an even higherscore than the longest one. It should be noted, however, thatthe fusion constructs did not contain the 5′-UTR, which mightinfluence the initiation site, and the localization experimentscarried out in heterologous systems might be misleading. Inter-estingly, a probable involvement of the protein proper in shiftingthe protein localization to mitochondria was shown, implicatingthe involvement of cytosolic factors (von Braun et al., 2007). Thegene model describing the P. patens homolog, Phypa_21288(ambiguous targeting predictor score 0.37), is obviously trun-cated. Manual inspection using the http://www.cosmoss.orggenome browser revealed a gene model with a longer and puta-tively complete N-terminus, all_Phypa_159494, yielding an

ambiguous targeting predictor score of 0.46, which might beambiguously targeted.

In a recent study, it could be shown that in Medicago trun-catula and Populus alba, in which the rps16 gene has been lostfrom the plastid genome, the plastid gene was substituted witha nuclear-encoded rps16 gene of mitochondrial origin throughthe capability of the encoded protein to dually target bothorganelles (Ueda et al., 2008). Interestingly, dual targeting ofRPS16 to mitochondria and chloroplasts seems to have evolvedbefore the Liliopsida/eudicotyledon split. Moreover, RPS16proteins of plants that still harbor the plastid copy of the gene(e.g. A. thaliana, Lycopersicon esculentum, O. sativa) also possessdual targeting ability. Those proteins from the latter organismsfor which dual targeting to plastids and mitochondria could beshown by fluorescent protein fusion generate ambiguous targetingpredictor scores of 0.46, 0.46, 0.49 and 0.51, respectively(Table S2). For the A. thaliana RPS16-1, which was found tobe exclusively targeted to chloroplasts in the assay, the ambiguoustargeting predictor score is 0.41.

It has been shown that differences in chloroplast targetingsignals exist between O. sativa and A. thaliana (Kleffmann et al.,2007; Zybailov et al., 2008). Yet, dual targeting could be vali-dated for several P. patens proteins predicted by ambiguous tar-geting predictor, and the ambiguous targeting predictor scoregenerally correlates well with the examples from different plants,as discussed earlier. Therefore, either the amino acid propertiesfor dual plastid/mitochondrial targeting sequences are conservedthroughout land plants, or the ambiguous targeting predictorapproach (taking 12 different amino acid features into account)enables the prediction across a diverse set of organisms.

Cross-genome analyses

By applying ambiguous targeting predictor to the A. thalianaproteome, 523 proteins were predicted to exhibit dual mito-chondrion/plastid targeting. A comparison with other plant pro-teomes (Populus trichocarpa, O. sativa, Vitis vinifera, P. patens;Table S3) showed that, in general, c. 450 (1.27 ± 0.4%) ofthe proteins carry potential ambiguous targeting signals (Fig. 2).By contrast, the proteomes of several algae (Chlamydomonasreinhardtii, Ostreococcus tauri, Ostreococcus lucimarinus andCyanidioschyzon merolae) encode significantly fewer (P =0.0016, Fisher’s exact test; on average c. 100) ambiguouslytargeted proteins. While this observation might be a result ofdifferent coding (and thus erroneous prediction) of targetingsignals in the algae, it might also represent a correlation of dualtargeting with increasing organismal complexity (especially ifone considers the low number of predicted proteins in thehighly reduced prasinophytes). Interestingly, of the A. thalianaproteins that exhibit putative ambiguous targeting, only 30%have their best blast hit among the lineages representing theancestors of plastids and mitochondria (Cyanobacteria andalpha-Proteobacteria; cut-off 30% identity, 80 amino acidsalignment length). While phylogenetic analysis will need to

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research234 Methods

reveal details, this might suggest that a plethora of eukaryoticgenes has evolved dual targeting capabilities during plantevolution. This would also suggest that neofunctionalizationof the endosymbiotic organelles has taken place and thatcontrol by the nucleus (i.e. the host) is exerted using thismechanism.

Conclusions

Functional genome annotation requires accurate prediction ofprotein localization. It is therefore necessary to expand our know-ledge further regarding dual targeting and to develop tools thatenable prediction of dual targeting. In this study, we demon-strated that dual protein targeting can accurately be predictedby applying machine learning. We implemented a tool, ambi-guous targeting predictor, for the prediction of ambiguoustargeting signals. Our results demonstrated that land plantgenomes encode, in general, > 400 proteins that are putativelytargeted to mitochondria and plastids based on ambiguous N-terminal presequences. Evaluation of the prediction results usingprotoplast transfection demonstrates that proteins with ambi-guous targeting predictor scores of > 0.3 might be ambiguouslytargeted to mitochondria and chloroplasts, while ambiguoustargeting predictor scores of > 0.7 indicate high specificity. Interms of amino acid features that significantly contribute to thetargeting predictability of the ambiguous targeting predictor,alpha helix, random coil, negative residues and arginineare important over the whole length of the N-terminal 70characters, while hydrophobicity, beta sheet and beta turnexhibit regional bias. Ambiguous targeting predictor has beenmade available online via a web interface, allowing the user tocheck proteins of interest.

Acknowledgements

We are grateful to Simon Zimmer for assistance with imple-mentation of the ambiguous targeting predictor (ATP) web tooland to Kirsten Krause for helpful comments on the manuscript.Financial funding by DFG (S.A.R. and R.R., grant Re 837/10-2); BMBF (S.A.R. and R.R., grant 0313921, Freiburg Initiativein Systems Biology) is gratefully acknowledged.

References

Adams KL, Daley DO, Qiu YL, Whelan J, Palmer JD. 2000. Repeated, recent and diverse transfers of a mitochondrial gene to the nucleus in flowering plants. Nature 408: 354–357.

Bierfreund NM, Reski R, Decker EL. 2003. Use of an inducible reporter gene system for the analysis of auxin distribution in the moss Physcomitrella patens. Plant Cell Reports 21: 1143–1152.

Birdsey GM, Lewin J, Cunningham AA, Bruford MW, Danpure CJ. 2004. Differential enzyme targeting as an evolutionary adaptation to herbivory in carnivora. Molecular Biology and Evolution 21: 632–646.

Boden M, Hawkins J. 2005. Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21: 2279–2286.

von Braun SS, Sabetti A, Hanic-Joyce PJ, Gu J, Schleiff E, Joyce PB. 2007.

Dual targeting of the tRNA nucleotidyltransferase in plants: not just the signal. Journal of Experimental Botany 58: 4083–4093.

Cheah MT, Wachter A, Sudarsan N, Breaker RR. 2007. Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 447: 497–500.

Christensen AC, Lyznik A, Mohammed S, Elowsky CG, Elo A, Yule R, Mackenzie SA. 2005. Dual-domain, dual-targeting organellar protein presequences in Arabidopsis can use nonAUG start codons. Plant Cell 17: 2805–2816.

Duchene AM, Giritch A, Hoffmann B, Cognat V, Lancelin D, Peeters NM, Zaepfel M, Marechal-Drouard L, Small ID. 2005. Dual targeting is the rule for organellar aminoacyl-tRNA synthetases in Arabidopsis thaliana. Proceedings of the National Academy of Science, USA 102: 16484–16489.

Emanuelsson O, Nielsen H, Brunak S, von Heijne G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300: 1005–1016.

Frank W, Decker EL, Reski R. 2005. Molecular tools to study Physcomitrella patens. Plant Biology (Stuttg) 7: 220–227.

Guda C, Fahy E, Subramaniam S. 2004. Mitopred: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 20: 1785–1794.

Heazlewood JL, Millar AH. 2005. Ampdb: the arabidopsis mitochondrial protein database. Nucleic Acids Research 33(Database issue): D605–610.

Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH. 2007. Suba: the arabidopsis subcellular database. Nucleic Acids Research 35(Database issue): D213–218.

Höglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O. 2006. Multiloc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics 22: 1158–1165.

Huang J, Hack E, Thornburg RW, Myers AM. 1990. A yeast mitochondrial leader peptide functions in vivo as a dual targeting signal for both chloroplasts and mitochondria. Plant Cell 2: 1249–1260.

Kabeya Y, Sato N. 2005. Unique translation initiation at the second AUG codon determines mitochondrial localization of the phage-type RNA polymerases in the moss Physcomitrella patens. Plant Physiology 138: 369–382.

Karniely S, Pines O. 2005. Single translation–dual destination: mechanisms of dual protein targeting in eukaryotes. EMBO Reports 6: 420–425.

Kawashima S, Kanehisa M. 2000. AAindex: amino acid index database. Nucleic Acids Research 28: 374.

Kiessling J, Martin A, Gremillon L, Rensing SA, Nick P, Sarnighausen E, Decker EL, Reski R. 2004. Dual targeting of plastid division protein ftsz to chloroplasts and the cytoplasm. EMBO Reports 5: 889–894.

Kircher S, Wellmer F, Nick P, Rugner A, Schafer E, Harter K. 1999. Nuclear import of the parsley bzip transcription factor cprf2 is regulated by phytochrome photoreceptors. Journal of Cell Biology 144: 201–211.

Kleffmann T, Hirsch-Hoffmann M, Gruissem W, Baginsky S. 2006. Plprot: a comprehensive proteome database for different plastid types. Plant & Cell Physiology 47: 432–436.

Kleffmann T, von Zychlinski A, Russenberger D, Hirsch-Hoffmann M, Gehrig P, Gruissem W, Baginsky S. 2007. Proteome dynamics during plastid differentiation in rice. Plant Physiology 143: 912–923.

Lodish H, Berk A, Kaiser CA, Krieger M, Scott MP, Bretscher A, Ploegh H, Matsudaira P. 2007. Molecular Cell biology. Houndmills, UK: Palgrave Macmillan.

Mackenzie SA. 2005. Plant organellar protein targeting: a traffic plan still under construction. Trends in Cell Biology 15: 548–554.

Michl D, Karnauchov I, Berghofer J, Herrmann RG, Klosgen RB. 1999. Phylogenetic transfer of organelle genes to the nucleus can lead to new mechanisms of protein integration into membranes. Plant Journal 17: 31–40.

Millar AH, Whelan J, Small I. 2006. Recent surprises in protein targeting to mitochondria and plastids. Current Opinion in Plant Biology 9: 610–615.

© The Authors (2009) New Phytologist (2009) 183: 224–236Journal compilation © New Phytologist (2009) www.newphytologist.org

Research 235Methods

Nakai K, Horton P. 1999. Psort: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Sciences 24: 34–36.

Park KJ, Kanehisa M. 2003. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19: 1656–1663.

Patzoldt WL, Hager AG, McCormick JS, Tranel PJ. 2006. A codon deletion confers resistance to herbicides inhibiting protoporphyrinogen oxidase. Proceedings of the National Academy of Science, USA 103: 12329–12334.

Peeters N, Small I. 2001. Dual targeting to mitochondria and chloroplasts. Biochimica et Biophysica Acta 1541: 54–63.

Petruccelli S, Otegui MS, Lareu F, Tran Dinh O, Fitchette AC, Circosta A, Rumbo M, Bardor M, Carcamo R, Gomord V et al. 2006. A kdel-tagged monoclonal antibody is efficiently retained in the endoplasmic reticulum in leaves, but is both partially secreted and sorted to protein storage vacuoles in seeds. Plant Biotechnology Journal 4: 511–527.

Pujol C, Marechal-Drouard L, Duchene AM. 2007. How can organellar protein n-terminal sequences be dual targeting signals? In silico analysis and mutagenesis approach. Journal of Molecular Biology 369: 356–367.

Puyaubert J, Denis L, Alban C. 2008. Dual targeting of arabidopsis holocarboxylase synthetase1: a small upstream open reading frame regulates translation initiation and protein targeting. Plant Physiology 146: 478–491.

Quatrano RS, McDaniel SF, Khandelwal A, Perroud PF, Cove DJ. 2007. Physcomitrella patens: mosses enter the genomic age. Current Opinion in Plant Biology 10: 182–189.

Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud PF, Lindquist EA, Kamisugi Y et al. 2008. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319: 64–69.

Richter U, Kiessling J, Hedtke B, Decker E, Reski R, Borner T, Weihe A. 2002. Two rpot genes of Physcomitrella patens encode phage-type rna polymerases with dual targeting to mitochondria and plastids. Gene 290: 95–105.

Rokov-Plavec J, Dulic M, Duchene AM, Weygand-Durasevic I. 2008. Dual targeting of organellar seryl-trna synthetase to maize mitochondria and chloroplasts. Plant Cell Reports 5: 5.

Schwacke R, Fischer K, Ketelsen B, Krupinska K, Krause K. 2007. Comparative survey of plastid and mitochondrial targeting properties of transcription factors in arabidopsis and rice. Molecular Genetics and Genomics 13: 13.

Shatkay H, Hoglund A, Brady S, Blum T, Donnes P, Kohlbacher O. 2007. Sherloc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23: 1410–1417.

Silva-Filho MC. 2003. One ticket for multiple destinations: dual targeting of proteins to distinct subcellular locations. Current Opinion in Plant Biology 6: 589–595.

Small I, Peeters N, Legeai F, Lurin C. 2004. Predotar: a tool for rapidly screening proteomes for n-terminal targeting sequences. Proteomics 4: 1581–1590.

Sun Q, Zybailov B, Majeran W, Friso G, Olinares PD, van Wijk KJ. 2008. Ppdb, the plant proteomics database at cornell. Nucleic Acids Research 2: 2.

Sunderland PA, West CE, Waterworth WM, Bray CM. 2006. An evolutionarily conserved translation initiation mechanism regulates nuclear

or mitochondrial targeting of DNA ligase 1 in Arabidopsis thaliana. Plant Journal 47: 356–367.

Ueda M, Nishikawa T, Fujimoto M, Takanashi H, Arimura SI, Tsutsumi N, Kadowaki KI. 2008. Substitution of the gene for chloroplast rps16 was assisted by generation of a dual targeting signal. Molecular Biology and Evolution 2: 2.

Vapnik VN. 1998. Statistical learning theory. Weinheim, Germany: Wiley-VCH.

van Wijk KJ. 2004. Plastid proteomics. Plant Physiology and Biochemistry 42: 963–977.

Zhang S, Xia X, Shen J, Zhou Y, Sun Z. 2008. Dbmloc: a database of proteins with multiple subcellular localizations. BMC Bioinformatics 9: 127.

Zhang XP, Glaser E. 2002. Interaction of plant mitochondrial and chloroplast signal peptides with the hsp70 molecular chaperone. Trends in Plant Science 7: 14–21.

Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk KJ. 2008. Sorting signals, n-terminal modifications and abundance of the chloroplast proteome. PLoS ONE 3: e1994.

Supporting Information

Additional supporting information may be found in theonline version of this article.

Fig. S1 A diagram (Figure_S1.ppt) explaining the ATParchitecture, including the sliding window approach foramino acid feature extraction and the training procedure.

Fig. S2 A figure (Figure_S2.png) showing distribution plotsof the amino acid features used by ATP along the first 70amino acids of the positive (blue) and negative exampleproteins (green).

Table S1 An Excel spreadsheet (Table_S1.xls) describing theATP training dataset (positive samples)

Table S2 An Excel spreadsheet (Table_S2.xls) describing theadditional (independent) ATP positive examples used fortesting, details for some of the proteins described in Resultsand Discussion and the twin targeting test data

Table S3 An Excel spreadsheet (Table_S3.xls) containing theATP prediction scores for the proteomes shown in Fig. 2

Please note: Wiley-Blackwell are not responsible for thecontent or functionality of any supporting informationsupplied by the authors. Any queries (other than missingmaterial) should be directed to the New Phytologist CentralOffice.

New Phytologist (2009) 183: 224–236 © The Authors (2009)www.newphytologist.org Journal compilation © New Phytologist (2009)

Research236 Methods

About New Phytologist

• New Phytologist is owned by a non-profit-making charitable trust dedicated to the promotion of plant science, facilitating projectsfrom symposia to open access for our Tansley reviews. Complete information is available at www.newphytologist.org.

• Regular papers, Letters, Research reviews, Rapid reports and both Modelling/Theory and Methods papers are encouraged.We are committed to rapid processing, from online submission through to publication ‘as-ready’ via Early View – our averagesubmission to decision time is just 29 days. Online-only colour is free, and essential print colour costs will be met if necessary.We also provide 25 offprints as well as a PDF for each article.

• For online summaries and ToC alerts, go to the website and click on ‘Journal online’. You can take out a personal subscription tothe journal for a fraction of the institutional price. Rates start at £139 in Europe/$259 in the USA & Canada for the online edition(click on ‘Subscribe’ at the website).

• If you have any questions, do get in touch with Central Office ([email protected]; tel +44 1524 594691) or, for a localcontact in North America, the US Office ([email protected]; tel +1 865 576 5261).


Recommended