PROPEPTIDES OF PROTEASES
EVOLVED SENSORS TO
EXPLOIT ORGANELLAR PH
Johannes Elferich, Danielle M. Williamson, Bala Krishnamoorthy¶, and Ujwal Shinde*
Department of Biochemistry and Molecular Biology,
Oregon Health and Science University, 3181 SW Sam Jackson Park Road,
Portland, OR, 97239, USA
¶ Department of Mathematics, Washington State University
Pullman, WA, 99164, USA
*Corresponding Author:
Ujwal Shinde, [email protected] Phone (503)-494-8683 Facsimile: (503)-494-8393
Running title: Protease evolution to exploit organelle pH
SUMMARY:
Eukaryotic cells maintain strict control over protein secretion, in part by utilizing the pH-gradient
maintained within their secretory pathway. How eukaryotic proteins evolved from prokaryotic
orthologs to exploit the pH-gradient for biological function remains a fundamental question in
cell biology. We have previously demonstrated that protein domains located within precursor
proteins, propeptides, encode histidine-driven pH-sensors to regulate organelle-specific
activation of the eukaryotic proteases, furin and proprotein convertase-1/3. Using bioinformatics,
we analyzed over 10,000 unique proteases within evolutionarily unrelated families, and
established that eukaryotic propeptides are enriched in histidines when compared to prokaryotic
orthologs. On this basis, we propose that eukaryotic proteins evolved to contain histidines within
cognate propeptides to exploit the tightly controlled pH-gradient of the secretory pathway,
thereby directing activation within specific organelles. Enrichment of histidine in propeptides
may therefore be used to predict the presence of pH sensors in other proteases or even
protease substrates.
HIGHLIGHTS:
• Histidine residues in propeptides act as pH-sensors in furin, a eukaryotic protease
• Histidine is enriched in eukaryotic, but not prokaryotic, subtilase propeptides
• Histidine enrichment is found in protein families unrelated to subtilases
• We propose histidine enrichment as an evolutionary mechanism to sense organellar pH
INTRODUCTION:
Eukaryotes are descendants of distinct prokaryotic cells that united symbiotically to
subsequently evolve complex cellular compartment called organelles (Embley and Martin,
2006). Although both prokaryotes and eukaryotes are able to secrete proteins, only eukaryotes
employ multi-compartmental secretory and endocytotic pathways. These pathways maintain a
precise pH-gradient that acidifies from the endoplasmic reticulum (pH~7.2) to secretory vesicles
(pH~5.5). This gradient provides the unique environmental conditions essential for the optimal
structure and function of proteins within distinct biochemical pathways (Casey et al., 2010).
Since many secreted eukaryotic proteins have prokaryotic orthologs, how and when
eukaryotic proteins evolved the ability to regulate their activity within different organelles is a
central question germane to our understanding of protein trafficking. Comparing secreted
eukaryotic proteins with their bacterial orthologs may potentially provide information about
mechanisms by which protein activity is regulated during trafficking through the secretory
pathway.
Proteases hydrolyze peptide bonds and likely arose early during evolution as simple
catabolic catalysts that generated amino acid residues in primitive organisms (Lopez-Otin and
Bond, 2008). Due to their ubiquitous distribution within prokaryotes, eukaryotes, and archea, the
three domains of life, proteases are well suited for analysis of selective pressures that drove
adaptation of eukaryotic proteins to the complex organelle trafficking system. Since uncontrolled
proteolysis has catastrophic consequences, cells appear to have evolved two distinct
mechanisms that maintain protease activities under exquisite spatiotemporal control (Lopez-
Otin and Bond, 2008). The first mechanism involves co-evolution of specific endogenous
inhibitors, typically within compartments distinct from those containing active enzymes. The
second mechanism involves proteases being synthesized as inactive precursors called
zymogens, which become active by limited intra- or intermolecular proteolysis. In some cases
the two regulatory mechanisms are combined; N-terminal propeptides co-evolved to facilitate
folding of cognate catalytic domains and act as potent inhibitors after cleavage from the catalytic
domain (Shinde and Inouye, 1993; Shinde and Thomas, 2011).
Subtilases – a ubiquitous super-family of serine proteases – represents an ideal group of
homologs to analyze protein adaptation to eukaryotic organelles, since they exist in all three
domains of life. Bacterial subtilisin and mammalian proprotein convertase (PC) sub-families
constitute the most extensively studied enzymes (Shinde and Thomas, 2011). Despite
evolutionary divergence, proteins in these subfamilies display common folds with conserved
catalytic triads. Subtilases are almost always expressed as zymogens, with amino and
occasionally carboxy propeptide extensions. They are classified into two sub-families;
Extracellular Serine Proteases (ESP) and Intracellular Serine Proteases (ISP) (Subbian et al.,
2004). ESPs have 80-100 residue long propeptides that catalyze folding and act as inhibitors
after cleavage, while ISPs have shorter propeptides that only act as inhibitors in the zymogen.
Catalytic domains and propeptides of mammalian PCs are closely related to protease domains
of ESPs and not ISPs (Shinde and Thomas, 2011). Similar to bacterial ESPs, propeptides of
PCs assist folding and require two ordered steps of proteolytic cleavage for activation. The two
proteolytic cleavages are precisely controlled within different organelles. The first cleavage
occurs rapidly after protein folding in the endoplasmic reticulum and results in a non-covalent
complex between the propeptide and the catalytic domain. Activation requires an additional
cleavage within the propeptide; in the case of furin this cleavage occurs only after the protein is
trafficked to a different organelle, the trans-golgi-network (TGN) (Anderson et al., 2002). Other
PCs are activated in a similar manner, but within different compartments (Seidah and Prat,
2012).
Experiments in vitro show that the pH of the TGN is sufficient to trigger the second
activating cleavage of furin (Anderson et al., 1997) and that a histidine residue in the propeptide
acts as a pH-sensor (Feliciangeli et al., 2006). We recently showed that propeptides of PCs
mediate the pH of activation, as swapping propeptides between PCs reassigned the pH of
activation (Dillon et al., 2012). We therefore hypothesized that propeptides of eukaryotic
subtilases evolved to sense organelle pH in order to direct activation. Such a broad hypothesis
is difficult to test experimentally, as it would require biochemical studies on a large number of
proteins. We overcame this problem by predicting properties of protein sequences based on our
hypothesis, and testing these against sequence databases using statistical methods. Histidine is
the only residue with an intrinsic pKa near the physiological range (~6.5) and therefore likely
involved in pH-sensing mechanisms. In this paper we show that enrichment of histidines in
propeptides correlates with the requirement to sense pH for activation within the subtilase
family. Furthermore, we demonstrate similar enrichment in other protease families, indicating
that enrichment of histidines in propeptides is a common mechanism to regulate activity in the
secretory pathway.
RESULTS:
Propeptide sequences of subtilases are more divergent than cognate catalytic domains:
To identify conserved sequence elements unique to either prokaryotic or eukaryotic subtilases,
we performed an evolutionary conservation analysis using the ConSurf server (Ashkenazy et
al., 2010). The analysis of prokaryotic subtilisin and eukaryotic proprotein convertase families
was initiated using sequences of Subtilisin E and Proprotein Convertase 1/3 (PC1/3),
respectively. The resulting conservation scores were mapped on the crystal structure of the
propepide:Subtilisin E complex (PDB: 1SCJ) and on a homology model of the propeptide:PC1
complex (based on PDB: 1P8J and 1KN6), respectively (Figure 1). Catalytic domains of
eukaryotic and prokaryotic subtilases depict a highly conserved core. On the contrary,
propeptides demonstrate less sequence conservation, with the dibasic cleavage motif at the C-
terminus of eukaryotic propeptides representing the only conserved region.
Since histidine 69 was demonstrated to function as a pH sensor in furin (Feliciangeli et
al., 2006), and given that propeptides of furin and PC1/3 alone are sufficient to impart organelle-
specific pH-dependent activation of cognate catalytic domains (Dillon et al., 2012), we analyzed
whether histidine residues demonstrate any sequence conservation within propeptides.
Although we could not identify absolutely conserved histidine residues in propeptides of
eukaryotic subtilases, several positions in our alignment contain a histidine residue in a
substantial fraction of sequences, especially at the position corresponding to histidine 69 in furin
(53.3% of sequences). In contrast, prokaryotic subtilases, which do not traverse the secretory
pathway, appear to encode less histidines within their propeptides. However, when catalytic
sequences are compared, we find strictly conserved histidine residues within prokaryotic and
eukaryotic sequences, and studies indicate that they play essential roles in catalysis or protein
stability (Carter and Wells, 1987). Hence, biased enrichment for histidine residues appears
localized within propeptides of eukaryotic subtilases.
The ConSurf analysis can only accommodate 150 sequences within each group, and a
search initiated using Subtilisin E and PC1/3 may introduce a selection bias based on input
sequences. Since subtilases encompass over 10,107 unique sequences, as per the PFAM
database family PF00082 (Punta et al., 2012), we developed a robust analysis of histidine
distribution within the available sequence data. Since PFAM employs a hidden Markov model of
only the catalytic domain in subtilases to scour through sequence databases, this method
avoids any selection bias for propeptide sequences. As PFAM families include only approximate
demarcations for start and stop positions for catalytic domains, we made the following three
suppositions to define propeptide and catalytic domain in each sequence: (i) the first 20
residues correspond to signal peptides (Hebert and Molinari, 2007) (ii) residues between
position 21 and the start of the catalytic domain correspond to the propeptide, and (iii) subtilases
with propeptides less than 50 residues represent ISP-like sequences that employ different
mechanisms for folding and activation (Subbian et al., 2004). This stringency provides a total of
6533 unique sequences from the PF00082 family for further analyses of a histidine bias.
Increased histidine content in propeptides of subtilases correlate with requirement of pH-
mediated activation:
We computed the abundance of histidine residues in propeptides ([His]Pro) and catalytic
domains ([His]Cat) for all sequences that met the above criteria. For comparison, we calculated
the difference in abundance in propeptides and catalytic domains within each protein (Δ[His] =
[His]Pro – [His]Cat). A positive value of Δ[His] indicates abundance of histidines in propeptides, a
negative value signifies abundance in catalytic domains, while near zero values imply equal
distribution. While Δ[His] values in individual proteins may be subject to random fluctuations, the
absence of any functional requirements would result in a distribution centered around zero. If
histidine residues in propeptides are required for the experimentally observed function of
sensing organelle specific pH, they would be selected during evolution, and one would expect
statistical bias for positive Δ[His] only within eukaryotic subtilases and near zero or negative
Δ[His] for prokaryotic subtilases.
For initial assessments, we plotted Δ[His] on a phylogenetic tree generated by the PFAM
database (Figure 2A). The tree is consistent with the homology groups defined by Siezen and
coworkers (Siezen and Leunissen, 1997), with the largest clades representing subtilisin, kexin,
proteinase K, and pyrolisin, as well as the later characterized sedolisin family (Wlodawer et al.,
2003). Four of these five families contain eukaryotic and prokaryotic proteins, suggesting these
families diverged before speciation. Only the subtilisin family is exclusively found within
prokaryotes. Interestingly, we observed that three of these four families display a predominantly
positive Δ[His] in eukaryotes, but not in prokaryotes. Only sedolisins show positive Δ[His] values
in both prokaryotes and eukaryotes.
To validate that positive Δ[His] values are unique to eukaryotic sequences, we
constructed a tree based on NCBI taxonomic classification and plotted Δ[His] within all
subtilases (Figure 2B). The slightly negative mean values of Δ[His] in prokaryotic and archaic
proteins imply that there is no functional requirements for histidines in prokaryotic propeptides.
In contrast, we observe a predominantly positive Δ[His] values in eukaryotes, with a mean value
of 1.72%. This difference signifies a strong increase in histidine content in the propeptide
compared to the catalytic domain. We observed positive Δ[His] values in all 3 kingdoms of
higher eukaryotes. In bacteria, the difference of about -0.3% was consistent in the 3 most
represented phylums. Interestingly, the phylum of Acidobacteria had a mean difference of
2.13%, comparable to eukaryotes.
Although the above analysis provides a visual description, we wanted to analyze the
statistical significance of the observed bias towards positive Δ[His] values within eukaryotes.
First, we plotted distributions of [His]Pro and [His]Cat for subtilases in prokaryotes and eukaryotes
(Figure 2C). The catalytic domains in both species display a distribution centered on 2%, with
eukaryotes having slightly higher [His]Cat values than prokaryotes, as expected from the average
histidine content in the UniProt database. While the distribution of [His]Pro in prokaryotes is
shifted towards lower values with several propeptides completely lacking histidines, the [His]Pro
in eukaryotes is shifted to higher values, much greater than the catalytic domains. It is important
to note that the distribution of [His]Pro in eukaryotes displays a much higher deviation than that
for [His]Cat within both prokaryotes and eukaryotes, which is likely due to the shorter length of
propeptides. When we investigated distributions for every amino acid we found that this
enrichment exists only for histidine residues (Figure S1). To further analyze this bias we also
investigated the distribution of Δ[His] (Figure 2D), which clearly demonstrates the differences in
histidine bias in prokaryotes and eukaryotes. The Δ[His] distribution in both species are
positively skewed, with median values of -0.56% (mean = -0.34%) and 1.5% (mean = 1.7%) for
prokaryotes and eukaryotes, respectively. When differences in distribution for every individual
amino acids were plotted (Δ[AA]), only cysteine displays a difference between prokaryotes and
eukaryotes similar to histidine (Figure S2). The cysteine bias is likely due to higher prevalence
of disulfide bonds in eukaryotes than prokaryotes. To quantify this distribution difference
between species we employed a non-parametric Mann-Whitney test (Table 1). For several
amino acids, the test resulted in small p-values (
effect size of 0.5. As seen in Figure 2E, histidine shows the highest deviation from 0.5,
suggesting this bias is not by pure chance. Only cysteine deviates substantially (more than 0.15
units from 0.5), which is likely due to higher frequency of disulfide bonds in eukaryotes than
prokaryotes. The fact that deviation from 0.5 in the effect size for histidine is considerably
greater than that observed for cysteine suggests a biological significance for a histidine bias.
Since possible errors in database annotation and differences in length between
propeptides and catalytic domains may result in a false-positive bias, we developed a test that is
independent of the start annotation in the PFAM database. We calculated histidine content in a
20-residue sliding window from the beginning of the sequence to the end of catalytic domain for
all sequences. After normalization as described in Methods, we averaged the resulting histidine
content profiles for eukaryotic and prokaryotic proteins (Figure 2F). Eukaryotic but not
prokaryotic proteins show an increase in histidine content in the first 100 residues,
corresponding to the propeptide. Proteins from both species have increased histidine content at
positions 200-250 along with a small increase at the C-terminus of the catalytic domain, likely
due to presence of the catalytic histidine, and a conserved histidine at the C-terminus of the
catalytic domain. Changes in length of the sliding window do not change the overall profile
(Figure S3).
To decipher correlations that may exist between the histidine bias and experimental
evidence of pH-dependent activation, we analyzed histidine contents in propeptides and
catalytic domains of individual proteins. For prokaryotes and archaea, we selected proteins that
displayed “reviewed status” in the UniProt database, and for eukaryotic sequences all homologs
in Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, and the model proteases
cucumisin and Proteinase K/R (Figure 2G). While most bacterial proteins show comparable
histidine content in propeptides and catalytic domain (approximately 2%), Kumamolisin and
Xanthomonalisin display histidine content >4% in their propeptides. Consistent with our
hypothesis, both proteins undergo activation at acidic pH in vitro (Oda et al., 1987; Oyama et al.,
2002), which is not surprising because their hosts display optimum growth under acidic
conditions. Since the intracellular pH within these cells is maintained near neutral, pH sensing is
an ideal mechanism for discerning intracellular and extracellular environments. Hence, it is not
surprising that the sedolisin family shows a histidine bias within propeptides of prokaryotic
sequences, given that this family is optimized to function at low pH. Interestingly, Acidobacteria
almost exclusively express sedolisins, which explains their positive Δ[His] values. On the other
hand, all eukaryotic propeptides display histidine content above 4%, with the exception of
Proteinase K/R and SKI-1. Recombinant expression of proteinase K/R in E. coli produces active
protease (Gunkel and Gassen, 1989) and SKI-1 loses its propeptide in the ER (Seidah et al.,
1999), suggesting these processes occur at neutral pH, relaxing the necessity for histidines.
In conclusion, our results demonstrate that the histidine bias in subtilase propeptides
generally correlates with host species as it is present in eukaryotes but not in prokaryotes
(Figure 2). These results are consistent with our hypothesis that during evolution subtilases
responded to the requirement of regulating their activation as per the pH of their environment by
encoding histidine residues in propeptides to sense pH and direct activation.
Cathepsin propeptides are enriched in histidine:
To investigate whether our hypothesis applies to other pH-activated, propeptide-
dependent proteases, we analyzed histidine content in cathepsins, a large family of cysteine
peptidases found mainly in lysosomes (Turk et al., 2012). Similar to subtilases, acidic pH
initiates activation of cathepsins by propeptide proteolysis. Due to these parallels, we
hypothesized that eukaryotic cathepsins should show a similar bias for histidine in their
propeptides.
We used the PFAM family PF00112 to obtain cathepsin sequences and analyzed them
in a manner identical to subtilases. Figure 3A shows the phylogenetic tree for various
cathepsins along with their Δ[His] values. While only few cathepsin homologs exist in
prokaryotes, this paucity is not due to exclusion of sequences based on our criteria. Although
experimental data on prokaryotic cathepsins is scarce, we included them in the analysis for
comparison. The two major well-studied lysosomal cathepsin families are cathepsin L and
cathepsin B, both of which activate at low pH (Nishimura et al., 1988; Turk et al., 1993). Since
no experimental data regarding activation of cathepsin O, cathepsin F, and plant cathepsins
was found, we excluded them from further analysis. Nonetheless, the latter two cathepsins also
display the Δ[His] bias.
While the cathepsin L family shows bias towards positive Δ[His] values, the cathepsin B
family does not. The distributions of [His]Pro and [His]Cat in the cathepsin L family are similar to
those in eukaryotic subtilases (Figure 3B). However, the cathepsin B family displays increased
[His]Pro and [His]Cat values, leading to near-zero Δ[His] values (Figure 3C). Prokaryotic
cathepsins show similar distributions as prokaryotic subtilases. The small number of prokaryotic
sequences precludes a statistical comparison between species with robustness similar to
subtilases.
We next applied the sliding window analysis to validate the increased histidine content in
the propeptides of cathepsin L, and to map the specific location of increased histidine content in
the sequence of cathepsin B (Figure 3D). Prokaryotic cathepsins showed low histidine content
throughout the protein, with one peak between residues 250 and 300, which is due to the
catalytic histidine. Consistent with our hypothesis, an additional increase in histidine content
exists within the first 100 residues of cathepsin L. Interestingly, cathepsin B shows a moderate
increase in histidine content within the first 100 residues compared to prokaryotes, but a
substantial second peak corresponding to the occluding loop within the catalytic domain (Figure
3D and 3E). A comparison of the crystal structures of cathepsin L and cathepsin B (Figure 3E)
shows that the catalytic domains of the two families are similar. However, the cathepsin B
propeptide is truncated compared to cathepsin L, while the occluding loop in the catalytic
domain is longer in cathepsin B. Moreover, the cathepsin B occluding loop in the catalytic
domain extends into the region occupied by the cathepsin L propeptide to form direct contacts
with the cathepsin B propeptide. Notably, histidines within the occluding loop of cathepsin B
occupy similar spatial locations as histidine residues within the cathepsin L propeptide. This
suggests that the pH-sensing capability in cathepsin B is encoded not only within the
propeptide, but also in the occluding loop within the catalytic domain. Consistent with this
prediction, experimental data demonstrates that the occluding loop interacts with the propeptide
in a pH-dependent manner and mutation of histidines in the occluding loop to alanine blocks
activation (Quraishi et al., 1999). Moving pH-sensitivity from the propeptide into the catalytic
domain provides an evolutionary advantage to cathepsin B by enabling it to switch between an
endo- and exopeptidase in a pH dependent manner (Illy et al., 1997). In summary, these results
are consistent with our hypothesis, although subtle variations can exist within individual
propeptide-dependent protease families.
Propeptides in the cytosolic caspase family do not display a histidine bias:
Our hypothesis assumes that eukaryotic proteases require histidines in their propeptides to
sense the pH of the secretory pathway. Therefore, proteases that are expressed and function in
the cytosol would be expected to show no histidine bias within their propeptides. Caspases
constitute the most prominent, propeptide-dependent cytosolic protease family. They are
responsible for initiating apoptosis within eukaryotic cells (Creagh et al., 2003). Similar to
subtilases and cathepsins, caspases are expressed as inactive zymogens and activated by
proteolytic processing. Although apoptosis is linked with mild acidification of the cytosol, pH is
not shown as important for triggering caspase activation.
We used the PFAM family PF00656 to obtain caspase sequences and processed them
as described in Methods. The phylogenetic tree demonstrates that caspase homologs are found
in metazoans, fungi, and plants (Figure 4A). We exclude metacaspases (homologs in fungi and
plants) from our analysis because their propeptides contain histidine residues that are involved
in zinc binding (Tsiatsiani et al., 2011). Metazoan caspases demonstrate increased [His]Cat
values, while the [His]Pro values are similar to that of prokaryotic caspases (Figure 4B).
Consistently, Δ[His] values were slightly smaller for eukaryotic proteins (Figure 4C). The sliding
window analysis of prokaryotic and eukaryotic caspases shows that there is no substantial
histidine enrichment in the N-terminal residues (Figure 4D). Overall these results are consistent
with the assumption that the functional requirement of histidines in the propeptide is unique to
proteases that need to sense pH to direct their activation.
DISCUSSION:
We report a correlation of increased histidine content in propeptides with the
requirement to sense pH. But does a correlation imply causality? Histidine residues play
multiple unique roles in proteins because they can (i) function as proton exchangers in enzyme
catalysis (Dodson and Wlodawer, 1998), (ii) form complexes with soft metals such as iron and
zinc (Andreini et al., 2009), (iii) provide unique hydrogen bonding geometry, and (iv) alter protein
structure and interactions in a pH-dependent manner. Since propeptides are not part of the
active site that mediates proteolysis, and because the propeptides analyzed in this study do not
bind metal ions (Coulombe et al., 1996; Jain et al., 1998; Tangrea et al., 2002), one can exclude
the first two roles. It is also unlikely that propeptides in eukaryotes have a different requirement
for hydrogen bonding than their prokaryotic orthologs, thus endorsing their roles as pH-sensors
as the most likely explanation for the observed histidine bias.
Histidine residues have been demonstrated to function as pH sensors within various
proteins in prokaryotes and eukaryotes (Casey et al., 2010; Srivastava et al., 2007). In some
cases such as Na+/H+ antiporters, pH-regulation is found in prokaryotic and eukaryotic orthologs
(Slepkov et al., 2007). This is due to the common functional requirement to regulate intracellular
pH. Since in the case of the protease families investigated here, pH-sensing seems to be
unique to eukaryotic proteins (with the exception of sedolisins), one can surmise that this is
indeed due to a functional requirement that is unique to eukaryotes, such as regulation within
the secretory pathway.
It is important to note that electrostatic interactions within a protein family can migrate
during the course of evolution, even though their physiological functions such as pH sensing
can be conserved. This may appear through mutations that introduce a redundant charge pair,
which display a pKa similar to the previous one, to allow the original charge to disappear during
subsequent steps of evolution while maintaining or subtly modifying its titration properties,
without requiring exquisite stereo chemistry (Harrison, 2008). In fact, we have observed that the
histidine residues in the propeptide of eukaryotic subtilases are not strictly conserved (Figure 1)
but are spatially “unconstrained” within their sequence. Based on this finding, we preferred to
analyze overall histidine content within propeptides and catalytic domains. Since histidine is
among the least abundant amino acids within proteins (~2.3%), small changes in their numbers
can significantly influence the overall content, especially in the rather short propeptides.
We selected three specific families for our analysis because we wanted proteins that
have (i) a well-distributed phylogeny in prokaryotes and eukaryotes (ii) propeptide domains
necessary for chaperoning folding (iii) the structures of both propeptides and catalytic domains
solved, and (iv) activation pathways which are experimentally well characterized. Subtilases,
cathepsins and caspases, conform to the above requirements, and represent more than 10,000
protease sequences.
It is astonishing that the histidine bias is found in families that have no evolutionary
relationship. Also, it is consistently found in subfamilies of subtilases, even though these likely
diverged before the speciation of eukaryotes and prokaryotes. This suggests that different
protease families may have reacted independently to the same evolutionary pressures to
converge to similar solutions, and therefore represent examples of convergent evolution. We
argue that there are two reasons why pH-sensing has independently evolved within propeptide
domains. First, propeptides appear to be under less evolutionary pressure than cognate
protease domains, which must maintain their catalytic function throughout evolution,
necessitating preservation of the active site geometry. As a consequence, propeptides are more
likely to develop new features that allow the modulation of the protease in different
environments. Secondly, since catalytic domains must often work in diverse pH-environments,
the introduction of elements that change protein conformation in a pH-dependent manner are
likely to be detrimental for protein stability and function, and are better “outsourced” to domains
of the protein that are no longer present in the mature protein. The notable exception is the
cathepsin B family, where the presence of the pH-sensitive occluding loop in the catalytic
domain allows a pH-dependent change in protease characteristics, which might be important for
the biological function (Nagler et al., 1997).
Further research must focus on the mechanisms by which histidines in propeptides
mediate activation. Since, not every histidine in a protein acts as a pH-sensor, determinants of
pH-sensing other than the mere presence of histidines must exist. For example, the propeptides
of furin and PC1/3 both are enriched in histidine, but they show differences in pH-sensitivity.
How are such differences encoded within their sequences? Understanding the physical
principles will also better predict the functional significance of histidines within sequences, an
important challenge given the abundance of sequence data compared to experimental
evidences.
While details of the mechanism of pH-sensing in propeptide-dependent proteases are
unknown, we speculate that two general mechanisms are possible. First, protonation of
histidines due to lower pH, could destabilize the interaction between propeptide and protease
domain, either by directly affecting charge or hydrophobic interactions, or by destabilizing the
propeptide structure, which is required for binding. Since subtilases and cathepsins can
autoactivate, the subsequent increase in the fraction of free protease leads to digestion of the
free propeptide and thereby prevents rebinding. A second potential mechanism is that
protonation of histidine leads to structural changes in the propeptides that make cleavage motifs
within the propeptide accessible for active proteases. While a mixture of both mechanisms is
likely responsible, we speculate that the second mechanism plays a prominent role, since a
histidine within the core and not at the interface to the catalytic domain was essential for pH-
mediated activation of furin (Feliciangeli et al., 2006). Also, the second mechanism was shown
to regulate maturation of the dengue virus within the secretory pathway, where lowering the pH
leads to drastic conformational changes in the capsid proteins, which expose cleavage sites to
control capsid processing (Yu et al., 2008). In this case the substrate, and not the protease,
controls activation, tempting one to speculate that such a mechanism may be found in other
proteins processed by proteases within the secretory pathway.
Since histidine enrichment correlates with the pH-mediated activation in subtilases and
cathepsins, we speculate that it can be used to predict proteins that use a similar mechanism for
activation. A list of all human proteins with annotated propeptides in the UniProt database,
which have more histidines in their propeptides than expected assuming a probability for
histidine of 2.3%, (Table S1) includes 52 proteins that are either secreted or targeted to the
secretory to endocytotic pathway. While this biased can be random or caused by other factors,
such as zinc binding sited, which could explain why metalloproteases like “A Disintegrin and
metalloproteinase domain-containing protein” (ADAM) or Matrix metalloproteases are frequent
in the list, we propose that members on that list use the pH of the secretory pathway to regulate
their proteolytic activation. While the lack of knowledge about their activation and functions of
propeptides, especially in prokaryotic homologs, did not allow us to include a detailed analysis,
we note that several of these proteins show consistent enrichment of histidines in eukaryotic,
but not prokaryotic homologs (Table S2).
In summary, this study suggests a prominent role of the pH gradient in the secretory
pathway in orchestrating the proteolytic processing of secreted proteins. Any disturbances in
this gradient could therefore lead do disregulation of protease activity. Disregulation of
proprotein convertases and cathepsins can have adverse effects, and are associated with
diseases like cancer, artherosclerosis and Dent’s disease (Reiser et al., 2010; Seidah and Prat,
2012). Since all these diseases are also associated with changes in cytosolic pH (Naghavi et
al., 2002; Webb et al., 2011), studies that address whether the secretory pH-gradient is also
effected are needed to address the question of whether pH-disregulation plays a role in
disturbing the regulation of the secretory pathway.
EXPERIMENTAL PROCEDURE:
Conservation Analysis: Analysis of conserved residues was performed using the ConSurf
server with standard settings (Ashkenazy et al., 2010). The crystal structure of the
propeptide:subtilisin E complex (PDB: 1SCJ) was used as input for analyzing bacterial
subtilases, while a homology model for the catalytic domain of PC1 based on the crystal
structure of furin (PDB: 1P8J) and an NMR solution structure of the PC1 propeptide (PDB:
1KN6) docked onto the catalytic domain using the subtilisin structure as a reference, was used
for eukaryotic subtilases. Results were analyzed and plotted using the UCSF Chimera package
(Pettersen et al., 2004).
Data acquisition: The BioMart interface of the InterPro database (Hunter et al., 2012) was
used to download UniProt sequence identifiers, start and stop positions, and taxonomy
identifiers of annotations from the entries PF00082, PF00112, and PF00656 of the PFAM
database for subtilases, cathepsins, and caspases, respectively (Punta et al., 2012; The UniProt
Consortium, 2012). Protein sequences were downloaded from the UniProt database. Phylogeny
was downloaded from the PFAM database, and taxonomy was obtained from the NCBI
Taxonomy homepage.
Amino acid content calculations: Sequences with two annotated catalytic domains or those
marked as deprecated in the UniProt database were discarded. The catalytic domains were
defined as sequences between the start and stop annotations while propeptides were defined
as sequences between positions 20 and the start annotations for subtilases and cathepsins.
The first 20 residues were not included since they represent the signal peptide. Since caspases
lack signal peptides, residues from position 1 to the start annotations were denoted as
propeptides. Sequences with propeptides shorter than 50 residues or longer than 300 residues
were discarded. For the remaining sequences the amino acid contents of the propeptides and
catalytic domains, [AA]Pro and [AA]Cat, were calculated by dividing the number of occurrences of
the amino acid AA in a sequence by the sequence length. The difference between [AA]Pro and
[AA]Cat was calculated as Δ[AA].
Tree construction: NCBI taxonomy based trees were constructed using taxonomy identifiers
as input for the iTol Tree generator (Letunic and Bork, 2011) and adding each protein as a node
of their species. Trees were plotted using the ‘ape’ package written in R statistical computing
language (Paradis et al., 2004; R Core Team, 2012).
Statistical testing: A non-parametric Mann-Whitney test was performed to assess differences
in the distribution of Δ[AA] between prokaryotes and eukaryotes using the R statistical
computing language. The effect size was calculated as U/mn, by dividing the test statistic U by
the product of the two sample sizes (Newcombe, 2006).
Sliding window analysis: For each sequence the number of histidines, #His(i,k), in a window
of length k starting at position i, ranging from 1 to n–k+1 were counted, where n is the length of
the sequence. To account for different sequence lengths, the starting sequence positions were
normalized as follows:
#!"#!"#$ !, ! = #!"#( !! ∗ !, !) ;
Where, ! is the median sequence length and the term !! ∗ ! was rounded to the nearest integer.
For each position i, the #!"#!"#$ !, ! values were averaged and then divided by k to obtain
the average histidine content, #His(i), at that position. This method assumes that differences in
length due to insertion and deletions are evenly distributed within the protein. Using a multiple
sequence alignment for normalization would potentially account better for the position of
insertion, but the number of sequences and the low quality of the alignment especially in the
propeptide region made that impractical for this study.
ACKNOWLEDGEMENTS:
REFERENCES:
FIGURE LEGENDS:
Figure 1: Propeptides are more divergent than cognate catalytic domains. Conservation
scores mapped onto a ribbon presentation of (A) Subtilisin E and (B) PC1/3. Thick tubes
represent high divergence at this position while thin tubes represent conservation. Color
indicates percentage of sequences that encode a histidine residue at this position from 0%
(grey) to 100% (blue)
Figure 2: Histidines are enriched in propeptides of eukaryotic, but not prokaryotic,
subtilases. (A) Phylogenetic tree of subtilases from the PFAM database. Bars on the outside
indicate the Δ[His] value of each sequence. A black circle represents 0%. Bars pointing outward
and inward represent positive and negative Δ[His] values, respectively. Dashed circles outside
and inside of the solid black circle represent Δ[His] values of ±1%. Eukaryotic, prokaryotic, and
archean sequences are colored red, blue, and green, respectively. Black arcs on the outside
mark the clades of major subtilase subfamilies. (B) Tree based on the NCBI taxonomy
classification. Annotation of Δ[His] and color coding are as above. Thick black arcs mark the
three super-kingdoms of life, while thin arcs denote kingdoms of eukaryotes and phylums of
prokaryotes. (C) Kernel density estimation of the distribution of [His]Pro and [His]Cat in
prokaryotes and eukaryotes. (D) Kernel density estimation of the distribution of Δ[His] for
prokaryotes and eukaryotes. (E) Effect size U/mn of the Mann-Whitney test for difference
between the distributions shown in panel D performed for all 20 natural amino acids. Figure S1
and S2 show the Kernel density estimations for all amino acids. (F) Sliding Window Analysis of
average histidine content in eukaryotic and prokaryotic subtilases using a window of 20
residues. The black dashed line indicates average histidine content in the UniProt database.
See methods for detailed explanation of normalization of the sequence length. Arrows indicate
relative position of annotations for the end of the propeptide domain and the catalytic histidine
residue according to subtilisin E and PC1/3. (G) Bar graph showing [His]Pro and [His]Cat values
for selected subtilases. Blue, red, and green shades represent prokaryotic, eukaryotic, and
archean sequences, respectively. Light shades indicate [His]Cat and dark shades indicate [His]Pro
Figure 3: Histidine enrichment exists only in propeptide domains of the Cathepsin L
family, while it is also present in the occluding loop of the Cathepsin B family. (A)
Phylogenetic tree of cathepsins from the PFAM database. Bars on the outside indicate the
Δ[His] value of each sequence. A black circle represents 0%. Bars pointing outward and inward
represent positive and negative Δ[His] values, respectively. Dashed circles outside and inside of
the solid black circle represent Δ[His] values of ±1%. Eukaryotic, prokaryotic, archean, and viral
sequences are colored red, blue, green, and cyan, respectively. Black arcs on the outside mark
the clades of major cathepsin subfamilies, with the cathepsin L family shown in green and the
cathepsin B family shown in purple. (B) Kernel density estimation of the distribution of [His]Pro
and [His]Cat in cathepsin L and B families and in prokaryotes. (C) Kernel density estimation of
the distribution of Δ[His] in cathepsin L and B families and in prokaryotes. (D) Sliding Window
Analysis of average histidine content in cathepsin L and B families and in prokaryotes using a
window of 20 residues. The black dashed line indicates average histidine content in the UniProt
database. See methods for detailed explanation of normalization of the sequence length.
Arrows indicate relative position of annotations for the end of the propeptide domain and the
catalytic histidine residue according to Cathepsin L and B, as well as the occluding loop in
cathepsin B. (E) Structure superimposition of procathepsin L (PDB: 1BY8) and procathepsin B
(PDB: 1MIR). The catalytic domains are shown in grey ribbon, while propeptides are shown in
green and purple for cathepsin L and B, respectively. The occluding loop of cathepsin B is
colored in orange and the corresponding loop in cathepsin L is colored green. The sidechains of
histidine residues are depicted as stick representations. (F) A close up of interactions between
the occluding loop and the propeptide. Colors are as above. Structural depictions were created
using the UCSF Chimera Suite Pettersen, 2004 #807}.
Figure 4: The cytosolic caspase family shows no histidine bias in propeptides. (A)
Phylogenetic tree of caspases from the PFAM database. Bars on the outside indicate the Δ[His]
value of each sequence. A black circle represents 0%. Bars pointing outward and inward
represent positive and negative Δ[His] values, respectively. Dashed circles outside and inside of
the solid black circle represent Δ[His] values of ±1%. Prokaryotic, metazoan, plant, fungal and
other eukaryotic sequences are colored blue, yellow, cyan, purple and red, respectively. Black
arcs on the outside depict the metazoan caspase and metacaspase families. (B) Kernel density
estimation of the distribution of [His]Pro and [His]Cat in prokaryotes and metazoan shown in blue
and yellow, respectively. (C) Kernel density estimation of the distribution of Δ[His] in metazoan
and prokaryotic caspases shown in yellow and blue, respectively. (D) Sliding Window Analysis
of average histidine content in metazoan and prokaryotic caspases using a window of 20
residues. See methods for detailed explanation of normalization of the sequence length. Arrows
indicate relative position of annotations for the end of the propeptide domain and the catalytic
histidine residue according to Caspase 2.
TABLE LEGENDS:
Table 1: Results of Mann-Whitney tests to evaluate differences in distribution of Δ[AA]
between eukaryotes and prokaryotes. For each amino acid the following numbers are
reported: Median of Δ[AA] for eukaryotes and prokaryotes, test statistic of the Mann-Whitney
test, the resulting p-value, the effect size U/mn. Sample sizes were 2156 and 4256 for
eukaryotes and prokaryotes, respectively.
TABLES:
Residue Median [%]
Eukaryotes Median [%]
Prokaryotes U p U/mn
A -2.01 -0.29 3484667 6.3 x 10-56 0.38 V -0.03 -0.13 4692108 1.4 x 10-1 0.51 L 1.27 1.53 4494450 1.8 x 10-1 0.49 I -0.29 -0.61 4845335 2.4 x 10-4 0.53
M -0.32 -0.44 4781449 5.7 x 10-3 0.52 F 0.53 0.06 5019341 7.3 x 10-10 0.55 Y -0.27 -1.00 5564109 3.6 x 10-44 0.61 W -0.46 -0.92 5529692 3.2 x 10-41 0.60 S -0.47 -0.18 4329195 2.2 x10-4 0.47 T -1.36 -0.07 3616488 9.2 x 10-44 0.39 N -1.63 -1.86 4339854 3.9 x 10-4 0.47 Q 1.11 1.49 4198344 2.6 x 10-8 0.46 C -1.25 -0.28 2881834 7.5 x 10-132 0.31 G -5.2 -5.3 4576112 8.7 x 10-1 0.50 P -0.67 0.04 3852582 8.5 x 10-26 0.42 D -0.25 -1.52 5644738 1.9 x 10-51 0.62 E 2.85 2.51 4864907 7.7 x 10-5 0.53 H 1.53 -0.56 7048731 1.6 x 10-270 0.77 K 1.45 1.58 4356812 9.6 x 10-4 0.47 R 1.86 0.95 5376156 2.2 x 10-29 0.59
REFERENCES:
Anderson, E.D., Molloy, S.S., Jean, F., Fei, H., Shimamura, S., and Thomas, G. (2002). The ordered and compartment-specfific autoproteolytic removal of the furin intramolecular chaperone is required for enzyme activation. J Biol Chem 277, 12879-12890. Anderson, E.D., VanSlyke, J.K., Thulin, C.D., Jean, F., and Thomas, G. (1997). Activation of the furin endoprotease is a multiple-step process: requirements for acidification and internal propeptide cleavage. Embo J 16, 1508-1518. Andreini, C., Bertini, I., Cavallaro, G., Najmanovich, R.J., and Thornton, J.M. (2009). Structural analysis of metal sites in proteins: non-heme iron sites as a case study. J Mol Biol 388, 356-380. Ashkenazy, H., Erez, E., Martz, E., Pupko, T., and Ben-Tal, N. (2010). ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic acids research 38, W529-533. Carter, P., and Wells, J.A. (1987). Engineering enzyme specificity by "substrate-assisted catalysis". Science 237, 394-399. Casey, J.R., Grinstein, S., and Orlowski, J. (2010). Sensors and regulators of intracellular pH. Nat Rev Mol Cell Biol 11, 50-61. Coulombe, R., Grochulski, P., Sivaraman, J., Menard, R., Mort, J.S., and Cygler, M. (1996). Structure of human procathepsin L reveals the molecular basis of inhibition by the prosegment. Embo J 15, 5492-5503. Creagh, E.M., Conroy, H., and Martin, S.J. (2003). Caspase-activation pathways in apoptosis and immunity. Immunological reviews 193, 10-21. Dillon, S.L., Williamson, D.M., Elferich, J., Radler, D., Joshi, R., Thomas, G., and Shinde, U. (2012). Propeptides Are Sufficient to Regulate Organelle-Specific pH-Dependent Activation of Furin and Proprotein Convertase 1/3. J Mol Biol 423, 47-62. Dodson, G., and Wlodawer, A. (1998). Catalytic triads and their relatives. Trends Biochem Sci 23, 347-352. Embley, T.M., and Martin, W. (2006). Eukaryotic evolution, changes and challenges. Nature 440, 623-630. Feliciangeli, S.F., Thomas, L., Scott, G.K., Subbian, E., Hung, C.H., Molloy, S.S., Jean, F., Shinde, U., and Thomas, G. (2006). Identification of a pH sensor in the furin propeptide that regulates enzyme activation. J Biol Chem 281, 16108-16116. Gunkel, F.A., and Gassen, H.G. (1989). Proteinase K from Tritirachium album Limber. Characterization of the chromosomal gene and expression of the cDNA in Escherichia coli. European journal of biochemistry / FEBS 179, 185-194. Harrison, S.C. (2008). The pH sensor for flavivirus membrane fusion. The Journal of cell biology 183, 177-179. Hebert, D.N., and Molinari, M. (2007). In and out of the ER: protein folding, quality control, degradation, and related human diseases. Physiological reviews 87, 1377-1408. Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., et al. (2012). InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research 40, D306-312. Illy, C., Quraishi, O., Wang, J., Purisima, E., Vernet, T., and Mort, J.S. (1997). Role of the occluding loop in cathepsin B activity. J Biol Chem 272, 1197-1202. Jain, S.C., Shinde, U., Li, Y., Inouye, M., and Berman, H.M. (1998). The crystal structure of an autoprocessed Ser221Cys-subtilisin E-propeptide complex at 2.0 A resolution. J Mol Biol 284, 137-144. Letunic, I., and Bork, P. (2011). Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic acids research 39, W475-478.
Lopez-Otin, C., and Bond, J.S. (2008). Proteases: multifunctional enzymes in life and disease. J Biol Chem 283, 30433-30437. Naghavi, M., John, R., Naguib, S., Siadaty, M.S., Grasu, R., Kurian, K.C., van Winkle, W.B., Soller, B., Litovsky, S., Madjid, M., et al. (2002). pH Heterogeneity of human and rabbit atherosclerotic plaques; a new insight into detection of vulnerable plaque. Atherosclerosis 164, 27-35. Nagler, D.K., Storer, A.C., Portaro, F.C., Carmona, E., Juliano, L., and Menard, R. (1997). Major increase in endopeptidase activity of human cathepsin B upon removal of occluding loop contacts. Biochemistry 36, 12608-12615. Newcombe, R.G. (2006). Confidence intervals for an effect size measure based on the Mann-Whitney statistic. Part 1: general issues and tail-area-based methods. Statistics in medicine 25, 543-557. Nishimura, Y., Kawabata, T., and Kato, K. (1988). Identification of latent procathepsins B and L in microsomal lumen: characterization of enzymatic activation and proteolytic processing in vitro. Archives of biochemistry and biophysics 261, 64-71. Oda, K., Sugitani, M., Fukuhara, K., and Murao, S. (1987). Purification and properties of a pepstatin-insensitive carboxyl proteinase from a gram-negative bacterium. Biochimica et biophysica acta 923, 463-469. Oyama, H., Hamada, T., Ogasawara, S., Uchida, K., Murao, S., Beyer, B.B., Dunn, B.M., and Oda, K. (2002). A CLN2-related and thermostable serine-carboxyl proteinase, kumamolysin: cloning, expression, and identification of catalytic serine residue. Journal of biochemistry 131, 757-765. Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20, 289-290. Pettersen, E.F., Goddard, T.D., Huang, C.C., Couch, G.S., Greenblatt, D.M., Meng, E.C., and Ferrin, T.E. (2004). UCSF Chimera--a visualization system for exploratory research and analysis. Journal of computational chemistry 25, 1605-1612. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., et al. (2012). The Pfam protein families database. Nucleic acids research 40, D290-301. Quraishi, O., Nagler, D.K., Fox, T., Sivaraman, J., Cygler, M., Mort, J.S., and Storer, A.C. (1999). The occluding loop in cathepsin B defines the pH dependence of inhibition by its propeptide. Biochemistry 38, 5017-5023. R Core Team (2012). R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing). Reiser, J., Adair, B., and Reinheckel, T. (2010). Specialized roles for cysteine cathepsins in health and disease. The Journal of clinical investigation 120, 3421-3431. Seidah, N.G., Mowla, S.J., Hamelin, J., Mamarbachi, A.M., Benjannet, S., Toure, B.B., Basak, A., Munzer, J.S., Marcinkiewicz, J., Zhong, M., et al. (1999). Mammalian subtilisin/kexin isozyme SKI-1: A widely expressed proprotein convertase with a unique cleavage specificity and cellular localization. Proceedings of the National Academy of Sciences of the United States of America 96, 1321-1326. Seidah, N.G., and Prat, A. (2012). The biology and therapeutic targeting of the proprotein convertases. Nature reviews Drug discovery 11, 367-383. Shinde, U., and Inouye, M. (1993). Intramolecular chaperones and protein folding. Trends Biochem Sci 18, 442-446. Shinde, U., and Thomas, G. (2011). Insights from bacterial subtilases into the mechanisms of intramolecular chaperone-mediated activation of furin. Methods Mol Biol 768, 59-106. Siezen, R.J., and Leunissen, J.A. (1997). Subtilases: the superfamily of subtilisin-like serine proteases. Protein science : a publication of the Protein Society 6, 501-523.
Slepkov, E.R., Rainey, J.K., Sykes, B.D., and Fliegel, L. (2007). Structural and functional analysis of the Na+/H+ exchanger. The Biochemical journal 401, 623-633. Srivastava, J., Barber, D.L., and Jacobson, M.P. (2007). Intracellular pH sensors: design principles and functional significance. Physiology (Bethesda) 22, 30-39. Subbian, E., Yabuta, Y., and Shinde, U. (2004). Positive selection dictates the choice between kinetic and thermodynamic protein folding and stability in subtilases. Biochemistry 43, 14348-14360. Tangrea, M.A., Bryan, P.N., Sari, N., and Orban, J. (2002). Solution structure of the pro-hormone convertase 1 pro-domain from Mus musculus. J Mol Biol 320, 801-812. The UniProt Consortium (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research 40, D71-75. Tsiatsiani, L., Van Breusegem, F., Gallois, P., Zavialov, A., Lam, E., and Bozhkov, P.V. (2011). Metacaspases. Cell death and differentiation 18, 1279-1288. Turk, B., Dolenc, I., Turk, V., and Bieth, J.G. (1993). Kinetics of the pH-induced inactivation of human cathepsin L. Biochemistry 32, 375-380. Turk, V., Stoka, V., Vasiljeva, O., Renko, M., Sun, T., Turk, B., and Turk, D. (2012). Cysteine cathepsins: from structure, function and regulation to new frontiers. Biochimica et biophysica acta 1824, 68-88. Webb, B.A., Chimenti, M., Jacobson, M.P., and Barber, D.L. (2011). Dysregulated pH: a perfect storm for cancer progression. Nature reviews Cancer 11, 671-677. Wlodawer, A., Li, M., Gustchina, A., Oyama, H., Dunn, B.M., and Oda, K. (2003). Structural and enzymatic properties of the sedolisin family of serine-carboxyl peptidases. Acta biochimica Polonica 50, 81-102. Yu, I.M., Zhang, W., Holdaway, H.A., Li, L., Kostyuchenko, V.A., Chipman, P.R., Kuhn, R.J., Rossmann, M.G., and Chen, J. (2008). Structure of the immature dengue virus at low pH primes proteolytic maturation. Science 319, 1834-1837.
Figure 1
A
B
Catalytic Domain Propeptide
Catalytic Domain Propeptide
C-terminus propeptide
C-terminus propeptide
Figure 2
Archaea∆ [His] = − 0.37 %
Eukaryota∆ [His] = 1.72 %
Bacteria∆ [His] = − 0.34 %
AnimalsAnimalsAnimalsAnimalsAnimalsAnimalsAnimalsAnimalsAnimals∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %∆ [His] = 2.36 %
PlantsPlantsPlantsPlantsPlantsPlantsPlantsPlantsPlants∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %∆ [His] = 2.08 %
Fungi∆ [His] = 1.61 %
AcidobacteriaAcidobacteriaAcidobacteriaAcidobacteriaAcidobacteriaAcidobacteriaAcidobacteriaAcidobacteriaAcidobacteria∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%∆ [His]=2.13%
ActinobacteriaActinobacteriaActinobacteriaActinobacteriaActinobacteriaActinobacteriaActinobacteriaActinobacteriaActinobacteria∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
ProteobacteriaProteobacteriaProteobacteriaProteobacteriaProteobacteriaProteobacteriaProteobacteriaProteobacteriaProteobacteria∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
FirmicutesFirmicutesFirmicutesFirmicutesFirmicutesFirmicutesFirmicutesFirmicutesFirmicutes∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
Kexin/PCsKexin/PCsKexin/PCsKexin/PCsKexin/PCsKexin/PCsKexin/PCsKexin/PCsKexin/PCs
SubtilisinSubtilisinSubtilisinSubtilisinSubtilisinSubtilisinSubtilisinSubtilisinSubtilisin
Pyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/CucumolisinPyrolysin/Cucumolisin
Proteinase KProteinase KProteinase KProteinase KProteinase KProteinase KProteinase KProteinase KProteinase K
SedolisinSedolisinSedolisinSedolisinSedolisinSedolisinSedolisinSedolisinSedolisin
A B
0
5
10
15
20
25
30
Dens
ity
0 2 4 6∆ [His] [%]
ProkaryotesEukaryotes
0
10
20
30
40
50
60
70
Dens
ity
[His]Cat Prokaryotes[His]Pro Prokaryotes[His]Cat Eukaryotes[His]Pro Eukaryotes
0 1 2 3 4 5 6 7 8 10[His] [%]
Alanine
ValineLeucine
Isoleucine
Methionine
Phenylalanine
Tyrosine
Tryptophan
SerineThreonine
Asparagine
Glutamine
Cysteine
Glycine
ProlineHistidine
Aspartate
Glutamate
LysineArginine
0
0.25
0.5
0.75
1
U\m
n
0 100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
PC1: End
of propepti
de
Subtilisin E
: End of pr
opeptide
PC1: Cata
lytic histidin
e
Subtilisin E
: Catalytic h
istidine
Pyrolysin
HalolysinCucumisinARA12XSP1Proteinase KProteinase RCerevisinKexin
FurinPC2PC1PC6
PC7PC4
PC5P80146Aqualysin
P42780P16558XanthomonalisinPseudomonalisinAlkaline protease
Subtilisin EProtease eprBacillopeptidase FP54423P00783Subtilisin NATSubtilisin BPN'Subt. CarlsbergSubtilisinP20724KumamolisinP41363Q45670Alkaline proteaseSubtilisin J
Protease nisP
0 2 4 6 8
0 2 4 6 8
His content [%]
C D
E
F
G
Prokaryotes
Eukaryotes
Archae
[His]Pro
[His]Cat
Figure 3
Cathepsin OCathepsin OCathepsin OCathepsin OCathepsin OCathepsin OCathepsin OCathepsin OCathepsin O∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%∆ [His]=0.28%
Cathepsin BCathepsin BCathepsin BCathepsin BCathepsin BCathepsin BCathepsin BCathepsin BCathepsin B∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
Cathepsin LCathepsin LCathepsin LCathepsin LCathepsin LCathepsin LCathepsin LCathepsin LCathepsin L∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
Cathepsin FCathepsin FCathepsin FCathepsin FCathepsin FCathepsin FCathepsin FCathepsin FCathepsin F∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%∆ [His]=1.14%
Plant CathepsinsPlant CathepsinsPlant CathepsinsPlant CathepsinsPlant CathepsinsPlant CathepsinsPlant CathepsinsPlant CathepsinsPlant Cathepsins∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%∆ [His]=1.81%
0
10
20
30
40
50
60
70
Dens
ity
[His]Cat Prokaryotes[His]Pro Prokaryotes[His]Cat Cathepsin L[His]Pro Cathepsin L[His]Cat Cathepsin B[His]Pro Cathepsin B
0 1 2 3 4 5 6 7 8 10[His] [%]
0
5
10
15
20
25
30
Dens
ity
0 2 4 6 8∆ [His] [%]
ProkaryotesCathepsin LCathepsin B
0 50 100 150 200 250 300
01234567
Residue
% H
istid
ine
cont
ent
Cathepsin L
: End of pr
opeptide
Cathepsin B
: End of pr
opeptide
Cathepsin B
: Occluding
loop
Cathepsin B
: Catalytic h
istidine
Cathepsin L
: Catalytic h
istidine
A
B C
D
E F
Figure 4
Metazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan CaspasesMetazoan Caspases∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %∆ [His]= %
MetacaspasesMetacaspasesMetacaspasesMetacaspasesMetacaspasesMetacaspasesMetacaspasesMetacaspasesMetacaspases∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%∆ [His]=0.83%
0
10
20
30
40
50
60
70
Dens
ity
[His]Cat Prokaryotes[His]Pro Prokaryotes[His]Cat Metazoans[His]Pro Metazoans
0 1 2 3 4 5 6 7 8 10[His] [%]
0
5
10
15
20
25
30
Dens
ity
ProkaryotesMetazoans
0 2 4 6∆ [His] [%]
0 50 100 150 200 250 300 350
01234567
Residue
% H
istid
ine
cont
ent
Caspase 2
: End of pr
opeptide
Caspase 2
: Catalytic h
istidine
A
B C
D
Figure S1
N = 4256 Bandwidth = 0.005246
Dens
ity
Alanine0
2040
60
N = 4256 Bandwidth = 0.002666
Dens
ity
Valine
N = 4256 Bandwidth = 0.002614
Dens
ity
Leucine
Dens
ity
Isoleucine
020
4060
Dens
ity
Methionine
N = 4256 Bandwidth = 0.001882
Dens
ity
Phenylalanine
N = 4256 Bandwidth = 0.00222
Dens
ity
Tyrosine
020
4060
N = 4256 Bandwidth = 0.001004
Dens
ityTryptophan
Dens
ity
Serin
Dens
ity
Threonine
020
4060
N = 4256 Bandwidth = 0.00343
Dens
ity
Asparagine
Dens
ity
Glutamine
N = 4256 Bandwidth = 0.001254
Dens
ity
Cysteine
020
4060
Dens
ity
Glycine
Dens
ity
Proline
N = 4256 Bandwidth = 0.001215
Dens
ity
Histidine
020
4060
N = 4256 Bandwidth = 0.002323
Dens
ity
Aspartate
N = 4256 Bandwidth = 0.002885
Dens
ity
Glutamate
0 2.5 5 7.5 10 12.5 15 17.5 20
Dens
ity
Lysine
020
4060
0 2.5 5 7.5 10 12.5 15 17.5 20
N = 4256 Bandwidth = 0.002723
Dens
ity
Arginine
0 2.5 5 7.5 10 12.5 15 17.5 20
Bacteria ProtBacteria IMCEukaryota ProtEukaryota IMC
Content [%]
Dens
ity
Figure S2
N = 4256 Bandwidth = 0.00818
Dens
ity
Alanine0
2040
6080
N = 4256 Bandwidth = 0.004751
Dens
ity
Valine
N = 4256 Bandwidth = 0.00482
Dens
ity
Leucine
Dens
ity
Isoleucine
020
4060
80
N = 4256 Bandwidth = 0.002277
Dens
ity
Methionine
N = 4256 Bandwidth = 0.003236
Dens
ity
Phenylalanine
N = 4256 Bandwidth = 0.00302
Dens
ity
Tyrosine
020
4060
80
N = 4256 Bandwidth = 0.001516
Dens
ityTryptophan
N = 4256 Bandwidth = 0.005756
Dens
ity
Serin
N = 4256 Bandwidth = 0.00504
Dens
ity
Threonine
020
4060
80
N = 4256 Bandwidth = 0.004001
Dens
ity
Asparagine
N = 4256 Bandwidth = 0.004062De
nsity
Glutamine
N = 4256 Bandwidth = 0.001235
Dens
ity
Cysteine
020
4060
80
Dens
ity
Glycine
N = 4256 Bandwidth = 0.004345
Dens
ity
Proline
N = 4256 Bandwidth = 0.002244
Dens
ity
Histidine
020
4060
80
N = 4256 Bandwidth = 0.004321
Dens
ity
Aspartate
N = 4256 Bandwidth = 0.005338
Dens
ity
Glutamate
6 4 2 0 2 4 6
N = 4256 Bandwidth = 0.005832
Dens
ity
Lysine
020
4060
80
6 4 2 0 2 4 6
N = 4256 Bandwidth = 0.005206
Dens
ity
Arginine
6 4 2 0 2 4 6
Bacteria ProtBacteria IMCEukaryota ProtEukaryota IMC
Content Difference [%]
Dens
ity
Figure S3
0 100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
0 100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
PC1: End
of propepti
de
Subtilisin E
: End of pr
opeptide
PC1: Cata
lytic histidin
e
Subtilisin E
: Catalytic h
istidine
0 100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
100 200 300 400 500
01234567
Residue
Hist
idin
e co
nten
t [%
]
10 Residue Windows
20 Residue Windows
30 Residue Windows
40 Residue Windows
50 Residue Windows
SUPPLEMENTAL TABLES: TABLE S2: Protein(Name( Prokaryotes( Eukaryotes ( Propeptid
e(Catalytic(( Δ(His)( Propeptid
e(Catalytic(( Δ(His)(
Cathepsin(D/E(Family( 1.34( 2.02( C0.68( 2.92( 1.15( +1.75(Carboxypeptidase(Y( 1.48( 1.95( C0.47( 2.16( 2.01( +0.15(alphaClytic((protease( 0.92( 0.99( C0.07( 1.61%( 0.82%( +0.79(Legumain( 2.19( 1.5( +0.69( 8.37( 4.4( +3.97(Lysosomal(Acid(Lipase(( 4.80( 3.14( +1.66( 5.00( 3.36( +1.64(Lysosomal(αCglucosidase(
( ( ( ( ( (
Coagulation(factor(VII( ( ( ( ( ( (CadherinCI( 0.81( 1.16( C0.35( 8.78( 1.9( +6.88(βCHexosaminidase( 1.53(
2.27(3.26(2.62(
C1.73(C0.35(
3.33(3.11(
1.98(α(2.13(β(
+1.35(+0.98(
BMP4( ( ( ( 6.22( 5.11( +1.11(Platelet(derived(growth(factor(
( ( ( ( ( (
SUPPLEMENTAL FIGURE LEGENDS: Figure S1: Distributions of [AA]Pro and [AA]Cat for all 20 amino aicds in eukaryotic and prokaryotic subtilases. Figure S2: Distributin of Δ[AA] for all 20 amino acids in eukaryotic and prokaryotic subtilases. Figure S3: Sliding window analysis of histidine content in eukaryotic and prokaryotic subtilases using different window sizes. SUPPLEMENTAL TABLE LEGENDS: Table S1: List of human proteins with histidine enrichement in their propeptides. All human proteins with annotated propeptides in the UniProt database that have more histidine in their propeptides than expected, assuming a 2.3% probability of histidine at each sequence position, with a significance level of 5%. Table S2: Histidine content of propeptide and catalytic domain of several protein families. Homologs in eukaryotes and prokaryotes were identified using BLAST. Propeptide regions were assigned by homology using multiple sequence alignments.
UniProt(identifierName Length(propeptideHistidine(contentP(X>=k)P12821 Angiotensin-converting1enzyme 74 6.76% 2.80E-02O14672 ADAM10 194 7.73% 5.08E-05O75078 ADAM11 202 4.46% 4.54E-02Q9Y3Q7 ADAM18 168 4.76% 4.15E-02Q9P0K1 ADAM22 197 6.09% 2.22E-03O75077 ADAM23 227 4.85% 1.69E-02Q9UKF2 ADAM30 171 4.68% 4.52E-02Q9BZ11 ADAM33 174 5.17% 2.01E-02O15204 ADAM-like1protein1decysin-1 175 6.29% 2.61E-03Q9H324 ADAM-TS10 208 5.29% 9.33E-03P58397 ADAM-TS12 215 6.05% 1.59E-03Q8TE57 ADAM-TS16 255 5.88% 9.60E-04Q8TE60 ADAM-TS18 237 5.91% 1.34E-03P59510 ADAM-TS20 232 4.74% 1.96E-02Q9UKP5 ADAM-TS6 223 8.52% 1.31E-06Q9P2N4 ADAM-TS9 269 4.09% 4.88E-02O95972 Bone1morphogenetic1protein115 249 5.22% 5.55E-03P12643 Bone1morphogenetic1protein12 259 5.79% 1.12E-03P12644 Bone1morphogenetic1protein14 273 6.23% 2.38E-04P18075 Bone1morphogenetic1protein17 263 5.70% 1.30E-03P55287 Cadherin-11 31 12.90% 5.36E-03Q13634 Cadherin-18 29 17.24% 4.82E-04P12830 Cadherin-1 132 6.06% 1.17E-02P14091 Cathepsin1E 34 8.82% 4.29E-02P09668 Cathepsin1H 85 8.24% 3.52E-03P43235 Cathepsin1K 99 6.06% 2.71E-02P25774 Cathepsin1S 98 8.16% 1.97E-03Q6YHK3 CD1091antigen 25 12.00% 1.92E-02P0CG37 Cryptic1protein 65 7.69% 1.70E-02Q14126 Desmoglein-2 26 11.54% 2.13E-02P12259 Coagulation1factor1V 836 3.59% 1.27E-02P02765 Alpha-2-HS-glycoprotein 40 12.50% 2.17E-03P09958 Furin 83 6.02% 4.28E-02O60383 Growth/differentiation1factor19 295 4.41% 2.04E-02P07686 Beta-hexosaminidase1subunit1beta 79 6.33% 3.58E-02P55103 Inhibin1beta1C1chain 218 4.59% 3.07E-02P58166 Inhibin1beta1E1chain 217 5.07% 1.25E-02P51460 Insulin-like13 47 14.89% 9.55E-05P19827 ITI1heavy1chain1H1 246 4.47% 2.84E-02Q99538 Legumain 110 9.09% 2.40E-04P09848 Lactase-phlorizin1hydrolase 847 3.78% 5.15E-03P10253 Lysosomal1alpha-glucosidase 42 9.52% 1.56E-02
P14151 L-selectin 10 20.00% 2.11E-02Q9NRE1 Matrix1metalloproteinase-26 72 8.33% 6.34E-03P16519 PCSK2 84 8.33% 3.29E-03P29122 PCSK6 86 5.81% 4.86E-02P01127 PDGF1subunit1B 112 5.36% 4.53E-02Q96B86 Repulsive1guidance1molecule1A 147 5.44% 2.10E-02P10600 Transforming1growth1factor1beta-3 280 5.00% 5.96E-03Q9BZD6 Proline-rich1Gla1protein14 32 9.38% 3.68E-02Q8N2E6 Prosalusin 163 5.52% 1.37E-02O43915 Vascular1endothelial1growth1factor1D 216 6.02% 1.66E-03