Date post: | 17-Aug-2015 |
Category: |
Documents |
Upload: | alan-moran |
View: | 26 times |
Download: | 6 times |
GENE40060 - Genetics Research Project
Detection of short-term positive selection in Verotoxigenic
Escherichia coli
Submitted by: Alan Moran
Student Number: 11452982
Supervisor: Dr Peadar Ó Gaora, BA, MSc, PhD
1
Summary:
The aim of this study was to identify genes that are under short-term positive selection in
Verotoxigenic Escherichia coli (VTEC), primarily genes associated with virulence or the
enhancement of virulence. VTEC are responsible for a number of diseases, primarily
haemolytic uremic syndrome (HUS) in humans. Furthermore, these bacteria produce
characteristic virulence factors such as verotoxins, and intimin. Thus, it was the extended aim
of this study to investigate virulence factors outside of those that distinguish these bacteria,
which are associated with a VTEC-infection. Positive, or Darwinian selection, refers to a
more extreme phenotype that is constantly selected for within the population, resulting in an
increase in the frequency of the allele. In relation to this, the ‘short-term’ basis of this
investigation describes the situation whereby this phenotype has been selected for relatively
recently, therefore it is likely that this phenotype is coded by a single allele which exhibits no
silent mutations. A number of candidate genes were detected on this basis using the software
programme ‘Timezone’, which focused primarily on constructing phylogenetic gene trees,
and examining mutations and hotspot mutations within these trees. A common trend was
noticed in the candidate genes with most results associated with virulence showing up for
genes associated with bacteriophages, the membrane, transposases, and cell motility. These
results agreed with many other studies which have illustrated the importance these bacteria
place on virulence-associated phenomena such as horizontal gene transfer, and modifying the
membrane in order to avoid the host immune defences. Other investigations were carried out
in order to study the associated pattern of evolution that was occurring. Here it was noticed
that it was primarily parallel hotspot mutations that were occurring in these genes, an example
of selection acting on these genes in order to induce a gain-of-function rather than a loss-of-
function. This study in its entirety demonstrated that selection is acting on these bacteria
mainly through hotspot mutations in order to modify primarily commensal genes and change
their function with the aim of enhancing virulence.
2
Table of Contents:
Summary 1
1. Introduction 3
1.1 Background 3
1.2 Mechanism of Infection 3
1.3 Defining non-O157:H7 infections 4
1.4 Comparison of VTEC and commensal strains of E.coli 4
1.5 Short-term positive selection 5
1.6 Statement of Intent 6
2. Materials & Methods 7
2.1 Software 7
2.2 Sequence processing 7
2.3.1 Timezone; extraction of orthologous gene sets from multiple genomes 9
2.3.2 Timezone; candidate gene selection 11
2.4 Troubleshooting problems 11
3. Results 12
3.1 Candidate gene selection 12
3.2 Zonal phylogeny analysis 13
3.3 Candidate gene list 14
3.4 Core gene presence among candidates 15
3.5.1 DAVID analysis; O157 analysis 16
3.5.2 DAVID analysis; Commensal and ‘top serotype’ strains 18
3.6 Premature Stop Codon analysis in Commensal and ‘top serotype’ strains 20
3.7.1 Hotspot analysis 21
3.7.2 Hotspot analysis; parallel vs coincidental 22
3.7.3 Hotspot analysis; recombinant O157, O104, and O111 genes 23
4. Discussion 24
5. Acknowledgements 30
6. References 31
7. Appendix 34
3
1. Introduction:
1.1 Background
Escherichia coli (E. coli) is a household name for scientists and non-scientists alike. It is a
natural resident of the lower intestine in humans, and is a very well-studied model organism.
However, it often makes more negative headlines due to many reported outbreaks of the
pathotypes of this bacteria which can cause very harmful effects on its host. One such
pathotype is ‘Verotoxigenic Escherichia coli’ (VTEC), which is also referred to as Shiga
toxin-producing E. coli (STEC).
VTEC regularly cause sporadic infection and outbreaks in human populations. In addition,
this pathotype is responsible for a wide range of diseases in humans such as diarrhoea,
haemorrhagic colitis, and haemolytic uremic syndrome (HUS) (1). Strains that belong to the
serotype O157:H7 are the most common cause of infection. Farm animals such as cattle and
sheep, are normally the most frequent reservoirs of this bacterium. Hence, infection often
occurs as a result of food contamination.
1.2 Mechanism of Infection
This pathotype of E. coli is referred to as VTEC due to one defining characteristic alluded to
in its name. VTEC have the capacity to produce one or more Shiga-like verotoxins (VT), VT1
and VT2, which are also referred to as stx (1). There are two sub-types of VT1 and four sub-
types of VT2, and they are encoded by bacteriophages (2). Studies have reported that VTEC
expressing VT2 in human infections have a higher risk of causing severe disease (1). Studies
of the mechanism of infection have illustrated that these toxins are AB5 toxins that bind to
tissues that express the glycolipid receptor globotriaosylceramide (Gb3). An AB5 toxin is a
toxin that contains a polypeptide A subunit that in linked to a pentamer of identical B
subunits. The A subunit is the active component, while the B subunits are responsible for
mediating the entry of the holotoxin (A subunit) into the cell (2). This results in interference
with the 60S ribosomal subunit which inhibits protein synthesis. This action leads to cell
death, or apoptosis.
Although this is the characteristic method of pathogenesis, it must be noted that it is not the
only one. Another key factor in the virulence of VTEC is its adhesion and colonization to
specific sites such as the small intestine, in a manner similar to Enteropathogenic E. coli
strains (EPEC). In this case, attaching and effacing (AE) lesions are produced on the target
4
cells. It achieves this by the production of the adhesion factor intimin, which is responsible
for the attachment to intestinal epithelial cells (3). Intimin is encoded by the eae gene, which
is located on the chromosomal LEE pathogenicity island. Furthermore, the LEE pathogenicity
island also harbours other important virulent genes such as tir, espA, espB, espC, and espD.
The espA,B, and D-genes are associated with the production of a Type III secretion system
(TTSS) which aids the transfer of VTEC proteins into the host cell (3). It appears that VTs
may be the defining disease-causing feature of this bacteria, but studies have illustrated that
VTEC serotypes regularly implicated in disease frequently contain the LEE pathogenicity
island (3).
1.3 Defining non-O157:H7 infections
VTEC have been the cause for much concern regarding foodborne illnesses worldwide,
resulting in outbreaks in both Western and developing countries alike. As aforementioned, the
serotype O157:H7 has been the most highlighted cause of VTEC infections. As a result, this
bacterium has been widely studied. However, it is becoming increasingly evident that there
are many disease-causing non-O157:H7 serotypes also. Although these serotypes may share
similar pathogenic traits with O157:H7, they must still be examined based on their own merits
in order for a successful diagnosis to be made. Examination into this area has resulted in what
scientists now refer to as ‘the big six’, the most common infectious non-O157:H7 VTEC
agents; O26, O45, O103, O111, O121, and O145 (4).
1.4 Comparison of VTEC and commensal strains of E.coli
It is important to remember that E. coli is part of the natural microflora of the human
gastrointestinal tract, and largely exists within a commensal, or even mutualistic relationship
with humans (5). However, the pathogenic E. coli clones have been able to exploit new niches
as a result of the shift from commensalism to pathogenicity. This contrast can serve as a
useful scenario for scientists who seek to explore what other differences may now be present
in the genetic makeup of VTEC.
The application of Comparative Genomics is extremely useful in cases such as this. For
example, by contrasting the pathogenic VTEC to the natural commensal state of E. coli, one
could make possible inferences on where the shift to pathogenesis has occurred before, and
where it may occur again. This apparent shift to a pathogenic state, or pathoadaptation (6), is
not uncommon with regard to bacterial lineages. For example, Staphylococcus aureus is
commonly located in the Nasopharynx and moist skin folds of humans, causing no damage to
5
the host. However, it can cause serious infection when found in other areas of the body. For
example, patients can suffer from pneumonia when this bacterium infects the lungs (6). Thus,
comparing the various VTEC serotypes to one another may allow scientists to make more
accurate characterizations of each. Results such as this would be highly desirable in a clinical
setting.
1.5 Short-term positive selection
Scientific research has traditionally focused on two primary methods of the acquisition of
pathogenic traits: Horizontal gene transfer and the accumulation of mutations in genes over
long-periods of time. However, another mechanism of adaptation of pathogenic bacterial
species is coming to the fore; the occurrence of point mutations in genes common to all
strains, also referred to as ‘core’ genes (7).
This phenomenon has been referred to by many studies as ‘short-term selection’. This
describes an evolutionary approach that has been taken on by many pathogenic bacterial
lineages in order to increase pathogenic fitness via pathoadaptation in commensal genes
present in members of that lineage (8). Although these pathogenic adaptations are beneficial
within a certain niche, there is sacrifice involved as they cause disruption to the original role
of the gene. Hence, these pathoadaptations are continuously under positive, or ‘Darwinian’,
selection and are constantly selected out of the genome also. This strategy is for the purposes
of facilitating the expense that must be paid in order to achieve greater virulence (8).
Many studies have focused on searching for specific pathogenic genes and their association
with a certain phenotype, or niche (8). However, this type of approach is often set on
detecting genes which have adapted over a long-evolutionary timescale via various mutations
in order to specifically confer a pathogenic function, or genes that have been newly acquired
via horizontal transfer. Short-term selection has often been missed by researchers as this form
of diversification occurs on a relatively recent timescale based on the nature of the genes to be
regularly selected for-and-against. Previous research has often lacked the necessary tools
required to examine this type of adaptation. However, as technology and computational
approaches have developed, this type of approach is more feasible.
The central approach of this study involves the use of the Timezone software package. This
applies useful approaches in the detection of one of the main footprints of short-term positive
selection; hotspot or convergent mutations. Hotspot mutations are mutations which
continuously occur at the same amino acid positions within genes. When a hotspot mutation
6
occurs, it can be a very significant event as this indicates that the replacement of a specific
amino acid provides a specific adaptive advantage in a certain environment (9). Since these
positions regularly accumulate mutations, certain functions can subsequently be selected in-
and-out. The nature of these mutations suit the aim of short-term selection. Hence, detection
of hotspot mutations serves as a useful marker.
1.6 Statement of Intent
The chief aim of this study is to identify relevant virulent and pathogenic genes that are
undergoing short-term positive selection in a number of VTEC strains. This will be conducted
on the basis of performing analysis on the VTEC serotypes O157, O104, and O111. In
addition to this, it is a secondary aim of this study to recognise the associated patterns of
evolution that are occurring. Further comparative studies will be made between a sub-set of
Commensal strains and the foremost disease causing VTEC-serotypes. This type of study is
extremely important for the purposes of identifying further pathogenic factors associated with
these bacteria which will better enable us to characterize O157 and non-O157 infections.
Hence, studies such as this could aid the development of new treatments against these
pathogenic strains.
7
2. Materials & Methods:
2.1 Software
Timezone requires a Windows-based (XP or higher) operating-system (8). Table 1 outlines
the Timezone dependencies and other programs required in the study. Important programs
such as Clustal and BLAST are contained within the Timezone package. In addition, PAUP*
4.0 must be purchased and downloaded separately (10). This application must be installed
correctly for Timezone to utilize it properly, as described by Chattopadhyay et al. (8).
Table 1: A list of the software version used in this project, and where to acquire them.
Program Source
Timezone 1.0 http://sourceforge.net/projects/timezone1/
TreeView X 0.5.0 http://darwin.zoology.gla.ac.uk/~rpage/treevie
wx/download.html
PAUP* 4.0 http://paup.csit.fsu.edu/downl.html
WinSCP 5.5.6 http://winscp.net/eng/download.php
PuTTY 0.63 http://www.chiark.greenend.org.uk/~sgtatham/
putty/
2.2 Sequence processing
Relevant sequences were downloaded from NCBI along with a collection of novel strains
sequenced by the lab. Thus, this large amount of data was sorted and organised into files
representative of the strains to be analysed. The Appendix (Table A) illustrates the script that
was used to perform this task.
8
Figure 1: Flow-chart demonstration of the process that was followed in order to prepare
sequences for Timezone. Serotype directories were labelled O157, O111, O104,
Commensals (containing a subset of commensal strains), and ‘top serotypes’ (O157 and non-
O157 ‘big six’ strains selected on the basis of reported outbreaks over the last decade or so)
(4).
Most of the sequence files contained ‘scaffolds’. In this case, a scaffold refers to the genomic
and plasmid DNA contigs. These contigs were not present together as a continuous stretch of
DNA sequence. Hence, it was necessary to concatenate the files in fasta format into one file
which was representative of the entire genome of the strain in question, as demonstrated by
Figure 1. Following the movement of the concatenated file into its respective directory, the
lengthy fasta headers in the sequence identifier of every strain were reduced in order for
PAUP* to run efficiently. The script used to solve this problem is displayed in the Appendix
(Table B). Further format requirements found it necessary that all sequences being primed for
input to be saved as ‘text’ files also. Thus, it was necessary to move the processed sequences
from UNIX into the Windows setting and subsequently save them as ‘text’ files. The final
instructions regarding the titles of the list of strains to be analysed were followed, as described
by Chattopadhyay et al. (8).
Furthermore, it was necessary to input a fully annotated reference genome in genbank format,
against which Timezone can compare the sequences to be analysed to obtain the entire gene-
set present. The reference genomes downloaded from NCBI are described in Table 2. These
reference genomes were also subsequently saved as text files in ‘C:\TimeZone_v1.0\Input’.
9
Table 2: The profile of serotypes that were subject to analysis.
Serotype Number of strains
analysed
Reference genome
O157 14 E. coli O157:H7 str. Sakai
O104 14 E. coli O104:H4 str. 2011C-3493
O111 11 E. coli O111: H- str. 11128
Commensal serotypes 14 E. coli str. K-12 substr. MG1655
‘Top’-disease causing
serotypes
10 E. coli O157:H7 str. Sakai
At the Timezone command prompt, instructions were followed as described by
Chattopadhyay et al. (8).The cut-off value for sequence-identity and coverage of sequence
length was selected as 95% in both cases. Timezone began its workflow upon entering these
final details. The entire workflow process along with the outputs produced is summarized in
Figure 2.
2.3.1 Timezone; extraction of orthologous gene sets from multiple genomes
Timezone was able to extract the orthologous gene sets from the strains to be analysed based
on alignment of these sequences with the reference genome. Most of the E.coli sequences had
up to 5200 genes present in their genome. Firstly, a list of sequences which contained non-
ACGT characters present in their genes was produced. The genes from these sequences were
excluded from the creation of the orthologous-gene list as a sequence with a large amount of
these types of characters was considered to be of poor-quality (8). An orthologous list of
genes which contain premature stop codons (PSC) is also produced (Figure 2). But this list
was also excluded from further analysis.
10
Figure 2: Flow-chart demonstration of the work-flow followed by Timezone. Genome
sequences or gene lists were used as input (red box). Outputs are highlighted in the blue box.
Specific analysis steps are shown in the Process column.
11
2.3.2 Timezone; candidate gene selection
Gene-specific alignment and phylogenetic trees were generated. This was subsequently used
to supply the main process of Timezone whereby genes are analysed for the presence of short-
term positive selection. This is illustrated by Figure 2 and comprises numerous tests including
zonal phylogeny analysis, the calculation of the ratio of structural to silent mutations in the
terminal and internal branches of phylogenetic gene trees, the rate and ratio of total structural
to silent mutations in genes, and calculation Tajima D and Fu & Li D values for each gene set.
This was followed by testing for recombination by Rec-MaxChi and Rec-Phylpro, which
separated the final list of candidate genes from candidate-genes that had arose through
recombination.
2.4 Troubleshooting problems
A Timezone run using over 10 sequences can take in excess of 30 hours to finish. This proved
to be problematic when running a standalone computer with regard to maintaining power, and
maintaining that type of workload. In response to this, it was necessary to set up a remote
Windows Server. In addition, Timezone was run through the Windows command line.
12
3. Results:
3.1 Candidate gene selection
The principle behind most of the tests carried out by Timezone is to detect changes due to
positive selection. This is normally in the form of an amino acid change (a structural or non-
synonymous change). Secondly, the tests try to identify if this change occurred relatively
recently in an evolutionary timescale. There are a number of criteria that signify this. A gene
was selected for candidacy based on meeting just one, or a combination, of the following
criteria: significantly higher allelic diversity in the evolutionarily recent zone than in the fixed
(long-term) zone (EXT>PRI diversity at P<0.05), the occurrence of evolutionarily recent
structural hotspot mutations (HSfreq-EXT), a significant higher ratio of non-synonymous to
synonymous mutations in the terminal branches (Tips) than in internal branches (Twigs)
(Tips>Twigs dN/dS at P<0.05), dN/dS values significantly higher than 1 (dN/dS-based
selection), or a negative D* value.
Table 3: A condensed illustration of the primary output of an O157 Timezone run.
Gene
Name
Product EXT>PRI
diversity
at P<0.05
HSfreq -
EXT
Tips>Twigs
dN/dS at
P<0.05
dN/dS-based
selection
ECs2998 Kil protein sig 0 non-sig
Neutral
ECs1986
tail assembly
protein
non-sig 0.26087
non-sig Purifying
ECs1122 outer
membrane
protein
non-sig 0.33333 Sig Purifying
Table 3 displays the gene, its protein product, and the results of the candidate-determining
tests that were conducted. The tests displayed are the main tests by which a gene was selected
for candidacy, which was followed by testing for recombination. In the cases of ‘HSfreq-
EXT’ and ‘dN/dS-based selection’, values of ‘>0’ and ‘positive’ represent significance,
respectively.
13
3.2 Zonal phylogeny analysis
This type of analysis categorizes genes into ‘RECENT’ or ‘FIXED’ in each of the strains used
for analysis. These two categories refer to the fact that the gene may either have multiple
evolutionary linked alleles differing via synonymous mutations (FIXED; Primary zone) or
may be encoded by single alleles, exhibiting no silent mutations (RECENT; External zone). A
high frequency of alleles in the external zone versus the primary zone signifies the presence of
positive selection.
Figure 3: Phylogram of the O157 gene ECs1991 which codes for an outer-membrane
protein. Red-boxes highlight short-term selection, whereas blue-boxes highlight long-term
selection. Each node follows a format such as this, ‘RECENT-O157 H str H2687-n1-1S/2N-
D47E/R81H’, this implies: ‘zone –strain name- number of strains representing this allele (n1)-
number of synonymous and non-synonymous mutations giving rise to this allele (1S/2N)- the
specific amino acid polymorphism, including the residual position (e.g. glutamate for
aspartate at position 47)’ (8).
14
3.3 Candidate gene list
A list of candidate genes was produced based on meeting the aforementioned criteria. This list
of genes has undergone testing for recombination. Candidate genes that have not been
produced through mutation are not considered to be under the action of ‘true’ selection.
In addition, it should be noted that the results for the DNA sequence and protein alignments
of genes, the topologies of these alignments, and the results of the zonal phylogeny analysis
which includes ZP-trees, and information of mutations and HS-mutations, as well as the
results of the other candidate-determining tests and recombination tests, were only visible for
those genes that have been deemed suitable for candidacy (Figure 4). This includes genes that
were considered to be recombinant. However, an annotation overview list was produced for
all orthologues identified.
3
15
9
0
2
4
6
8
10
12
14
16
Rhs element Proteins Phage Proteins Transposases
O104
1
30
1 16
3 1 1 30
5
10
15
20
25
30
35
O157
A
B
15
Figure 4A, 4B & 4C: The number and profile of gene products extracted from the
primary output of Timezone for O157, O104, and O111. Hypothetical proteins with no
described function have been excluded from the analysis represented here. O157; total
candidate gene number: 74, total number of hypothetical proteins found: 27. O104; total
candidate gene number: 32, total number of hypothetical proteins found: 5. O111; total
candidate gene number: 68, total number of hypothetical proteins found: 5. Note that the size
of the bars are relevant to the total number of candidate genes found for each strain.
3.4 Core gene presence among candidates
Table 2 illustrates the number of strains that were analysed (including the reference genome)
for each serotype. Timezone presented the number and names of strain sequences that a
candidate gene was present in. 15 strains were analysed during O157 and O104 analysis. To
be considered a core gene, a gene would need to be present in all 15 strains to be considered a
core gene. Likewise, 12 strains were analysed during O111 analysis, due to less O111 strains
being available.
4
21
30
14 3
0
5
10
15
20
25
30
35
DNA associated;methylation,
replication, andrepair
Phage Proteins Transposases Endonuclease MembraneProteins
Endopeptidase
O111
C
16
Figure 5: The distribution of core and mosaic genes throughout the genes selected for
candidacy. The coloured-bar at the top of the graph represents this distribution from unique
(present in one sequence) to core (present in all sequences). There is a total of 25 core genes
under short-term positive selection.
3.5.1 DAVID analysis; O157 analysis
Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis was
completed in order to visualize the Gene Ontology (GO) terms associated with the serotypes
at the centre of this study (11) (12). Chart analysis was performed in this case. This groups’
genes that are represented by similar or identical GO terms.
A threshold count of 3 was applied. This determined that in order for a term to be considered
significant, it must represent a minimum gene count of 3. As a result, 50 genes were excluded
as the genes in this exclusion list may not have a relationship with any of the other genes
above the similarity threshold.
17
Table 4: The most commonly associated GO terms with the candidate O157 genes.
Term
Category
Gene count % of total
candidate genes
P-value
Outer membrane 7 9.7 1.4e-5
Virulence-related outer
membrane protein
6 8.3 1.4e-7
Outer membrane
protein, beta-barrel
6 8.3 2.9e-7
Cell outer membrane 6 8.3 6.6e-5
External encapsulating
structure part
6 8.3 5.9e-4
Cell envelope 6 8.3 1.9e-3
Envelope 6 8.3 9.8e-3
External encapsulating
structure
6 8.3 1.3e-2
Terminase small
subunit
4 5.6 3.8e-7
Terminase small
subunit
4 5.6 4.6e-6
DNA packaging 4 5.6 6.2e-6
Phage lambda
membrane protein lom
4 5.6 1.3e-5
Phage lamda minor tail
protein L
4 5.6 2.8e-5
Putative prophage tail
fibre, C-terminal
4 5.6 1.7e-4
Phage minor tail protein
L
4 5.6 2.3e-4
Phage-related tail
assembly protein I
3 4.2 1.5E-3
Bacteriopage lambda
tail assembly I
3 4.2 6.4e-3
Table 4 represents the number of genes associated with the GO term and the percentage this
makes up of the total genes selected for analysis. Note a gene can be associated with more
than one GO term. In addition, the mean P-values are also illustrated to display statistical
significance.
18
3.5.2 DAVID analysis; Commensal and ‘top serotype’ strains
Cluster analysis was the main form of inspection here. This groups chart GO terms together
based on common biology and similar function. Both analyses resulted in a large amount of
associated GO terms. Hence, the classification stringency was selected as ‘highest’. In
addition to this, the effort to maintain statistical significance was strengthened by increasing
the kappa ‘similarity term overlap’. In the case of ‘Top Serotypes’ this kappa value was raised
to 6, and in the case of ‘Commensal strains’ it value was increased to 9. In order to maintain
the analysis integrity, it was important to compare a similar number of cluster terms (<20).
However, the number Commensal strains showed greater restraint to increasing the kappa
score, thus it was necessary to increase it one factor higher.
A
19
Figure 6A & 6B: The GO cluster terms that were most represented for ‘Commensals’
and ‘top serotypes’. The biological significance of group terms are graded by their
enrichment score (11) (12). The percentage represents the ‘enrichment’ score for each cluster
over the total ‘enrichment’ score. The higher the percentage, the more enriched the group is
and hence is it more biologically significant, relative to the other groups. The list of terms
begins with the most enriched, and ends with the least enriched.
B
20
0 50 100 150 200 250 300 350 400 450 500
#Genes with PSC
#Genes with PSC
Top Serotype 466
Commensal 191
3.6 Premature Stop Codon analysis in Commensal and ‘top serotype’ strains
Figure 2 illustrates that genes and strains that contain PSC are listed, and subsequently
excluded from further analysis as these genes have been inactivated. Some studies have
shown that PSC are correlated to bacterial evolution (26).
Figure 7: A display of the number of genes which had premature stop codons present in
each analysis. Each bar is coloured coded, and the exact number of genes with PSC is given.
21
0.2535418
0.3913443
7
0.3098282
3
0 0.1 0.2 0.3 0.4 0.5
Mean ratio of HS mutationsto total amount of aa
changes
O111 O104 O157
0%
100%
Hotspot frequencies
Long-term
Short-term
3.7.1 Hotspot analysis
Hotspot mutations have been described as the ‘footprints of short-term positive selection’.
These types of mutations can illustrate interesting patterns of evolution upon further
inspection. For example, the frequency at which HS-mutations appear in the genome can
signify the extent to which selection acts on these mutations in order to drive evolution.
Figure 8A & 8B: A display of the ratio of HS-mutations. 8A displays the total proportion
of hotspot (HS) mutations in the long and short term zones of O157, O104, and O111. There
were only 2 HS-mutations in the long-term zone of the genes of these serotypes. Hence, the
percentage of HS-mutations in the long-term zone is 0.315%, but this is not visible in this
graph. 8B illustrates the mean ratio of HS-mutations to the total number of amino acid (aa)
changes in each of the genomes of the serotypes analysed.
A B
22
3.7.2 Hotspot analysis; parallel vs coincidental
The nature of these hotspot mutations is extremely important in order to determine what
pattern of evolution is being followed. Hotspot mutations can occur as either parallel or
coincidental. The former refers to a situation whereby the same amino acid replacement
occurs at each of these hotspot positions, whereas the latter refers to the occurrence of
different amino acid replacements (23). Figure 9 illustrates that parallel hotspot mutations are
predominantly occurring in these VTEC strains.
Figure 9: Different types of hotspot accumulations across the three serotypes analysed.
Here we can see the number of candidate genes in each strain that accumulated parallel
hotspot mutations only, coincidental hotspot mutations only, or both. Genes that accumulated
no hotspot mutations are not included here.
23
46%
23%
31%#Genes Para
#Genes Coin
#Genes both 59%
41% #Genes Para
#Genes Coin
3.7.3 Hotspot analysis; recombinant O157, O104, and O111 genes
Recombination-labelled genes may point towards some interesting patterns of evolution
present here as parallel hotspot polymorphisms may occur as point mutations, such changes
may also occur due to recombination. Yet Figure 10A shows there is a high proportion of
genes with both parallel and coincidental hotspot mutations, and the percentage of genes with
just coincidental hotspot mutations also appears to be high.
Figure 10A & 10B: The distribution of the different types of HS-mutations in candidate
genes produced through recombination. 10A displays the distribution of the nature of
hotspots in recombinant genes. 10B illustrates the total distribution of parallel and
coincidental hotspot changes. In this case, recombinant genes that have both have been
included in both the number of genes with parallel, and the number of genes with coincidental
hotspot mutations.
A B
24
4. Discussion:
There are many genes under short-term positive selection in the serotypes of VTEC that were
studied, and many of them are associated with pathogenicity. Observation of Figure 4
illustrates that there is a prominent presence of ‘phage-related’ proteins. This grouping covers
a wide range of proteins and their functions, including DNA packaging, tail assembly,
terminases, capsid assembly, portal proteins, and holin proteins, to name just a few.
Horizontal gene transfer plays a massive role in the evolution of bacteria which can account
for this observation. This mechanism of gene transfer is commonly mediated by
bacteriophages. These phages invade their bacterial host and integrate their genomes as
prophages into the resident genetic material. Indeed, these prophages can carry important new
information such as virulence factors, or further niche adaptation mechanisms (13).
Observation of Figure 5 illustrates that nearly all aspects to do with production, release, and
integration of phages are under positive selection. For example, Table 4 illustrates that GO
term, “Phage lambda membrane protein lom” is heavily represented. This protein is
incorporated into the host cell membrane during E. coli infection by phage lambda. Hence, it
is evident that this selection is favouring this process as it must be of benefit to these bacteria.
It is apparent that this phenomenon is not just prevalent in one serotype such as O157 in
Figure 5A, but in all three serotypes that have been examined. In addition to this, examples of
this occurrence can be observed in tangible settings. For example, there have been mass
reports of the outbreak in 2011 in Germany of haemolytic uraemic syndrome (HUS)
associated with E. coli O104:H4. Genomic studies have shown that the enhanced
pathogenicity of this strain was probably as a result of horizontal transfer due to the presence
of stx-2 (normally present in other E. coli strains) and β-lactamase-encoding plasmid CTX-M-
15 (often identified in other members of Enterobacteriaceae) (14).
The membrane is under heavy selection as Table 4 illustrates the number of GO terms and
their large gene count that are associated with the membrane in these VTEC strains. The
membrane serves as the primary contact region for host-pathogen interactions and thus it
appears as a natural candidate for positive selection since there is constant pressure to avoid
immune system recognition, and also to have the capability to invade host cells (15). There
are 2 GO term categories of interest highlighted by Table 4: Virulence-related outer
membrane protein (P-value 1.4e-7) and Outer membrane protein, beta-barrel (P-value 2.9e-7).
25
Upon further inspection, “Virulence-related outer membrane proteins” refers to protein family
members which confer a distinct virulent phenotype such as lom and OmpX in E. coli. The
structure of OmpX is integral to its function as it contains a highly-variable four-strand β-
sheet protruding from the cell surface which would aid the binding of external proteins with
complementary β-sheets. This type of binding promotes adhesion and invasion of mammalian
cells, as well as defence against the host immune response (16, 17). Indeed, it has been
established that adhesion inside the host system is a vital part of the VTEC virulence armoury.
In this manner, positive selection for this protein family further enhances the virulence of
these bacteria.
Examination into the “Outer membrane protein, beta barrel” reveals that this is a
transmembrane beta-barrel structure, or porin, that allows the passage of small, hydrophilic,
or charged molecules (15). However, this structure also has a role to play in host-immune
interaction and pathogenesis since it serves as a receptor for phages, antibiotics, and colicins
(15). This transmembrane beta-barrel structure can be found in outer membrane proteins such
as OmpA, and in the outer membrane enzyme PagP of pathogenic gram-negative bacteria.
Outer membrane protein A (OmpA) plays a multitude of roles. For example, colicins K and L
require the action of OmpA for correct functioning, and it also serves as a receptor for a
number of T-even like phages (18). PagP, or its E.coli homolog CrcA, also aids the bacterium
to avoid the host immune system. Lipopolysaccharide (LPS) is a major component of the
outer membrane in gram-negative bacteria. It contains a hydrophobic anchor, referred to as
lipid A. In addition to this, lipid A is also an active component of the LPS endotoxin. This
promotes septic shock during a bacterial infection in extreme cases (19). However, the
pathogenic capabilities of this lipid can be further enhanced with some modification. The
aforementioned enzymes catalyse the transfer of palmitate from a phospholipid to a
glucosamine unit of lipid A. This action provides the bacteria with resistance to the response
of the innate immune system, such as cationic anti-microbial peptides (CAMPs). Furthermore,
it also antagonizes LPS-mediated signal transduction in human cells (19). Thus, a common
trend can be observed in the membrane. It appears that positive selection in many of these
genes seems to be acting on processes associated with host-immune attack and evasion, and
binding of phages and colicins.
Selection for phage and membrane-associated activities is evident. However, O104 and O111
did not return any results for associated GO terms. This is most likely due the fact that there is
26
poorer characterization of these serotypes and hence the ‘GI’ numbers used for input did not
map to any GO terms present in the database to elicit any significant results. However, the
data presented in Figure 5 suggests that transposases and transposable-elements merit further
examination.
A transposase catalyses the movement of a transposon to another part the genome. A number
of transposases appeared to be under short-term positive selection during this analysis such as
transposase IS3, IS629 transposase OrfB, and IS1 and IS5 transposases. There are a number
of opinions in the literature as to what significance positive selection for transposable
elements there might be. Some studies suggest that insertion of transposable elements has a
negative fitness effect on the organism, and simply occurs due to the selfish nature of these
genetic elements. Genes, like organisms, struggle for existence and the most successful genes
are those that persist. Thus, it has been postulated these genes successfully persist in a manner
which is similar to the nature of pathogens persisting in their hosts (15).
However, other research has suggested some theories that are quite on the contrary. For
example, it has been suggested that silent catabolic operons in E. coli can be activated by IS
elements in the presence of the substrate for that operon. In addition, this transposition occurs
at a higher rate in starving cells than in growing ones. In this case, these transposable
elements contribute to the survival of the cell (20). In any case, it is unclear as to why these
groups of genes are being positively selected for in VTEC, whether it be for selfish purposes
or for the benefit of the organism in terms of survival and pathogenesis. Despite this,
however, it cannot be denied that these transposases are being selected for, heavily so in the
case of O111, and it certainly re-opens the debate as what role these elements are playing.
Figure 7A displays that there is a focus on ‘Organelle membrane’ in the Commensal strains
whereas ‘top serotypes’ displays a more even distribution of terms under selection, with the
term ‘cell wall biogenesis’ being the most highly represented. Although a case could be made
for enhanced virulence selection in the case of ‘top serotypes’ as there is a decent
representation of ‘cell motility’ (13%) and ‘taxis’ (10%). Some studies have described that
increased mechanisms for cell motility and chemotaxis is associated with enhanced virulence
in bacteria (25). Thus, this is a point worth highlighting in this case. Despite this, however,
the overall profile appears to be pretty similar with some minor exceptions.
Perhaps this is unsurprising however, since it is largely commensal genes that are under
selection in both cases. Previous studies have indicated that there is significant mosaicism
27
between the genome sequences of commensal and pathogenic strains of E. coli. Indeed,
inspections such as this have revealed that traits that were largely thought to be almost unique
to the pathogenic strains, can be found within the commensal genome also (22). This would
most likely aid the survival of the pathogenic species as the commensal population
continually serves as a useful ‘resource’ for which further pathogenic members can be
obtained via horizontal gene transfer in order to explore novel niches. However, the
commensal and pathogenic populations are not so diverse that the commensals cannot
maintain the primary reservoir habitat where the long-term survival of the organism mainly
lies. For example, pathoadpative traits will be selected-for in the pathogenic habitat but
selected-against in the commensal habitat (23). This is the theory behind ‘source-sink’
dynamics and hence, in this manner, the commensal and opportunistic nature of E. coli can be
maintained.
In addition to this, analysis of the premature stop codons (PSC) in the commensal and ‘top
serotype’ strains is particularly interesting. Figure 7 demonstrates that the number of genes
with PSC in the top pathogenic serotype strains is almost 2.5 times the number of genes with
PSC in the commensal strains. Some studies have suggested that this is the result of the
adaptation of the pathogen to its ‘novel’ habitat. Thus far, pathoadaptation in VTEC has been
described by gain-of-function modification in order for the bacteria to better exploit its niche.
However, it is equally important for genes that are no longer compatible with the ‘pathogenic
lifestyle’, to be inactivated. In other-words, this is pathoadaptation via loss-of-function
modifications. This is another direction evolution can take during adaptation to a new habitat.
At the beginning of this study, it was alluded to that this type of analysis would yield a
significant presence of core genes in the results. Figure 5 further supports this hypothesis.
Although the most significant presence is technically from mosaic genes, it should be noticed
that these points are mainly concentrated in the locality of the core gene region. Thus, it has
become evident that pathoadaptation is occurring in these pathogenic bacteria through the
means of mutations in commensal genes in order to confer a short-term advantage, yet these
mutations will only be mildly deleterious in the ancestral, commensal niche. This type of
pathoadaptation suits the opportunistic nature of these bacteria.
Significantly, one must observe the absence of the VTEC characteristic pathogenic genes in
the candidate list of genes. Before this study was conducted, it was expected that these genes
would naturally be present such is their important to these strains of E. coli. However, the
28
very nature of this analysis does not include these genes. This is due to the fact that the
candidate list of genes, for the most part, includes primarily commensal genes that are
possibly being hijacked in order to further enhance the bacteria’s virulent weaponry. These
genes are under short-term positive selection, and are just as likely to be selected-against in
order to return the balance. In contrast to this, the previously stated quintessential VTEC
genes are constant virulent factors for these bacteria. In other-words, they are not likely to be
selected in-and-out of the genome, rather they continuously serve the pathogenic efforts of
these bacteria.
The point mutations that are occurring in the commensal or ‘core’ genes are occurring mainly
as mutations in hotspot positions. Figure 8A demonstrates the extent to which short-term
positive selection uses these types of mutations as its main driver since almost none can be
witnessed to be occurring in the ‘long-term’ zone. This certainly fits in with the picture that
these protein variants which have accumulated recent hotspot mutations could be functionally
significant for short-term adaptation (23). However, such is the nature of hotspot mutations,
these protein variants could also be reverted back to their original, commensal state. In
addition to this, Figure 8B illustrates the overall importance hotspot mutations have as these
type of mutations account for 25-39% of total number of changes happening in the three
VTEC serotypes that have been examined.
The predominance of parallel HS-mutations shown by Figure 9 signifies that selection is
acting on these genes in order to modify the protein in a specific and directional manner, as
the same amino acid is being continuously inserted into these positions. This is in line with
the principle of positive selection which aims to produce a shift in the phenotype. In contrast
to this, if coincidental hotspot mutations were predominant, this would show that selection
was acting in order to eliminate protein function as multiple types of amino acids would be
accumulating in positions that are vital to the function of the protein (23). In the case of O111,
however, it appears that both parallel and coincidental changes are co-occurring.
Recombinant genes are showing a high frequency of parallel changes (Figure 10). This is to
be expected as parallel HS-mutations can occur as point mutations, which also may occur due
to recombination. Hence, there is normally a much higher frequency of recombinant genes
that display parallel HS-mutations than coincidental HS-mutations. In this case, however,
there is a high proportion of recombinant genes displaying coincidental hotspot mutations.
Figure 10B illustrates the total proportion of coincidental and parallel hotspot mutations in the
29
recombinant genes. This supports the observation that coincidental hotspot changes are
holding a high percentage of the total number of HS-mutations in recombinant genes, higher
than would be normally expected. This is suggestive of the power of positive selection to
produce sequence changes not just through mutation, but through recombination also. This
broadens the horizons of the organism and widens its scope to adapt (8, 23).
In conclusion, this study has succeeded in achieving its aims by identifying further traits
associated with the pathogenicity of VTEC, including a more detailed characterization of the
virulent traits associated with some non-O157 strains. It is undeniable that selection is
favouring ‘phages’ in these bacteria in order to increase the transfer of virulent genetic
material across the population. Although we are currently aware of the genes that identify
VTEC strains, this study has focused on the ‘short-term’ selection of other genes for similar
purposes. This is important as this ‘short-term’ focus fits in with ‘source-sink’ life-cycle of
Escherichia coli populations. Furthermore, this study has also achieved its secondary goal in
recognizing the pattern of evolution that is occurring here. The short-term pathoadaptation of
VTEC is occurring largely through hotspot mutations. Once again, this type of mutation is
suitable as it can be manipulated to produce gain-of-function in genes for shifting to a
pathogenic state, or to perform a loss-of-function in genes in order to revert back to
commensalism and ultimately maintain the survival of the species. It has been observed that
many of the genes are not specialist virulence factors, rather they are commensal genes that
are now being used to improve the armoury of virulent factors when exploiting novel niches.
30
5. Acknowledgements:
A word of thanks to Lisa Rogers and Dr. Peadar Ó Gaora for their contribution to this study.
31
6. References:
(1): Karama, M., Johnson, R. P., Holtslander, R., McEwen, S. A., & Gyles, C. L. (2008).
Prevalence and characterization of verotoxin-producing Escherichia coli (VTEC) in cattle
from an Ontario abattoir. Canadian Journal of Veterinary Research, 72(4), 297.
(2). Karmali, M. A., Gannon, V., & Sargeant, J. M. (2010). Verocytotoxin Escherichia coli
(VTEC). Veterinary microbiology, 140(3), 360-370.
(3). Bolton, D. J. (2011). Verocytotoxigenic (Shiga toxin–producing) Escherichia coli:
virulence factors and pathogenicity in the farm to fork paradigm. Foodborne pathogens and
disease, 8(3), 357-365.
(4). Yin, S., Jensen, M. A., Bai, J., DebRoy, C., Barrangou, R., & Dudley, E. G. (2013). The
evolutionary divergence of Shiga toxin-producing Escherichia coli is reflected in clustered
regularly interspaced short palindromic repeat (CRISPR) spacer composition. Applied and
environmental microbiology, 79(18), 5710-5720.
(5). Nataro, J. P., & Kaper, J. B. (1998). Diarrheagenic escherichia coli. Clinical microbiology
reviews, 11(1), 142-201.
(6). Sokurenko, E. V., Hasty, D. L., & Dykhuizen, D. E. (1999). Pathoadaptive mutations:
gene loss and variation in bacterial pathogens. Trends in microbiology, 7(5), 191-195.
(7). Chattopadhyay, S., Paul, S., Kisiela, D. I., Linardopoulou, E. V., & Sokurenko, E. V.
(2012). Convergent molecular evolution of genomic cores in Salmonella enterica and
Escherichia coli. Journal of bacteriology, 194(18), 5002-5011.
(8). Chattopadhyay, S., Paul, S., Dykhuizen, D. E., & Sokurenko, E. V. (2013). Tracking
recent adaptive evolution in microbial species using TimeZone. Nature protocols, 8(4), 652-
665.
(9). Chattopadhyay, S., Dykhuizen, D. E., & Sokurenko, E. V. (2007). ZPS: visualization of
recent adaptive evolution of proteins. BMC bioinformatics, 8(1), 187.
(10). Swofford, D. L. 2003. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other
Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
(11). Huang D.W., Sherman B.T., Lempicki R.A. (2009). Systematic and integrative analysis
of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 4(1): 44-57.
32
(12). Huang D.W., Sherman B.T., Lempicki R.A. (2009). Bioinformatics enrichment tools:
paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids
Research; 37(1):1-13.
(13). Asadulghani, M. D., Ogura, Y., Ooka, T., Itoh, T., Sawaguchi, A., Iguchi, A., &
Hayashi, T. (2009). The defective prophage pool of Escherichia coli O157: prophage–
prophage interactions potentiate horizontal transfer of virulence determinants. PLoS
pathogens, 5(5), e1000408.
(14). Juhas, M. (2013). Horizontal gene transfer in human pathogens. Critical reviews in
microbiology, (0), 1-8.
(15). Petersen, L., Bollback, J. P., Dimmic, M., Hubisz, M., & Nielsen, R. (2007). Genes
under positive selection in Escherichia coli. Genome research, 17(9), 1336-1343.
(16). Otto, K., & Hermansson, M. (2004). Inactivation of ompX causes increased interactions
of type 1 fimbriated Escherichia coli with abiotic surfaces. Journal of bacteriology, 186(1),
226-234.
(17). Vogt, J., & Schulz, G. E. (1999). The structure of the outer membrane protein OmpX
from Escherichia coli reveals possible mechanisms of virulence.Structure, 7(10), 1301-1309.
(18). Johansson, M. U., Alioth, S., Hu, K., Walser, R., Koebnik, R., & Pervushin, K. (2007).
A minimal transmembrane β-barrel platform protein studied by nuclear magnetic
resonance. Biochemistry, 46(5), 1128-1140.
(19). Bishop, R. E., Gibbons, H. S., Guina, T., Trent, M. S., Miller, S. I., & Raetz, C. R.
(2000). Transfer of palmitate from phospholipids to lipid A in outer membranes of
Gram‐negative bacteria. The EMBO journal, 19(19), 5071-5080.
(20). Hall, B. G. (2000). Transposable elements as activators of cryptic genes in E. coli.
In Transposable Elements and Genome Evolution (pp. 181-187). Springer Netherlands.
(21). Rasko, D. A., Rosovitz, M. J., Myers, G. S., Mongodin, E. F., Fricke, W. F., Gajer, P., &
Ravel, J. (2008). The pangenome structure of Escherichia coli: comparative genomic analysis
of E. coli commensal and pathogenic isolates. Journal of bacteriology, 190(20), 6881-6893.
(22). Sokurenko, E. V., Hasty, D. L., & Dykhuizen, D. E. (1999). Pathoadaptive mutations:
gene loss and variation in bacterial pathogens. Trends in microbiology, 7(5), 191-195.
33
(23). Chattopadhyay, S., Weissman, S. J., Minin, V. N., Russo, T. A., Dykhuizen, D. E., &
Sokurenko, E. V. (2009). High frequency of hotspot mutations in core genes of Escherichia
coli due to short-term positive selection. Proceedings of the National Academy of
Sciences, 106(30), 12412-12417.
(24). Maurelli, A. T. (2007). Black holes, antivirulence genes, and gene inactivation in the
evolution of bacterial pathogens. FEMS microbiology letters, 267(1), 1-8.
(25). Josenhans, C., & Suerbaum, S. (2002). The role of motility as a virulence factor in
bacteria. International Journal of Medical Microbiology, 291(8), 605-614.
(26). Wong, T. Y., Fernandes, S., Sankhon, N., Leong, P. P., Kuo, J., & Liu, J. K. (2008).
Role of premature stop codons in bacterial evolution. Journal of bacteriology, 190(20), 6718-
6725.
34
7. Appendix:
Table A: Bash commands used together in one script referred to as the ‘assembly
script’.
Table B. ‘Header-truncator’ script.