Post on 02-May-2018
transcript
CLOSING THE GAP BETWEEN GENOME ANALYSIS AND THE
BIOLOGIST
Vincenzo Forgetta
Faculty of Medicine
Department of Human Genetics
McGill University, Montreal, Quebec, Canada
June 2012
A thesis submitted to McGill University in partial fulfillment of the requirements
of the degree of Doctor of Philosophy
© Vincenzo Forgetta, 2012
ii
ABSTRACT
Bioinformatics is a crucial component of genomics research because it enables the
analyses of large and complex data sets. Conventionally, these analyses involve
the use of sophisticated software, and are largely performed by those with prior
experience in bioinformatics using adequate computational resources.
Massively parallel DNA sequencing (MPS) platforms have democratized genome
sequencing, making it affordable to the biologist. For many biologists this will be
their first venture into bioinformatics and genomics. Consequently, they may be
unfamiliar with bioinformatics or lack the necessary computer resources. For
these biologists, the potential of using MPS platforms for genome analysis is half
fulfilled; providing affordable genomic data without the means to easily analyze
it. One approach to close this gap is to build software oriented towards those with
limited bioinformatics expertise or resources.
This dissertation describes a paradigm to close the gap between genome analysis
and the biologist. Using this paradigm, I have developed software tools for three
bioinformatics tasks in genome analysis: [i] assessment of a genome assembly,
[ii] display and integrated analysis of genomic data, and [iii] deriving biological
insight using public information. The first tool I developed was cgb, a program
that creates custom UCSC Genome Browsers, allowing biologists to use this
browser for genome sequences obtained from MPS platforms. Using cgb for a
comparative genomics study of Clostridium difficile assisted us to identify
diagnostic DNA markers associated with disease severity and to estimate that the
pan-genome is larger than previously estimated. Next I developed contiGo, a
general purpose tool to inspect genome assemblies via a web browser, thus
bypassing the need for the biologist to install software, satisfy hardware
requirements, and download large datasets. Along with cgb, this program enabled
us to evaluate the performance of the Roche/454 Genome Sequencer-FLX MPS
platform across five sequencing core facilities, and to produce a high quality
genome sequence of the fungus Ophiostoma novo-ulmi. Lastly, I developed BL!P,
a program to automate NCBI BLAST searches and explore the results in a
iii
dynamic interface. This program was inspired by my work on characterizing the
genome of a multi-drug resistant and pathogenic strain of Escherichia fergusonii,
for which cgb and contiGo were also used in data analysis. These applications
have been used in other genomics projects by users with a range of bioinformatics
expertise and resources. Other data-intensive fields of science could benefit from
a similar software development paradigm.
iv
RÉSUMÉ
La bioinformatique fait maintenant partie intégrante de la recherche en
génomique, car elle permet des analyses de bases de données larges et complexes.
Conventionnellement, ces analyses impliquent l'utilisation de logiciels
sophistiqués et sont généralement faites par des personnes expérimentées en
bioinformatique qui utilisent des ressources informatiques adéquates.
Les plateformes de séquençage haut débit d'ADN ont démocratisé le séquençage
du génome, le rendant ainsi accessible aux biologistes. Pour de nombreux
biologistes, ce sera leur première incursion dans les domaines de la
bioinformatique et de la génomique. Par conséquent, ils ne sont probablement pas
familiers avec la bioinformatique ou n'ont pas les ressources informatiques
nécessaires afin d’analyser les résultats. Pour ces biologistes, l’utilisation des
plateformes de séquençage haut débit permet l’obtention abordable de données
génomiques, mais n’offre pas les outils pour les analyser facilement. Le
développement de logiciels ciblant les chercheurs ayant une expertise en
bioinformatique limitée ou avec peu de ressources permettrait de combler cet
écart.
Cette dissertation décrit un paradigme visant à réduire, voire même à fermer,
l’écart entre l'analyse du génome et le biologiste. En utilisant ce paradigme, j'ai
développé des outils informatiques pour trois tâches facilitant l'analyse
génomique : [i] l'évaluation de l’assemblage du génome, [ii] l’affichage et
l'analyse intégrée des données génomiques, et [iii] l’obtention de connaissances
biologiques utilisant de l'information publique. Le premier outil que j'ai développé
était cgb, un programme qui crée des navigateurs personnalisés « UCSC Genome
». Il permet aux biologistes d'utiliser ces navigateurs pour évaluer les séquences
obtenues à partir de plateformes de séquençage haut débit. L’utilisation de cgb
lors d’une étude génomique comparative de Clostridium difficile nous a permis
d’identifier des marqueurs diagnostics d'ADN associés à la gravité de la maladie
et de démontrer que son pan-génome est plus grand qu’estimé précédemment.
Ensuite, j'ai développé contiGo, un outil d'usage général pour réviser les
v
assemblages de séquences génomiques par l’intermédiaire d’un navigateur web.
Cette application permet aux biologistes de contourner la nécessité d’installer un
logiciel, de satisfaire les exigences de l’équipement informatique, et de
télécharger des larges bases de données. Conjointement avec cgb, ce programme
nous a permis d'évaluer la performance de la plateforme de séquençage haut débit
Roche/454 Genome Sequencer FLX, à travers cinq installations de séquençage,
ainsi qu’à générer une séquence génomique de grande qualité du champignon
Ophiostoma novo-ulmi. Finalement, j'ai développé BL!P, un programme pour
automatiser les recherches BLAST NCBI et pour explorer les résultats obtenus
dans une interface dynamique. Ce programme a été inspiré par mon travail sur la
caractérisation du génome d’une souche pathogène et multi résistante
d'Escherichia fergusonii, et pour laquelle cgb et contiGo ont également été
utilisés dans l'analyse des données. Ces applications ont été utilisées dans d'autres
projets de génomique par des utilisateurs possédant un éventail de compétences et
de ressources bioinformatiques. D'autres domaines scientifiques générant des
multitudes de données pourraient bénéficier d'un paradigme similaire de
développement de logiciel informatique.
vi
TABLE OF CONTENTS
ABSTRACT ..................................................................................................... ii
RÉSUMÉ ........................................................................................................ iv
TABLE OF CONTENTS .............................................................................. vi
LIST OF FIGURES ....................................................................................... ix
LIST OF TABLES ......................................................................................... xi
TABLE OF ABBREVIATIONS .................................................................. xii
ACKNOWLEDGEMENTS ........................................................................ xiii
ORIGINALITY AND CONTRIBUTIONS TO KNOWLEDGE ............ xiv
PART I: INTRODUCTION ........................................................................... 1
CHAPTER 1: BIOINFORMATICS AND GENOMICS ............................... 2
Concise History of Bioinformatics .............................................................. 2
Evolution of Genome Sequencing ............................................................... 4
Coevolution of Bioinformatics and Genome Analysis .............................. 12
Synthesis .................................................................................................... 26
Research Objectives ................................................................................... 27
Thesis Outline ............................................................................................ 28
PART II: DISPLAY AND INTEGRATED ANALYSIS OF GENOMIC
DATA ................................................................................................. 29
CONNECTING TEXT ................................................................................. 30
Contribution of Authors ............................................................................. 31
CHAPTER 2: CGB − A UNIX SHELL PROGRAM TO CREATE CUSTOM
INSTANCES OF THE UCSC GENOME BROWSER ................... 32
Abstract ...................................................................................................... 33
Introduction ................................................................................................ 34
Methods ...................................................................................................... 35
Results. ....................................................................................................... 35
Discussion .................................................................................................. 39
Acknowledgements .................................................................................... 39
vii
CHAPTER 3: FOURTEEN-GENOME COMPARISON IDENTIFIES DNA
MARKERS FOR SEVERE-DISEASE-ASSOCIATED STRAINS
OFCLOSTRIDIUM DIFFICILE ....................................................... 40
Abstract ...................................................................................................... 41
Introduction ................................................................................................ 42
Material and Methods ................................................................................ 44
Results. ....................................................................................................... 46
Discussion .................................................................................................. 60
Acknowledgements .................................................................................... 64
PART III: ASSESSMENT OF A GENOME ASSEMBLY ...................... 65
CONNECTING TEXT ................................................................................. 66
Contribution of Authors ............................................................................. 67
CHAPTER 4: CONTIGO -- A TOOL TO INSPECT GENOME ASSEMBLIES
IN A WEB BROWSER .................................................................... 68
Abstract ...................................................................................................... 69
Introduction ................................................................................................ 70
Methods ...................................................................................................... 71
Results. ....................................................................................................... 72
Discussion .................................................................................................. 76
Acknowledgments ...................................................................................... 76
CHAPTER 5: REPRODUCIBILITY OF THE ROCHE/454 GS-FLX
TITANIUM SYSTEM TO GENOME SEQUENCE THE DUTCH ELM
DISEASE PATHOGEN ................................................................... 77
Abstract ...................................................................................................... 78
Introduction ................................................................................................ 79
Methods ...................................................................................................... 80
Results. ....................................................................................................... 81
Discussion .................................................................................................. 97
Acknowledgements .................................................................................... 99
PART IV: DERIVING BIOLOGICAL INSIGHT USING PUBLIC
INFORMATION ............................................................................. 100
viii
CONNECTING TEXT ............................................................................... 101
Contribution of Authors ........................................................................... 102
CHAPTER 6: A TOOL TO AUTOMATE MULTIPLE BLAST SEARCHES
AND DYNAMICALLY EXPLORE RESULTS ........................... 103
Abstract .................................................................................................... 104
Introduction .............................................................................................. 105
Implementation ........................................................................................ 106
Results. ..................................................................................................... 108
Conclusions .............................................................................................. 110
Availability and Requirements ................................................................. 110
Authors' contributions .............................................................................. 110
Acknowledgements and Funding ............................................................. 111
CHAPTER 7: PATHOGENIC AND MULTIDRUG RESISTANT
ESCHERICHIA FERGUSONII FROM BROILER CHICKEN ..... 112
Abstract .................................................................................................... 113
Introduction .............................................................................................. 114
Materials and Methods ............................................................................. 115
Results and Discussion ............................................................................. 118
Conclusions .............................................................................................. 140
Acknowledgements .................................................................................. 140
PART V: DISCUSSION ............................................................................. 142
CHAPTER 8: IMPACT OF RESEARCH, FUTURE WORK, AND
CONCLUDING REMARKS ......................................................... 143
Impact of the Genome Sequencing Projects ............................................ 144
Bioinformatics Software .......................................................................... 145
Concluding Remarks ................................................................................ 150
REFERENCES ............................................................................................ 152
ix
LIST OF FIGURES
Figure 1. Improvements in DNA sequencing technology. ..................................... 7
Figure 2. Species counts from the NCBI genome project database. ..................... 10
Figure 3. The genome sequencing and assembly process. .................................... 14
Figure 4. A custom instance of the UCSC Genome Browser for C. difficile isolate
QCD-66c26. .......................................................................................................... 19
Figure 5. NCBI GenBank record for the clpA protein from E. fergusonii isolate
ECD227................................................................................................................. 23
Figure 6. Excerpt from the NCBI BLAST output for a protein sequence from E.
fergusonii isolate ECD227 against the GenBank non-redundant database. ......... 24
Figure 7. List of cgb commands for creating a custom UCSC Genome Browser. 38
Figure 8. Percent identity plot (top) and dot-plot (bottom) depicting the whole
genome pairwise alignments of a NAP1 isolate (QCD-66c26) versus a NAP7
isolate (QCD-23m63)............................................................................................ 50
Figure 9. a) Phylogenetic tree of 14 C. difficile genomes constructed using SNP
data. ....................................................................................................................... 52
Figure 10. Distribution of SNPs that uniquely identify the NAP1 group of isolates.
............................................................................................................................... 54
Figure 11. Correlation of disease severity with SNPs from the TCS-ABC locus or
existing diagnostic methodologies. ....................................................................... 57
Figure 12. A contiGo screenshot illustrating the E. fergusonii isolate ECD227
genome assembly. ................................................................................................. 73
Figure 13. Core facility read length distribution. .................................................. 84
Figure 14. Base quality per core facility. .............................................................. 86
x
Figure 15. The O. novo-ulmi strain H327 genome assembly. .............................. 90
Figure 16. Homopolymer counts and overall accuracy. ....................................... 94
Figure 17. Aspects of homopolymer accuracy. .................................................... 95
Figure 18. Substitution error rate. ......................................................................... 96
Figure 19. Test case study using BL!P for the analysis of 223 predicted proteins
from E. fergusonii ECD227. ............................................................................... 109
Figure 20. Phylogenetic tree of 110 enteric bacteria and E. fergusonii ECD-227.
............................................................................................................................. 120
Figure 21. A linear representation of the ECD-227 chromosome. ..................... 123
Figure 22. Linear representation of the two largest ECD-227 plasmids;
pECD227_112 and pECD227_113. .................................................................... 125
Figure 23. Mortality rates (%) induced by ECD-227 compared to that induced by
clinically virulent E. coli D06-2195 and non-virulent E. coli K-12 in a day-old
chicks infection model. ....................................................................................... 139
xi
LIST OF TABLES
Table 1. Characteristics of DNA sequencing platforms. ........................................ 9
Table 2. List of cgb tasks and their commands ..................................................... 36
Table 3. Characteristics of C. difficile isolates used in this study......................... 48
Table 4. Characteristics of C. difficile genome assemblies used in this study...... 49
Table 5. Targets for species-level detection of C. difficile. .................................. 59
Table 6. Summary of participating core facilities and sequencing yield. ............. 83
Table 7. Overview of the O. novo-ulmi strain H327 genome assembly. .............. 88
Table 8. Homopolymer measurement statistics. ................................................... 93
Table 9. Overview of the ECD-227 genome. ..................................................... 122
Table 10. Minimal inhibitory concentrations (MICs) of 28 antibiotics against
ECD-227. ............................................................................................................ 128
Table 11. Antimicrobial resistance-associated genes of ECD-227. .................... 129
Table 12. Gene content of different genomic islands of ECD-227..................... 133
Table 13. Virulence-associated genes of ECD-227. ........................................... 137
xii
TABLE OF ABBREVIATIONS
ABI Applied Biosystems
ABRF Association of Biomolecular Resource Facilities
AJAX Asynchronous JavaScript and XML
AMR Antimicrobial resistance
APEC Avian pathogenic Escherichia coli
ATCC American Type Culture Collection
BLAST Basic Local Alignment Search Tool
BLAT BLAST-like alignment tool
bp base pair
CDI Clostridium difficile infection
CDN Canadian
CFU Colony-forming unit
DDBJ DNA Data Bank of Japan
DNA Deoxyribonucleic acid
DSRG DNA Sequencing Research Group
EHEC Enterohaemorrhagic Escherichia coli
EMBL European Molecular Biology Laboratory
EST Expressed sequence tag
FAQ Frequently Asked Questions
Gb giga base pairs
GS Genome Sequencer
HTML HyperText Markup Language
INSDC International Nucleotide Sequence Database Collaboration
JSON JavaScript Object Notation
kb kilo base pairs
Mb mega base pair
MPS Massively parallel DNA sequencing
MUGQIC McGill University and Génome Québec Innovation Centre
NCBI National Center for Biotechnology Information
nt nucleotide
OS Operating system
PCR Polymerase chain reaction
PDB Protein Data Bank
PFGE Pulsed-Field Gel Electrophoresis
PIR Protein Information Resource
PRF Protein Research Foundation
RNA Ribonucleic acid
SDA Severe disease-associated
SNP Single-nucleotide polymorphism
UCSC University of California, Santa Cruz
UK United Kingdom
UPEC Uropathogenic Escherichia coli
US United States
WGS Whole genome sequencing
xiii
ACKNOWLEDGEMENTS
Above all, I am grateful to my supervisor, Ken Dewar. His support,
encouragement, and (sometimes witty) comments enabled me to achieve a high
standard in scholarship and innovation. This dissertation benefited greatly from
his critique and drive for clarity, cohesion, and coherence. I value our friendship
and look forward to future collaborative endeavors.
I thank my supervisory committee, Joan Bartlett, Marcel Behr, Andre Dascal, and
Mathieu Blanchette, for their pertinent observations and guidance. I am also
thankful to Rob Sladek for his guidance during the early stages of thesis writing. I
am indebted to the DNA sequencing platform at the McGill University and
Genome Quebec Innovation Centre for their excellent work. Also at the Genome
Centre, I thank Gary Leveque, Pascale Marquis, and Jessica Wasserscheid for
testing and using my software programs and providing analytical support. Also, I
thank Kevin Ha, Tony Kwan, Sudeep Mehrotra, and Carl Murie, for helpful
bioinformatics discussions. I thank Joan Bartlett, Nikoleta Juretic, Caroline
Vincent, Sudeep Mehrotra and Carl Murie for thesis revisions and comments. I
thank Thomas Leslie and Kandace Springer of the Department of Human Genetics,
whose diligent administrative work made my graduates studies all the more
enjoyable. From Microsoft Research, I thank my mentor Simon Mercer for
offering me the opportunity to work/play in such a great environment. I thank the
Canadian Institutes of Health Research for awarding me a Doctoral Research
Award.
I am beholden to my parents, Anna and Giovanni, for providing the foundation
and support upon which this thesis is built. Also, I thank my three sisters Marisa,
Alba, and Valerie, who have been my personal cheerleaders from the start.
Lastly, I thank my wife, Zulay, for her endless love, support, and encouragement.
Our relationship began around the same time as this research began, and as this
thesis ends we embark on a new journey together as Mom and Dad. I dedicate this
thesis to our son Liam, who I will love and support forever.
xiv
ORIGINALITY AND CONTRIBUTIONS TO KNOWLEDGE
In this thesis I have developed three software tools. These tools were used to
analyze data from three genome projects. Each tool has an original element:
Cgb automates the creation of custom UCSC Genome Browsers. Cgb is available
at https://github.com/vforget/cgb and has a manuscript in preparation (Chapter 2).
ContiGo is a general purpose tool for the analysis of genome assemblies that
operates within a web browser. The program is available at https://github.com/
vforget/contigo and has a manuscript in preparation (Chapter 4).
BL!P is a program to automate NCBI BLAST search and dynamically explore
results. The program is available at http://blip.codeplex.com and has a manuscript
in preparation (Chapter 6).
Example demonstrations are made available on the website of each tool.
Combinations of these tools were used to analyze data from three genome
projects, leading to the following contributions to scientific knowledge:
The comparative genome analysis of Clostridium difficile discovered genetic
markers associated with severe disease strains or those that could detect C.
difficile at the species level. I also found that the C. difficile pan-genome is larger
than previously estimated. These finding are published in the Journal of Clinical
Microbiology (Forgetta et al., 2011) (Chapter 3).
This study determined that the Roche/454 GS-FLX Platform is reproducible
across tested core sequencing facilities. I also produced a high-quality genome
sequence of the fungal pathogen Ophiostoma novo-ulmi. These finding are in
preparation to be submitted for publication (Chapter 5).
I characterized the genome sequence of a pathogenic and multi-drug resistant
strain of E. fergusonii from poultry. These finding are published in Poultry
Science (Forgetta et al., 2012) (Chapter 7).
2
CHAPTER 1: Bioinformatics and Genomics
Bioinformatics is intertwined throughout all aspects of genome analysis; from the
collection and processing of data to the analysis of results. Primarily, this is due to
the large data sets generated by high throughput DNA sequencing platforms.
Conventionally, genome analysis involves the use of sophisticated software, and
is largely performed by those with prior experience in bioinformatics using
adequate computational resources. This introduction aims to concisely describe
these three aspects — bioinformatics, genomic analysis, and DNA sequencing
platforms — and to demonstrate that a gap currently exists between genomic data
analysis and the biologist. First, I will concisely review the history of
bioinformatics; how it evolved from its early studies of molecular biology up to
its highly pervasive role in genomics research. Following this, I will review how
technological advancements in DNA sequencing platforms have transformed
genomic research, with particular emphasis on whole genome sequencing (WGS).
I will then continue to review the history of bioinformatics within the context of
WGS, focusing on three stages in a genome project: [i] assessment of a genome
assembly, [ii] display and integrated analysis of genomic data, and [iii] deriving
biological insight using public information. A short synthesis follow this
introduction, postulating as others have that there is a gap between biologists and
the ability to analyze large genomic data sets (McPherson, 2009; Morales &
Holben, 2011; Perez-Enciso & Ferretti, 2010; Stapley et al., 2010; Zhang et al.,
2011). The research objectives will state how I plan to address this gap between
genome analysis and the biologist. The introduction will conclude by outlining the
structure of this dissertation.
Concise History of Bioinformatics
Searls states that bioinformatics is the balanced combination of “episteme”, or
knowledge, and “techne”, or know-how (Searls, 2010). It is an interdisciplinary
enterprise that combines the fields of biology and computer science. As such,
3
bioinformatics concerns itself not only with the development of computational
tools, but also their use in deriving scientific knowledge from biological data.
During the late 1960s and early 1970s, when bioinformatics was in its infancy, it
was used primarily to understand aspects of molecular biology, such as phylogeny
(Fitch & Margoliash, 1967), evolution (Kimura, 1969), or the accessibility of
protein structures (Lee & Richards, 1971). The 1970s and 1980s saw further
developments in molecular biology, such as computing evolutionary distances
(Sellers, 1974) and approximate string matching (Ukkonen, 1985). But more
importantly, because we accumulated many nucleotide sequences, we developed
sequence alignment algorithms (Lipman & Pearson, 1985; Smith & Waterman,
1981; Wilbur & Lipman, 1983) and created resources for nucleotide data
submission such as GenBank (Bilofsky et al., 1986) and EMBL (Hamm &
Cameron, 1986), thus making this data publicly accessible and giving other
researchers the ability to analyze it. A few years later, the BLAST search tool was
created (Altschul et al., 1990) and its use via the internet to search GenBank
(Altschul et al., 1997) remains as the first foray into bioinformatics for many
biologists. Up to the early 1990s, bioinformatics was primarily used to analyze
data produced by a laboratory experiment, such as a DNA sequence obtained
from an autoradiograph using the Sanger sequencing method (Sanger & Coulson,
1975).
Approaching the new millennium, advances made in DNA sequencing technology
enabled us to sequence entire genomes (Fleischmann et al., 1995). During this
time, bioinformatics tools were employed in every step of a genome project; from
determining the nucleotide sequence from a chromatogram (Ewing & Green,
1998; Ewing et al., 1998), to assessing the quality of DNA sequences (Chou &
Holmes, 2001), to managing and storing data (Parsons et al., 1999), to assembling
the genome from sequence reads (Gordon et al., 1998), to predicting genes
(Delcher et al., 1999), and finally, to submission of data to public repositories
such as GenBank (http://www.ncbi.nlm.nih.gov/Sequin/). Because it provided a
myriad of tools that are necessary for the collection, processing, and analysis of
4
genomic data, bioinformatics became a sophisticated and necessary component of
genome analysis.
During the genomics era, the progression of bioinformatics was impacted greatly
by multiple technological advancements in DNA sequencing. Therefore, to
provide proper context, I will briefly review these advancements in technology
and their impact within the context of genome sequencing. Following this, I will
continue the review of bioinformatics with emphasis on genome analysis.
Evolution of Genome Sequencing
The history of genome sequencing spans roughly 40 years and includes multiple
technological advancements. It has continually strived towards a singular goal; to
rapidly and accurately determine the DNA sequence of an organism’s genome.
A genome is the complete genetically heritable information of an organism. It is
encoded as either RNA in some viruses, or as DNA in most other life forms. Most
often genomic DNA or RNA is present in double-stranded form, but some viruses
have single stranded DNA or RNA, such as the hepatitis C virus (Kato, 2000).
The genome plays a central role in determining the observable traits of an
organism, with genes and their regulators, known as transcription factors, being
the most well studied genomic elements that contribute to phenotype. The genome
is dynamic, with changes to the sequence occurring at varying scales. If these
changes occur within a functional element of a DNA sequence they may alter
phenotype. For example, single nucleotide changes are responsible for diseases
such as cystic fibrosis (Bobadilla et al., 2002), neurofibromatosis (MacCollin et
al., 1996), and sickle cell anemia (Rees et al., 2010) in humans. Alternatively,
short repetitive regions such as tri-nucleotide repeats, which are susceptible to
slippage during DNA replication, have been implicated in many human diseases
(Fu et al., 1991; Lindblad et al., 1996; Walker, 2007). At yet an even larger scale,
retrotransposons — genetic elements of hundreds to thousands of base pairs in
length — can amplify themselves and constitute a large portion of the mammalian
genome (Lander et al., 2001; Waterston et al., 2002). Most retrotransposons are
5
inactive, but studies suggest that they serve some functional role (Chueh et al.,
2009; Pi et al., 2010; Schmidt et al., 2012). Within a population genetic variants
can be common or rare. In humans, common variants associated with diseases,
such as Alzheimer’s (Sillen et al., 2008), are typically of low penetrance,
contributing to phenotype in combination with other genetic variants or the
environment. Conversely, genetic variants of low frequency that are associated
with disease are in general highly penetrant and include the diseases mentioned
previously (cystic fibrosis, Huntington’s, etc). Due to the genome’s importance in
determining phenotype, the scientific community has continually strived to
improve DNA sequencing technologies in order to determine the complete DNA
sequence of a genome. This has provided a foundation upon which we investigate
how the genome’s elements function and how this function is impacted by types
of DNA variation.
During the late 1970s, DNA sequencing became a mainstream laboratory method
because of advances made by Frederick Sanger (Sanger et al., 1977; Sanger &
Coulson, 1975). The throughput of the dideoxynucleotide sequencing method by
today’s standards is low, producing a few DNA sequence readouts (or reads) of a
few hundred bases per experiment. Nonetheless, this method was easy to use, and
lead to the genome sequence of the bacteriophage φX174 (5,375 nucleotides)
(Sanger, et al., 1977), the human mitochondrion (16,569 base pairs) (Anderson et
al., 1981) and bacteriophage λ (48,502 base pairs) (Sanger et al., 1982). Since that
time, there have been at least two technological advancements in DNA
sequencing technology, each of which has resulted in dramatic increases in
throughput (Stratton et al., 2009) (Figure 1).
The first major advancement was the optimization and automation of the
dideoxynucleotide DNA sequencing method into fluorescence-based capillary
sequencers such the ABI PRISM 3700/3730 DNA Analyzer from Applied
Biosystems. These platforms are still in use today and are capable of generating
hundreds of DNA sequence reads per instrument per day. The late 1990s’ were
highlighted by several laboratories implementing these platforms on a massive
6
scale (hundreds per lab), and with the aid of automated data collection, allowing
us to sequence the genome of reference organisms, such as human (Lander, et al.,
2001) and mouse (Waterston, et al., 2002). These projects employed a
hierarchical approach to genome sequencing (or hybrid thereof), where the
genome is fragmented into overlapping large-insert clones followed by shotgun
sequencing of these intermediates. This hierarchical process in combination with
the large number of DNA templates needed for shotgun sequencing made
sequencing entire genomes relatively costly and time consuming even for smaller
genomes. Even today, using the ABI 3730xl platform to obtain the DNA sequence
of a small bacterial genome (only 4 million base pairs (Mb)) using the shotgun
method would cost roughly C$240,000 and occupy four DNA sequencers for
about 5 weeks (computed using the Genome Project Cost Calculator (Forgetta &
Dewar, 2005)).
7
Figure 1. Improvements in DNA sequencing technology.
DNA sequencing technology has advanced considerably since dideoxynucleotide
method using manual gel slabs. Initial advancements refined and automated this
method into fluorescence-based capillary sequencers (blue). Massively parallel
sequencing (red) employed numerous new methods, increasing throughput
dramatically while also reducing cost. Single molecule sequencing platforms
promise further increases in throughput without DNA amplification. Reprinted by
permission from Macmillan Publishers Ltd: Nature (Stratton, et al., 2009),
copyright (2009).
8
With the advent of massively parallel DNA sequencing technologies (MPS) in
2005 this completely changed. These technologies combined advancements made
in DNA amplification and sequencing chemistries (Ronaghi et al., 1996; Ronaghi
et al., 1998; Shendure et al., 2005) with dramatic miniaturization and
parallelization of individual DNA sequencing reactions (Margulies et al., 2005;
Shendure, et al., 2005). These advancements were commercialized into DNA
sequencing platforms capable of generating hundreds of thousands to millions of
DNA sequence reads over a relatively short period of time, and at a substantially
lower cost than previous dideoxynucleotide-based technologies such as Applied
Biosystems ABI 3730xl (Table 1). Also, in contrast to the hierarchical strategy
used for past reference genomes, these technologies produce sequence data for an
entire genome.
In the 2006-2011 period, there were three MPS platforms in widespread use: the
Illumina®/Solexa Genome Analyzer (Bentley et al., 2008), the Roche/454
Genome Sequencer (GS) (Margulies, et al., 2005), and Applied Biosystems
SOLiD™ System (Shendure, et al., 2005). Each platform differs in cost and
throughput, but all are able to generate on average hundreds of thousands to
millions of sequence reads within a one to two week time frame (Table 1). For
example, as of late 2011, the cost to sequence a small bacterial genome (~4 Mb)
using the Roche/454 GS-FLX Titanium platform was below C$4,000 and could
be completed within one week. This platform was used to genome sequence the
bacterial strains in Chapter 3 (Forgetta, et al., 2011) and Chapter 7 (Forgetta, et
al., 2012), and the fungal genome in Chapter 5 (in preparation). This increased
throughput and reduced cost has led to a growing demand for whole genome
sequencing from biologists studying a diverse set of organisms. This growth can
be observed by the increasing number of new species being submitted to the
NCBI Genome database, as well as the increase in the cumulative total number of
genomes, which includes the sequencing of multiple strains from the same species
(Figure 2).
9
Table 1. Characteristics of DNA sequencing platforms.
System (Vendor/Version) Release
Date Reads
Read Length
(bp)
Output (Mb)
Run Time (hrs)
Cost/ Run ($)
Runs/ Genome*^
Time/ Genome
(hrs)
Cost/Mb ($)
Cost/ Genome
($)
Applied Biosystems(ABI)/3730xl 2003
96
700
0.0672
2
200
1,191
2,382
2,976.19
238,095.24
Roche/454 GS 20 2005
100,000
200
20
12 10,000
4
48
500.00
40,000.00
Illumina/GAIIx 2006
320,000,000
216
69,120
288 10,000
1
288
0.14
11.57
ABI/SOLiD 2007
1,000,000,000
85
85,000
360 20,000
1
360
0.24
18.82
Roche/454 GS FLX Titanium 2008
1,200,000
450
540
12 10,000
1
12
18.52
1,481.48
Illumina/HiSeq2000 2010
1,200,000,000
300
360,000
288 24,000
1
288
0.07
5.33
Life Technologies/Ion Torrent 2011
2,000,000
220
440
4 1,000
1
4
2.27
181.82
Illumina/MiSeq 2011
13,500,000
500
6,750
40 2,000
1
40
0.30
23.70
Pacific Biosciences/RS 2011
25,000
2,500
63
2 500
2
4
8.00
640.00
Roche/454 GS FLX+ 2012
1,200,000
650
780
26 10,000
1
26
12.82
1,025.64
Illumina/HiSeq2500 2012
300,000,000
300
90,000
40 6,000
1
40
0.07
5.33
Life Technologies/Ion Proton Q2-2012
500,000,000
220
110,000
12 1,500
1
12
0.01
1.09
* Genome size 4Mb, 20x depth of coverage, i.e., minimum 80Mb output required ^ Values rounded to integer, multiple genomes per run possible
10
Figure 2. Species counts from the NCBI genome project database.
As of 2011, the NCBI genome project database has accumulated genome
sequences from over 3,000 species (dotted line). The number of new species
deposited on a yearly basis has increased since 2003 (bars), with 910 new species
deposited in 2011 alone.
0
500
1,000
1,500
2,000
2,500
3,000
3,500
2003 2004 2005 2006 2007 2008 2009 2010 2011
Sp
ecie
s C
ou
nt
Year of Database Submission
11
Throughout this growth period and up to present today, MPS platforms are
continually being improved upon and include the release of a new platform, the
Ion Torrent by Life Technologies (Rothberg et al., 2011). In general, these
improvements are producing a greater number of longer reads for less cost. Also,
these platforms are increasingly being applied to other areas of research, such as
measuring gene expression (Morin et al., 2008), DNA methylation status
(Korshunova et al., 2008), and protein-DNA interactions (D. S. Johnson et al.,
2007).
In addition to differences in cost and throughput, current MPS platforms also vary
in their error profiles (Gilles et al., 2011; Nakamura et al., 2011; Victoria et al.,
2012). For instance, the Illumina®/Solexa Genome Analyzer is susceptible to
substitution-type error (Victoria, et al., 2012), whereas the Roche/454 Genome
Sequencer is prone to incorrectly sequencing long monomeric repeats (Balzer et
al., 2010; Gilles, et al., 2011; Margulies, et al., 2005)(Chapter 5). These platform-
specific error rates further complicate data analysis, particularly when sequence
data from these platforms are combined. As a result, multiple bioinformatics
tools begun to emerge that model these platform specific error rates (Balzer, et al.,
2010; McElroy et al., 2012; Richter et al., 2008).
Currently, we are witnessing yet another change in DNA sequencing technology.
In addition to providing even longer reads (>1000 bp), platforms such as the
Pacific Biosciences RS (Eid et al., 2009) and the upcoming Oxford Nanopore
(http://www.nanoporetech.com) will sequence single DNA molecules, which is in
contrast to existing MPS platforms that require DNA amplification to achieve
sufficient signal for nucleotide detection. Single molecule sequencing is
advantageous because it avoids potential bias caused by DNA amplification and
reduces costs further (Table 1). Also, these systems are able to directly detect
methylated nucleotides (Flusberg et al., 2010), measure enzyme kinetics of single
polymerase molecules (Metzker, 2009), or monitor in real-time tRNA transit
within ribosomes (Uemura et al., 2010). The Oxford Nanopore also has the
12
potential to identify proteins using aptamer oligonucleotides (Cheley et al., 2006;
Howorka et al., 2004).
Coevolution of Bioinformatics and Genome Analysis
Genome sequencing and analysis relies on computers and bioinformatics to
analyze large and complex datasets. For instance, DNA sequencing even a small
bacterial genome will generate a few hundred thousand sequence reads that will
assemble into a genome sequence that contains a few million base pairs and a few
thousands genes. Consequently, a myriad of sophisticated bioinformatics tools
have been developed to analyze data across the entire lifespan of a genome
project. During this lifespan, a genomic data passes through three important steps.
These three steps will be described in the sections that follow:
1. Assessment of a genome assembly; how well the genome was pieced
together from individual sequence reads;
2. Display and integrated analysis of genomic data; how we visualize and
interact with a genome sequence and its annotations;
3. Deriving biological insight using public information; how we compare
genomic data to information in public repositories.
Assessment of a Genome Assembly
In genomics, assembly is the process of piecing together a genome sequence from
the set of individual sequence reads. The algorithm utilized by many assembly
programs can be generalized to:
i. determine the pair-wise similarity between sequence reads,
ii. group sufficiently similar reads together, and,
13
iii. compute a consensus sequence from the read overlaps within each
group.
The resulting assembly will consist of contiguous DNA sequences called contigs,
which represent genomic regions where the assembly program was capable of
merging (or piling up) reads and computing a consensus sequence (Figure 3).
14
Figure 3. The genome sequencing and assembly process.
Whole genome sequencing begins with the fragmenting of the genome into small
DNA templates. Templates are DNA sequenced, producing a set of sequence
reads that are assembled into a set of contigs. Contigs are assembled by
computing the overlap between sequence reads, forming a pileup from which the
contig consensus sequence is determined.
15
Furthermore, contigs can be ordered and oriented into a draft genome sequence
using two methods. One method orders and orients contigs in relation to an
existing genome sequence (i.e., reference) of a similar strain or species. This was
the approach used for the Clostridium difficile strains in Chapter 3 (Forgetta, et
al., 2011) and Escherichia fergusonii ECD227 in Chapter 7 (Forgetta, et al.,
2012). The second method, which does not rely on a reference, incorporates pair-
end reads into the assembly process (Guillaume et al., 2009). This method was
used for the fungus Ophiostoma novo-ulmi in Chapter 5 (in preparation). This
draft genome sequence can be further refined in a process named genome
finishing. An important aspect of genome finishing is gap closure, where regions
between adjacent contigs (i.e., gaps) are resolved by designing primers to amplify
and DNA sequence the intervening region. This is followed by local re-assembly
of these read sequences with the adjacent contigs, resulting in a longer consensus
sequence. We utilized this gap closing procedure for several strains of C. difficile
in Chapter 3 (Forgetta, et al., 2011).
Multiple factors can negatively affect the quality of a genome assembly, because
of elements within the genome itself or the performance of the DNA sequencing
experiment. For example, if the genome contains elements such as repeats or
paralogous genes, the assembly process may falsely order these elements or
merge them into one contig (Phillippy et al., 2008). Errors such as these have
been observed in the human genome (Bailey et al., 2001; Eichler, 2001), and
when they occur in regions associated with human disease (Mazzarella &
Schlessinger, 1998) may lead to false associations. Another factor that can impact
the quality of an assembly is low read coverage, which can be due to insufficient
reads from the sequencing experiment, or the assembly process itself. Low read
coverage may produce erroneous base calls in the contig sequence (Hubisz et al.,
2011), which will negatively affect downstream analyses. For example, a
sequence error may falsely predict a stop codon within a coding region, resulting
in the false prediction of gene structure (Hoff, 2009).
16
In general, assessment of a genome assembly aims to answer the following
questions: “Is the assembly of sufficient coverage, or is more sequencing
required?”, “What is the high-quality portion of the assembly and what artifacts,
such as collapsed repeats, could impact downstream analysis?” Finding answers
to these types of questions is typically performed using an assembly viewer, as
well as additional methods such as spreadsheets that display statistics about
contigs, such as average size or average depth of read coverage. Other in silico
methods used to evaluate genome assemblies include the consistency of mate-pair
(or paired-end) insert sizes, the percentage of high quality bases, as well as
comparative analyses such as alignment of contigs or scaffolds to the genome
sequence of a closely related strain or species, or assessing the completeness of
gene content by aligning EST sequences. In addition to purely computational
analyses, traditional experimental methods can also be used to quality assess a
genome assembly, such as comparing chromosomal sizes as determined by PFGE
to scaffolds lengths, and the comparison of results from optical mapping or
restriction digests to their in silico predictions.
In the late 1990s, during the era of capillary-based fluorescence DNA sequencing
platforms, assembly viewers were developed to support reference genome
projects, and include programs such as Consed (Gordon, et al., 1998) and the
Staden sequence analysis package (Staden, 1996). Primarily, these programs were
used by bioinformaticians or genome analysts to refine the local assembly of
large-insert clones by correcting errors and incorporating additional sequence
reads to fill assembly gaps. At the time, these programs were compatible only
with Unix-like operating systems. The Staden package has since been ported to
other operating systems (Bonfield & Whitwham, 2010).
The high cost of genome finishing meant it was used for only small genomes or
for reference genome projects, such as human or mouse. As a result, many
genome assemblies were left in an unfinished draft stage, often containing errors
(Salzberg & Yorke, 2005). In response to this, Schatz et al. (2007) developed
Hawkeye, a program to assist genome finishing, and to increase the quality of
17
draft genome assemblies without finishing by identifying assembly errors. Like
Consed and the early Staden package, Hawkeye is compatible with Unix-like
operating systems, but uniquely offers numerous analytical views of the assembly
to aid in the detection of assembly errors. More modern assembly viewers support
other operating systems such as Microsoft Windows or Mac OS X (Bao et al.,
2009; Hou et al., 2010; Huang & Marth, 2008; Li et al., 2008; Milne et al., 2010).
Recent assembly viewers are tailored towards reference-based genome analyses,
such as identifying genetic variants from the mapping of sequence reads to a
reference genome (Bao, et al., 2009; Huang & Marth, 2008; Li, et al., 2008).
Other recent viewers, including Tablet (Milne, et al., 2010) and MagicViewer
(Hou, et al., 2010), support a more general analysis of a genome assembly;
however, they lack specific functionality to detect assembly errors or to assess the
quality of a genome assembly. All currently available assembly viewers require
installation onto a personal computer and a local copy of the genome assembly.
This quality assessed genome sequence is the foundation upon which we overlay
biological information such as the location of genes or repeats. The section that
follows describes tools that assist in the display and integrated analysis of a
genome sequence and the elements it contains.
Display and Integrated Analysis of Genomic Data
A genome sequence can be simply abstracted as the one-dimensional order of
nucleotides along a horizontal axis, with the position of nucleotides ordered
horizontally from left to right. Upon this coordinate system, we can position
genomic elements, such as genes, by defining where they start and end. This
process of giving meaning to regions in the genome sequence is defined as
annotation, and can represent static or dynamic content. Static elements include
genes, repeats, or transcription factor binding sites, and dynamic elements include
gene expression values from a particular tissue at a specific point in time. These
annotations can be represented as a second dimension to the genome sequence;
for any given position or range, there may exist one or more annotations, and
these annotations are vertically stacked from top to bottom as tracks of
18
information. Representing a genome sequence along with its annotations in this
manner is a fundamental feature of modern genome browsers (Hubbard et al.,
2002; Kent et al., 2002; Robinson et al., 2011; Stein et al., 2002); software
programs used to view and analyze genome annotations.
An example of using a genome browser to integrate and display genomic data is
presented in Figure 4. Using a custom instance of the UCSC Genome Browser
(created using cgb, see Chapter 2), this figure illustrates a region in the C. difficile
genome (Forgetta, et al., 2011) containing annotations for static elements such as
genes and cellular localization, as well as dynamic elements, such as peptides
from proteomics experiment on a cell-wall protein extract (LaBoissière et al.,
2005). Visualizing genomic data in this manner allows us to correlate information
across multiple annotations. For instance, in Figure 4 we observe that the slpA
gene (top-most track), which is predicted to be expressed in the bacterial cell wall
(middle track), has more peptides associated with it than the neighboring cell wall
genes (bottom track). Also, genes in the vicinity which are predicted to be
elsewhere in the cell have no mapped peptides (Figure 4). These observations
suggest that slpA is present in greater abundance on the cell-wall in relation to
other predicted cell wall proteins, and that our extraction method is specific to
proteins from this cellular location (LaBoissière, et al., 2005). In addition to these
types of correlation-based analyses, genome browsers can also be used to
visualize the genome-wide distribution of an annotation such as cross-species
sequence conservation, with areas in the genome with exceptional trends
indicating potentially biologically relevant events. For example, genome regions
with excessive conservation suggest evolutionary constraint (Bejerano et al.,
2004) and may possess biological function (Cheley, et al., 2006).
19
Figure 4. A custom instance of the UCSC Genome Browser for C. difficile isolate
QCD-66c26.
The screenshot depicts a 36kb region of the C. difficile genome containing the
slpA (name in black background) and neighboring genes. Each gene has a
prediction for protein localization (colored coded). Individual peptides from a
proteomics experiment that were mapped to the genome sequence are depicted
below the cellular localization track (black).
20
Available genome browsers can be divided into two categories; those that are
internet accessible, operating from within a web browser, and those that are
standalone desktop applications. Of the internet-based genome browsers created
to house reference genomes, such as human (Lander, et al., 2001) and mouse
(Waterston, et al., 2002), four are currently in widespread use; the UCSC Genome
Browser (Kent, et al., 2002), the Ensembl Genome Browser (Flicek et al., 2011),
the NCBI Map Viewer (NCBI, 2011) and the Generic Genome Browser (Stein, et
al., 2002). The desktop-based genome browsers are mainly used to assist the
manual curation of genome annotations (Lewis et al., 2002), or to visualize small
reference genomes (Rutherford et al., 2000). Genome browsers typically include
tools to search, mine, and filter the annotation database (Haider et al., 2009;
Karolchik et al., 2004), as well as tools for sequence alignment such as BLAST
(Altschul, et al., 1990; Kent, 2002) and BLAT (Kent, 2002). Also, the internet-
based genome browsers support custom annotations, such as loading results from
experiments, or the filtering and combining of existing annotations. Of the four
internet-based browsers mentioned, only the Generic Genome Browser is
designed to support non-reference genomes via a semi-guided installation of the
source code and genome data on a web server.
Up to this stage, genomic data analysis has relied chiefly on internally generated
data sets; a genome assembly and a genome sequence with annotations. However,
another important process in a genome project is comparing internally generated
genomic data to publicly available information. Performing such comparisons
allows us to investigate the genetic relationships to other species or strains
(phylogeny), and to validate or discover the function of genes and other genomic
elements (functional annotation).
Deriving biological insight using public information
In the biological sciences, public data repositories play a crucial role in storing,
disseminating, and curating information. This information includes the sequence
of genomes, genes, and proteins, as well as metadata about the biological function
and taxonomic information for each sequence. The largest public repository of
21
DNA and RNA sequences is organized by the International Nucleotide Sequence
Database Collaboration (INSDC), which is an international collaboration between
three organizations (DNA Databank of Japan (DDBJ) at the National Institute for
Genetics in Mishima, Japan; the European Molecular Biology Laboratory’s
European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and the National
Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA) to
exchange biological sequence data. Within NCBI, biological sequences and
associated information are stored in the GenBank database (Benson et al., 2011).
Protein sequences stored in NCBI GenBank include records from the
UniProtKB/SwissProt (Magrane & Consortium, 2011), PIR (Wu et al., 2002),
PRF (http://www.prf.or.jp), and PDB (Berman et al., 2000) databases. As an
example, a GenBank record for the clpA protein from E. fergusonii isolate ECD-
227 (Forgetta, et al., 2012) is presented in Figure 5. In addition to the protein
sequence, such records typically contain metadata concerning the location of
functional domains, as well as information regarding the source of the sequence
(Figure 5). NCBI GenBank contains over 135 million sequence records (as of
May 2012), and is an invaluable asset to biomedical research because it allows
access to the combined knowledge of many researchers from around the world.
A common use of biological sequence databases such as GenBank is to
characterize a novel biological sequence, such as a predicted gene. Using
sequence similarity, the function and taxonomic classification of a biological
sequence can be inferred by comparing its sequence to those within public
repositories. For example, a predicted gene that is highly similar to an already
characterized sequence in a public repository may have a similar function or be of
a related species or strain. To enable this type of analysis, public repositories offer
the ability to perform sequence similarity searches against their biological
sequence databases. A leading method used for biological sequence search is
using the basic local alignment search tool (Altschul, et al., 1990), or BLAST,
against the NCBI sequence databases, such as GenBank. For example, the
annotation of the protease clpA protein from E. fergusonii (Figure 5) was based on
information gathered from comparison to the GenBank database using NCBI
22
BLAST (Forgetta, et al., 2012). The output of searching GenBank using the NCBI
BLAST web service for the clpA protein is presented in Figure 6. Additional
BLAST output formats are also available and are described on the NCBI website
(http://www.ncbi.nlm.nih.gov/books/NBK21097/).
The analysis of the results obtained from querying a few sequences can be
analyzed manually using the NCBI BLAST web service. However, when
analyzing the result from querying a larger sequence dataset, such as the entire
predicted protein set of a bacterial genome, this process becomes more difficult to
accomplish for at least two reasons. First, due to NCBI usage policies
(http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastD
ocs&DOC_TYPE=FAQ#Queuetime), the querying of large sequence data sets
(thousands of sequences) is impractical using the NCBI BLAST web-browser
interface. As a result, this often requires us to download and install BLAST and
the NCBI databases locally. Second, querying a large sequence dataset will
undoubtedly result in a large set of alignment results. Manually inspecting these
results is time-consuming and tedious, and bias or error may result due to human
error. Numerous software applications have been developed to address one or
both of these concerns, with their evolution being gradual and towards the use of
graphical user interfaces and advanced data visualizations.
23
Figure 5. NCBI GenBank record for the clpA protein from E. fergusonii isolate ECD227.
In addition to the protein sequence (D) the GenBank records contains metadata such as taxonomy (A), source information (B), and
functional domains (C).
24
Figure 6. Excerpt from the NCBI BLAST output for a protein sequence from E.
fergusonii isolate ECD227 against the GenBank non-redundant database.
For each database hit (A), the NCBI BLAST output includes a one line
description composing the GenBank accession (hyperlinked), a short description
of the sequence, and the alignment statistics (e.g., score [maximum and total]).
The output also includes the pair-wise alignment between the query sequence and
the matched database sequence (B).
25
Developed in 2001, MuSeqBox (Xing & Brendel, 2001) is an application that
post-processes BLAST results obtained from the NCBI BLAST web service or a
local installation of the BLAST and a biological sequence database. It is a
command-line program that filters BLAST output on criteria, such as minimum
percent identity, producing an output in tabular format. Similar to MuSeqBox,
BioParser processes pre-computed BLAST results (Catanho et al., 2006).
However, BioParser stores BLAST results in a relational database, allowing for
the accumulation of results across many executions of the BLAST program and
the filtering of the results via a graphical user interface. The results are stored in
the relational database or can be exported to text files. Further improving the
automatic analysis and filtering of BLAST is the PLAN program (J. He et al.,
2007). This tool is a web browser-based program that uses a local installation of
BLAST. Unlike previously mentioned programs, PLAN is an end-to-end solution
that is accessible via the internet (http://bioinfo.noble.org/plan/). Installation of
the PLAN web platform requires a computer with the necessary software
programs and system administration expertise (see
http://bioinfo.noble.org/plan/docs/install.htm). The BLAST output viewer (BOV)
(Gollapudi et al., 2008) is also web-based and has functionality similar to
MuSeqBox and BioParser. However, unlike these programs it can visualize
multiple pair-wise alignments between two query sequences in graphical format
(for an example see http://cas-bioinfo.cas.unt.edu/cgi-bin/BOV/tutorial.cgi).
Advancing data visualization even further is Circoletto (Darzentas, 2010), which
utilizes the popular Circos (Krzywinski et al., 2009) program to visualize data
using circular organization. In general, these programs are sophisticated, requiring
prior experience in bioinformatics or specific computational resources.
26
Synthesis
The analysis of massive genomic datasets is possible only with the use of
bioinformatics tools, and MPS technologies have made genomics studies
affordable to individual biologists. For many of these biologists, this will be their
first foray into genomic analysis, which commonly involves using software that
requires bioinformatics experience and sufficient computer resources. As a result,
the promise that genome analysis brings to the biologist is only half fulfilled;
providing affordable whole genome data sets without the ability for
straightforward analysis. McPherson was the first to observe such a gap in 2009:
“… the gap between large-scale genome centers and individual investigators may
seem to be growing, not shrinking, as the next generation platforms’ apparent
promise of a ‘Genome Center in a box’ may have only been half delivered,
providing data without a full suite of tools.”
McPherson (2009)
Since 2009, others have commented about this gap in bioinformatics, either
involving an entire sub-field of biology (Morales & Holben, 2011) or more
specifically about computational bottlenecks (Perez-Enciso & Ferretti, 2010;
Zhang, et al., 2011), or concerning the interpretation of results (Stapley, et al.,
2010).
27
Research Objectives
This dissertation describes advances I have made to develop bioinformatics
applications that close this gap between genome analysis and the biologist. To
accomplish this goal, the thesis objectives were to develop applications that:
i. Are intended for researchers with limited computer or bioinformatics
expertise.
ii. Address the limitations of existing software to analyze genomic data from
individual genomics projects, and
iii. Encapsulate the bioinformatics know-how derived from the analytical
processes of real-world genome projects.
During my PhD research, I have conducted bioinformatics analyses across several
genome sequencing projects. These projects were used as vehicles to develop
applications for the following tasks in genome sequence analysis:
i. Display and integrated analysis of genomic data
ii. Assessment of a genome assembly, and
iii. Deriving biological insight using public information.
28
Thesis Outline
This dissertation contains six manuscripts (Chapters 2 to 7) for which I am the
first author. The manuscripts are paired into three parts representing the three
processes listed in the research objectives:
Display and integrated analysis of genomic data (Chapters 2 and 3)
Assessment of a genome assembly (Chapters 4 and 5),
Deriving biological insight using public information (Chapters 6 and 7).
Within each part, the first manuscript concerns the bioinformatics application I
developed (Chapters 2, 4, and 6). The second manuscript concerns the research
that I conducted in the genome sequencing project (Chapters 3, 5, and 7), which
also served as inspiration for the applications. Introducing each part is connecting
text, explaining how limitations I encountered during my genome analyses using
existing software inspired me to develop the application. To conclude the
dissertation (Chapter 8), I will describe the impact my genomics research and
applications have had beyond the scope of this thesis. Also, for each
bioinformatics application, I will discuss aspects for improvement, such as new
features, usability, and scalability to larger data sets.
30
Connecting Text
My initial research coincided with a project that aimed to identify DNA-based
diagnostic targets of C. difficile using comparative genomics (Chapter 3)
(Forgetta, et al., 2011). At the time, the Roche/454 GS Platform (Margulies, et al.,
2005) was an emerging technology. As a result, my objective was to develop a
tool to support the assembly, annotation and comparative analyses of genomes
sequenced on this platform. I tested existing tools (Lewis, et al., 2002; Rutherford,
et al., 2000; Stein, et al., 2002), and chose to use the UCSC Genome Browser
(Kent, et al., 2002) because it had superior display capabilities, built-in sequence
analysis tools (i.e., BLAT (Kent, 2002)), advanced data filtering (i.e., the Table
Browser (Karolchik, et al., 2004) and custom annotations (custom tracks). Also,
because it was internet-accessible, it facilitated collaboration among the
researchers and technicians working on the project.
At the time, the UCSC Genome Browser was used to view and analyze publicly
available reference genomes, such as human and mouse, or to establish mirrors of
these genomes at different geographic locations. The UCSC Genome Browser did
not officially support the loading of non-reference genomes, such as multiple C.
difficile genomes. Also, it had no established security measures to restrict access,
a feature that was necessary to limit data access prior to peer-reviewed
publication. For these reasons, I developed automated methods to load non-
reference (or custom) genomes into a locally installed copy of the UCSC Genome
Browser, and to protect it using a username and password. To my knowledge, I
was the first to use the UCSC Genome Browser in this way. The methods I
developed are encapsulated into a program named cgb (Chapter 2, in preparation).
The publicly accessible custom UCSC Genome Browser for C. difficile is
available at http://genomequebec.mcgill.ca/compgen/browser/cgi-bin/hgGateway.
This resource assisted us to assemble, annotate, and analyze 10 C. difficile
genomes. The multiple genome analysis allowed us to identify 18 single
nucleotide polymorphisms that detected multiple severe-disease causing strains,
as well as 12 highly conserved genes that could detect C. difficile at the species
31
level. In addition, our whole genome multiple alignment-based analyses suggest
that the C. difficile pan-genome is three times larger than previously estimated
(Chapter 3) (Forgetta, et al., 2011).
Contribution of Authors
I created cgb and prepared and wrote the manuscript (Chapter 2). The C. difficile
study was conceived by the senior authors of the publication (Forgetta, et al.,
2011). I oversaw and conducted all the data analyses and prepared and wrote the
manuscript (Chapter 3). The sequencing teams at the Genome Center at
Washington University School of Medicine and the McGill University and
Genome Quebec Innovation Centre performed the sequencing of the C. difficile
isolates. The Matthew T. Oughton, M.D., FRCPC, performed aspects of genome
finishing and gap closure for some C. difficile strains, selected candidate genes for
re-sequencing, and designed PCR and sequencing primers. Pascale Marquis was
responsible for genome finishing and gap closure of the remaining C. difficile
strains, as well as aspects of genome annotation.
32
CHAPTER 2: Cgb − A Unix Shell Program to Create Custom Instances of
the UCSC Genome Browser
Vincenzo Forgetta1 & Ken Dewar
1
1Department of Human Genetics, McGill University, Montreal, Quebec, Canada
A modified version of this manuscript is published as an e-print at ArXiv and is
available at
http://arxiv.org/abs/1211.1607
The source code and documentation is available at
http://github.com/vforget/cgb
33
Abstract
The UCSC Genome Browser is a popular tool for the analysis of reference
genomes. Mirrors of the UCSC Genome Browser exist at multiple geographic
locations, and this mirror procedure has been modified to support custom genome
sequences. While straightforward, this procedure is lengthy and tedious and
would benefit from automation, especially when processing many genome
sequences. We present a Unix shell program that facilitates the creation of custom
UCSC Genome Browsers. It automates many steps of the browser creation
process, provides password protection for each browser instance, and automates
the creation of basic annotation tracks. As an example we generate a custom
UCSC Genome Browser for a bacterial genome obtained from a massively
parallel sequencing platform.
34
Introduction
In the past, large institutions sequenced de novo the genome of organisms such as
human (Lander, et al., 2001), mouse (Waterston, et al., 2002) and fly (Adams et
al., 2000), and bioinformatics tools were created to provide the scientific research
community with access to analyzing these reference genome resources. For
instance, the scientific community routinely uses genome browsers to visualize
and analyze a reference genome’s sequence and annotations, with popular
browsers being the UCSC Genome Browser (Kent, et al., 2002), the NCBI Map
Viewer (NCBI, 2011) and the Ensembl Genome Browser (Hubbard, et al., 2002).
These browsers provide a common set of functionality, such as data visualization
and text search, but each also offers functionality that makes them unique. For
example, the UCSC Genome Browser has tight integration with the BLAT
sequence alignment tool (Kent, 2002), advanced database search with the UCSC
Table Browser (Karolchik, et al., 2004), and extensibility via custom annotation
tracks (http://genome.ucsc.edu/FAQ/FAQcustom.html). These features make this
browser a leading resource for the analysis of over 40 reference genomes. As of
mid-2012, the UCSC Genome Browser receives over 600,000 hits per day
(http://genome.ucsc.edu/admin/stats/, accessed 28/05/12), and has been cited in
more than 2,000 peer-reviewed articles.
Today, massive parallel DNA sequencing (MPS) technology (Bentley, et al.,
2008; Margulies, et al., 2005; Shendure, et al., 2005) has reduced the cost of DNA
sequencing dramatically, allowing individual researchers to sequence the genome
of many organisms. However, the UCSC Genome Browser remains primarily as
community-based resource, thus excluding individual researchers from using this
tool in their analysis of non-reference genome sequences. Recently, the mirror site
installation procedure for the UCSC Genome Browser (http://genome.ucsc.edu/
admin/mirror.html) has been modified to support non-reference genome
sequences (http://genomewiki.ucsc.edu/index.php/Minimal_Browser_Installation),
but the procedure is lengthy and tedious, making it cumbersome to perform for
many genome sequences.
35
We have created a Unix shell program, cgb, which facilitates the creation of
custom instances of the UCSC Genome Browser. Each browser instance can be
password-protected and can contain multiple genome sequences. We also include
functionality to build genome sequences from contigs or scaffolds, and to
automatically create basic annotations. Here we briefly describe the
implementation and functionality of cgb, and provide an example usage case.
Methods
Cgb is written in the bash (Bourne-Again shell) scripting language. We chose this
language because of its ubiquity on Unix-like platforms and its ability to execute
the external programs required to setup an instance of the UCSC Genome
Browser. Cgb relies on a functional installation of the UCSC Genome Browser
(for more information see INSTALL.txt that is provided with cgb), but it does not
require reference genome sequences or annotations. Each custom browser
instance is secured using an Apache’s distributed configuration file (.htaccess
file).
We also provide programs to automate the building of a genome sequence from a
set of contigs or scaffolds, and to create browser tracks for contigs, scaffolds,
gaps, depth of read coverage, and GC content. This functionality was developed
in the Python programming language.
Results.
General Functionality
Cgb is a Unix shell program that presents the user with a series of tasks, with each
task pertaining to a particular step in the browser installation process (Table 2).
Prior to executing one or more tasks, the user specifies a new or existing identifier
for the browser instance by setting the CLIENT_NAME variable (for more detail
see Example Usage Case).
36
Table 2. List of cgb tasks and their commands
Task Command(s)
Manage an instance of a UCSC Genome Browser add, remove
Manage "Clade" entries add, remove, list
Manage "Genome" entries for a particular "Clade" add, remove, list
Manage "Build" entries for a particular "Genome" add, remove, list
Add a FASTA a sequence for a Genome Build add
Manage default Genome Builds add, remove, list
Manage BLAT servers for a Genome Build add, remove, list, restart
Add a contig assembly to a Genome Build add
Add a scaffold assembly to a Genome Build add
Add a depth of coverage annotation track to a Genome Build add
37
Each task has a set of commands (add, remove, list, or restart) to manage data
entries that describe each genome sequence (i.e., clade, genome, and build) or that
load a genome sequence and/or annotations into the browser instance (fasta,
contig, scaffold, etc.). The list and remove commands are useful in cases where
errors are committed or data is no longer required. Each command has a required
set of arguments that specify the value of database entries (e.g., build name) or
other properties pertaining to a genome sequence. A full description of these
arguments is available via the cgb help message or in the documentation.
We have also provided extra commands that automate the converting of contigs or
scaffolds into a genome build. By default, contigs or scaffolds are sorted by
decreasing length and merged into one sequence record.
Example Usage Case
To create a custom instance of the UCSC Genome Browser we would execute the
cgb commands listed in Figure 7.
The create_browser command adds a new instance (identified by
ExampleClient1) of the UCSC Genome Browser, and protects the browser
instance by prompting the user to set a username and password (not shown). The
subsequent 4 steps (add_clade, add_genome, add_build, and add_defaultdb)
created entries in the browser’s database. The add_fasta command loads a genome
sequence in FASTA format, and add_blat starts the BLAT servers for this genome
sequence.
Additional genome sequences can be added to the same browser instance (within
or outside the same clade and/or genome), or a new browser instances can be
created by setting a new CLIENT_NAME and repeating the tasks in Figure 7, but
with different values.
38
Figure 7. List of cgb commands for creating a custom UCSC Genome Browser.
Example describes the commands for creating a custom genome browser for the
C. difficile isolate QCD-66c26 chromosome (CM000441.2). Lines beginning with
a pound (#) or dollar sign ($) are comments or commands, respectively.
39
Discussion
By converting a lengthy and tedious procedure into a series of simple tasks, cgb
makes it easier to create custom instances of the UCSC Genome Browser.
Because many of the tasks are non-interactive they can be further automated and
customized by wrapping them into another Unix shell program, or more
interestingly, through a web-browser interface, allowing users with no knowledge
of Unix-like operating systems to create browser instances for non-reference
genome sequences. Cgb demonstrates that existing bioinformatics tools can be
adapted to address changes caused by MPS technologies, thereby reducing
resources needed to develop a new tool.
Acknowledgements
I thank Pascale Marquis and Gary Leveque for user testing. V.F. was the recipient
of a Canadian Institutes of Health Research Doctoral Research Award.
40
CHAPTER 3: Fourteen-Genome Comparison Identifies DNA Markers for
Severe-Disease-Associated Strains of Clostridium difficile
Copyright © 2011, American Society for Microbiology
Citation:
Vincenzo Forgetta, Matthew T. Oughton, Pascale Marquis, Ivan Brukner, Ruth
Blanchette, Kevin Haub, Vince Magrini, Elaine R. Mardis, Dale N. Gerding,
Vivian G. Loo, Mark A. Miller, Michael R. Mulvey, Maja Rupnik, Andre Dascal,
and Ken Dewar. (2011). "Fourteen-genome comparison identifies DNA markers
for severe-disease-associated strains of Clostridium difficile" Journal of Clinical
Microbiology 49(6): 2230-2238.
Reprinted with the permission of the Journal of Clinical Microbiology (JCM).
This is an author-created, uncopyedited electronic version of an article accepted
for publication in JCM. The American Society for Microbiology (ASM),
publisher of JCM, is not responsible for any errors or omissions in this version of
the manuscript or any version derived from it by third parties. The definitive
publisher-authenticated version is available at
http://jcm.asm.org/content/49/6/2230.
41
Abstract
Clostridium difficile is a common cause of infectious diarrhea in hospitalized
patients. Severe and increased incidence of C. difficile infection (CDI) is
associated predominantly with the strain NAP1; however, the existence of other
severe disease-associated (SDA) strains and the extensive genetic diversity across
C. difficile complicates reliable detection and diagnosis. Comparative genome
analysis of 14 sequenced genomes, including a subset of NAP1 isolates, allowed
the assessment of genetic diversity within and between strain types to identify
DNA markers that associate with severe disease. Comparative genome analysis of
14 isolates, including five publicly available strains, allowed the determination
that C. difficile has a core genome of 3.4 Mb, comprising ~3000 genes. Analysis
of the core genome identified candidate DNA markers that were subsequently
evaluated on a multi-strain panel of 177 isolates, representing more than 50
pulsovars and 8 toxinotypes. A subset of 117 isolates from the panel has
associated patient data which allowed assessment of an association between the
DNA markers and severe CDI. We identified 20 candidate DNA markers for
species-wide detection and 10,683 SNPs associated with the predominant SDA
strain (NAP1). A species-wide detection candidate marker, the gene sspA, was
found to be identical across 177 isolates sequenced and lacked significant
similarity to other species. Candidate SNPs in genes CD1269 and CD1265 were
found to associate better with disease severity than currently used diagnostic
markers, as they were also present in the A-B+ strain type. The genetic markers
identified illustrate the potential of comparative genomics for the discovery of
diagnostic DNA-based targets that are species-specific or associated with multiple
severe disease strains.
42
Introduction
Clostridium difficile is the most common cause of infectious diarrhea in
hospitalized patients in the industrialized world (Barbut et al., 1996; S. Johnson &
Gerding, 1998). With symptoms ranging from self-limited diarrhea to the life-
threatening fulminant colitis, C. difficile infection (CDI) has affected hundreds of
thousands of patients worldwide and substantially burdens healthcare resources
(L. Kyne et al., 2002). In particular, CDI has caused increased patient morbidity
and mortality in hospitals throughout the world since 2003, when outbreaks with
increased disease incidence and severity first emerged (Warny et al., 2005), and
from 2004 to 2007 contributed to almost 1000 deaths in the province of Quebec
(Gilca R, 2008). Investigations of these and other outbreaks across Canada, the
United States, and Western Europe led to the recognition of a severe disease-
associated (SDA) strain predominantly responsible for this epidemic (Kuijper et
al., 2006; Loo et al., 2005). This strain has been classified as North American
pulse field type 1 (NAP1), ribotype 027, toxinotype III, or restriction-
endonuclease type BI (McDonald et al., 2005), which we refer to as NAP1.
Outbreaks have also been associated with other SDA strains, such as the
NAP7/toxinotype V/ribotype 078 (NAP7) strain found in cases of human and
animal disease (Goorhuis et al., 2008; Mulvey et al., 2010), and multiple toxin A-
B+ pulsotypes/ribotypes responsible for CDI outbreaks in Ireland, UK, US, and
Canada (al-Barrak et al., 1999; Stabler et al., 2006).
To date, the monitoring of C. difficile infection in a hospital setting has been a
reactive process, where patients are tested after symptoms emerge, usually
employing methods that are time consuming, expensive, and/or lack sufficient
sensitivity (Eastwood et al., 2009; Killgore et al., 2008). Improved tests are
sought in which all at-risk patients can be tested in a rapid, reliable and cost
effective manner (Planche et al., 2008). To address this, DNA-based diagnostic
tests have been developed (Barbut et al., 2009; Sloan et al., 2008; Spigaglia et al.,
2010; Wolff et al., 2009), but they rely primarily on targets from previously
known genomic regions, such as the genes tcdC (Sloan, et al., 2008; Wolff, et al.,
43
2009), gyrA/gyrB (Spigaglia, et al., 2010), and includes numerous commercially
available assays [Xpect toxin A/B test (Remel, Inc., Lenexa, KS); BD GeneOhm™
Cdiff assay (BD Diagnostics, San Diego, CA); ProGastro Cd assay (Prodesse Inc.,
Waukesha, WI); Cepheid Xpert™
C. difficile (Cepheid, Sunnyvale, CA); and
illumigene™
C. difficile (Meridian Bioscience, Inc.)].
With the advent of massively parallel DNA sequencing (MPS) technology
(Margulies, et al., 2005), bacterial whole genome sequencing has become rapid
and affordable, and there are now well over 5,000 completed or in progress
microbial genomes in public databases such as NCBI Entrez. MPS-based genome
sequencing has been used in studies of C. difficile and other human pathogens,
including a comparative genome analysis of 25 isolates that was performed to
provide insight into the molecular evolution of C. difficile (M. He et al., 2010;
Scaria et al., 2010), and in a study of genome comparisons between 3 isolates that
was used to identify potential virulence mechanisms in the NAP1 strain (Stabler
et al., 2009). These studies have confirmed the mobile nature of the C. difficile
genome (Sebaihia et al., 2006) and that genetic diversity among strains is high. In
several studies (M. He, et al., 2010; Janvilisri et al., 2009; Scaria, et al., 2010) the
NAP7/8 strain type has been recognized as highly divergent from other strain
types and that the C. difficile core genome may only be comprised of ~1000 genes
(M. He, et al., 2010; Janvilisri, et al., 2009; Scaria, et al., 2010; Stabler, et al.,
2006). To date, MPS and comparative genome analyses of C. difficile have not
been applied to the search for additional DNA-based diagnostic targets, as has
been done for other human pathogens (Dai et al., 2011; Feng et al., 2011; Garcia
Pelayo et al., 2009; Kuroda et al., 2010).
In this report we describe the comparative analysis of the genomes of 14 isolates
of C. difficile. The genomes of nine isolates, including 6 Eastern Canadian and 3
reference isolates, were sequenced as part of this study, and the genome
sequences from 5 additional publicly available isolates were also used in the
analyses. Our main objective was to identify DNA targets potentially addressing
two major clinical questions: (i) Is the patient infected with C. difficile? and, if so
44
(ii) Is the patient infected with a strain associated with severe disease? We thus
identified DNA-based diagnostic sequences that could be used to detect any
isolate of C. difficile, as well as targets able to discriminate between SDA and
non-SDA strain types. We also used comparative genome analysis to study the
genetic diversity between strain types, estimate the size of the C. difficile core
genome, and begin to investigate the existence of additional loci responsible for
virulence. Candidate targets identified from the 14-genome analysis were
reconfirmed with a larger panel of 177 isolates. Clinical records available for 117
of these isolates were cross-referenced with target alleles to ascertain their
association to disease severity.
Material and Methods
Isolates for Whole Genome Sequencing
Within the available hospital collections, six isolates of the predominant SDA
strain of C. difficile (NAP1) were selected to emphasize variation across
geographic location and time of isolation (QCD-32g58, QCD-66c26, QCD-
97b34, QCD-37x79, QCD-76w55, CIP107932). The remaining three isolates
were a NAP2 strain (QCD-63q42), a SDA NAP7 strain (QCD-23m63), and
reference strain VPI10463 (ATCC43255). Isolates were incubated anaerobically
at 37 oC on 5% Columbia sheep’s blood agar in pure culture, and identified by
standard phenotypic criteria. Genomic DNA from each isolate was extracted
using a standard commercial column-based extraction (QIAamp DNA Mini Kit,
Qiagen, Mississauga, CA), stored in 1X TE buffer (pH 8.0), quantitated by
spectrophotometry (Bio Rad SmartSpec 3000 UV/Visible Spectrophotometer,
Mississauga, CA), visualized by 1% agarose gel electrophoresis and stored at -20
oC until further use.
Whole Genome Sequencing, Gap Closure and Comparative Genome Analysis
All DNA extractions and sequencing experiments were performed at separate
times to minimize the risk of cross-contamination. Isolate QCD-32g58 was
45
sequenced using the first generation Roche-454 GS system (GS20). The
remaining isolates were sequenced later on the second generation Roche/454 GS
GS-FLX system. Contigs (minimum size 500 bases) were ordered and oriented
based on alignment to the finished genome of C. difficile strain 630 (Sebaihia, et
al., 2006). Contigs that could not to be ordered and oriented were placed into
separate scaffolds. Assembly improvement was accomplished by designing
primers flanking each gap, and performing conventional or long-range PCR (as
required by predicted gap size) from the genomic DNA. Bidirectional sequencing
of resulting amplicons was performed on ABI 3730xl systems. Gaps were closed
by aligning genomic and gap-directed amplicon sequences with Consed (Gordon,
et al., 1998). Genes were predicted using GLIMMER 3.01 (Delcher et al., 2007).
Gene function was predicted by aligning its protein sequence against the NCBI nr
database (e-value < 1.0e-20). A whole genome multiple alignment was created
using Multiz-TBA (Blanchette et al., 2004). DNA variation was catalogued that
considered: [i] only alignment blocks with sequence from the 14 genomes, [ii] at
>80% pairwise identity and [iii] contained genotypes with a quality value of >63.
The bootstrap consensus tree was inferred from 100 replicates and constructed
using MEGA 4 (Tamura et al., 2007). The pair-wise genome comparison between
QCD-66c26 and QCD-23m63 was computed using lastz (Harris, 2007), and the
percent identity histogram was generated using a custom Python script.
Validation via Targeted Resequencing
Within available hospital collections, 177 isolates of C. difficile were selected to
assess genetic variation for each candidate locus within the NAP1 strain and
between strains of varying PFGE type. Of the 177 isolates, 170 (91%) were
successfully PCR amplified and DNA sequenced. We also included the 9 isolates
from the whole genome sequencing as sequencing accuracy controls. We also
included a species-level control [C. spiroforme (ATCC 29900)] and a phylum-
level control [Mycobacterium intracellulare (kindly provided by Dr. Marcel Behr,
McGill University Health Centre)]. Genomic DNA was extracted from isolates
using a standard lysis-based column extraction, and DNA yield estimated by
46
spectrophotometry. For each candidate locus primers were designed using
Primer3 (Rozen & Skaletsky, 2000) to amplify regions of roughly 700-1000bp.
PCR and bidirectional sequencing of resulting amplicons was performed on the
ABI 377 platform. Sequence chromatograms were base called using Phred
(Ewing & Green, 1998; Ewing, et al., 1998), and polymorphisms (Q > 39)
catalogued in reference to C. difficile VPI10463.
Results.
Genome Sequencing and Analysis
We performed whole genome sequencing and assembly for nine isolates of C.
difficile (Table 3). Six isolates were of the predominant SDA strain type (NAP1)
and ranged in collection date from 1984 to 2007. The international reference
isolate CIP 107932 (isolated in 1984) and the BI-1 isolate (isolated in Minnesota
in 1988(Razaq et al., 2007)) represented NAP1 isolates predating the 2003-2007
CDI epidemic. The other four NAP1 isolates (QCD-32g58, QCD-66c26, QCD-
37x79, QCD-97b34) were collected during the CDI epidemic from three locations
across Canada. In addition to the six NAP1 isolates, we sequenced and assembled
a SDA NAP7/toxinotype V/ribotype 078 (NAP7) isolate, QCD-23m63, collected
in 2007. The two non-SDA isolates sequenced were international reference strain
VPI10463 (ATCC43255) and a NAP2 isolate (QCD-63q42), collected in 1980
and 2005, respectively. We also used the publicly available genome assemblies of
another five isolates, including two additional NAP1 isolates (R20291 and CD196
(Stabler, et al., 2009)), NAP7 and NAP8 isolates (Human Microbiome Project),
and strain 630 (Sebaihia, et al., 2006) (Table 3). Genome assembly details and the
NCBI/GenBank accession numbers for all 14 isolates are given in Table 4.
The nine genomes sequenced in this study ranged in size from 3.94 Mb (QCD-
23m63) to 4.44 Mb (QCD-63q42) (Table 4), which corresponds to the range of
genome sizes (3.90 to 4.29 Mb) observed in other C. difficile genome sequencing
projects (Table 4). We generated 9 draft genome assemblies, where the contig
numbers in the assembly ranged from 16 (QCD-32g58) to 66 (BI-1). However,
47
the largest assembled contigs were >350 kb for all isolates, and the contig N80
calculation indicated that 80% of the assembled genomes was found in contigs of
>59 kb (BI-1) to contigs of >300 kb (QCD-32g58). Given that C. difficile has a
typical bacterial gene density (80-85%) and typical bacterial gene lengths (500-
2000 nt), our genome assemblies provided contigs that could support synteny
analyses combining sequence similarity in the context of information on gene
order and orientation. Draft assemblies from the Human Microbiome Project also
provided contigs of lengths (contig N80 > 30Kb) useful for gene annotation and
the derivation of local synteny relationships.
We used whole-genome alignments to determine that a core of 3.4 Mb of
orthologous genomic sequence was present in all of the 14 genomes. Within the
core genome, the NAP7/8 group of isolates tended to be the most divergent to all
other groups, with an average percent identity of 97% compared to the NAP1
group, although this included regions of lower percent identity embedded within
the syntenic regions (Figure 8). The non-core genome sequences represent strain
and isolate specific insertions and deletions, often due to mobile genetic elements
and possible extrachromosomal plasmids (data not shown).
48
Table 3. Characteristics of C. difficile isolates used in this study.
Isolate Year PFGE Type
Location Source Additional Characteristics Genome
Size Contigs
a Isolates sequenced in this study
QCD-66c26 2007 NAP11 Montreal, QC, CA 56 Yo male with severe CDI BT+; tcdC delta-117; 18bp del 4.13 45
QCD-32g58 2004 NAP1 Montreal, QC, CA 70 Yo male with CDI BT+; tcdC delta-117; 18bp del 4.11 18
BI-13 1988 NAP1
1 Minneapolis, MN, USA Non-epidemic strain BT+; tcdC delta-117; 18bp del 4.4 89
CIP 1079324 1984 NAP1
1 Reims, Marne, FR 28 Yo female with PMC reference strain for binary toxin 4.04 69
QCD-37x79 2005 NAP17 London, ON, CA 67 Yo patient with severe CDI BT+; tcdC delta-117; 18bp del 4.33 59
QCD-97b34 2004 NAP18 St. John's, NL, CA 70 Yo with severe CDI BT+; tcdC delta-117; 18bp del 4.07 74
QCD-63q42 2005 NAP2 Quebec, QC, CA 67 Yo male with severe CDI toxinotype 0 4.44 87
VPI 104634 1980
2 - - - reference strain from ATCC 4.21 88
QCD-23m63 2007 NAP7 Montreal, QC, CA male with severe CDI toxinotype V/ribotype 078 3.94 80
b Publicly available isolates
CD196 1985 NAP1 Paris, France non-epidemic strain - 4.11 1
R20291 20066 NAP1
Stoke Mandeville, England
outbreak-associated - 4.19 1
Strain 630 1980 - Zurich, Switzerland CDI and PMC - 4.29 2
NAP07 20085 NAP7 Unknown Human feces - 3.90 33
NAP08 20085 NAP8 Unknown Human feces - 4.08 24
1Predicted from sequence similarity
2Estimated year of isolation
3Razaq N, et al. (2007)
4Reference strain
5Estimated from date of sequencing
6Estimated from publication
7subtype b/006
8subtype a/001
49
Table 4. Characteristics of C. difficile genome assemblies used in this study.
Isolate Technology Status Genome
Size Contigs Largest_Contig Contig_N80 Accession Reference
a Isolates sequenced in this study
QCD-66c26 GS-FLX Draft 4,126,050 32 937.3 232.5 NZ_ABFD00000000 This study
QCD-32g58 GS-20 Draft 4,108,089 16 1247.0 302.3 NZ_AAML00000000 This study
BI-1 GS-FLX Draft 4,392,595 66 356.4 59.6 NZ_ABHE00000000 This study
CIP 1079324 GS-FLX Draft 4,032,580 55 354.2 81.1 NZ_ABKK00000000 This study
QCD-37x79 GS-FLX Draft 4,329,888 45 559.0 128.7 NZ_ABHG00000000 This study
QCD-97b34 GS-FLX Draft 4,059,010 60 366.6 76.7 NZ_ABHF00000000 This study
QCD-63q42 GS-FLX Draft 4,440,437 60 1027.2 101.2 NZ_ABHD00000000 This study
VPI 104634 GS-FLX Draft 4,204,780 55 1293.1 138.5 NZ_ABKJ00000000 This study
QCD-23m63 GS-FLX Draft 3,936,085 61 440.0 93.8 NZ_ABKL00000000 This study
b Publicly available isolates
CD196 GS-FLX Complete
4,110,554 1
4,110,554
4,110,554 FN538970 Stabler et al. (2009)
R20291 GS-20 Complete
4,191,339 1
4,191,339
4,191,339 FN545816 Stabler et al. (2009)
Strain 630 ABI377 Complete
4,290,252 1
4,290,252
4,290,252 NC_009089 Sebaihia et al. (2006)
NAP07 GS-FLX Draft 3,862,058 100 269.2 35.7 NZ_ADVM00000000 Human Microbiome
Project*
NAP08 GS-FLX Draft 4,022,033 111 169.7 32.3 NZ_ADNX00000000 Human Microbiome
Project*
*http://www.hmpdacc.org/
50
Figure 8. Percent identity plot (top) and dot-plot (bottom) depicting the whole
genome pairwise alignments of a NAP1 isolate (QCD-66c26) versus a NAP7
isolate (QCD-23m63).
The dotplot in the bottom panel depicts the colinearity of the two genomes, and
the percent identity plot in the upper panel depicts the level of nucleotide-level
similarity between the two genomes (red line indicates average percent identity).
51
Polymorphism Discovery
Within the 3.4 Mb core genome, we identified 127,442 single nucleotide
polymorphisms (SNPs). Although other types of genomic variation are present,
including insertions and deletions, they were not analyzed in this study as they
accounted for less than 3% (3,211) of all instances of nucleotide level variation. A
phylogenetic tree constructed using the SNPs clustered the isolates into three
distinct groups: the eight NAP1 isolates, the three NAP7/8 isolates, and the three
remaining isolates (ATCC43255, 630, and QCD-63q42) which we refer to as the
R group (Figure 9a). Genetic variation among these three groups is greater than
the level of variation within any one group (Figure 9b). The NAP1 isolates differ
from the NAP7/8 group at 104,853 SNP positions, and differ from the three
remaining isolates (R group) at 17,076 SNP positions. The NAP7/8 isolates differ
from the R group of isolates at 96,302 SNP positions.
The NAP1 group consisting of 8 isolates are highly identical within their core
genome, ranging from 9 to 62 polymorphic SNP positions between any two
members (Figure 9b). We identified 11 non-synonymous nucleotide substitutions
that distinguish a subset of Canadian NAP1 isolates (QCD-32g58 (Quebec, 2004),
QCD-66c26 (Quebec, 2007), QCD-37x79 (Ontario, 2005)) as well as the UK
outbreak strain R20291 (Stoke, Mandeville) from the others (QCD-97b34
(Newfoundland, 2004), BI-1 (Minnesota, 1988), CIP107932 (France, 1984), and
CD196 (France, 1985). This set of 11 SNPs includes the previously described
mutation responsible for resistance to fluoroquinolones (Drudy et al., 2007).
The group of three NAP7/NAP8 isolates are also highly similar to each other,
with at most 1,851 SNPs separating QCD-23m63 (NAP 7) from the HMP-NAP07
isolate. However, the two HMP isolates are very similar, with only 297 SNPs.
Variation between the three isolates in the R group is greater than variation within
the other two groups, with each of the 3 isolates in the group having over 11,000
SNPs.
52
Figure 9. a) Phylogenetic tree of 14 C. difficile genomes constructed using SNP data.
Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees
in which the associated taxa clustered together in the bootstrap test (100 replicates) are shown next to the branches. Genomes cluster
into 3 distinct groups: NAP1 isolates (red), NAP7/8 isolates (blue), and the 3 remaining isolates (black, R group); b) Number of
polymorphic SNPs observed between the three groups (large boxes), as well as variation observed between isolates from within each
group (remaining values). Also indicated are SNPs (text at center-bottom) that uniquely identify each group.
53
Genetic variations consistently present in a single group are candidate DNA
markers of potential diagnostic value. We found 10,683 SNPs that distinguish the
NAP1 group from the other two groups (Figure 9b). There were also 89,954 SNPs
that separated the NAP7/8 set from all others, and 6,342 SNPs that separated the
remainder (R group) from the NAP1 and NAP7/8 group.
The 10,683 SNPs that are specific to the NAP1 group have a genome-wide
distribution (Figure 10a) and includes a prominent cluster overlapping the
cytotoxin gene tcdB (Figure 10a). Among the other prominent clusters, 545 SNPs
were observed in a 5.8 kb region of 6 genes (CD1265-CD1270), annotated as a
two-component system (TCS) adjacent to an ABC transporter (TCS-ABC)
(Figure 10b). The predicted protein sequences across this TCS-ABC locus
showed no premature stop codons, providing preliminary evidence that these
remain functional genes despite the increased level of genetic variation. Other
prominent clusters include an ABC transporter (CD0336-CD0337), and a
hypothetical protein adjacent to a transcriptional regulator (CD3143-CD3144).
54
Figure 10. Distribution of SNPs that uniquely identify the NAP1 group of isolates.
a) Genome-wide distribution of SNPs that uniquely identify the NAP1 group of isolates. Prominent clusters overlap known (e.g., tcdB
in PaLoc) and unknown loci (indicated with arrows). b) Genomic intervals of 5 loci with prominent clusters of NAP1 SNPs. From top
to bottom: CD0366-CD0337, tcdB, TCS-ABC, and CD3143-CD3144. Annotations for each genomic interval depict (from top to
bottom): a scale, genomic position, pileup of NAP1 SNPs, and gene annotations (arrows indicate direction of gene).
55
Identification and Validation of Targets for SDA Strains
To investigate the potential of NAP1 SNPs in the TCS-ABC locus as diagnostic
targets associated with disease severity we selected 2 candidate genes, CD1265
and CD1269, for comparison against 9 putative or known virulence genes (tcdA,
tcdB, tcdC, tcdE, codY, fliC, groEL, gyrB, mviN). These genes were PCR
amplified and resequenced in a panel of 177 isolates of C. difficile. Isolates were
selected to assess genetic variation for each candidate locus within the NAP1
strain and between strains of varying PFGE type. The 177 isolates were collected
predominantly from Canadian hospital, provincial and national reference
collections, and chosen to sample variation in PFGE type. The isolates were also
selected to encompass different collection sites, and years of isolation. A second
set of international strains provided another survey of C. difficile diversity based
on toxinotyping (Rupnik, 2008). A total of 20 contributing sites were represented,
including 17 Canadian, 2 American, and one European institution. There were
163 isolates collected in or after 2002, with 22 isolated prior to 2002. Sixty-seven
isolates were classified as NAP1, 12 as NAP2 and 109 classified into more than
50 other pulsovars. Ten isolates were from the reference panel of toxinotyping
strains, and 50 were from 9 hospitals in the province of Quebec, Canada. 117
isolates were obtained from The Canadian Nosocomial Infection Surveillance
Program (CNISP) and have available patient data. Seventy of the 117 clinical
records reported severe CDI, which was defined as death associated to CDI, ICU
or colectomy due to CDI (all measured <= 30 days after diagnosis).
Our set included 10 A-B+ isolates, which are of PFGE type 00065 (NAP9,
personal communication), correlated with toxinotype VIII (Janvilisri, et al., 2009).
In patient data available for 100 of the isolates that were resequenced, SNP
genotypes in two genes (CD1269 and CD1265) associated with 70% (44/63) of
severe disease cases comprising three PFGE types (NAP1/NAP7/NAP9), whereas
existing typing methods, such as PFGE NAP1 classification or deletions in tcdC,
accounted for 52% (33/63) or 62% (39/63), respectively, and was limited to two
PFGE types (NAP1 and NAP7). For example, the G allele for SNP at genomic
56
location 1,387,666 in CD1269 (Figure 11) associates with more severe disease
cases than the single base deletion at position 117 (∆117) in tcdC due to the
observation that the A-B+ strains have the G allele but lack the tcdC deletion. In
contrast, the ∆117 deletion in tcdC detected 33 of 63 severe disease cases, and the
larger deletions of at least 18 bp in the 3' end of tcdC accounted for 39 of 63 of
severe disease cases. Deletions in tcdC were diagnostic for two of three strain
types previously associated with severe disease (NAP1 and NAP7), whereas
genotypes for each of 18 SNPs in CD1269 (1/18) and CD1265 (17/18) (the TCS-
ABC locus) detected 44 of 63 severe disease cases and specific alleles were
associated with three SDA strain types (NAP1, NAP7, and A-B+
(NAP9/toxinotype VIII)) (Figure 11). In addition, 5 of the SNPS in CD1265 are
tri-allelic and partition the SDA strains into two groups (NAP1 versus NAP7 and
A-B+ (NAP9/toxinotype VIII) (Figure 11). For example, the SNP at genomic
location 1,383,226 has three alleles C, T, and A. The A allele is associated with
the NAP1 strain, the T allele with the NAP7 and A-B+ strains, and the C allele
with all the others.
57
Figure 11. Correlation of disease severity with SNPs from the TCS-ABC locus or existing diagnostic methodologies.
The incidence of severe CDI outcome (SDA column, grey boxes) is higher for the NAP1 strain and occurs in 7/10 and 2/2 cases for the
A-B+ and NAP7 (+1 closely related, 11ACD0028) strains, respectively (genome sequenced isolates are not phenotyped and indicated
with an "x"). Molecular markers such as the PFGE type NAP1, and presence of binary toxin are diagnostic for NAP1, with deletions in
tcdC additionally capturing the NAP7 strain. Genotypes in SNPs identified in the TCS-ABC locus are diagnostic for three SDA strain
types; the NAP1 strain, NAP7, A-B+ strains, and includes 5 tri-allelic SNPs (the third allele is boxed).
58
Identification and Validation of Targets for Species-level Detection
We selected a set of 12 genes (spoIIIAG, CD0596, CD0117, CD3014, CD0279,
CD1795, CD2251, CD0017, cdd1, srlE, prdB and sspA) that displayed high
nucleotide level identity across all 14 sequenced genomes as species-wide
detection candidates (Table 5). After resequencing the panel of 177 isolates, sspA
was found to be 100% conserved at the nucleotide level for all isolates analyzed
(Table 5). The remaining 11 candidates contained a few polymorphisms, and did
not deviate substantially from what was originally observed in the 14 genome
analysis (Table 5). To further investigate whether these gene sequences could be
used as DNA based markers for specific detection of C. difficile (regardless of
strain type), yet remain specific to C. difficile but no other species, we aligned the
genomic sequence from each candidate gene region against the NCBI non-
redundant DNA sequence database. Database hits for all 12 gene regions were
well below 90% identity and/or 90% query coverage (Table 5). Furthermore, C.
difficile PCR primers for all 12 genes did not lead to amplification products from
Clostridium spiroforme and Mycobacterium intracellulare (data not shown).
59
Table 5. Targets for species-level detection of C. difficile.
Gene
Number of SNPs Most Similar Species using BLASTn (nr)
Query Coverage
(%)
Max. Identity
(%) 14 Reseq.
sspA 0 0 Alkaliphilus oremlandii OhILAs 94 74
prdB 2 5 Clostridium sticklandii 96 76
CD0017 2 6 Clostridium botulinum E3 str. Alaska E43 83 73
cdd1 2 6 Leptospira interrogans serovar lai str. 56601 39 74
srlE 6 6 Clostridium botulinum B str. Eklund 17B 97 83
CD2251 3 8 Trichomonas vaginalis 13 85
CD1795 4 9 Polistes sp. MD1 mitochondrion 16 84
CD0596 6 10 Brassica rapa subsp. Pekinensis 22 81
CD0117 6 10 Alkaliphilus metalliredigens QYMF 96 76
CD3014 6 10 Listeria welshimeri serovar 6b str. SLCC5334 96 86
CD0279 8 10 Clostridium acetobutylicum ATCC 824 45 67
spoIIIAG 8 14 Mycoplasma mycoides subsp. mycoides SC str. PG1 8 84
60
Discussion
The primary objective of this study was to identify DNA markers for C. difficile
that could be used to test stool samples of patients and determine: (i) whether they
are infected with C. difficile, and if they are infected (ii) whether the particular
strain has been associated with severe disease. Our strain selection, which was
influenced by the availability of isolates in well-characterized hospital collections,
such as the Canadian Nosocomial Infection Surveillance Program, included an
emphasis on isolates of the predominant epidemic strain (NAP1) as well as an
isolate of another SDA strain (NAP7) reported in literature (Goorhuis, et al.,
2008; Mulvey, et al., 2010). Our strain choices comprise the majority of severe
disease associated strain types (Hubert et al., 2007; Loo, et al., 2005; Quesada-
Gomez et al., 2010; Walkty et al., 2010) as well as two widely used research
reference isolates (CIP107932 and VPI10463). Our testing on an extended panel
of isolates, from an even wider diversity of strain types described here, further
demonstrate that we have analyzed a wide spectrum of naturally occurring C.
difficile genetic diversity.
Similar to previous studies, we observed variation in genome size that is largely
attributable to mobile genetic element activity (M. He, et al., 2010; Scaria, et al.,
2010; Sebaihia, et al., 2006; Stabler, et al., 2009). Mobile genetic elements in C.
difficile have been found to carry virulence factors (Stabler, et al., 2009), and as a
result, a more detailed comparative genome analysis of the NAP1 isolates from
this study and others may provide further insight into the differential virulence
observed within this strain. We observed 3.4 Mb of conserved sequence present in
all 14 genomes, comprising 3063 genes, which differs substantially from
observations made by other studies, which estimate a smaller core genome of less
than 1000 genes (M. He, et al., 2010; Janvilisri, et al., 2009; Scaria, et al., 2010;
Stabler, et al., 2006). We do not believe this is a reflection of strain choices, as
our most divergent pairs (NAP1 vs NAP7) were also recognized as being the most
highly divergent in other studies (M. He, et al., 2010). Rather, we believe the
differences in interpretation of core genome size are due to differences in
61
methodologies and analysis techniques. Given the high quality of the 3 completed
genomes and 11 draft genomes used in our analyses, we were able to combine
sequence comparisons and syntenic relationships to determine gene orthologs, and
observed that compared to NAP1 references, the NAP7/8 group displayed a
higher level of genetic diversity with an average of 97% identity. We believe that
in comparative genome hybridization (CGH) studies, for example, where probe
mismatches can affect hybridization (Naiser et al., 2008; Rennie et al., 2008), and
NAP1 or strain 630 genomes have been used as the reference, even a few
mismatches between the probe and target DNA can increase the false negative
rate (Machado & Renn, 2010; Renn et al., 2010). This leads us to suggest that the
C. difficile CGH arrays, which were designed using either strain 630 (R group) or
the NAP1 isolate QCD-32g58, may produce numerous false negative
hybridizations when tested using more a more distantly related strain (e.g., NAP7)
and has resulted in a calculation of a smaller core genome size. Other whole
genome sequence based studies have used differing analytical techniques to
determine the core genome size. For example, one study determined the core
genome to consist of 622 genes, but only after stringently considering the non-
recombining genes in the genome (M. He, et al., 2010). Another study estimated
the core genome to be comprised of 947 genes (Scaria, et al., 2010); however, as
it was a gene-centric analysis based on sequence identity without synteny
evaluation, the study may have considered genes of lower than average sequence
identity, such as those present in the NAP7/8 group, to be non-orthologous. Our
whole genome multiple alignment-based approach allowed the identification of
orthologous regions of lower sequence identity which might not have been
detectable using CGH arrays or considered to be below similarity thresholds
using a gene-centric approach.
At this time, our catalogue of genetic variation does not include insertions or
deletions. Future work may reveal insertions or deletions in severe disease
associated strains in addition to those previously identified in tcdC (Curry et al.,
2007; Loo, et al., 2005; MacCannell et al., 2006; McDonald, et al., 2005). Indeed,
while it has been shown that deletions in tcdC were hypothesized to lead to
62
increased toxin production (MacCannell, et al., 2006; Spigaglia & Mastrantonio,
2002; Warny, et al., 2005), a recent study (Murray et al., 2009) indicated that
deletions in tcdC do not predict the biological activity of the PaLoc toxin genes,
providing further motivation to identify all sources of genetic variation within the
C. difficile genomes that may correlate with disease severity.
There are 64 SNPs that discriminate isolates within the NAP1 strain type,
including 14 SNPs that separate eastern Canadian isolates from the others, and
these demonstrate the potential of massively parallel sequencing to identify SNPs
suitable for subsequent intra-strain typing and tracking. The SNPs that distinguish
the NAP1 strain from all other strains show an uneven genome-wide distribution
and cluster in known pathogenicity genes as well as in genes with currently
unrecognized roles in CDI. The other genes with clusters of NAP1 SNPs include a
general stress (CD2599) and carbon starvation (CD2600) gene, and an ABC
transporter. The most prominent cluster of SNPs that discriminate the epidemic
strain were observed in a genetic locus consisting of a two-component system
(CD1269-70) adjacent to an ABC transporter (CD1265-68). The adjacency of
these two loci has been shown to be functionally relevant in Bacillus, and is
present in other low-GC Gram-positive bacteria, including C. difficile (Joseph et
al., 2002). The function of the TCS-ABC system has been investigated previously
in B. subtilis, and includes detoxification of antimicrobial compounds (Ohki et al.,
2003). The function of any particular TCS-ABC system is primarily dictated by
the protein domains present in the histidine kinase gene (Jordan et al., 2008).
Sequence analysis of the histidine kinase gene (CD1270) in this TCS-ABC
system suggests that it plays a role in detoxification (data not shown). In silico
analysis indicates that this locus is not comprised of pseudo-genes, and as a result
may have a functional role. Future experiments are needed to confirm the
expression of genes in the TCS-ABC system and investigate their roles in CDI.
Enzyme-based strategies for the detection of C. difficile have limited sensitivity
and specificity (Eastwood, et al., 2009). To address this, alternative gene targets
have been investigated, such as the tpi housekeeping gene (Dhalluin et al., 2003).
63
While this assay is specific to C. difficile, it requires the additional procedure of
RFLP to achieve its specificity, making it less suitable in a clinical setting. Our
genome analysis has identified numerous highly conserved genes that may be
suitable for PCR-based specific detection of C. difficile. The 12 candidate genes
displayed a high level of sequence conservation across the 14 genomes as well as
a diverse population of 177 isolates, representing major PFGE and toxinotypes.
Moreover, our in silico analysis shows that many of these candidates have low
sequence identity to other bacterial species, and upon further experimental
validation and development may define a more rapid and specific clinical
detection assay.
The development of genetic markers to track severe disease-causing strains has
previously relied on variation in known pathogenicity genes, such as the detection
of deletions in tcdC (Wolff, et al., 2009). While deletions in tcdC are diagnostic
for the NAP1 strain and the SDA NAP7 strain they are not diagnostic for the A-
B+ isolates (NAP9/toxinotype VIII) that have been found in food animals (Pirs et
al., 2008) and retail meats (Rodriguez-Palacios et al., 2007). Using comparative
genome analysis with candidate gene resequencing, we have identified numerous
SNPs that detect these three SDA strain types and this in turn has led to an
increase in the detection of severe disease causing cases by almost 20%.
However, strains containing these SNPs do not always lead to CDI, and severe
CDI can be observed in cases of infection with non-SDA strains. Part of this may
be attributable to host factors that alter disease severity, such as variation in
immune response (Lorraine Kyne et al., 2000; L. Kyne et al., 2001), and exposure
to certain antibiotics (Loo, et al., 2005; Muto et al., 2005; Pepin et al., 2005) or
acid-reducing agents (Dial et al., 2004; Loo, et al., 2005; Pepin, et al., 2005).
Alternatively, as our resequencing of candidate regions was limited, the
resequencing of additional genes or loci may uncover SNPs that detected
additional severe disease causing isolates. Also, severe disease or outbreak size
and number may be caused by a constellation of genetic loci, such as those
responsible for sporulation (Merrigan et al., 2010) or gut survival, and
64
investigation of these other loci may identify SNPs that account for additional
severe disease cases.
This study demonstrates the utility of massively parallel DNA sequencing to
identify clinically relevant diagnostic markers of C. difficile. As the cost of whole
genome sequencing continues to decrease, we envision this approach being
applied to the study of other human pathogens.
Acknowledgements
We wish to thank the DNA sequencing teams at the Genome Center at
Washington University School of Medicine and the McGill University and
Genome Quebec Innovation Centre for the sequencing of the C. difficile isolates.
We thank Manon Lorange for the isolation of C. difficile QCD-63q42. We also
thank Dr. Marcel Behr for providing C. spiroforme and Mycobacterium
intracellulare control DNAs. We thank the Human Microbiome Project (HMP)
and the HMP DACC for pre-publication data release of the C. difficile NAP07
(ADVM00000000.1) and NAP08 (ADNX00000000.1) sequences. This study was
funded by Genome Canada and Genome Quebec (KD) and the NHGRI (EM). VF
is a recipient of a Canadian Institutes of Health Research Doctoral Research
Award. MO was the recipient of an AMMI Canada/CIHR/CFID/Bayer Healthcare
research fellowship.
66
Connecting Text
Genome browsers, including the UCSC Genome Browser (Kent, et al., 2002),
were designed to view and analyze genome annotations. They are limited in their
capacity to analyze the underlying genome assembly; how individual reads
contribute to the subsequent consensus sequence. In 2006, existing assembly
viewers (Gordon, et al., 1998; Huang & Marth, 2008; Schatz et al., 2007) did not
process MPS datasets due to their large size, or were limited in their ability to
assess the quality of an entire genome assembly. As a compromise, I used a
combination of cgb and other programs, such as a spreadsheet application, to view
and analyze the C. difficile genome assemblies (Chapter 3). This approach proved
to be cumbersome, and did not meet the need to easily inspect a genome assembly
from an MPS platform.
The need to have assembly assessment tools beyond a consensus-based browser
became evident in my next project, the multi-centre sequencing and analysis of
the O. novo-ulmi genome (Chapter 5) (in preparation). The goal of this project
was to determine if the Roche/454 GS platform was reproducible across
sequencing centers. To assess variation in sequencing error rates, or other sources
of center-specific bias such as genome coverage, we required a high-quality
genome assembly which we could use as a point of reference (Chapter 5). The
knowledge gained from assessing the quality of the genome assembly enabled me
to develop a new assembly viewer that addressed the limitations of existing
software programs. The program created from this endeavor was named contiGo
(Chapter 4), and was developed to specifically assist in the quality assessment and
analysis of genome assemblies. Among other features, it offers multiple views of
the assembly statistics via tables and charts, as well as a fully and fluidly
zoomable image of the read pileup for each contig. The program was used to
assess the quality of the genome assembly during the pilot stages of the project, as
well as identifying the presence of a mitochondrial genome and high copy number
rDNA sequences. Subsequent to analysis with contiGo, cgb was used to create a
custom UCSC Genome Browser to assist in the analysis of the reproducibility of
67
the Roche/454 GS-FLX Platform. This analysis allowed us to conclude that: [i]
the platform is reproducible; and [ii] the deviation in base-calling of monomeric
repeats is more pronounced at a length of 7 nt or greater. We also produced a
high-quality assembly of the O. novo-ulmi genome for use by the fungal research
community.
Contribution of Authors
I created contiGo and prepared and wrote the manuscript (Chapter 4). Gary
Leveque and Pascale Marquis assisted with user testing. The O. novo-ulmi project
(Chapter 5) was conceived by the DNA Sequencing Research Group (DSRG) of
the Association of Biomolecular Resource Facilities (ABRF) and DNA
sequencing was performed by participating DSRG core facilities. I oversaw and
conducted the analysis of the reproducibility of the Roche/454 GS-FLX Titanium
platform, performed the genome assembly and quality assessment, and prepared
and wrote the manuscript (Chapter 5).
68
CHAPTER 4: ContiGo -- A Tool to Inspect Genome Assemblies in a Web
Browser
Vincenzo Forgetta1 and Ken Dewar
1
1Department of Human Genetics, McGill University, Montreal, Quebec, Canada
This manuscript is in preparation for submission.
The source code and documentation is available at
http://github.com/vforget/Contigo
69
Abstract
Assembly viewers are software programs used to inspect genome assemblies,
determining their overall quality and identifying assembly artifacts such as
collapsed repeats and low-coverage contigs. Massively parallel DNA sequencing
platforms has made genome sequencing affordable to the biologist, who may be
tasked with inspecting the genome assembly. Currently available assembly
viewers require specific hardware or software or are limited in their analytical
capabilities, restricting their use to those with adequate expertise or resources, or
to specific types of projects. To address this need we created contiGo, a program
that offers multiple analytical views of a genome assembly from within a web
browser, bypassing the need to install software and download large data sets. We
demonstrate its general purpose functionality across various example usage cases.
70
Introduction
During the era of dye-terminator-based DNA sequencing technologies, whole
genome sequencing (WGS) was costly and laborious. Consequently, it was
performed by a few large institutions and limited to few organisms of high
importance, such as human (Lander, et al., 2001; Venter et al., 2001) and mouse
(Waterston, et al., 2002). Within this context, many software programs were
developed to assist in all aspects of producing a genome sequence. Assembly
editors were particularly important because they were used to inspect the genome
assembly in a process termed genome finishing, a process that corrects sequence
and assembly errors, and incorporates additional sequence information to fill in
sequence gaps. At the time, popular assembly editors were Consed (Gordon, et
al., 1998) and Staden (Staden, 1996), which were used to finish the human and the
nematode genome, respectively. As the cost of DNA sequencing dropped, the cost
of genome finishing became a limiting factor and was primarily used for
important model organisms. This left many genome assemblies in a draft stage,
consisting of a set of contigs that may contain undetected errors (Salzberg &
Yorke, 2005). To assist in the analysis of this large number of draft genomes,
Schatz et al. developed Hawkeye (Schatz, et al., 2007), an assembly viewer that
offered advanced analytical techniques to detect assembly errors. Regardless of
the software program used, the primary goal of WGS was to produce a high-
quality assembly upon which various annotations were predicted. The genome
sequence and annotations were then incorporated into genomic resources, such as
the UCSC Genome Browser (Kent, et al., 2002) and Ensembl Genome browser
(Hubbard, et al., 2002), for use by the scientific community. To date, more than
40 eukaryotic genomes are available via these online resources, with many more
prokaryotic and viral species. This has provided the scientific community a
foundation upon which they have conducted further research.
Massively parallel DNA sequencing technology (Bentley, et al., 2008; Margulies,
et al., 2005; Shendure, et al., 2005) has changed the process of WGS and analysis
in at least one important way. By dramatically reducing costs, MPS technology
71
has given individual researchers access to whole genome sequencing, often by
contracting the sequencing out to a third party. Similar to past model organism
genome projects, this genome sequencing service often produces a draft assembly.
However, unlike past genome projects, the initial analysis of the assembly is often
left to the individual researcher. While some of this analysis will include typical
genome finishing tasks, such as identifying low-quality contigs, or misassemblies
due to repeats, researchers may also want to inspect the genome assembly in more
specific ways; a need that requires a general purpose assembly viewer. Also,
because some researchers possess varied levels of computer expertise and
different operating systems, this tool should be cross-platform and present a short
barrier to entry, such as ease of installation and data access. Recently developed
assembly viewers (Hou, et al., 2010; Huang & Marth, 2008; Milne, et al., 2010)
address some of these issues because they are cross-platform, however, they are
mainly geared towards variant detection, choosing to focus on the inspection of
individual contigs. A more general analysis tool that includes contig level analysis
as well as the analysis of the entire assembly is needed as it would allow for a
more general and open-ended analysis.
To address this need we have created contiGo, a program that offers multiple
analytical views of a genome assembly from within a web browser. Here we
briefly describe the implementation and functionality of contiGo, and demonstrate
its functionality across multiple example usage cases.
Methods
ContiGo is implemented in the Python programming language
(http://www.python.org). As input, it accepts a genome assembly in ACE format,
and optionally other files generated by the Roche/454 GS Assembler. ContiGo
parses the input file(s), computes statistical values for the assembly, and produces
images of the read pileups for each contig. The statistical values, contig sequence,
and contig qualities, are stored as a HTML table and/or JSON-objects. Contig
images are converted to a format compatible with the Seadragon AJAX library
(http://expression.microsoft.com/en-us/gg413362). ContiGo’s web-browser
72
interface visualizes this information in a dynamic manner using freely available or
custom Javascript libraries. For more implementation details, please refer
ContiGo's website (https://github.com/vforget/Contigo).
Results.
General Functionality
The ContiGo interface consists of four sections: a summary, a table, a plot, and a
contig read pileup (Figure 12):
A. Summary: Displays basic assembly statistics, and provides links for the bulk
download of the contigs sequences, quality values or contig statistics.
B. Table: Lists statistics for each contig in the assembly in tabular format.
Some row values are hyperlinks (e.g., contig length) to download data for
individual contigs (e.g., sequences). Columns headers can be used to sort, search,
or filter the contents of the table.
C. Plot: Visualizes contig statistics using three plots: scatter, cumulative, and
count. The scatter plot displays two contig statistics against each other (e.g.,
contig length versus read depth of coverage). The count and cumulative plots
display for one statistic a histogram of counts or cumulative counts, respectively.
Clicking on values in the plots identifies the point by providing the contig name
and other information.
D. Pileup: Clicking the contig name in the table loads the read pileup, which is
visualized as a fully zoomable image (Figure 12). The status bar above the image
reports the base position and read depth at the current pointer position.
73
Figure 12. A contiGo screenshot illustrating the E. fergusonii isolate ECD227
genome assembly.
(A) Genome assembly statistics and download links for contig sequences, quality
values, and statistics in tabular format. (B) Table of contig statistics that can be
filtered or sorted by multiple columns (depicted here as sorted by length and
filtered for contigs >100,000nt). Underlined row values are links to data for each
contig (e.g., sequence, quality values. (C) Three dynamic plots of contig statistic
values; a scatter, cumulative and count plot. Values to plot are selected via pull-
down menus, and clicking individual points provides further details. (D) Read
pileup for contig00064 in the vicinity of 1,500nt. The read pileup is fully
zoomable, and position of the mouse pointer reports approximate location within
the contig sequence (bar above pileup).
74
Example Usage Cases
The table, plot, and pileup can be used individually or in combination to inspect
genome assembly. Using the bacterial genome assembly in Figure 12 as an
example, we demonstrate this functionality across multiple usage cases.
General Quality Assessment
The plot and table can be used to assess the overall quality of a genome assembly:
i. A counts plot of base quality can be used to determine that the majority of bases
in the assembly are of high quality (e.g., 4869631 bases at quality value 64).
ii. A cumulative plot of contig length can be used to obtain the amount of
sequence in contigs of a minimum length (e.g., N50 contig length of 156910 bp).
iii. A scatter plot of contig length versus read coverage determines that contigs
greater than 100 kb have a depth of coverage of approximately 14-21x.
iv. Applying these criteria (Contig Length >= 100,000 and Read Depth >= 14) to
the contig table shows that 13 contigs met this threshold, and that the all contigs
are of high quality (0.22-0.70 % in low quality bases) and between 48 and 51%
GC content.
Detect a High-Copy Number Plasmid
During DNA preparation, plasmid sequences are often isolated in higher copy
number than the chromosomal component of the genome. ContiGo can be used to
identify and download the sequence of putative plasmids.
i. The plot of Contig Length (x-axis) versus Read Depth (y-axis) identifies
numerous large (> 10kb) contigs with higher depth of coverage (> 28x) than most
other contigs (14-21x).
ii. These contigs can be identified on an individual basis by clicking the points in
the plot, or by applying a filter to the table. Using the plot, we can identify a
75
contig with a length of 38,412 and depth of 48x as contig000011, which can be
used to filter the table by contig name. Clicking the length value in the table
displays the contig sequence.
iii. An NCBI BLAST search using the contig sequence against the non-redundant
database identifies the contig as being similar plasmid sequences.
A similar procedure can be used to inspect other large high coverage contigs.
Alternatively, the table can be filtered for contigs > 10kb and > 28x depth of
coverage. Clicking on each contig length value in series will append these
sequences to the contig sequence display window.
Inspect a Mis-assembled Repeat
Repetitive elements are difficult to assemble, and often collapse into a single
contig with higher than expected depth of coverage. Because they are composed
of paralogous sequences, consensus positions within the contig may contain
multiple correlated mis-matches that result in low-quality positions. A
combination of the plot and the read pileup can be used to identify these contigs:
i. A plot of percent low quality bases (% < 64, x-axis) versus Read Depth (y-axis)
identifies a contig with high depth of coverage (129x) and 1.23% low quality
bases as contig00071. Alternatively, filtering the table for contigs with > 100x
depth of coverage identifies four contigs. The read pileup for these contigs can be
loaded by clicking on the contig name in the table.
ii. Inspecting the read pileup for contig00064 and contig00071 identifies
numerous consensus positions with correlated mismatches. For example,
contig00064 contains three such mismatches in the vicinity of position 1500.
iii. NCBI BLAST analysis of this contig reveals it to be the rRNA-16S ribosomal
RNA gene.
76
Discussion
We built contiGo as a general purpose tool to inspect genome assemblies from
within a web browser. It was designed to be suitable for use on a personal
computer or as part of a WGS analysis pipeline at a sequencing core facility.
Should third-party WGS platforms chose to incorporate contiGo into their
analysis pipelines, individuals researchers could remotely inspect genome
assembly data, thus bypassing the need to download data locally, install software,
and satisfy other software or hardware requirements. In addition to inspecting
genome assemblies, contiGo's general purpose functionality also lends itself to the
analysis transcriptome or metagenome assemblies.
Acknowledgments
I thank Pascale Marquis and Gary Leveque for user testing. V. . was the recipient
of a Canadian Institutes of Health Research Doctoral Research Award. I thank
Moussa Sory Diarra of Agriculture and Agri-Food Canada for permission to use
the genome assembly of E. fergusonii ECD-227 in the example usage cases.
77
CHAPTER 5: Reproducibility of the Roche/454 GS-FLX Titanium System to
Genome Sequence the Dutch Elm Disease Pathogen
Vincenzo Forgetta1, Gary Leveque
2, Joana Dias
2, Deborah Grove
3, Gregory
Grove3, Robert Lyons Jr.
4, Suzanne Genik
4, Chris Wright
5, Alvaro Hernandez
5,
Sharon Bachman5, Lorie Hetrick
5, Sushmita Singh
6, Nichole Peterson
6, Louis
Bernier7, Ken Dewar
1
1McGill University, Department of Human Genetics, Montreal, Canada,
2McGill
University and Génome Québec Innovation Center, Montreal, Canada,
3Pennsylvania State University, Huck Institutes of the Life Sciences, University
Park, Pennsylvania, United States, 4University of Michigan, DNA Sequencing
Core, Ann Arbor, Michigan, USA, 5University of Illinois, W.M. Keck Center,
Urbana, Illinois, United States, 6University of Minnesota, Biomedical Genomics
Center, St. Paul, Minnesota, USA, 7Centre de recherche en biologie forestière,
Pavillon C.-E. Marchand, Université Laval, Sainte-Foy, QC, Canada
A modified version of this manuscript has been submitted to the Journal of
Biomolecular Techniques.
All co-authors have consented to the release of data.
78
Abstract
In this study, we have tested the reproducibility of the Roche/454 GS-FLX
Titanium system at five core facilities. We evaluated a number of sequencing
yield and accuracy metrics using a single sample from O. novo-ulmi strain H327.
We have revealed that the Titanium system is reproducible, with some variation
detected in sequencing yield and homopolymer length accuracy. We demonstrate
that reads shorter than the expected minimum length are of lower quality. The O.
novo-ulmi H327 genome sequence we produced is of high-quality and of benefit
to the fungal and arboreal research community.
KEY WORDS: massively parallel DNA sequencing, whole genome sequencing
Roche/454 GS-FLX Titanium, O. novo-ulmi
79
Introduction
Massively parallel sequencing technology has dramatically reduced the cost of
DNA sequencing, thus enabling the biologist to conduct a genomic research
project. Within these projects, the DNA sequencing component is most often
conducted by a core facility, a laboratory that prepares and sequences the DNA
sample, finally providing the biologist with a genomic data set. Ideally, given the
same sample the resulting genomic data should be of comparable quality
regardless of the chosen facility, allowing the biologist to freely select a facility
based on the needs of the project.
The DNA Sequencing Research Group (DSRG) is a collaboration between
multiple core facilities (simply facility from here on), whose mandate includes
conducting studies to assess the capabilities of its member laboratories and to
promote excellence in DNA sequencing. In accordance with this mandate, the
DSRG sought to assess the performance of the Roche/454 GS sequencing
platform (Margulies, et al., 2005) across multiple member facilities in order to
determine the reproducibility of this instrument, its protocols, and reagents. While
we could assess reproducibility using existing genomic data sets derived from a
diverse set of samples, this would introduce additional experimental variation due
to the use of multiple sample preparation procedures and the DNA base
composition of the different samples. To limit extraneous sources of experimental
variation, we performed our assessment on a single DNA sample preparation and
distributed aliquots to participating member facilities. We chose to sequence the
genome of strain H327 of O. novo-ulmi, the fungal pathogen responsible for the
current pandemic of Dutch elm disease. O. novo-ulmi is a vegetatively haploid
ascomycetous fungus with an estimated genome size of 30-50 Mb2. Since very
few fungal genomes have been sequenced, and none from the Ophiostoma/
Ceratocystis group, none of the sequencing teams had previous experience to
draw from and all groups relied completely on standard protocols. Further, given
the 30-50 Mb genome size (Dewar et al., 1997), the sample was an appropriate
candidate where multiple runs would be required to gain a high level of genome
80
coverage. Also, as a de novo sequencing project, the genome assembly of this
organism would provide new biological knowledge in addition to our technical
evaluations.
In this study we present our assessment of the Roche/454 GS sequencing platform
across five member facilities, including comparisons of sequencing throughput as
well as accuracy. In addition, we present a the genome assembly of O. novo-ulmi
H327 for use by the fungal and arboreal research communities.
Methods
Isolation of O. novo-ulmi H327
O. novo-ulmi strain H327 was isolated and DNA extracted according to methods
described by Et-Touil et al. (1999).
Genome Sequencing and Assembly and Annotation
At each core facility, whole-genome shotgun library preparation and sequencing
was performed using the Roche/454 GS-FLX Titanium system following the
manufacturer’s protocols. A single paired-end library preparation and sequencing
was performed using the Roche/454 GS-FLX Titanium system following the
manufacturer’s protocols. Shotgun and paired-end reads were assembled using
Newbler version 2.3. Assembly errors were fixed manually and with the aid of
custom Python scripts. Gene predictions on the linear scaffolds were performed
using GeneMark-ES version 2.3 (Ter-Hovhannisyan et al., 2008). Gene
predictions for the mitochondrial sequence were performed using the web version
of Agustus (Candida tropicalis model) (Stanke et al., 2008). The EST sequences
(Hintz et al., 2011) were aligned to the genome using BLAT (Kent, 2002).
Sequencing Yield and Accuracy Analysis
Sequence reads were aligned to the genome using BLAT. All analyses were
conducted using custom Python scripts. This includes generating random
pyrosequencing reads, cataloguing of homopolymers from the genome sequence,
81
determining homopolymer accuracy and substitution error rates from the read
sequence alignments. Figures were generated using the Python module matplotlib
or R version 2.1 or greater.
Results.
DNA Sample
A single DNA sample of O. novo-ulmi strain H327 was prepared as a single
stranded DNA fragment library according to common protocols. Aliquots of this
single library were sequenced at the five facilities (identified here as A, B, C, D
and E) according to standard protocols. The size and experience of the sequencing
teams ranged from 2 to 7 technicians and from 5 to 345 runs, respectively (Table
6). The runs performed on the instruments used in this study ranged from 5 to
197. Sequence data was returned to one facility for centralized analysis. Using
this sequence data, we measured the reproducibility of the Roche/454 GS-FLX
Titanium System across several sequencing yield metrics: overall throughput
(Table 6), read length (Figure 13), and read quality (Figure 14). We then present
the O. novo-ulmi H327 genome assembly (Figure 15 and Table 7), and use it to
evaluate the homopolymer length (Figure 16, Figure 17, and Table 8) and the
substitution error rate (Figure 18).
Sequencing Yield
For each facility, we catalogued the read counts and bases sequenced for each
region of the two region sequencing plate (Table 6). Per region, all facilities
exceeded the minimum quality thresholds specified by Roche/454 (450k reads,
180 Mb of sequence and 300nt peak read length). Across facilities, the number of
reads and bases sequenced ranged from 1.05 to 1.56 million and 411 to 580 Mb,
respectively (Table 6). Site C was the top performer, with 21% and 15% above
average counts for reads and bases sequenced, respectively. Site E ranked fifth,
with 18% below average performance for reads and bases sequenced (1.04M
reads, 410Mb). Within each facility, the difference between the region 1 and 2
82
was below 10%, with the exception of facility D. For this facility, region 1 (743k
reads) exceeded region 2 (564k reads) by at least 25% for read count and bases
sequenced. The remaining three facilities (A, B and D) performed within 10% of
the average (Table 6). The average peak read length was 478 and varied from 473
to 481 nucleotides between facilities.
To compare sequence yield and read length performance to a standard measure,
we generated 500,000 random DNA sequences of 1200nt in length, and subjected
them to in-silico pyrosequencing (200 cycles). These random reads had a peak
read length of 531nt (2.6 in silico incorporations per cycle), with the shortest read
being 450nt (Figure 13). The lower peak read length for the experimental data
(478 nt) (Table 6 and Figure 13) is due to a combination of genome base
composition, as well as the trimming of low-quality sequence on the 3' end of
each read and the molecular identifier (MID) tag on the 5' end (data not shown).
Using the difference in peak to shortest read length for the random set (81nt), we
can estimate that the lower bound for an ideal read from the O. novo-ulmi genome
to be 397 nt (478 less 81) (Figure 13). Therefore, reads below 397nt are likely
truncated due to short DNA templates, failed sequencing reactions or other
phenomena that result in reads of less than minimal ideal length. Using this
threshold, we observe that at least 60% of reads from 4 of 5 facilities (A, B, D and
E) are of ideal length (Table 6). Contrary to its top rank in overall sequence yield,
facility C's performance is now below all other facilities, with ~50% of reads
being above ideal length. This is also evident as a larger leading tail of short reads
for facility C (Figure 13a). Notably, facility D has the highest proportion of reads
(74%) above the minimum ideal read length of 397 nt (Table 6). To confirm these
observations are free of potential bias caused by differences in sequencing yield,
we selected 1 million reads randomly from each facility. We observed that facility
C's leading tail remained substantial and that facility D had the lowest leading tail
of reads below 397nt and a higher peak than all other facilities (Figure 13b).
83
Table 6. Summary of participating core facilities and sequencing yield.
All Reads
Reads > 397nt
Site Team Size
Team Runs
Instrument Runs
Region Peak
Length Read Count (% Δ)$ Bases (% Δ)
$ Read Count (%)^ Bases (%)^
A 2 22 22
1 481 601,679 (-10.55) 238,617,862 (-10.45)
398,245 (66.19) 189,657,001 (79.48)
2 477 596,503 (-2.62) 235,886,896 (-1.52)
395,681 (66.33) 187,629,648 (79.54)
1 & 2 481 1,198,182 (-6.77) 474,504,758 (-6.22)
793,926 (66.26) 377,286,649 (79.51)
B 5 172 163
1 481 679,997 (1.10) 273,067,587 (2.48)
464,393 (68.29) 222,011,763 (81.30)
2 475 625,654 (2.14) 243,215,385 (1.54)
385,954 (61.69) 182,799,792 (75.16)
1 & 2 481 1,305,651 (1.59) 516,282,972 (2.04)
850,347 (65.13) 404,811,555 (78.41)
C 3 128 128
1 481 791,274 (17.64) 296,820,523 (11.4)
480,925 (60.78) 227,892,482 (76.78)
2 477 774,951 (26.51) 283,485,969 (18.35)
444,718 (57.39) 209,747,097 (73.99)
1 & 2 477 1,566,225 (21.87) 580,306,492 (14.69)
925,643 (59.10) 437,639,579 (75.42)
D 2 5 5
1 478 743,160 (10.49) 308,348,892 (15.72)
536,017 (72.13) 256,322,790 (83.13)
2 478 563,799 (-7.96) 239,814,876 (0.12)
425,479 (75.47) 204,793,870 (85.40)
1 & 2 478 1,306,959 (1.70) 548,163,768 (8.34)
961,496 (73.57) 461,116,660 (84.12)
E 7 345 197
1 475 546,931 (-18.69) 215,406,359 (-19.16)
346,190 (63.30) 163,197,131 (75.76)
2 473 501,900 (-18.07) 195,249,283 (-18.49)
309,626 (61.69) 145,532,130 (74.54)
1 & 2 473 1,048,831 (-18.39) 410,655,642 (-18.84) 655,816 (62.53) 308,729,261 (75.18) $ Percent deviation from average across all sites for that category
^Percent of total reads or bases for that region
84
Figure 13. Core facility read length distribution.
(A) Total read length distribution for the 5 core facilities (A-E) and 500,000
randomly generated reads. The peak read length for the 5 core facilities (478nt)
and the random sequences (531nt) is marked below the x-axis. The distance to the
shortest random sequence is marked (450nt) and was found to be 81nt. This same
distance from the peak read length for the 5 core facilities is also marked (397nt).
(B) Read length distribution for 1 million randomly selected reads from each core
facility (A-E) and 500,000 random pyrosequencing reads [same as (A)].
85
Next, we chose to investigate variation in base quality between facilities,
including whether reads of less than ideal length (i.e., < 397 nt) are of lower
quality that longer reads. For each facility, we catalogued the quality values of all
bases from reads below and above 397 nt separately. For reads below 397 nt, we
observed a peak quality value of 30 to 31 (Figure 14a), with on average 65% of
bases being above this peak value. This is in contrast to reads above 397 nt, where
the average peak quality value is 36 (Figure 14a) and at least 90% of bases are
above or equal to a quality value of 30. Notably, for facility C, reads >397 nt have
a peak quality value of 37, and there is a slight shift towards having a greater
number of higher quality bases. Extending the quality analysis further, we
catalogued the average quality of bases from reads below and above 397 nt by
read position (Figure 14b). We observed that the reads below 397nt are of lower
average quality across their entire length as compared to longer reads (Figure
14b). Also, longer reads maintain a minimum quality value of 35 up to 300 nt,
whereas shorter reads show a gradual decrease in average quality value across
their length.
86
Figure 14. Base quality per core facility.
(A) Percentage of total base quality values per core facility (A-E), divided by
reads above (solid lines) or below (dotted lines) 397nt in length. (B) Mean
quality values by read position per core facility (A-E), divided by reads above
(solid lines) or below (dotted lines) 397nt in length.
87
Overview of the H327 Genome Assembly
The O. novo-ulmi H327 genome was assembled using the combined read set from
all five facilities (6,425,848 reads, 2,529,913,632 bases), as well as 181,162
paired-end sequence reads (7 kb average insert size ± 2.5 kb). Using Newbler
version 2.3, this produced an initial assembly of 19 scaffolds at an average 61x
depth of read coverage. The 19 scaffolds consisted of 9 large multi-contig
scaffolds (> 500 kb), 9 single contig scaffolds (< 5 kb), and one scaffold of 2
contigs containing a partial sequence of the ribosomal gene (~9 kb). Paired-end
information linked either end of this ribosomal scaffold to two larger scaffolds,
leading us to merge these three scaffolds together (scaffold00010 in Table 7).
Also using paired-end information, we placed 7 of 9 small single contig scaffolds
into sequence gaps within large scaffolds (data not shown). Of the two remaining
small scaffolds, one had low read depth of coverage and was discarded (length
2187 nt), while the other of length 4,151 nt had higher than average read coverage
of 73x and was retained separately (Table 7). NCBI BLAST analysis of this
scaffold shows 39% amino acid similarity (e-value 3e-29) to a predicted protein
from the filamentous fungi Grosmannia clavigera (DiGuistini et al., 2011).
After applying these fixes, the genome assembly of O. novo-ulmi H327 consists
of 9 scaffolds containing 160 contigs, and totals 31,789,037 nucleotides (Table 7
and Figure 15a). The sizes of the eight largest scaffolds are consistent with what
was observed using pulse-field gel electrophoresis of chromosomes for other
strains of O. novo-ulmi (Figure 15b) (Et-Touil, et al., 1999). The absence of
paired-end information spanning the extremities of these scaffolds suggests they
are linear molecules. We also predict that the 8 large scaffolds contain a total of
8,613 genes (Table 7). To further assess the overall completeness of the genome
assembly, we aligned 3,309 EST sequences from Hintz et al. (2011) and
catalogued the number of unique high-scoring alignments. The vast majority of
ESTs (3,277, 99.01%) mapped to the assembly, suggesting that the coding portion
of the genome sequence is near complete.
88
Table 7. Overview of the O. novo-ulmi strain H327 genome assembly.
Name Read Depth Contigs Size Genes ESTs Structure
scaffold00005 60.87 34 6,937,932 (21.78) 1,854 (21.45) 697 (21.06) Linear
scaffold00011 60.7 26 6,817,711 (21.4) 1,827 (21.14) 772 (23.33) Linear
scaffold00002 60.99 17 3,669,772 (11.52) 985 (11.4) 517 (15.62) Linear
scaffold00018 61.05 18 3,419,703 (10.74) 968 (11.2) 335 (10.12) Linear
scaffold00012 61.25 12 2,848,703 (8.94) 794 (9.19) 241 (7.28) Linear
scaffold00008 61.3 17 2,801,594 (8.79) 766 (8.86) 232 (7.01) Linear
scaffold00010 60.91 22 2,758,224 (8.66) 756 (8.75) 246 (7.43) Linear
scaffold00015 61.14 13 2,531,247 (7.95) 663 (7.67) 237 (7.16) Linear
scaffold00004 73.81 1 4,151 (0.01) 1 (0.01) 0 (0) Linear
contig00013 31.18 1 66,357 (0.21) 28 (0.32) 1 (0.03) Circular
89
In addition to these linear scaffolds, we also assembled a 66,357 nt circular
scaffold, which has its origin spanned by multiple paired-ends and sequence reads
(Figure 15c) and is similar in size to known Ophiostoma mitochondria (Bates et
al., 1993). Also, NCBI BLAST alignment of the 28 predicted protein sequences
shows similarity to known mitochondrial protein sequences (data not shown).
The majority of assembled bases are covered by at least one read from each
facility (99.92%, 31,680,436 nt), with 94.06% of the assembled bases being
covered by 5 or more reads from each facility. Only 23,727 nt, or 0.074%, of the
genome is covered by different combinations of 4 facilities. Also, of the total
number of bases assembled, 99.93% (31,683,719) are at a quality value of 64.
These analyses of suggest that the assembly is an accurate and high-quality
representation of the O. novo-ulmi strain H327 genome.
90
Figure 15. The O. novo-ulmi strain H327 genome assembly.
(A) A custom UCSC Genome Browser screen capture of the O. novo-ulmi H327
assembly. Tracks from top to bottom are: (i) a 10Mb scale, (ii) a ruler, (iii)
scaffolds ordered by decreasing size (alternating black/grey), (iv) contigs (black),
(v), depth of read coverage (brown), (vi) GC percent (gray), (vii) gene
predictions, and (viii) EST alignments. (B) Pulse-field gel electrophoresis of 8
strains of O. novo-ulmi and O. ulmi. DNA ladder sizes are marked on the right
and estimated chromosome sizes are marked on the left. Reprinted from Et-Touil,
et al. (1999). (C) The mitochondrial genome assembly. Tracks from top to
bottom are: (i) a scale (20kb), (ii) a ruler, (iii) contig (black), (iv), whole genome
shotgun reads that span the origin, confirming circularity, (v) pair-end reads that
span the origin, conforming circularity.
91
Sequencing Accuracy
Using the genome sequence as a reference, we assessed the reproducibility of the
Roche/454 GS-FLX Titanium System across two sequencing accuracy metrics:
homopolymer length accuracy and substitution error rate.
In the O. novo-ulmi genome we catalogued a total of 408,315 homopolymers from
4 to 10 nt in length (Table 8). The majority (402,455, 98.5%) are less than 8 nt,
and there are roughly 2.8x as many A/T as there are C/G homopolymers (Figure
16a). Accuracy was measured by comparing homopolymer length from individual
reads to orthologous occurrence in the genome sequence. To perform this
analysis, we aligned the reads from each facility to the O. novo-ulmi H327
genome, retaining only unique alignment greater that 300nt. We then catalogued
within each alignment the homopolymer length from both the read sequence and
the orthologous genome sequence. We only considered homopolymer alignments
that had flanking nucleotides that matched between the read and genome
sequence. Due to these criteria, we were able to measure homopolymer accuracy
for a roughly 65-85% of all catalogued homopolymers (Table 8). Within this set,
we observed that accuracy decreases with increasing homopolymer length (Figure
16b). We also observed that while facilities B, C, D and E performed similarly
and maintained above 50% accuracy for all homopolymer lengths considered, the
performance for facility A was below that of the other facilities, achieving 65%
accuracy for homopolymers of length 7, and below 50% for homopolymers of
length 9 and 10. This lower accuracy for facility A was observed for all four
nucleotides (A, C, G and T), with some improvement in accuracy for guanine and
cytosine homopolymers (Figure 17a). This improvement is marginal, and
variation in homopolymer accuracy between bases for any a single facility is
fairly consistent (Figure 17b). We next sought to determine whether inaccuracies
in homopolymer length where due to the under calling or overcalling of
homopolymer length in individual reads. Figure 17c demonstrates that the
majority of errors across all facilities are due to under calls, and that facility A
92
tends to under call homopolymers of length 9 or 10 at an equal or greater
proportion as compared to correct calls (Figure 17c).
Similar to the homopolymer analysis, we conducted the substitution error rate
analysis using only unique alignments of > 300 nt. In addition, we only
considered alignments that had one or two mismatches to the genome sequence,
and only substitutions (not gaps) that appeared more than 5 nucleotides from the
end of the read or another mismatch. Using these criteria we catalogued 135,037
substitutions, and observed that 94.9% (128,201) occurred once per read. Within
each substitution type (e.g., T>C or A>G), the substitution error rate per facility
shows slight variation (Figure 18), with facility E having a slightly higher rate for
C>T or G>A substitutions, and facility B having a higher rate for G>T or C>A
substitutions. The most marked variation is the elevated error rates across all
facilities for T>C or A>G and C>T or G>A substitutions (Figure 18). This has
been observed previously for this platform (Campbell et al., 2008), and is
attributable to PCR fidelity during library preparation (Bracho et al., 1998;
Dunning et al., 1988; Ennis et al., 1990). Notwithstanding these polymerase-based
substitutions, the substitution error rate is below 0.4 per 1000 sequenced bases for
the other 4 substitution types and is similar to previously published observations
for pyrosequencing (Campbell, et al., 2008).
93
Table 8. Homopolymer measurement statistics.
Measured Hompolymers per Site (%)
Homopolymer Length
Genome A B C D E
4 294,261 251,918 (85.61) 253,370 (86.1) 254124 (86.36) 255884 (86.96) 251041 (85.31)
5 79,130 64,110 (81.02) 64,569 (81.6) 64609 (81.65) 65265 (82.48) 63792 (80.62)
6 20,974 15,758 (75.13) 15,873 (75.68) 15718 (74.94) 16093 (76.73) 15618 (74.46)
7 8,090 5,544 (68.53) 5,650 (69.84) 5405 (66.81) 5723 (70.74) 5510 (68.11)
8 3,381 2,173 (64.27) 2,192 (64.83) 2018 (59.69) 2249 (66.52) 2144 (63.41)
9 1,654 1,083 (65.48) 1,129 (68.26) 935 (56.53) 1138 (68.8) 1065 (64.39)
10 825 547 (66.3) 570 (69.09) 411 (49.82) 589 (71.39) 532 (64.48)
94
Figure 16. Homopolymer counts and overall accuracy.
(A) Counts of homopolymers of length 4 to 10 in the O. novo-ulmi H327 genome.
(B) Homopolymer accuracy per core facility (A-E).
95
Figure 17. Aspects of homopolymer accuracy.
(A) Trend of facility accuracy per nucleotide. The leftmost plot is overall
accuracy per facility (identical to Figure 16B), with subsequent plots depicting
accuracy for each nucleotide separately. (B) Trend of nucleotide accuracy per
facility. Each chart depicts overall accurate per facility and each plot line depicts
accuracy for each nucleotide. (C) Homopolymer overcall and under call rates. The
chart depicts for each facility the percent agreement (diagonal), under call (below
diagonal), and overcall (above diagonal) of homopolymers from read sequences
to the genome sequence.
96
Figure 18. Substitution error rate.
For each facility, the substitution rate of a nucleotide in read sequence as
compared to the genome sequence.
97
Discussion
The primary objective of this study was to test the reproducibility of the
Roche/454 GS-FLX Titanium platform across DSRG member facilities. Because
using existing data sets would introduce experimental noise, we chose to whole
genome sequence a single sample from O. novo-ulmi strain H327. Consequently,
we had the opportunity to provide the genome sequence of an important pathogen
to the fungal and arboreal research communities.
We have produced a high-quality assembly of the O. novo-ulmi H327 genome.
The multiple large linear scaffolds are similar in length and number to what was
observed previously (Dewar, et al., 1997) and the number of predicted genes is
similar to related fungi (Diguistini et al., 2009). Also, the near complete mapping
of ESTs from Hintz et al. (2011) suggests that the coding portion of the genome is
well represented. In addition to the chromosomes, we fully assembled the
mitochondrial genome into a single contig, and confirmed its circularity using
both read sequence and paired-end information. This genome sequence provides a
foundation to conduct further research, such as EST sequencing experiments
similar to Hintz et al. (2011), as well as other functional genomics experiments
such as RNAseq or proteomics. Also, this genome assembly will complement
phenotypic and genetic studies, such as those that investigate the pathogenicity of
this organism (C. M. Brasier, 1996; Clive M. Brasier & Kirk, 2001; Et-Touil, et
al., 1999).
In general, the Roche/454 GS-FLX Titanium System was found to be
reproducible across the five facilities tested in this study, with no facility
performing below the minimum throughput limits specified by Roche/454.
However, some cross-facility variation does exist, which has the potential to
impact downstream genomic analyses. This variation was observed at the
sequencing yield and accuracy level.
During this study, reproducibility was tested using several sequencing yield
metrics, including the partitioning of reads below and above an ideal length. This
98
metric has revealed that facilities that produce higher number of sequence reads
(e.g., facility C) may also be producing a higher proportion of low-quality short
reads as compared to other facilities (i.e., facility D). Depending on the goal of the
genomic project, using a higher proportion of lower-quality reads may impact
downstream analysis. For genome assembly, these short low-quality reads will
remain useful, contributing to overall genome coverage; however, for
metagenomics projects, where individual reads are used to identify species,
sequencing errors in short low-quality reads may result in false species or strain
identifications. As a result, a prudent method would be to apply both of these
metrics, thus providing an overall assessment of the sequencing run and how
much of it is of high quality.
To investigate sequencing quality at the nucleotide level we tested the
reproducibility of homopolymer lengths as well as determined the substitution
error rate. The five facilities tested equally well for homopolymers below 7 nt,
achieving above 75% accuracy. Above this length all facilities showed a decrease
in accuracy, calling homopolymer lengths below what was observed in the
genome sequence (under call). Notably, facility A showed the most predominant
decrease in homopolymer accuracy, with less than 50% accuracy for
homopolymers of length 9 or 10 nt. Assembling the genome using only reads
from this facility would have incorrectly determined the length of about 2,400
homopolymers, accounting for 0.008% of the genome sequence. This decrease in
accuracy would be significant for genomic samples with larger proportions of
long homopolymers, or where individual reads are of importance, such as in
metagenomics studies. Compared to the homopolymer error rate, the substitution
error rate due to pyrosequencing is much lower occurring once for every 2,500
sequenced nucleotides. For particular substitutions, the error rate is higher due to
PCR fidelity during library preparation (Campbell, et al., 2008), yet remains
below that of homopolymers, at a rate of approximately 3.25 in 2,500 sequenced
nucleotides. This higher substitution error rate may introduce bias in
metagenomics experiments, where we would expect at least one error per 2-3 read
sequences of 450nt. For metagenomic samples, simulating the effect that
99
increases in homopolymer and substitution error rates have on species
identification would be informative.
Among the facilities tested in this study, we found no correlation between the size
or experience of the sequencing team and sequencing yield or accuracy. For
instance, facility D was the least experienced (lowest number of runs) but
produced the highest quality data, whereas teams with roughly equal experience
(B, C and E) varied in performance. For example, facility E had the lowest overall
sequence yield, whereas facility C had the highest, but both teams have over 100
runs of experience. Other factors not considered in this study, such as instrument
maintenance schedules, may show correlation with reproducibility.
Our study suggests that the Roche/454 GS-FLX Titanium System is reproducible
across member facilities, with some variation is sequence yield and homopolymer
length accuracy. This study demonstrates that a more detailed analysis of
sequence yield and quality would provide greater insight into performance of the
instrument.
Acknowledgements
V. . was the recipient of a Canadian Institutes of Health Research Doctoral
Research Award. We thank Clotilde Teiling and Tim Harkins at Roche/454 for
donating the GS-FLX Titanium reagents. We are thankful to Louis Bernier
(Centre d'étude de la forêt, Université Laval, Québec, Québec) for providing the
strain of O. novo-ulmi H 7.
101
Connecting Text
My work to this point was largely focused on comparisons within internal
datasets, a genome assembly or a genome sequence and annotations. However,
another important aspect to genome analysis is relating these internally generated
sequences and annotations to those in public repositories. In Chapter 6 I have
collaborated with an investigator from Agriculture and Agri-Food Canada (Dr.
Moussa S. Diarra), and used a summer internship at Microsoft® Research
(mentored by Dr. Simon Mercer) to develop a software application that mines
biological sequence annotations in public repositories.
Dr. Diarra previously isolated an antimicrobial resistant (AMR) strain of E. coli
(ECD-227) from broiler chicken (Diarrassouba et al., 2007). We used genome
sequencing and annotation to identify determinants responsible for this phenotype
(Chapter 7) (Forgetta, et al., 2012) . During this project, we used contiGo and a
custom UCSC Genome Browser of ECD-227 to analyze the genome assembly
and annotations, and found that this strain had numerous virulence and AMR
genes, some of which were located on plasmids found to be from a Salmonella
enterica serovar. Also, the genome analysis determined that ECD-227 is an E.
fergusonii species, correcting the E. coli typing performed previously using other
typing methods (Diarrassouba, et al., 2007). This study is the first to report a
multi-drug resistant E. fergusonii strain from a broiler chicken.
The analysis of ECD-227 included data mining the protein annotations for terms
associated with AMR and virulence, as well as determining whether or not these
genes were acquired from other species or strains. The annotations were obtained
by using a local installation of NCBI BLAST (Altschul, et al., 1997) to compare
the protein sequences to the NCBI non-redundant protein database. Although this
analysis was facilitated by using the UCSC Genome Browser, it was primarily
conducted by me using computer programming and high performance servers;
skills and resources that most microbiologists do not possess. This experience
inspired me to develop a software application, BLAST in Pivot (BL!P) (Chapter
6), that completely automates the alignment of biological sequences using NCBI
102
BLAST and presents the results for data exploration in a program called Pivot
(Chapter 6). Unlike existing applications (Catanho, et al., 2006; Gollapudi, et al.,
2008; J. He, et al., 2007; Xing & Brendel, 2001), BL!P does not require a local
installation of the NCBI BLAST programs or databases and allows for data
mining capabilities beyond the simple filters found in these programs.
Contribution of Authors
I created BL!P and prepared and wrote the manuscript (Chapter 6). The E.
fergusonii study was conceived by Moussa S. Diarra, and the bacterial strain
isolation and antibiotic profiling was performed by the other authors of the
publication (Forgetta, et al., 2012). The team at the McGill University and
Genome Quebec Innovation Centre performed the sequencing of the bacterial
strain. I conducted all the genome analyses and prepared and wrote the
manuscript (Chapter 7) (Forgetta, et al., 2012).
103
CHAPTER 6: A Tool to Automate Multiple BLAST Searches and
Dynamically Explore Results
Vincenzo Forgetta1, Moussa S. Diarra
2, Ken Dewar
1, and Simon Mercer
3
1Department of Human Genetics, McGill University, 740 Dr. Penfield Avenue,
Montreal, QC, H3A 1A4, Canada. 2Pacific Agri-Food Research Centre,
Agriculture and Agri-Food Canada, PO Box 1000, Agassiz, BC, V0M 1A0,
Canada. 3Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA.
This work was performed by Vincenzo Forgetta during a Microsoft Research
internship from June to September, 2010. His mentor was Simon Mercer, Ph.D.,
Director of Health and Wellbeing, Microsoft External Research.
A modified version of this manuscript has been submitted to PLOS ONE.
The source code, binaries, and documentation is available at
http://blip.codeplex.com
104
Abstract
NCBI BLAST is a tool widely used to annotate biological sequences. Current
limitations in the annotation process are in part dictated by the methodology used.
The manual inspection of BLAST results is slow, tedious and limited to static
analysis of textual output, while automated analyses typically discard useful
information in favor of increased speed and simplicity of analysis. These
limitations can be addressed using novel data exploration and visualization
methods. We have created a program that automates NCBI BLAST searches,
fetches associated GenBank records, and dynamically explores the results. It also
provides an interface to create customized images for each BLAST match,
allowing the user to perform further customizations to meet their data exploration
objectives.
105
Introduction
An NCBI BLAST (Altschul, et al., 1997) search of available biological sequence
databases is a leading method used for biological sequence comparison and
annotation. The method consists of two tasks: querying one or more sequences
against a biological sequence database, followed by the analysis of significant
alignments to suggest the origin and function of the query sequences. The
majority of users perform a BLAST analysis using the web servers available at
NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi) followed by manual inspection of
the results. This approach entails reading or scanning the BLAST results web
page for each query separately and browsing to individual records in GenBank
(Benson et al., 1997) to obtain further information regarding the matched target
sequence. For a larger number of query sequences manual inspection is not
feasible, and numerous computer programs were developed to automate the
querying and analysis of a large number of BLAST records (Catanho, et al., 2006;
Darzentas, 2010; Gollapudi, et al., 2008; J. He, et al., 2007; Xing & Brendel,
2001). These programs require the local installation of the BLAST programs and
databases or process pre-computed BLAST results, and are available as a web-
service or can be downloaded and installed on a local server.
The analysis of BLAST results involves filtering for significant hits followed by a
more detailed analysis of the resulting output for biologically relevant results.
This analysis can be conducted using text-based and/or graphical representations
of the BLAST results. Currently available programs present results primarily in
tabular format, supplemented with simple graphical views that summarize the
pair-wise alignment for each hit. In general, these programs facilitate the post-
processing of BLAST results; however, limitations still exist for their use by some
biologists. These limitations include: (i) the need to install and operate BLAST
locally, (ii) the absence of additional information provided in GenBank for the
target sequence of each BLAST hit, (iii) the query-centric analysis hinders the
ability to detect patterns that exist across multiple query sequences, and (iv) the
inability to customize aspects of the data analysis.
106
To address these limitations we have created a desktop application to
automatically query NCBI BLAST and allow for the dynamic exploration of the
results across multiple query sequences. Our goal was to create an application that
automates the NCBI BLAST search of hundreds, and potentially thousands, of
biological sequences, including multiple matches per query, as well as
incorporating information from the GenBank records. Because BLAST results are
textual, our aim included the development of a user interface to create custom
image layouts for each BLAST hit to allow the user to customize the visual
representation of the individual BLAST matches. The resulting application has
been named BL!P (pronounced \blip\), or BLAST in Pivot.
This report describes how we created BL!P and demonstrates its use in exploring
and visualizing the attributes contained within the BLAST and GenBank data
across multiple query sequences. As a test case example, we demonstrate the use
of BL!P in assessing the evolutionary origins of predicted proteins from a
genomic region of an E. coli-like strain isolated from broiler chicken (Forgetta, et
al., 2012), as well as detecting the presence of an introgressed bacteriophage
within this genomic region.
Implementation
The process for generating the output is straightforward and is similar to other
programs known as wizards; programs that instruct the user to complete a task,
such as the installation of a program, through a series of linear steps. The input is
a file with either DNA or protein sequences in FASTA format. In addition, the
user is prompted to specify a project name, a location in which to store the results,
as well as the BLAST parameters used to query the database and filter the results
(for further information see documentation on the BL!P website). To avoid the
inefficient use of time, internet bandwidth, and computational resources at NCBI,
the BLAST results and GenBank records are saved to disk. This enables BL!P to
resume an interrupted process, and is particularly important when processing
large numbers of query sequences that have lengthy BLAST execution times. To
further avoid inefficiency, the similarity searches are performed using the default
107
e-value parameter (10.0), with filters being applied to these results and used to
obtain a list of GenBank identifiers to download from NCBI. This version of
BL!P gathers 21 attributes from the BLAST match and associated GenBank
record. Following successful completion of the BLAST search and GenBank
download, the user is allowed to customize the image that will represent each
BLAST hit. The customized image consists of textual elements from the BLAST
and GenBank record. Each element can be positioned within the image layout and
its appearance modified (font size, font color, font family, etc) to accentuate its
importance. The background color of the image varies according to the species or
genus of the BLAST hit's target sequence.
Following generation of the images for each BLAST hit, the user is directed to
visualize their results using the PivotViewer web-browser plugin. PivotViewer is
a data exploration program that allows for the seamless exploration of large
datasets that consist of text and images, and includes real-time text search and
filtering functionality. The list of BLAST hits, represented as images, is updated
in real-time based on the text search and filter criteria specified by the user.
In addition, the set of BLAST hit images can be navigated with the mouse as well
as zoomed in and out. Selecting a BLAST hit image raises an information panel
that contains the detailed information parsed from the BLAST result and
GenBank record. Individual details can be selected to further filter the BLAST hit
images by that value (e.g., Identity), navigate to a web resource (e.g., Gene), or
link to the pairwise alignment in text format (e.g., AlignLen). Double-clicking the
matches’ image will navigate to the GenBank record in NCBI. Tabular results can
be exported in text format for either the entire set of BLAST results or the
currently filtered set of hits via links in the title bar.
BL!P was developed in C# using .NET 4.0 and the Microsoft Biology Foundation
(MBF) open-source bioinformatics toolkit (http://research.microsoft.com/bio).
The graphical user interface was created using Windows Presentation Foundation
version 4. The MBF toolkit was used to access NCBI resources such as BLAST
and GenBank, as well as parsing biological sequence data. Submissions to
108
BLAST and data from GenBank are accessed according to NCBI recommended
usage policies. Freely available dynamically linked libraries were also used (see
BL!P website). BL!P requires Microsoft Windows XP or better, .NET 4.0, and the
Silverlight PivotViewer web browser plugin (see http://blip.codeplex.com for
more information).
Results.
Functionality in BL!P is demonstrated in Figure 19 using a real world example of
223 predicted proteins from a 228 Kb genomic region of ECD227. Prior to
genome sequencing, conventional tests indicated an ambiguity in its taxonomy,
with ECD227 either belonging to E. coli or a closely related genera. Figure 19a
demonstrates the two steps required to visualize the distribution of bacterial
species names among the top-scoring matches for each query. The resulting data
visualization shows that this region in ECD227 contains a roughly equal number
of genes that are most similar to either E. coli or E. fergusonii. Figure 19b shows
refinement of this analysis, demonstrating that the interspersed positional
distribution matches for each predicted gene indicates that this is not due to the
insertion of one or more genomic islands, but that isolate ECD227 is closely
related to both species. Figure 19c demonstrates how to assess for the presence of
horizontally transferred sequences using the text search capabilities built within
Pivot. The procedure is similar to Figure 19a; however, the data is sorted using a
different category (input order), and not filtered for rank. Also in this example, the
keyword “phage” was used to filter the matches accordingly, and illustrates that
there is a bacteriophage present in the centre of this genomic region.
109
Figure 19. Test case study using BL!P for the analysis of 223 predicted proteins
from E. fergusonii ECD227.
(A) Pivot image demonstrating the distribution of species names among the top-scoring matches:
(1) sliders in the Rank filter were used to limit rank value to 1; (2) the histogram was sorted by
species; (3) two predominant peaks that represent top-scoring matches for Escherichia coli (left,
white) and E. fergusonii ECD227 (right, light-brown) were observed. (B) Pivot image
demonstrating the input order of top-scoring matches: (1) the histogram was sorted by input order
and filtered for matches from E. coli or E. fergusonii (2). The interspersed distribution of E. coli
and E. fergusonii matches was observed. (3) Individual matches are represented by an image,
which contains additional BLAST/GenBank detail and can be customized prior to analysis in
Pivot (Fig 1D). Selecting the image in the PivotViewer raises a panel with additional
BLAST/GenBank details (not shown). (C) Pivot image demonstrating the positional distribution
of the top 5 matches for each predicted gene, filtered for the keyword “phage”: (1) the keyword
“phage” was entered using the search box; ( ) histogram was sorted by input order; ( ) a peak of
matches indicating the presence of a phage was observed. (D). BL!P image of the custom image
layout creator displaying information for the top-scoring match for query sequence 17. The image
layout (6) is a dynamic interface that allows items to be moved (not shown) and their appearance
modified (3). Selecting a category dictates the background color scheme (1) (set to species in this
example). Attributes are added to the image (2) and their appearance is modified using the controls
(3). The currently displayed item can be changed (4), with details for each item appearing on the
right (5).
110
Because BLAST results and GenBank records consist entirely of textual
information, an interface was created enabling the user to create their own image
layout (Figure 19d). The image layout is created by selecting filters from the list
and adding them to the image preview; filters are deleted from the preview using
the delete or backspace key. The value for the particular BLAST hit appear in the
image preview; alternate BLAST hits can be previewed using the interface. The
filter value for the BLAST hit in the image preview can be moved with the mouse
and the fonts’ appearance can be modified to accentuate its relative importance or
provide aesthetic appeal. The background color of the image varies according to
the species or genus of the matches target sequence. For more detail please refer
to the BL!P User Guide.
Conclusions
BL!P attempts to make the process of BLAST search of many query sequences
trivial and customizable, allowing a biologist to explore BLAST results in a
timely manner. The example of ECD-227 shown in the present work demonstrates
only few of BL!P capabilities, and interested researchers are invited to try the
sample dataset available for download on the BL!P website, or to create
collections using their own DNA or amino acid sequences.
Availability and Requirements
This software is open source and is available for free at http://blip.codeplex.com
under the Microsoft Public License (Ms-PL). BL!P requires Microsoft Windows
XP or better.
Authors' contributions
VF conceived, designed and developed the software program and wrote the
manuscript; MSD was responsible for isolating and typing E. coli ECD-227. KD
revised the manuscript and provided intellectual content; SM was important in the
conception and design of the software program. All authors read and approved the
final manuscript.
111
Acknowledgements and Funding
We are grateful to Xin-Yi Chua for comments and C# expertise. We also thank
the developers of the MBF and PivotViewer for their technical support, and Bob
Silverstein and Xiaoji Chen from the EPX User Experience group for comments
and suggestions.
VF was a summer intern at Microsoft Research (Redmond, WA, USA). VF
received a standard intern salary and benefits from the Microsoft Corporation
during this time. VF is a recipient of a Canadian Institutes of Health Research
Doctoral Research Award. The bacterial typing and genome sequencing of
ECD227 was funded by the Agriculture and Agri-Food Canada (MSD) and the
genome sequencing was performed at the McGill University and Genome Quebec
Innovation Centre (KD).
112
CHAPTER 7: Pathogenic and Multidrug Resistant Escherichia fergusonii
from Broiler Chicken
Copyright © 2012, Federation of Animal Science Societies.
Citation:
Vincenzo Forgetta, Heidi Rempel, François Malouin, Rolland Jr.
Vaillancourt, Edward Topp, Ken Dewar and Moussa S. Diarra. (2012).
"Pathogenic and multidrug-resistant Escherichia fergusonii from broiler chicken."
Poultry Science 91(2): 512-525.
Reprinted with the permission of Poultry Science (PS).
This is an author-created, uncopyedited electronic version of an article accepted
for publication in PS. The Federation of Animal Science Society, publisher of PS,
is not responsible for any errors or omissions in this version of the manuscript or
any version derived from it by third parties. The definitive publisher-authenticated
version is available at
http://ps.fass.org/content/91/2/512.
113
Abstract
An Escherichia spp. isolate, ECD-227, was previously identified from broiler
chicken as a phylogenetically divergent and multidrug resistant E. coli
(Diarrassouba, et al., 2007) possessing numerous virulence genes (Lefebvre et al.,
2008; Lefebvre et al., 2009). In this study, whole genome sequencing and
comparative genome analysis were used to further characterize this isolate. The
presence of known and putative antibiotic resistance and virulence open reading
frames (ORFs) were determined by comparison to pathogenic (E. coli O157:H7
TW14359, APEC O1:K1:H7, and UPEC UTI89) and non-pathogenic species (E.
coli K-12 MG1655 and E. fergusonii ATCC 35469). The assembled genome size
of 4.87 Mb was sequenced to 18-fold depth of coverage and predicted to contain
4,376 ORFs. Phylogenetic analysis of 537 ORFs present across 110 enteric
bacterial species identifies ECD-227 to be E. fergusonii. The genome of ECD-227
contains five plasmids showing similarity to known E. coli and Salmonella
enterica plasmids. The presence of virulence and antibiotic resistance genes were
identified and localized to the chromosome and plasmids. The mutation in gyrA
(S83L) involved in fluoroquinolone resistance was identified. The Salmonella-
like plasmids harbor antibiotic resistance genes on a class I integron (aadA,
qacE-sul1, aac3-VI, and sulI) as well as numerous virulence genes (iucABCD,
sitABCD, cib, traT). In addition to the genome analysis, the virulence of ECD-227
was evaluated in a day-old chick model. In the virulence assay, ECD-227 was
found to induce 18 to 30% mortality in day old chicks after 24 h and 48 h
infection, respectively. This study documents an avian multidrug resistant and
virulent E. fergusonii. The existence of several resistance genes to multiple
classes of antibiotics indicates that infection caused by ECD-227 would be
difficult to treat using antimicrobials currently available for poultry.
Key words: Genome sequence, Escherichia fergusonii, broiler chicken, virulence,
multidrug resistance.
114
Introduction
Escherichia are found in many environments, including the digestive tracts of
mammals and birds. In the gut, most are part of the normal microflora; however,
certain strains can be pathogenic and able to induce intra-intestinal or extra-
intestinal disease, and can thus pose significant risks to health. Furthermore,
pathogenic strains of poultry and livestock can be transmitted to humans where
they may cause disease (Rodriguez-Siek et al., 2005; Ron, 2006). The genomes of
numerous pathogenic Escherichia coli strains have been sequenced, including
enterohaemorrhagic E. coli (EHEC) strain O157:H7 (Kulasekara et al., 2009;
Perna et al., 2001) that causes intestinal disease, uropathogenic E. coli (UPEC)
(Chen et al., 2006; Welch et al., 2002) strains that cause urinary tract infections,
and the avian pathogenic E. coli (APEC) strain O1:K1:H7 that causes
colibacillosis in poultry (T. J. Johnson et al., 2007). E. fergusonii is a new species
identified by Farmer et al. in 1985 (Farmer et al., 1985), and the genome sequence
for a non-pathogenic strain was made available as part of a survey of E. coli
genome evolution (Touchon et al., 2009). E. fergusonii has recently been shown
to cause disease in animals (Bangert et al., 1988; Hariharan et al., 2007; Herraez
et al., 2005) and humans (Bain & Green, 1999; Funke et al., 1993; Mahapatra &
Mahapatra, 2005), and it possesses an extended spectrum of resistance to
antibiotics (Lagace-Wiens et al., 2010; Savini et al., 2008).
In North America, sub-therapeutic doses of antimicrobial agents continue to be
used in livestock and poultry feed to prevent and control infections, resulting in
improved weight gain and enhanced feed efficiency. This practice is known to
modify the intestinal micro-flora and create selective pressure favoring the
emergence of antibiotic resistant strains (Aarestrup et al., 2001). The emergence
of antimicrobial resistant and pathogenic bacteria in food producing animals
represents a serious food safety concern and is a threat to public health. In
bacteria isolated from poultry, determinants of antibiotic resistance have been
identified on mobile genetic elements such as plasmids, transposons and
integrons, thus potentially allowing these determinants to be spread between
115
species (Fricke et al., 2009). A study conducted by Diarrassouba et al. in 2007
(Diarrassouba, et al., 2007) on antibiotic resistance and virulence determinants
across multiple broiler chicken farms isolated several antibiotic resistant E. coli.
One of these isolates, ECD-227, was found to be resistant to 22 of 25 antibiotics
tested. Subsequent studies conducted by Lefebvre et al. (Lefebvre, et al., 2008;
Lefebvre, et al., 2009) using DNA array hybridization showed that ECD-227
possessed numerous E. coli virulence genes, but this isolate was genetically
distant to most known E. coli strains, suggesting that it could be a divergent E.
coli strain or a closely related species.
Whole genome sequencing followed by comparative genome analysis allows the
comprehensive determination of evolutionary relationships among bacterial
species, and permits the identification and localization of candidate genes
involved in antibiotic resistance and/or virulence. The objective of this study was
to sequence the genome of ECD-227 in order to clarify its evolutionary
relationship with other enteric bacteria, and to assess them for the presence of
determinants responsible for antibiotic resistance of this isolate. In addition, the
level of virulence of ECD-227 was assessed in a day old chick model.
Materials and Methods
Bacterial Strains
The isolation and original characterization of ECD-227 has been described
previously (Diarrassouba, et al., 2007; Lefebvre, et al., 2009). E. coli D06-2195
(isolated from a broiler septicemia case) and E. coli K-12 MG1655 (American
Type Culture Collection) (Ngeleka et al., 2002) were used as positive and
negative controls, respectively, in the in vivo virulence assay.
116
Whole Genome Sequencing and Assembly
Genomic DNA was extracted from a single colony culture of stock kept at -80 oC
in 25% glycerol using a standard commercial column-based extraction method
(QIAamp DNA Mini Kit, Qiagen, Mississauga, CA). DNA was stored in 1X TE
buffer (pH 8.0), quantified by spectrophotometry (Bio-Rad SmartSpec 3000
UV/Visible Spectrophotometer, Mississauga, CA), visualized by 1% agarose gel
electrophoresis, and stored at -20 oC until further use. Whole-genome shotgun
library preparation and sequencing was performed using the Roche/454 GS-FLX
Titanium system following manufacturer protocols and shotgun reads were
assembled using Newbler version 2.3. Contigs of minimum size of 500
nucleotides were ordered and oriented based on alignment to the finished genome
of E. fergusonii ATCC 35469 (GenBank Accession CU928158). Plasmid contigs
were ordered and oriented based on the most similar plasmid sequences found in
NCBI GenBank.
Annotation and Comparative Analysis
Open reading frames (ORFs) were predicted using Glimmer 3.02 (Delcher, et al.,
2007). ORFs less than 200 nt were excluded from further analysis. Translated
ORFs were compared to all known protein sequences using NCBI BLAST
(Altschul, et al., 1990) (April 2010 version of the nr database). Amino acid
alignments below 60% identity or 80% query coverage were excluded from
further analysis. Protein alignments to representative pathogenic (E. coli APEC
O1:K1:H7 (T. J. Johnson, et al., 2007), UTI89 (Chen, et al., 2006), and O157:H7
TW14359 (Kulasekara, et al., 2009)) and non-pathogenic (E. coli K-12 MG1655
(Blattner et al., 1997) and E. fergusonii ATCC 35469 (Touchon, et al., 2009))
bacteria were extracted from the BLAST results. The GC content of the genome
was computed in 2000 nt windows that where staggered by 1000 nt. Putative
virulence and antimicrobial resistance genes were identified and positioned on the
genome by alignment of protein sequences obtained from Bruant et al. (2006) or
through manual inspection of the BLAST results. Linear representations of the
117
genome and plasmids were generated using a custom instance of the UCSC
Genome Browser (Kent, et al., 2002) housing the ECD-227 genome and
annotations.
Phylogenetic Analysis
A total of 110 enteric bacterial species were selected from the NCBI BLAST
results, from which 537 homologous ORFs were found to be present in all
members. An amino acid multiple alignment was generated for each of the 537
ORFs using MAFFT (Katoh et al., 2002), and informative alignment columns
were extracted and merged. Phylogenetic analyses were conducted in MEGA4
(Tamura, et al., 2007) as follows: the evolutionary history was inferred using the
Neighbor-Joining method; the bootstrap consensus tree was inferred from 100
replicates; and branches corresponding to partitions reproduced in less than 95%
bootstrap replicates were collapsed. All positions containing gaps and missing
data were eliminated from the dataset (Complete deletion option) giving a total of
23,813 positions in the final dataset.
Nucleotide Sequence Accession Numbers
The sequence and annotation of the ECD-227 chromosome and plasmids has been
deposited in NCBI GenBank under the accession numbers CM001142,
CM001143, CM001144, CM001145, CM001146, and CM001147.
Antibiotic Resistance Profiling
The antibiotic resistance profile for ECD-227 was determined for 18 antibiotics
previously (Diarrassouba, et al., 2007). Additional susceptibility of ECD-227 to
10 antibiotics and to imipenem, meropenem, lincomycin and rifampin was
determined using sensititre as described previously (Diarrassouba, et al., 2007)
and by disk diffusion assays according to Clinical Laboratory and Standards
Institute (CLSI) guidelines (CLSI, 2008).
118
Virulence in Day-old Chicks
The virulence of ECD-227 was tested by subcutaneous inoculation of day-old
broiler chicks obtained from a local hatchery (Western Hatchery, Abbotsford, BC,
Canada). ECD-227 and the control strains were grown in Tryptic Soy broth (TSB,
Becton Dickinson, Mississauga, ON, Canada) for 18 hr at 37oC and re-suspended
in the TSB (Becton Dickinson) at a concentration of 1.2 × 107 CFU/ml
(confirmed by viable bacterial count). Groups of 40 day-old chicks were
subcutaneously inoculated with 0.25 ml/bird (approximately 3 × 106 CFU) and
divided in four cages (10 birds per cage). Care was taken to assure that similar
CFUs were subcutaneously inoculated for all tested isolates. Clinical symptoms
and mortality were first monitored hourly for 12 h and thereafter at 24 h and 48 h.
Birds in discomfort (signs of severe disease such as a drop in food consumption,
listless appearance with ruffled feathers, head drawn into their bodies, rapid
labored breathing, gasping or other respiratory distress) were removed and
euthanized. Two birds per cage from the remaining birds at the end of the
experiment were euthanized with carbon dioxide for necropsy. After 48 h, isolates
that killed more than 50%, 10-50% and less that 10% of chicks were classified
virulent, moderately virulent and non-virulent, respectively. During necropsies,
livers, spleens, hearts and lungs were taken for bacterial isolation. All
experimental procedures performed with chicks were approved by the Animal
Care Committee of the Pacific Agri-Food Research Center (Agassiz, British
Columbia, Canada) according to guidelines described by the Canadian Council on
Animal Care (CCAC, 1993).
Results and Discussion
Phylogenetic Assessment of ECD-227
Analyses of 537 homologous protein sequences were conducted against 110
bacterial species, encompassing E. fergusonii (ATCC 35469), 74 E. coli, 29
Salmonella, and 6 other enteric bacterial species. ECD-227 was more similar to E.
fergusonii ATCC 35469 than to any other enteric bacteria analyzed (Figure 20)
119
In addition, analysis of the BLAST results for the chromosomal predicted protein
set (4,058 protein sequences) against the NCBI nr database showed that 2006
protein sequences (49%) had greater similarity to E. fergusonii ATCC 35469 than
to E. coli K-12 MG1655. Another 515 sequences (12%) were equally similar
between E. fergusonii ATCC 35469 and E. coli K-12 MG1655. Only 581 (14%)
had greater similarity to E. coli K-12 MG1655. Of the 956 predicted proteins
(23%) that were not present in both E. fergusonii ATCC 35469 and E. coli K-12
MG1655 there were 677 (16%) that had significant alignment to E. fergusonii
ATCC 35469 only. The remaining group of 279 (6%) predicted proteins were
either similar to E. coli K-12 MG1655 only (48 ORFs), pathogenic E. coli strains
(90 ORFs), or had no significant alignment and where unique to ECD-227 (141
ORFs).
120
Figure 20. Phylogenetic tree of 110 enteric bacteria and E. fergusonii ECD-227.
The tree was constructed using protein sequences from 537 genes. The bootstrap
consensus tree was inferred from 100 replicates, and branches corresponding to
partitions reproduced in less than 95% bootstrap replicates are collapsed. ECD-
227 is more closely related to E. fergusonii (boxed) than to other E. coli (dark
gray) or Salmonella enterica (light gray) strains.
121
Biochemical Typing
Initially, ECD-227 tested negative for sorbitol (SOR), arginine dihydrolase
(ADH), ornithine decarboxylase (ODC) and fermentation of amygdalin (AMY),
and was identified as E. coli (97.2% accuracy) with API 20E and APILAB
software (version 3.3.3) (Diarrassouba, et al., 2007). However, subsequent DNA
hybridization studies (Lefebvre, et al., 2008; Lefebvre, et al., 2009) showed that
ECD-227 clustered distantly from the other E. coli isolates, indicating its
divergence from other E. coli strains. Due to the discrepancy between the API
20E test and the whole genome analysis, the API 20E test was repeated and now
identified ECD-227 as SOR negative, but positive for ADH, ODC, and AMY.
Using APIWEB version 4.1 (http://apiweb.biomerieux.com) ECD-227 was
identified as E. fergusonii (99.8% accuracy, but with the ADH result not matching
the identification). The difference between the initial API identification and the
one conducted as part of this study may be due to the improved API 20E
identification system and/or misinterpretations of the original tests of ECD-227 in
the ADH, ODC, and AMY reactions. Our whole genome sequence analysis
allowed us to comprehensively determine the phylogenetic relationship of ECD-
227 as compared to other enteric bacteria. To the our knowledge, our study is the
first report of E. fergusonii in commercial broilers in Canada, although there is
previous report of the presence of E. fergusonii in processed poultry meat from
New Zealand (Millar, 2007).
Overview of the ECD-227 Genome
The draft-quality genome assembly of ECD-227 was sequenced to an average
read depth of 18x and assembled into 83 contigs. The genome is comprised of a
chromosome of 4,509,764 nt (Table 9 and Figure 21) and five plasmids
(pEDC227), ranging in size from 4,716 to 113,116 nt (Table 9). The entire
assembly possesses 4,376 predicted ORFs, with 4,058 ORFs on the 4.51 Mb
chromosome.
122
Table 9. Overview of the ECD-227 genome.
Size (nt)
No.
ORFs
No.
Contigs
G+C Content
% (± 1 std)
Average
ORF Length
(nt)
Coding
Density
(%)
Read Depth
(± 1 std)
Est.
Plasmid
Copy No.
Most similar plasmid (GenBank)
Chromosome 4,509,764 4,058 59 50.1 ± 4.1 978 88.0 16.9 ± 6.1
Plasmids
pECD227_5 4,716 2 1 59.1 ± 5.2 923 39.1 63.4 ±
16.2 4 pColG (NC_010904)
pECD227_46 45,563 29 7 58.0 ± 8.2 787 50.1 57.4 ±
23.8 3 pCVM29188_46 (NC_011078.1)
pECD227_80 80,153 74 3 52.4 ± 2.8 874 80.7 20.4 ± 5.3 1 pO111-2 (NC_013370.1)
pECD227_112 112,296 109 6 47.1 ± 6.9 788 76.5 37.1 ±
10.9 2 pCVM29188_101 (NC_011077.1)
pECD227_113 113,116 104 7 50.5 ± 5.8 801 73.6 25.8 ± 7.0 2 pCVM29188_146 (NC_011076.1)
Genome 4,865,608 4,376 83
123
Figure 21. A linear representation of the ECD-227 chromosome.
a: scale (1 Mb); b: chromosomal position ruler; c: assembly contigs black); d:
origin of replication; e: genes present in E. coli K-12 MG1655 with (green) or
without (black) E. fergusonii ATCC35469; f: genomic islands greater than 4 Kb
that are absent from the E. coli K-12 MG1655 genome; g: genes present in E.
fergusonii ATCC35469 only; h: gene present in one or more pathogenic E. coli
strains with (green) or without (black) E. fergusonii ATCC35469; i: gene present
in ECD-227 only (red); j: genes whose annotation suggests phage origin; k: genes
associated with antimicrobial resistance (red) or virulence (blue).
124
The GC content of the ECD-227 chromosome is 50.1%, which is similar to E.
fergusonii strain ATCC 35469 and other enteric bacteria. The plasmids had a
variable GC content ranging from 47.1 to 59.1%. The ECD-227 plasmids tend to
have higher depths of read coverage, lower average ORF length, and lower coding
density (Table 9).
All five ECD-227 plasmids had high similarity to known circular plasmids.
Plasmid alignments indicate that the ECD-227 plasmids are also circular.
Plasmids pECD227_5 and pECD227_80 are more than 97% identical to pColG
(Avgustin & Grabnar, 2007) and pO111-2 (similar to prophage plasmid P1)
(Ogura et al., 2009), respectively, and do not possess any large insertions or
deletions. Plasmids pECD227_46, pECD227_112 and pECD227_113 are more
than 97% identical to Salmonella plasmids (Fricke, et al., 2009) (Table 9).
Plasmid pECD227_46 is similar to pCVM29188_46 (NC_011078.1) which has
little similarity to other known plasmids (Fricke, et al., 2009). The sequence of
pECD227_112 is highly similar to that of pCVM29188_101 (NC_011077.1), with
the exception of an extra 28 Kb region. Analysis of this 28 Kb region indicated
that it was composed of a transposon similar to Tn21 harboring a class I integron
(Figure 22a). The largest plasmid, pECD227_113, is highly similar to
pCVM29188_146 (NC_011076.1), with the exception of one region that
possesses sporadic similarity and contains a putative adhesin and some putative
invasion genes (Figure 22b). This plasmid has a backbone that is highly similar to
the E. coli APEC plasmids pAPEC-O1-ColBM (T. J. Johnson, Johnson, et al.,
2006) and pAPEC-O2-ColV (T. J. Johnson, Siek, et al., 2006). The ability of the
ECD-227 plasmids to be transferred to other Enterobacteriaceae such as
Salmonella and E. coli has not been evaluated in this study. However, to our
knowledge this is the first study reporting an E. fergusonii isolate from chicken
that possesses multiple plasmids similar to those from a Salmonella species.
125
Figure 22. Linear representation of the two largest ECD-227 plasmids;
pECD227_112 and pECD227_113.
From the top to bottom: a: a scale (50 Kb); b: chromosomal position ruler; c:
assembly contigs (black); d: position of ORFs (black); e: ORFs associated with
antimicrobial resistance (red) or virulence (blue); f: similarity to transposon tn21
(purple); g: similarity to the most similar plasmid from NCBI GenBank.
126
Antimicrobial Resistance of ECD-227
Results combined from this study and a previous study (Diarrassouba, et al.,
2007) found that ECD-227 is resistant to several structural classes of antibiotics
(Table 10). In addition, ECD-227 is resistant to lincomycin and rifampin but
susceptible to imipenem and meropenem.
Acquired multidrug resistance in bacteria may be a result of accumulation of
multiple genes, each coding for resistance to a single drug, or of the increased
expression of genes coding for multi-drug efflux pumps, extruding a wide range
of drugs (Nikaido, 2009). The ECD-227 efflux pump and resistance genes are
presented in Table 11. The genomic loci for the multi-drug resistance were found
across the chromosome (Figure 21) and plasmid pECD227_112 (Figure 22).
Chromosomal loci include the multi-drug efflux acrABDEFR, ampC/yfeW and
rarD genes conferring resistance to aminoglycosides, beta-lactams, macrolides,
and phenicol antimicrobials, respectively (Nikaido, 2009). The acrABDEFR
efflux pump, belonging to the resistance-nodulation-division family (RND), is
known to play a prominent role in multi-drug resistance in Gram negative bacteria
(Nikaido, 2009). The presence of such a system in ECD-227 would be expected to
contribute to the multi-drug resistance phenotype of this isolate. Another RND-
type efflux pump cluster, mdtABCDEFGHKLMNOP, involved in the resistance to
novobiocin in E. coli, has been detected in ECD-227 (Baranova & Nikaido,
2002). A multidrug efflux pump emrABDER, of the major facilitator superfamily
(MFS) has also been identified on the chromosome of ECD-227. This MFS pump
with 14 transmembrane domains confers resistance to nalidixic acid which could
contribute to the elevated MICs (>32 µg/ml) of this antibiotic against ECD-227.
Numerous other loci associated with multi-drug efflux systems including the arnE
of the small multidrug resistance (SMR) family were also found on the
chromosome (Table 11). The plasmid pECD227_112 bears a class I integron that
contains genes aadA/aacA, qacEΔ1, sulI, and tetAR which confer resistance to
aminocyclitols/aminoglycosides, sulfamides, and tetracyclines, respectively
(Figure 22a). The class I integron is present within a transposon similar to Tn21,
127
which contains the mercuric transport system (merACDRT). The chloramphenicol
acetyltransferase (cat) gene conferring resistance to chloramphenicol resistance
was not detected on this plasmid suggesting that the reduced susceptibility
(intermediary resistance) of ECD-227 to this antibiotic might be related to the
chromosomal efflux pump mdfA, which can also confer resistance to cationic dyes
and fluoroquinolones (Nikaido, 2009). In addition to the presence of antibiotic
resistance genes, we also detected the serine to leucine mutation at codon 83
(S83L) of the gyrA gene that confers resistance to quinolones (Hopkins et al.,
2005). This mutation was not detected in the E. coli strains or E. fergusonii ATCC
35469. Further work is needed to determine how much of the resistance profile is
accounted for by chromosomal or plasmid resistance genes and to further validate
the function of each resistance gene.
128
Table 10. Minimal inhibitory concentrations (MICs) of 28 antibiotics against
ECD-227.
Class Antibiotic MIC (µg/ml)a Phenotypeb Reference
Beta-lactam amoxicillin > 16 R Diarrassouba et al. (2007)
amoxicillin-clavulanic acid 32 R This study.
ampicillin > 32 R This study.
cefoxitin 32 R This study.
ceftiofur > 4 R Diarrassouba et al. (2007)
ceftriaxone 0.5 S This study.
penicillin > 8 R Diarrassouba et al. (2007)
Phenicol choramphenicol 16 I This study.
Quinolone ciprofloxacin ≤ 0.015 S This study.
clindamycin > 4 R Diarrassouba et al. (2007)
enrofloxacin 1 I Diarrassouba et al. (2007)
nalidixic acid > 32 R This study.
novobiocin > 4 R Diarrassouba et al. (2007)
sarafloxacin > 0.25 R Diarrassouba et al. (2007)
Macrolide erythromycin > 4 R Diarrassouba et al. (2007)
tylosin > 20 R Diarrassouba et al. (2007)
Aminocyclitol spectinomycin > 64 R Diarrassouba et al. (2007)
Aminoglycoside amikacin 4 S This study.
gentamicin > 8 R Diarrassouba et al. (2007)
kanamycin > 64 R This study.
neomycin 32 R Diarrassouba et al. (2007)
streptomycin 64 R Diarrassouba et al. (2007)
Tetracycline oxytetracycline > 8 R Diarrassouba et al. (2007)
tetracycline > 8 R Diarrassouba et al. (2007)
Sulfamide sulfizoxazole > 256 R This study.
sulphadimethoxime > 256 R Diarrassouba et al. (2007)
sulphathiazole > 256 R Diarrassouba et al. (2007)
trimethoprim/sulphamethoxazole 0.5 S Diarrassouba et al. (2007) a, Breakpoints of Clinical Laboratory Standards Institute and the Canadian Integrated Program for Antimicrobial Resistance
Surveillance. b I = intermediary resistant; R = resistant; S = susceptible.
129
Table 11. Antimicrobial resistance-associated genes of ECD-227.
Presence in:
Gene(s) Description Class APEC
O1:K1:H7
UPEC
UTI89
O157:H7
TW14359
K-12
MG1655
E. fergusonii
ATCC 34569
Chromosomal
aae(ABRX) pHBA resistance + + + + +
acr(ABDEFR) multidrug efflux system Aminoglycoside + + + + +
ampC beta-lactamase Beta-lactam + + + + +
arnE multidrug resistance protein, SMR family - + + + +
baeR DNA-binding transcriptional regulator + + + + +
bcr bicyclomycin/multidrug efflux system + + + + +
emrC multidrug-resistance antiporter Macrolide + + + + -
gyrA codon 83 (Ser>Leu) gyrase A Quinilone - - - - -
hde(ABD) acid-resistance protein + + + + +
marB Multiple antibiotic resistance protein, putative
exported protein
- - - - +
mdfA multidrug efflux system + + + + +
mdl(AB) putative multidrug transport system + + + + +
mdt(ABCDEFGHKLMNOP) multidrug efflux system + + + + +
pmrD polymyxin resistance protein B + + + + +
rarD putative chloramphenicol resistance permease Phenicol + + + + +
rob DNA-binding transcriptional activator + + + + +
sdiA DNA-binding transcriptional activator + + + + +
soxS DNA-binding transcriptional regulator - + + + +
tehB tellurite resistance protein TehB + + + + +
yddA putative multidrug transporter + + + + +
yfeW putative hydrolase/beta lactamase fusion protein Beta-lactam - - + + +
yggT putative resistance protein + + + + +
yojI multidrug transporter membrane component/ATP-
binding component
+ + + + +
YP_002381967 putative transcriptional repressor of for multidrug
resistance pump (MarR family)
- - - - +
YP_002382714 hypothetical protein, putative transcriptional
activator for multiple antibiotic resistance
- - - - +
pECD227_112 (tn21 and class I integron)
aadA* Aminoglycoside Aminocyclitol, + - - - -
130
Aminoglycoside
mer(ACDRT) mercuric transport system - - - - -
qacEΔ1* & sulI* sulfonamide Sulfamide + - - - -
tetA* & tetR tetracylcline Tetracycline + - - - -
aacC aminoglycoside 3-N-acetyltransferase Aminoglycoside + - - - -
*, previously reported by Diarrassouba et al., 2007.
131
Analysis of Virulence Genes
Comparative genome analysis was performed to investigate the existence of
putative virulence genes, and to determine whether they have been acquired via
lateral gene transfer from other pathogenic bacterial strains. The protein
sequences from E. coli strains APEC O1:K1:H7 (T. J. Johnson, et al., 2007),
UTI89 (Chen, et al., 2006), as well as O157:H7 (Kulasekara, et al., 2009) were
used as comparators. The E. coli K-12 MG1655 (Blattner, et al., 1997), and the E.
fergusonii strain ATCC 35469 (Touchon, et al., 2009) (isolated from human stool)
were also included in the analysis.
Of the 4058 ORFs identified on the ECD-227 chromosome (Figure 21), 3150
(77%) were present in E. coli K-12 MG1655 (Figure 21). Of these 3150 ORFs,
3102 (76%) were also present in E. fergusonii ATCC 35469, and 2731 (67%)
were present in all five strains analyzed. The remaining 908 (22.4%) ORFs were
present in different combinations of the pathogenic E. coli strains and/or E.
fergusonii, or were unique to this genome (Figure 21). A large proportion of these
ORFs, 410 (10.1%), were specific to E. fergusonii ATCC 35469 (Figure 21) and
includes a diverse set of functions such as the metabolism of propanediol,
benzoate, citrate, oxalacetate (Table 12 and Table 13) suggesting the ability of
ECD-227 to adapt and to use a number of strategies to meet its growth
requirements in environments having these various compounds. Relatively few
genes were specific to ECD-227 (141, 3.5%) and these were interspersed
throughout the genome (Figure 21), with the exception of four clusters (Table 12).
Two of these four clusters overlap a region associated with a phage as well as a
region having sequence similarity to the dnd sulfur modification operon of E. coli
B7A (GenBank Accession AAJT00000000.2). The second largest group of 357
(8.8%) ORFs was present in a combination of extra-intestinal pathogenic E. coli
(ExPEC) or EHEC strains and/or E. fergusonii, but absent from K-12 MG1655
(Figure 21). Some of these ORFs, including a putative adhesin operon
sii(EA/BA/CA/DA), two putative intimin or invasin proteins, and the auf fimbriae
operon are well known virulence factors of ExPEC and EHEC (Table 12 and
132
Table 13). Numerous other ORFs found in ECD-227 that are also present in
ExPEC/EHEC could be involved in metabolism and/or virulence of this
bacterium. Overall, the ECD-227 genome contains 57 islands of greater than 4 Kb
that are absent from the K-12 genome (Figure 21 and Table 12). We identified
one CRISPR-element that appears to be derived from the E. coli O157:H7
genome. Six of the larger islands were composed of ORFs with similarity to
phage elements, four of which appeared to be derived from pathogenic E. coli
strains (Table 12). We did not identify putative virulence genes associated with
these mobile elements. We also identified ORFs associated with virulence that are
present in K-12 (Table 13), suggesting that they either have no role in virulence or
are under altered regulatory control that enables pathogenesis, which ultimately
warrants further study. The genome characteristics of E. fergusonii ECD-227
suggest that it is a species with a more complex lifestyle than E. coli.
133
Table 12. Gene content of different genomic islands of ECD-227.
Island
no.
Start
(nt) End (nt) Island Type/Gene Content Size (nt)
Presence in:
APEC
O1:K1:H7
UPEC
UTI89
O157:H7
TW14359
K-12
MG1655
E. fergusonii
ATCC 34569
1 58607 71748 E. fergusonii island: Oxalacetate decarboxylase and
citrate lyase proteins
13,141
- - - - +
2 116722 121895 E. fergusonii island
5,173
- - - - +
3 347577 354887 CRISPR-element
7,310
- - + - -
4 374915 385596 E. fergusonii island: erythritol phosphase/kinase and
sugar transporters
10,681
- - - - +
5 467335 475780 E. fergusonii island: Virulence protein msgA and
sugar isomerases/transporters
8,445
- - - - +
6 554368 580706 siiCA/EA/BA; adhesin for cattle intestine
colonization, type I secretion system outer
membrane protein
26,338
- - + - +
7 657665 669091 putative intimin (attaching and effacing protein)
11,426
+ + - - +
8 765422 792533 Prophage
27,111
+ + - - +
9 818394 829464 E. fergusonii island
11,070
- - - - +
10 877239 882488 E. fergusonii island: Phosphotransferase transport
system
5,249
- - - - +
11 1186459 1191659 E. fergusonii island
5,200
- - - - +
12 1197745 1202023 E. fergusonii island
4,278
- - - - +
13 1206828 1214908 E. fergusonii island: Cellulose synthase and
regulator of cellulose synthase
8,080
- - - - +
14 1524525 1537916 Putative multidrug resistance proteins; a
transcriptional regulator (tetR family)
13,391
- + - - +
15 1539135 1547437 E. fergusonii island: Propanediol utilization
8,302
- - - - +
16 1559875 1567082 E. fergusonii island: Benzoate 1,2-dioxygenase/
Muconolactone isomerases
7,207
- - - - +
17 1604765 1611475 E. fergusonii island: TRAP dicarboxylate transporter
locus
6,710
- - - - +
18 1644938 1648925 E. fergusonii island: putative transcriptional activator
for multiple antibiotic resistance; putative invasin-
like protein
3,987
- - - - +
134
19 1803892 1810774 E. fergusonii island: conserved hypothetical proteins;
putative metallo-beta-lactamase
6,882
- - - - +
20 1839545 1849418 E. fergusonii ECD227 island, adjacent to putative
integrase
9,873
- - - - -
21 1872157 1877266 E. fergusonii island: ABC-type transport system,
adjacent to putative integrase
5,109
- - - - +
22 2047591 2078493 Conserved hypothetical proteins
30,902
+ + + - +
23 2079486 2090458 E. fergusonii island: Propanediol utilization
10,972
- - - - +
24 2116623 2194456 Prophage
77,833
Partial Partial Partial - Partial
25 2255244 2266222 E. fergusonii island: Ribitol kinase and transporters
10,978
- - - - +
26 2467532 2475188 Methylaspartate utilization
7,656
- - + - +
27 2491220 2499032 E. fergusonii ECD227 island
7,812
- - - - -
28 2690056 2699367 E. fergusonii island
9,311
- - - - +
29 2771035 2775565 E. fergusonii island
4,530
- - - - +
30 2782291 2791622 Prophage remnant
9,331
+ + - - +
31 2857150 2876371 E. fergusonii island: phage remnant
19,221
- - - - +
32 2931454 2937429 E. fergusonii ECD227 island
5,975
- - - - -
33 2974993 2982753 Putative type II secretion proteins
7,760
+ + - - +
34 3012149 3019625 Putative TRAP-type C4-dicarboxylate transport
system proteins; putative dehydrogenases
7,476
+ + - - +
35 3042628 3047606 ygiL/G/H, fimbral-like adhesin/usher/chaperone
proteins
4,978
+ + - - +
36 3074079 3083856 E. fergusonii island: Glycine/Thioredoxin reductases
9,777
- - - - +
37 3110307 3117540 E. fergusonii ECD227 island
7,233
- - - - -
38 3130279 3147202 E. fergusonii island: 4HPA-hydroxylase operon
16,923
- - - - +
39 3152246 3157999 Restriction modification system proteins
5,753
- - + - -
40 3169948 3193236 E. fergusonii ECD227 island: DND sulfur-
modification system
23,288
- - - - -
41 3205317 3217344 E. fergusonii island: Malonate utilization system
12,027
- - - - +
42 3452481 3459670 auf fimbriae system proteins
7,189
+ + - - +
135
43 3508317 3523346 fatty acid degradation; acyl carrier proteins
15,029
- - + - +
44 3525681 3531290 putative phosphotransferase system proteins
5,609
+ + + - +
45 3539005 3552722 E. fergusonii island: siiEB/CB/BB/DB; adhesin for
cattle intestine colonization, type I secretion system
proteins
13,717
- - - - +
46 3622078 3626265 E. fergusonii island
4,187
- - - - +
47 3670114 3675130 haemagglutinins/invasins protein
5,016
+ + + - -
48 3721148 3726588 acidic carbohydrate kinase/aldolase
5,440
+ + - - +
49 3903191 3927849 Prophage
24,658
+ - + - -
50 3970143 3975678 E. fergusonii island
5,535
- - - - +
51 4030655 4036860 E. fergusonii island; chitobiose/cellobiose
phosphotransferase system
6,205
- - - - +
52 4151194 4156909 sorbitol/sorbose operon proteins
5,715 + + + - +
53 4158916 4178573 E. fergusonii island: Prophage
19,657
- - - - +
54 4213335 4226729 Succinyl-CoA synthetases and C4-dicarboxylate
transporters
13,394
+ + - - +
55 4258515 4260470 E. fergusonii island
1,955
+ - - +
56 4325898 4334798 TRAP transporter; putative oxidoreductase/CoA
transferase/endonuclease/dehydrogenase
8,900
+ + - - +
57 4479722 4504975 Prophage
25,253
+ + + - +
136
Apart from the chromosome, virulence genes were also present on the plasmids of
ECD-227 (Table 13, Figure 22). As characterized previously (Fricke, et al., 2009),
pECD227_46, which was found to be similar to pCVM29188_46 (NC_011078.1),
contains the cytotoxic protein ccdB. The plasmid pECD227_112, which is similar
to pCVM29188_101, possesses the colicin Ib gene (cib) and the complement
resistance traT gene. The largest plasmid, pECD227_113, contains the aerobactin
siderophore system encoded by iutA and iucABCD, as well as an iron/manganese
transport system encoded by sitABCD. The virulence-associated iron uptake
systems present on this plasmid as well as chromosomally-derived iron-related
genes (Table 13) suggest that ECD-227 has the potential to survive in an iron-
poor environment such as the host, and may possibly survive outside of the gut.
The pathogenic potential or capacity for horizontal transfer of these plasmids has
not been verified experimentally; however, the pathogenicity and transmissibility
of similar plasmids has been observed previously between Salmonella and E. coli
(T. J. Johnson et al., 2010), and the contribution of plasmids present in pathogenic
strains as also been demonstrated (T. J. Johnson, et al., 2010; Mellata et al., 2010).
137
Table 13. Virulence-associated genes of ECD-227.
Presence in:
Gene(s) Description APEC
O1:K1:H7
UPEC
UTI89
O157:H7
TW14359
K-12
MG1655
E. fergusonii
ATCC 34569
Chromosomal
cit(ABCDEFGPX) citrate two-component system and lyase - - - - +
eaeH putative attaching and effacing protein - - - - +
feoA ferrous iron transport - + + + +
feoB ferrous iron transport + + + + +
fep(ABCG) ferrienterobactin uptake + + + + +
focA formate transporter + + + + +
sii(EA/CA/BA/DA) adhesin for cattle intestine colonization - - + - +
tol(BCQR) receptor and inner membrane complex of the Tol system + + + + +
tsx nucleoside-specific channel-forming protein + + + + +
csgE assembly/transport component in curli production + + + + +
fliC flagellin + + + + +
gadA glutamate decarboxylase A, PLP-dependent + + + + +
ibeB IbeB invasion gene locus, ibeB, required for penetration of
brain microvascular endothelial cells
+ + + + +
ompA ompA protein outer membrane protein II + + + + +
artJ arginine ABC transporter, periplasmic arginine-binding
protein ArtJ
+ + + + +
ycfZ Hypothetical protein ycfZ putative factor + - - + +
ydcM Hypothetical protein ydcM putative factor - - - - -
mviM putative virulence factor + + + + +
mviN putative virulence factor + + + + +
shf putative virulence factor (plasmid pAA2), similar to Shigella
flexneri Shf
- + - - +
virK VirK- similar to Shigella flexneri VirK - - - - +
pECD227_46
ccdB cytotoxic protein - + + - -
pECD227_112
cib colicin Ib - - - - -
traT complement resistance protein - - - - -
pECD227_113
iutA & iuc(ABCD) aerobactin receptor and biosynthesis + - - - -
sit(ABCD) iron periplasmic binding protein, ATP-binding protein, and
iron permeases
+ + - - -
traT complement resistance protein + + - - -
138
Virulence Assay in Day-old Chicks
Due to the presence of several virulence genes detected in the present and
previous studies (Diarrassouba, et al., 2007; Lefebvre, et al., 2009), we evaluated
and compared the in vivo virulence of EDC-227 to that of a clinically virulent
isolate (D06-2195) in day-old chicks (106 CFU/birds). As expected the clinical
isolate D06-2195 killed 40% and 51% of chicks after 24 and 48 h post
inoculation, respectively (Figure 23). In contrast no mortality was recorded when
E. coli K-12 was inoculated, whereas ECD-227 showed moderate virulence
according to our classification, inducing 18% and 30% of mortality after 24h and
48h post infection, respectively (Figure 23). Moreover, both D06-2195 and ECD-
227, but not K-12, were detected in internal tissues of all diseased and dead birds
suggesting that ECD-227 is likely to induce septicemia in young chicks. The
determination of the contribution of specific genes in ECD-227 pathogenesis is
beyond the scope of this study. However, our findings demonstrated the in vivo
virulence of ECD-227 and that further study is warranted to investigate which
virulent determinants directly contribute to its pathogenic ability and extra-
intestinal survivability.
139
Figure 23. Mortality rates (%) induced by ECD-227 compared to that induced by
clinically virulent E. coli D06-2195 and non-virulent E. coli K-12 in a day-old
chicks infection model.
140
Conclusions
Since E. fergusonii was first identified as a species of the family
Enterobacteriaceae by Farmer et al. in 1985 (Farmer, et al., 1985) numerous
studies have demonstrated its capacity to cause disease in both humans (Bain &
Green, 1999; Funke, et al., 1993; Lagace-Wiens, et al., 2010; Mahapatra &
Mahapatra, 2005; Savini, et al., 2008) and animals (Bangert, et al., 1988;
Hariharan, et al., 2007; Herraez, et al., 2005). This study provided the first
genome sequence of a multi-drug resistant and pathogenic E. fergusonii isolated
from a broiler chicken. This isolate causes disease and mortality in chicks and
surviving chicks may be a reservoir for further infection. Numerous antimicrobial
and virulence determinants are present on mobile genetic elements potentially
acquired from related species, further demonstrating that transfer of mobile
genetic elements from other bacterial species can lead to the emergence of
potentially new multidrug resistant pathogenic strains. Further study of the
numerous antibiotic and virulence determinants should elucidate their precise role
and whether or not their regulation plays an important role. Our findings are of
concern to poultry health because such strains, if involved in infection, would be
difficult to treat with antibiotics currently available for poultry. Furthermore the
potential for infection of such a strain to humans represents a public health
concern. This study provides the first documentation that an E. fergusonii isolated
from poultry has accrued multiple resistance and virulence genes and that the
isolate has an elevated level of pathogenicity.
Acknowledgements
This study was supported by Agriculture and Agri-Food Canada (AAFC) through
the Sustainable Agriculture Environmental Systems (SAGES) program. We also
acknowledge funding support from grant 89758-2010 from the Natural Sciences
and Engineering Research Council of Canada to FM. We recognize the technical
assistance of Andrew Metcalfe (AAFC, Agassiz, BC). The genome sequencing
was performed at the McGill University and Genome Quebec Innovation Centre.
141
VF was the recipient of a Canadian Institutes of Health Research Doctoral
Research Award. We are thankful to M. Ngeleka (University of Saskatchewan,
SK, Canada) for providing E. coli strain D06-2195. Pacific Agri-Food Research
Center contribution: #803.
143
CHAPTER 8: Impact of Research, Future Work, and Concluding Remarks
Bioinformatics has become an integral part of genomics, intertwining itself
throughout all aspects of genomic data analysis. Primarily, this interdependence
began with the use of modern DNA sequencing platforms, which required the use
of computers and software to collect, process, and analyze large and complex
genomic data sets. Moreover, recently established MPS platforms have
democratized genome sequencing, giving biologists access to a method that was
previously limited to large laboratories for the study of reference genomes. This
democratization has yet to spread to bioinformatics, leaving many biologists new
to genomic data analysis to struggle with sophisticated software that requires
bioinformatics expertise or resources. Consequently, I chose to address this gap
between genome analysis and the biologist by developing bioinformatics tools
intended for them to use. Specifically, my aim was to develop tools for three
analytical steps in a genome sequencing project: [i] display and integrated
analysis of genomic data, [ii] assembly quality assessment, and [iii] deriving
biological insight using public information. In large part, I was inspired by
research that I conducted in actual genome projects.
I developed three innovative applications. They have been published via the
internet (http://github.com/vforget/ or http://blip.codeplex.com/) and prepared as
manuscripts for peer-reviewed publication. My research also included scientific
investigations in three genome sequencing projects, which has led me to prepare
three manuscripts for peer-reviewed publication. Two of these manuscripts are
already published in scientific journals (Forgetta, et al., 2011; Forgetta, et al.,
2012). The applications and genome projects are discussed individually within
each of the manuscripts. What follows is a discussion of the impact of my
research findings that extend beyond this thesis. I also discuss future
improvements to each bioinformatics application with regards to other features,
usability and scalability.
144
Impact of the Genome Sequencing Projects
Fourteen-Genome Comparison Identifies DNA Markers for Severe-Disease-
Associated Strains of C. difficile
This comparative genomics study has had impact in private and public research.
The 18 SNPs and 12 conserved genes have garnered the interest of the MultiGen
biotechnology company (http://www.multigen-diagnostics.com/) for development
into a medical diagnostic test. In addition, this genome sequencing project has
given our group the opportunity to collaborate with other researchers, which
includes the investigation of intestinal microbiome of patients infected with C.
difficile (Prof. Amee Manges).
Reproducibility of the Roche/454 GS-FLX Titanium System to Genome Sequence
the Dutch Elm Disease Pathogen
The impact of this study has been two-fold. Firstly, the O. novo-ulmi custom
UCSC Genome Browser (http://www.genomequebec.mcgill.ca/compgen/browser-
ophiostoma/cgi-bin/hgGateway) continues to be used by researchers at Université
de Laval in Quebec City (Prof. Louis Bernier) to investigate the evolution of the
lipoxygenase genes. Secondly, additional tools were developed to assess the
performance of the Roche/454 GS Platform. One such tool converts the raw data
from the instrument into a movie, which has been used to confirm the fluidics
problems with the instrument (for example see http://www.genomequebec.
mcgill.ca/compgen/public/meetings/ABRF2010/bubble1.avi).
Pathogenic and Multidrug Resistant E. fergusonii from Broiler Chicken
This research has spurred the further development of means to investigate the
prevalence of antimicrobial and virulence genes in poultry. I continue to
collaborate with Dr. Moussa S. Diarra, and we are currently investigating the
metagenomic profile of the gut microbiome in food safety.
145
Bioinformatics Software
This section discusses the adoption of each bioinformatics tool by our research
group, and by others in the scientific community. Following this, I will discuss
each tool with regards to features to be implemented, usability testing, and the
scalability of the programs to analyze human-sized datasets.
Adoption
Overall, the adoption of each application varies according to time since initial
release. For instance, cgb, which I developed early on in my research, has been
used more often, whereas BL!P and contiGo have been used less frequently.
Cgb has been used to create custom browser instances for the three genome
sequencing projects presented in this thesis, assisting in the analysis and
visualization of results, and ultimately contributing to preparation of three
manuscripts (Chapter 3 (Forgetta, et al., 2011), Chapter 5 (in preparation), and
Chapter 7 (Forgetta, et al., 2012)). In addition, cgb has been used to create custom
browser instances for more than 50 genome sequencing projects at the McGill
University and Genome Quebec Innovation Centre (MUGQIC) (personal
communication with research staff, Gary Leveque and Pascal Marquis). This
includes two large genome sequencing projects, a fungal genomics projects
(http://www.fungalgenomics.ca/wiki/Main_Page) and a vervet monkey genome
project (http://www.genomequebec.mcgill.ca/compgen/vervet_research/
genomics_genetics/). Cgb is being considered for use as a service offered to
clients of the MUGQIC.
ContiGo has been used to assess genome assemblies across two projects in this
thesis (Chapter 5 (Forgetta, et al., 2011) and Chapter 7 (Forgetta, et al., 2012)). It
is also being used by our group to analyze metagenomic assemblies of the vervet
gut microbiome (Sudeep Mehrotra, PhD candidate). Outside of our group, it has
been used to present genome assembly data to more than 20 clients of the
146
MUGQIC. ContiGo is also being considered for use as a service offered to clients
of the MUGQIC.
BL!P is published online (http://blip.codeplex.com), and is part of the Microsoft
Biology Tools collection (http://research.microsoft.com/en-us/projects/bio
/mbt.aspx). As of early 2012, the program has been downloaded over 800 times
(http://blip.codeplex.com/stats).
Features
It is common practice to update software programs because new needs arise or
existing needs change. Over the course of my PhD, the bioinformatics tools that I
developed were continually improved based upon the requirements of the genome
projects or users, including myself. What follows are features that will be
implemented to address the most recent needs I have encountered.
Currently, cgb populates the browser with a basic set of annotations, including
contigs, scaffolds, depth of read coverage, and GC percent. Future work includes
increasing the number of automated annotations, such as gene predictions, cellular
localization, and functional categories. Also, my plan is to collaborate with the
UCSC Genome Browser software development team to provide others to
download and use cgb via the UCSC Genome Browser web portal and to extend
cgb’s functionality by incorporating their expertise into its development.
ContiGo’s interface was developed from scratch by combining basic components
(table, plots, and a read pileup) into a novel graphical interface. My fellow lab
members or members of the MUQGIC (Gary Leveque, Pascale Marquis, and
Sudeep Mehrotra) have used it to analyze over 50 genome or metagenome
assemblies, but further improvement is needed in some regards. For instance,
greater interactivity between the different elements of the display (e.g., selecting
points in charts identifies them in the tables) will be implemented. Also, contiGo
does not visualize paired-end information, limiting its ability to detect certain
types of assembly errors, such as internal re-arrangements (Phillippy, et al., 2008).
147
Future work will involve extending contiGo to support paired-end information in
the quality assessment process.
I have demonstrated BL!P’s functionality at five scientific conferences, as well as
its use within our group, and have received positive feedback particularly with
respect to its data exploration capabilities. Requests for improvement largely
entail the support for different input and output formats or methods. For example,
the ability to load pre-computed BLAST results will be implemented, as well as
the ability to incrementally add new query sequences.
Usability
The degree of usability of the tools varies, and is dependent on their overall
novelty and their frequency of user testing.
The UCSC Genome Browser (Kent, et al., 2002) is a popular community-based
resource, receiving over 600,000 requests per day (http://genome.ucsc.edu/admin/
stats/, accessed 28/05/12). Cgb takes this popular and widely used tool and brings
it to a new audience, such as biologists studying non-reference genomes.
Usability with regards to the UCSC Genome Browser is handled by support staff
at UCSC, who collect and assess comments from all its users
(http://genome.ucsc.edu/contacts.html). With respect to cgb, user feedback has
largely come from the analysts that prepare the custom genome browser instances.
According to their recommendations I will automate more downstream analyses,
such as gene prediction. Furthermore, I will streamline the browser creation
process by removing superfluous error messages and detecting incorrect program
inputs.
ContiGo’s interface is unique and was initially developed to meet the needs of the
various genome projects reported in this thesis. As a result, of the three
applications developed in this context, it would benefit most from usability
testing. Multiple strategies could be used to test the usability of contiGo. One
approach would test the program using methods similar to previously published
usability studies of bioinformatics software (Bolchini et al., 2009). This would
148
require building a usage scenario within which a user performs a specific task and
the observer watches and takes notes. Results from this type of study would allow
me to determine which parts of the interface need refinement or what features are
missing. The second approach would be to discover new usage cases beyond what
I encountered during my PhD studies. Similar to a previous study (Stevens et al.,
2001), we could create a questionnaire that classifies tasks that one or more
biologists complete during genome assembly analysis. Also, similar to a study
conducted by Bartlett and Toms (2005), we could interview biologists with the
goal of modelling the process they use to analyze a genome assembly. Either of
these approaches would be beneficial, leading to contiGo being a more useful
tool.
During a previous internship at Microsoft Research, I had the opportunity to
collaborate with Bob Silverstein and Xiaoji Chen from the EPX User Experience
group, with whom I surveyed academic researchers in the Seattle area concerning
BL!P’s usage scenarios. These surveys were similar to those conducted by Bartlett
and Toms (2005) in that we used semi-structured interviews to ask a series of
questions concerning each researcher’s project and responses were grouped in
categories from which a set of actions for improving BL!P were devised. Results
from these sessions were instrumental in BL!P’s development. For instance, the
program was originally designed as a gene-centric analysis tool. Only protein
level alignments were supported and the image that represents each BLAST result
was pre-determined and focused on presenting information pertaining to gene
function. When users were confronted with this scenario, they saw little need for
it in their daily work, but remained impressed by the automation of BLAST and
the data exploration capabilities of Pivot. As a result, I generalized BL!P to
support multiple BLAST algorithms and to create a custom image layout for the
BLAST results.
Usability is an important aspect of any bioinformatics tool. A recent survey
conducted by Bartlett et al. (2012) found that laboratory-focused users (e.g.
biologists) prefer tools that are easy to use and install, and perform the types of
149
analyses they want. The objectives of this thesis have strived towards satisfying
these preferences; tools were intended for use by the biologist and inspiration was
drawn from actual genome projects. Performing even small usability studies of a
few users would allow these tools to more closely match the biologists’
preferences.
Scalability
In 2006, when I started my PhD, the throughput and cost of genome sequencing
on MPS platforms geared it towards the study of microbial genomes. Further
reductions in cost and increases in throughput have enabled the study of larger
genomes, requiring an assessment of the scalability of the programs developed in
this thesis.
Cgb relies on the UCSC Genome Browser, a tool that was developed to browse
mammalian-sized genomes. Because of this, cgb scales easily to larger datasets.
As mentioned previously, it has been used to house multiple fungal genomes,
each being tens of Mb in size, and the vervet monkey genome assembly, which is
about four gigabases.
To date, contiGo has been tested on assemblies of up to tens of Mb in size. Larger
assemblies will require some performance tuning, such as re-factoring the
algorithm for drawing the read pileup or, should the need arise, rewriting in a
more high performance programming language such as C or Java. Also, because
contiGo stores read-pileups as images on disk, improvements to reduce disk usage
can be made by reducing the pixel usage per base. For example, instead of
presenting each nucleotide by a character image, storage space could be reduced
by representing each nucleotide as a smaller color-coded glyph (e.g., A is red, C is
yellow, etc).
The largest dataset used with BL!P has been the complete set of 4,376 proteins
from E. coli isolate ECD-227 (Forgetta, et al., 2012). To process larger datasets,
scalability of the alignment procedure and the visualization in Pivot will need to
be addressed. Support for local BLAST would help alleviate the alignment
150
bottleneck for users with sufficient resources, whereas a cloud-based solution,
such as AzureBLAST (http://research.microsoft.com/en-us/projects/ncbi-
blast/default.aspx), would provide high performance computing to those without
the necessary computing resources. Faster sequence alignment algorithms (Edgar,
2010) are another possible avenue for improving speed during the alignment step.
The performance of the Pivot visualization will be improved by loading subsets of
the results on-demand. For example, multiple hits per query sequence could be
grouped and only expanded when selected.
As genomic datasets increase in size, the scalability of existing applications will
continue to be an important problem. Existing programs can be modified to
address this concern, as was the case with the Artemis genome browser (Carver et
al., 2012), or new programs can be created, such as the Integrated Genome
Viewer (Robinson, et al., 2011).
Concluding Remarks
Genomics continues to be applied to other areas of science and industry. For
instance, in the field of medicine, personalized genomics promises to
revolutionize the healthcare system, where genetic testing is used to catalogue
genetic variants associated with disease or adverse drug effects for individual
patients. Personalized genomics is currently offered as a service by companies
such as 23andMe and deCODE genetics, which use SNP genotyping platforms,
such as Illumina bead arrays (http://www.ncbi.nlm.nih.gov/projects/genome/
probe/doc/TechBeadArray.shtml), to assay millions of genetic variants. The
genotypes from these genetic variants are then compared to those in published
literature and associations to disease risk are assessed. Compared to the size of the
human genome, the resolution of the SNP genotyping assay is relatively low (1
SNP every few thousand nucleotides) and typically considers only variants that
are common in the population. Full human genome sequencing would provide a
more complete catalogue of an individual’s genotypes, including rare variants, but
due to higher cost, has been used in only a few individuals (Bentley, et al., 2008;
Ley et al., 2008; Wheeler et al., 2008) or within the context of large studies, such
151
as the 1000 Genome Project Consortium (2010). However, the cost for whole
genome sequencing continues to decrease (Wetterstrand, 2012), and may soon
become cost effective for the population in general. Anticipating this, vendors of
current MPS technologies are commercializing full human genome sequencing
(e.g., Complete Genomics), and companies such as Knome and HelloGenome are
starting to offer full-genome sequencing and interpretation of results. Therefore, it
is foreseeable that personalized genomics will become commonplace, and will
create yet another challenge for bioinformatics to process and interpret this data in
meaningful ways. For instance, how do medical professionals such as physicians
or genetic counselors present such complex datasets to patients, particularly when
either one may have limited computer skills or knowledge of biology or genetics?
How do we process these complex data sets into simple yet informative
abstractions, and offer a level of interactivity that facilitates communication
between the medical professional and the patient?
The paradigm used in this thesis can also be applied to address this potential gap
between patients or clinicians and genomic data. By basing software development
on multiple real-world personal genomics experiments, as well as targeting usage
to particular audiences, we can encapsulate sophisticated bioinformatics processes
into tools intended for common use, with the ultimate goal of allowing as many
people as possible to benefit from the bioinformatician’s expertise and abilities.
152
REFERENCES
Aarestrup, F. M., Seyfarth, A. M., Emborg, H. D., Pedersen, K., Hendriksen, R.
S., & Bager, F. (2001). Effect of abolishment of the use of antimicrobial agents
for growth promotion on occurrence of antimicrobial resistance in fecal
enterococci from food animals in Denmark. Antimicrobial Agents and
Chemotherapy, 45(7), 2054-2059.
Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D.,
Amanatides, P. G., et al. (2000). The genome sequence of Drosophila
melanogaster. Science, 287(5461), 2185-2195.
al-Barrak, A., Embil, J., Dyck, B., Olekson, K., Nicoll, D., Alfa, M., et al. (1999).
An outbreak of toxin A negative, toxin B positive Clostridium difficile-associated
diarrhea in a Canadian tertiary-care hospital. Can Commun Dis Rep, 25(7), 65-69.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990).
Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.,
et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res, 25(17), 3389-3402.
Anderson, S., Bankier, A. T., Barrell, B. G., de Bruijn, M. H., Coulson, A. R.,
Drouin, J., et al. (1981). Sequence and organization of the human mitochondrial
genome. Nature, 290(5806), 457-465.
Avgustin, J. A., & Grabnar, M. (2007). Sequence analysis of the plasmid pColG
from the Escherichia coli strain CA46. Plasmid, 57(1), 89-93.
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J., & Eichler, E. E. (2001).
Segmental duplications: organization and impact within the current human
genome project assembly. Genome Res, 11(6), 1005-1017.
Bain, M. S., & Green, C. C. (1999). Isolation of Escherichia fergusonii in cases
clinically suggestive of salmonellosis. Vet Rec, 144(18), 511.
Balzer, S., Malde, K., Lanzen, A., Sharma, A., & Jonassen, I. (2010).
Characteristics of 454 pyrosequencing data--enabling realistic simulation with
flowsim. Bioinformatics, 26(18), i420-425.
Bangert, R. L., Ward, A. C., Stauber, E. H., Cho, B. R., & Widders, P. R. (1988).
A survey of the aerobic bacteria in the feces of captive raptors. Avian Dis, 32(1),
53-62.
Bao, H., Guo, H., Wang, J., Zhou, R., Lu, X., & Shi, S. (2009). MapView:
visualization of short reads alignment on a desktop computer. Bioinformatics,
25(12), 1554-1555.
153
Baranova, N., & Nikaido, H. (2002). The baeSR two-component regulatory
system activates transcription of the yegMNOB (mdtABCD) transporter gene
cluster in Escherichia coli and increases its resistance to novobiocin and
deoxycholate. J Bacteriol, 184(15), 4168-4176.
Barbut, F., Braun, M., Burghoffer, B., Lalande, V., & Eckert, C. (2009). Rapid
detection of toxigenic strains of Clostridium difficile in diarrheal stools by real-
time PCR. J Clin Microbiol, 47(4), 1276-1277.
Barbut, F., Corthier, G., Charpak, Y., Cerf, M., Monteil, H., Fosse, T., et al.
(1996). Prevalence and pathogenicity of Clostridium difficile in hospitalized
patients. A French multicenter study. Arch Intern Med, 156(13), 1449-1454.
Bartlett, J. C., Ishimura, Y., & Kloda, L. A. (2012). Scientists’ Preferences for
Bioinformatics Tools: The Task-based Selection of Information Retrieval Systems.
Paper presented at the Fourth Information Interaction in Context conference (IIiX
2012), Nijmegen, the Netherlands.
Bartlett, J. C., & Toms, E. G. (2005). Developing a protocol for bioinformatics
analysis: An integrated information behavior and task analysis approach. Journal
of the American Society for Information Science and Technology, 56(5), 469-482.
Bates, M. R., Buck, K. W., & Brasier, C. M. (1993). Molecular relationships of
the mitochondrial DNA of Ophiostoma ulmi and the NAN and EAN races of O.
novo-ulmi determined by restriction fragment length polymorphisms. Mycological
Research, 97(9), 1093-1100.
Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S.,
et al. (2004). Ultraconserved elements in the human genome. Science, 304(5675),
1321-1325.
Benson, D. A., Boguski, M. S., Lipman, D. J., & Ostell, J. (1997). GenBank.
Nucleic Acids Res, 25(1), 1-6.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W.
(2011). GenBank. Nucleic Acids Res, 39(Database issue), D32-37.
Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J.,
Brown, C. G., et al. (2008). Accurate whole human genome sequencing using
reversible terminator chemistry. Nature, 456(7218), 53-59.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et
al. (2000). The Protein Data Bank. Nucleic Acids Res, 28(1), 235-242.
Bilofsky, H. S., Burks, C., Fickett, J. W., Goad, W. B., Lewitter, F. I., Rindone,
W. P., et al. (1986). The GenBank genetic sequence databank. Nucleic acids
research, 14(1), 1-4.
154
Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F., Roskin, K. M.,
et al. (2004). Aligning multiple genomic sequences with the threaded blockset
aligner. Genome Res, 14(4), 708-715.
Blattner, F. R., Plunkett, G., 3rd, Bloch, C. A., Perna, N. T., Burland, V., Riley,
M., et al. (1997). The complete genome sequence of Escherichia coli K-12.
Science, 277(5331), 1453-1462.
Bobadilla, J. L., Macek, M., Jr., Fine, J. P., & Farrell, P. M. (2002). Cystic
fibrosis: a worldwide analysis of CFTR mutations--correlation with incidence
data and application to screening. Human mutation, 19(6), 575-606.
Bolchini, D., Finkelstein, A., Perrone, V., & Nagl, S. (2009). Better
bioinformatics through usability analysis. Bioinformatics, 25(3), 406-412.
Bonfield, J. K., & Whitwham, A. (2010). Gap5--editing the billion fragment
sequence assembly. Bioinformatics, 26(14), 1699-1703.
Bracho, M. A., Moya, A., & Barrio, E. (1998). Contribution of Taq polymerase-
induced errors to the estimation of RNA virus diversity. The Journal of general
virology, 79 ( Pt 12), 2921-2928.
Brasier, C. M. (1996). Low Genetic Diversity of the Ophiostoma novo-ulmi
Population in North America. Mycologia, 88(6), 951-964.
Brasier, C. M., & Kirk, S. A. (2001). Designation of the EAN and NAN races of
Ophiostoma novo-ulmi as subspecies. Mycological Research, 105(05), 547-554.
Bruant, G., Maynard, C., Bekal, S., Gaucher, I., Masson, L., Brousseau, R., et al.
(2006). Development and validation of an oligonucleotide microarray for
detection of multiple virulence and antimicrobial resistance genes in Escherichia
coli. Appl Environ Microbiol, 72(5), 3780-3784.
Campbell, P. J., Pleasance, E. D., Stephens, P. J., Dicks, E., Rance, R., Goodhead,
I., et al. (2008). Subclonal phylogenetic structures in cancer revealed by ultra-
deep sequencing. Proceedings of the National Academy of Sciences of the United
States of America, 105(35), 13081-13086.
Carver, T., Harris, S. R., Berriman, M., Parkhill, J., & McQuillan, J. A. (2012).
Artemis: an integrated platform for visualization and analysis of high-throughput
sequence-based experimental data. Bioinformatics, 28(4), 464-469.
Catanho, M., Mascarenhas, D., Degrave, W., & de Miranda, A. B. (2006).
BioParser: a tool for processing of sequence similarity analysis reports. Appl
Bioinformatics, 5(1), 49-53.
CCAC. (1993). Guide to the Care and Use of Experimental Animals (2nd ed.).
CCAC, Ottawa, ON: Canadian Council on Animal Care.
155
Cheley, S., Xie, H., & Bayley, H. (2006). A genetically encoded pore for the
stochastic detection of a protein kinase. Chembiochem, 7(12), 1923-1927.
Chen, S. L., Hung, C. S., Xu, J., Reigstad, C. S., Magrini, V., Sabo, A., et al.
(2006). Identification of genes subject to positive selection in uropathogenic
strains of Escherichia coli: a comparative genomics approach. Proc Natl Acad Sci
U S A, 103(15), 5977-5982.
Chou, H. H., & Holmes, M. H. (2001). DNA sequence quality trimming and
vector removal. Bioinformatics, 17(12), 1093-1104.
Chueh, A. C., Northrop, E. L., Brettingham-Moore, K. H., Choo, K. H., & Wong,
L. H. (2009). LINE retrotransposon RNA is an essential structural and functional
epigenetic component of a core neocentromeric chromatin. PLoS genetics, 5(1),
e1000354.
CLSI. (2008). Performance standards for antimicrobial disk and dilution
susceptibility tests for bacteria isolated from animals: approved standard - Third
Edition. CLSI document M31-A3 (ISBN 1-56238-659-X).
Consortium, T. G. P. (2010). A map of human genome variation from population-
scale sequencing. Nature, 467(7319), 1061-1073.
Curry, S. R., Marsh, J. W., Muto, C. A., O'Leary, M. M., Pasculle, A. W., &
Harrison, L. H. (2007). tcdC genotypes associated with severe TcdC truncation in
an epidemic clone and other strains of Clostridium difficile. J Clin Microbiol,
45(1), 215-221.
Dai, J., Chen, Y., Dean, S., Morris, J. G., Salfinger, M., & Johnson, J. A. (2011).
Multiple-genome comparison reveals new Loci for mycobacterium species
identification. J Clin Microbiol, 49(1), 144-153.
Darzentas, N. (2010). Circoletto: visualizing sequence similarity with Circos.
Bioinformatics, 26(20), 2620-2621.
Delcher, A. L., Bratke, K. A., Powers, E. C., & Salzberg, S. L. (2007). Identifying
bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics, 23(6),
673-679.
Delcher, A. L., Harmon, D., Kasif, S., White, O., & Salzberg, S. L. (1999).
Improved microbial gene identification with GLIMMER. Nucleic acids research,
27(23), 4636-4641.
Dewar, K., Bousquet, J., Dufour, J., & Bernier, L. (1997). A meiotically
reproducible chromosome length polymorphism in the ascomycete fungus
Ophiostoma ulmi (sensu lato). Molecular & general genetics : MGG, 255(1), 38-
44.
156
Dhalluin, A., Lemee, L., Pestel-Caron, M., Mory, F., Leluan, G., Lemeland, J. F.,
et al. (2003). Genotypic differentiation of twelve Clostridium species by
polymorphism analysis of the triosephosphate isomerase (tpi) gene. Syst Appl
Microbiol, 26(1), 90-96.
Dial, S., Alrasadi, K., Manoukian, C., Huang, A., & Menzies, D. (2004). Risk of
Clostridium difficile diarrhea among hospital inpatients prescribed proton pump
inhibitors: cohort and case-control studies. CMAJ, 171(1), 33-38.
Diarrassouba, F., Diarra, M. S., Bach, S., Delaquis, P., Pritchard, J., Topp, E., et
al. (2007). Antibiotic resistance and virulence genes in commensal Escherichia
coli and Salmonella isolates from commercial broiler chicken farms. J Food Prot,
70(6), 1316-1327.
Diguistini, S., Liao, N. Y., Platt, D., Robertson, G., Seidel, M., Chan, S. K., et al.
(2009). De novo genome sequence assembly of a filamentous fungus using
Sanger, 454 and Illumina sequence data. Genome biology, 10(9), R94.
DiGuistini, S., Wang, Y., Liao, N. Y., Taylor, G., Tanguay, P., Feau, N., et al.
(2011). Genome and transcriptome analyses of the mountain pine beetle-fungal
symbiont Grosmannia clavigera, a lodgepole pine pathogen. Proceedings of the
National Academy of Sciences of the United States of America, 108(6), 2504-
2509.
Drudy, D., Kyne, L., O'Mahony, R., & Fanning, S. a. (2007). gyrA Mutations in
Fluoroquinolone-resistant Clostridium difficile PCR-027 Emerging Infectious
Diseases (Vol. 13, pp. 504-505): Centers for Disease Control & Prevention
(CDC).
Dunning, A. M., Talmud, P., & Humphries, S. E. (1988). Errors in the polymerase
chain reaction. Nucleic Acids Research, 16(21), 10393.
Eastwood, K., Else, P., Charlett, A., & Wilcox, M. (2009). Comparison of nine
commercially available Clostridium difficile toxin detection assays, a real-time
PCR assay for C. difficile tcdB, and a glutamate dehydrogenase detection assay to
cytotoxin testing and cytotoxigenic culture methods. J Clin Microbiol, 47(10),
3211-3217.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than
BLAST. Bioinformatics, 26(19), 2460-2461.
Eichler, E. E. (2001). Segmental duplications: what's missing, misassigned, and
misassembled--and should we care? Genome Res, 11(5), 653-656.
Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., et al. (2009). Real-time
DNA sequencing from single polymerase molecules. Science, 323(5910), 133-
138.
157
Ennis, P. D., Zemmour, J., Salter, R. D., & Parham, P. (1990). Rapid cloning of
HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of
errors produced in amplification. Proceedings of the National Academy of
Sciences of the United States of America, 87(7), 2833-2837.
Et-Touil, A., Brasier, C. M., & Bernier, L. (1999). Localization of a Pathogenicity
Gene in Ophiostoma novo-ulmi and Evidence That It May Be Introgressed from
O. ulmi. Molecular Plant-Microbe Interactions, 12(1), 6-15.
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using
phred. II. Error probabilities. Genome research, 8(3), 186-194.
Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-calling of
automated sequencer traces using phred. I. Accuracy assessment. Genome
research, 8(3), 175-185.
Farmer, J. J., 3rd, Fanning, G. R., Davis, B. R., O'Hara, C. M., Riddle, C.,
Hickman-Brenner, F. W., et al. (1985). Escherichia fergusonii and Enterobacter
taylorae, two new species of Enterobacteriaceae isolated from clinical
specimens. J Clin Microbiol, 21(1), 77-81.
Feng, Y., Yang, W., Ryan, U., Zhang, L., Kvac, M., Koudela, B., et al. (2011).
Development of a Multilocus Sequence Tool for Typing Cryptosporidium muris
and Cryptosporidium andersoni. J Clin Microbiol, 49(1), 34-41.
Fitch, W. M., & Margoliash, E. (1967). Construction of phylogenetic trees.
Science, 155(3760), 279-284.
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F.,
Kerlavage, A. R., et al. (1995). Whole-genome random sequencing and assembly
of Haemophilus influenzae Rd. Science, 269(5223), 496-512.
Flicek, P., Amode, M. R., Barrell, D., Beal, K., Brent, S., Chen, Y., et al. (2011).
Ensembl 2011. Nucleic acids research, 39(Database issue), D800-806.
Flusberg, B. A., Webster, D. R., Lee, J. H., Travers, K. J., Olivares, E. C., Clark,
T. A., et al. (2010). Direct detection of DNA methylation during single-molecule,
real-time sequencing. Nature methods, 7(6), 461-465.
Forgetta, V., & Dewar, K. (2005). Genome Project Cost Calculator, from
http://genomequebec.mcgill.ca/compgen/cgc.html
Forgetta, V., Oughton, M. T., Marquis, P., Brukner, I., Blanchette, R., Haub, K.,
et al. (2011). Fourteen-genome comparison identifies DNA markers for severe-
disease-associated strains of Clostridium difficile. Journal of clinical
microbiology, 49(6), 2230-2238.
158
Forgetta, V., Rempel, H., Malouin, F., Vaillancourt, R., Jr., Topp, E., Dewar, K.,
et al. (2012). Pathogenic and multidrug-resistant Escherichia fergusonii from
broiler chicken. Poultry science, 91(2), 512-525.
Fricke, W. F., McDermott, P. F., Mammel, M. K., Zhao, S., Johnson, T. J., Rasko,
D. A., et al. (2009). Antimicrobial resistance-conferring plasmids with similarity
to virulence plasmids from avian pathogenic Escherichia coli strains in
Salmonella enterica serovar Kentucky isolates from poultry. Appl Environ
Microbiol, 75(18), 5963-5971.
Fu, Y. H., Kuhl, D. P., Pizzuti, A., Pieretti, M., Sutcliffe, J. S., Richards, S., et al.
(1991). Variation of the CGG repeat at the fragile X site results in genetic
instability: resolution of the Sherman paradox. Cell, 67(6), 1047-1058.
Funke, G., Hany, A., & Altwegg, M. (1993). Isolation of Escherichia fergusonii
from four different sites in a patient with pancreatic carcinoma and
cholangiosepsis. J Clin Microbiol, 31(8), 2201-2203.
Garcia Pelayo, M. C., Uplekar, S., Keniry, A., Mendoza Lopez, P., Garnier, T.,
Nunez Garcia, J., et al. (2009). A comprehensive survey of single nucleotide
polymorphisms (SNPs) across Mycobacterium bovis strains and M. bovis BCG
vaccine strains refines the genealogy and defines a minimal set of SNPs that
separate virulent M. bovis strains and M. bovis BCG strains. Infection &
Immunity, 77(5), 2230-2238.
Gilca R, F. E., Hubert B et al. (2008). Surveillance des diarrhées associées à
Clostridium difficile au Québec : bilan du 22 août 2004 au 18 août 2007
Retrieved October 5, 2010, 2010, from
http://www.inspq.qc.ca/pdf/publications/745_Cdifficile_bilan2004-2007.pdf
Gilles, A., Meglecz, E., Pech, N., Ferreira, S., Malausa, T., & Martin, J. F. (2011).
Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC
Genomics, 12, 245.
Gollapudi, R., Revanna, K. V., Hemmerich, C., Schaack, S., & Dong, Q. (2008).
BOV--a web-based BLAST output visualization tool. BMC Genomics, 9, 414.
Goorhuis, A., Bakker, D., Corver, J., Debast, S. B., Harmanus, C., Notermans, D.
W., et al. (2008). Emergence of Clostridium difficile Infection Due to a New
Hypervirulent Strain, Polymerase Chain Reaction Ribotype 078. Clinical
Infectious Diseases, 47(9), 1162-1170.
Gordon, D., Abajian, C., & Green, P. (1998). Consed: a graphical tool for
sequence finishing. Genome Res, 8(3), 195-202.
Guillaume, B., Montpetit, A., Forgetta, V., Leveque, G., Attiya, S., Dias, J., et al.
(2009). Accurate genome Assembly Using Long Insert Libraries and Next-
generation Sequencing Genome Quebec Technical Application Note.
159
Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., & Kasprzyk, A. (2009).
BioMart Central Portal--unified access to biological data. Nucleic acids research,
37(Web Server issue), W23-27.
Hamm, G. H., & Cameron, G. N. (1986). The EMBL data library. Nucleic acids
research, 14(1), 5-9.
Hariharan, H., Lopez, A., Conboy, G., Coles, M., & Muirhead, T. (2007).
Isolation of Escherichia fergusonii from the feces and internal organs of a goat
with diarrhea. Can Vet J, 48(6), 630-631.
Harris, R. S. (2007). Improved pairwise alignment of genomic DNA. Pennsylvania
State University.
He, J., Dai, X., & Zhao, X. (2007). PLAN: a web platform for automating high-
throughput BLAST searches and for managing and mining results. BMC
Bioinformatics, 8, 53.
He, M., Sebaihia, M., Lawley, T. D., Stabler, R. A., Dawson, L. F., Martin, M. J.,
et al. (2010). Evolutionary dynamics of Clostridium difficile over short and long
time scales. Proc Natl Acad Sci U S A, 107(16), 7527-7532.
Herraez, P., Rodriguez, A. F., Espinosa de los Monteros, A., Acosta, A. B., Jaber,
J. R., Castellano, J., et al. (2005). Fibrino-necrotic typhlitis caused by Escherichia
fergusonii in ostriches (Struthio camelus). Avian Dis, 49(1), 167-169.
Hintz, W., Pinchback, M., de la Bastide, P., Burgess, S., Jacobi, V., Hamelin, R.,
et al. (2011). Functional categorization of unique expressed sequence tags
obtained from the yeast-like growth phase of the elm pathogen Ophiostoma novo-
ulmi. BMC genomics, 12, 431.
Hoff, K. J. (2009). The effect of sequencing errors on metagenomic gene
prediction. BMC Genomics, 10, 520.
Hopkins, K. L., Davies, R. H., & Threlfall, E. J. (2005). Mechanisms of quinolone
resistance in Escherichia coli and Salmonella: recent developments. Int J
Antimicrob Agents, 25(5), 358-373.
Hou, H., Zhao, F., Zhou, L., Zhu, E., Teng, H., Li, X., et al. (2010).
MagicViewer: integrated solution for next-generation sequencing data
visualization and genetic variation detection and annotation. Nucleic acids
research, 38(Web Server issue), W732-736.
Howorka, S., Nam, J., Bayley, H., & Kahne, D. (2004). Stochastic detection of
monovalent and bivalent protein-ligand interactions. Angew Chem Int Ed Engl,
43(7), 842-846.
160
Huang, W., & Marth, G. (2008). EagleView: a genome assembly viewer for next-
generation sequencing technologies. Genome Res, 18(9), 1538-1543.
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., et al.
(2002). The Ensembl genome database project. Nucleic acids research, 30(1), 38-
41.
Hubert, B., Loo, V. G., Bourgault, A.-M., Poirier, L., Dascal, A., Fortin, E., et al.
(2007). A Portrait of the Geographic Dissemination of the Clostridium difficile
North American Pulsed-Field Type 1 Strain and the Epidemiology of C. difficile-
Associated Disease in Quebec. Clin Infect Dis, 44(2), 238-244.
Hubisz, M. J., Lin, M. F., Kellis, M., & Siepel, A. (2011). Error and error
mitigation in low-coverage genome assemblies. PLoS One, 6(2), e17034.
Janvilisri, T., Scaria, J., Thompson, A. D., Nicholson, A., Limbago, B. M.,
Arroyo, L. G., et al. (2009). Microarray identification of Clostridium difficile core
components and divergent regions associated with host origin. Journal of
Bacteriology, 191(12), 3881-3891.
Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-wide
mapping of in vivo protein-DNA interactions. Science, 316(5830), 1497-1502.
Johnson, S., & Gerding, D. N. (1998). Clostridium difficile-associated diarrhea.
Clin Infect Dis, 26, 1027 - 1034.
Johnson, T. J., Johnson, S. J., & Nolan, L. K. (2006). Complete DNA sequence of
a ColBM plasmid from avian pathogenic Escherichia coli suggests that it evolved
from closely related ColV virulence plasmids. J Bacteriol, 188(16), 5975-5983.
Johnson, T. J., Kariyawasam, S., Wannemuehler, Y., Mangiamele, P., Johnson, S.
J., Doetkott, C., et al. (2007). The genome sequence of avian pathogenic
Escherichia coli strain O1:K1:H7 shares strong similarities with human
extraintestinal pathogenic E. coli genomes. J Bacteriol, 189(8), 3228-3236.
Johnson, T. J., Siek, K. E., Johnson, S. J., & Nolan, L. K. (2006). DNA sequence
of a ColV plasmid and prevalence of selected plasmid-encoded virulence genes
among avian Escherichia coli strains. J Bacteriol, 188(2), 745-758.
Johnson, T. J., Thorsness, J. L., Anderson, C. P., Lynne, A. M., Foley, S. L., Han,
J., et al. (2010). Horizontal gene transfer of a ColV plasmid has resulted in a
dominant avian clonal type of Salmonella enterica serovar Kentucky. PLoS One,
5(12), e15524.
Jordan, S., Hutchings, M. I., & Mascher, T. (2008). Cell envelope stress response
in Gram-positive bacteria. FEMS Microbiol Rev, 32(1), 107-146.
161
Joseph, P., Fichant, G., Quentin, Y., & Denizot, F. (2002). Regulatory
relationship of two-component and ABC transport systems and clustering of their
genes in the Bacillus/Clostridium group, suggest a functional link between them.
J Mol Microbiol Biotechnol, 4(5), 503-513.
Karolchik, D., Hinrichs, A. S., Furey, T. S., Roskin, K. M., Sugnet, C. W.,
Haussler, D., et al. (2004). The UCSC Table Browser data retrieval tool. Nucleic
Acids Res, 32(Database issue), D493-496.
Kato, N. (2000). Genome of human hepatitis C virus (HCV): gene organization,
sequence diversity, and variation. Microbial & comparative genomics, 5(3), 129-
151.
Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method
for rapid multiple sequence alignment based on fast Fourier transform. Nucleic
Acids Res, 30(14), 3059-3066.
Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Res, 12(4),
656-664.
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A.
M., et al. (2002). The human genome browser at UCSC. Genome Res, 12(6), 996-
1006.
Killgore, G., Thompson, A., Johnson, S., Brazier, J., Kuijper, E., Pepin, J., et al.
(2008). Comparison of seven techniques for typing international epidemic strains
of Clostridium difficile: restriction endonuclease analysis, pulsed-field gel
electrophoresis, PCR-ribotyping, multilocus sequence typing, multilocus variable-
number tandem-repeat analysis, amplified fragment length polymorphism, and
surface layer protein A gene sequence typing. J Clin Microbiol, 46(2), 431-437.
Kimura, M. (1969). The rate of molecular evolution considered from the
standpoint of population genetics. Proceedings of the National Academy of
Sciences of the United States of America, 63(4), 1181-1188.
Korshunova, Y., Maloney, R. K., Lakey, N., Citek, R. W., Bacher, B., Budiman,
A., et al. (2008). Massively parallel bisulphite pyrosequencing reveals the
molecular complexity of breast cancer-associated cytosine-methylation patterns
obtained from tissue and serum DNA. Genome research, 18(1), 19-29.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., et
al. (2009). Circos: an information aesthetic for comparative genomics. Genome
research, 19(9), 1639-1645.
Kuijper, E. J., Coignard, B., & Tull, P. (2006). Emergence of Clostridium
difficile-associated disease in North America and Europe. Clinical Microbiology
and Infection, 12(s6), 2-18.
162
Kulasekara, B. R., Jacobs, M., Zhou, Y., Wu, Z., Sims, E., Saenphimmachak, C.,
et al. (2009). Analysis of the genome of the Escherichia coli O157:H7 2006
spinach-associated outbreak isolate indicates candidate genes that may enhance
virulence. Infect Immun, 77(9), 3713-3721.
Kuroda, M., Serizawa, M., Okutani, A., Sekizuka, T., Banno, S., & Inoue, S.
(2010). Genome-wide single nucleotide polymorphism typing method for
identification of Bacillus anthracis species and strains among B. cereus group
species. J Clin Microbiol, 48(8), 2821-2829.
Kyne, L., Hamel, M. B., Polavaram, R., & Kelly, C. P. (2002). Health care costs
and mortality associated with nosocomial diarrhea due to Clostridium difficile.
Clin Infect Dis, 34, 346-353.
Kyne, L., Warny, M., Qamar, A., & Kelly, C. P. (2000). Asymptomatic Carriage
of Clostridium difficile and Serum Levels of IgG Antibody against Toxin A. N
Engl J Med, 342(6), 390-397.
Kyne, L., Warny, M., Qamar, A., & Kelly, C. P. (2001). Association between
antibody response to toxin A and protection against recurrent Clostridium difficile
diarrhoea. Lancet, 357, 189 - 193.
LaBoissière, S., Forgetta, V., Blanchette, R., Kriazhev, L., Roy, L., Boismenu, D.,
et al. (2005). Mass Spectrometry Based Approach for Gene Annotation of
Bacterial Genomes—A Case Study on Cell-Wall Surface Proteins From C.
difficile. Genome Quebec Technical Application Note.
Lagace-Wiens, P. R., Baudry, P. J., Pang, P., & Hammond, G. (2010). First
description of an extended-spectrum-beta-lactamase-producing multidrug-
resistant Escherichia fergusonii strain in a patient with cystitis. J Clin Microbiol,
48(6), 2301-2302.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J.,
et al. (2001). Initial sequencing and analysis of the human genome. Nature,
409(6822), 860-921.
Lee, B., & Richards, F. M. (1971). The interpretation of protein structures:
estimation of static accessibility. Journal of molecular biology, 55(3), 379-400.
Lefebvre, B., Diarra, M. S., Moisan, H., & Malouin, F. (2008). Detection of
virulence-associated genes in Escherichia coli O157 and non-O157 isolates from
beef cattle, humans, and chickens. J Food Prot, 71(9), 1774-1784.
Lefebvre, B., Gattuso, M., Moisan, H., Malouin, F., & Diarra, M. S. (2009).
Genotype comparison of sorbitol-negative Escherichia coli isolates from healthy
broiler chickens from different commercial farms. Poult Sci, 88(7), 1474-1484.
163
Lewis, S. E., Searle, S. M., Harris, N., Gibson, M., Lyer, V., Richter, J., et al.
(2002). Apollo: a sequence annotation editor. Genome Biol, 3(12),
RESEARCH0082.
Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., McLellan, M. D., Chen, K., et al.
(2008). DNA sequencing of a cytogenetically normal acute myeloid leukaemia
genome. Nature, 456(7218), 66-72.
Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and
calling variants using mapping quality scores. Genome research, 18(11), 1851-
1858.
Lindblad, K., Savontaus, M. L., Stevanin, G., Holmberg, M., Digre, K., Zander,
C., et al. (1996). An expanded CAG repeat sequence in spinocerebellar ataxia
type 7. Genome research, 6(10), 965-971.
Lipman, D. J., & Pearson, W. R. (1985). Rapid and sensitive protein similarity
searches. Science, 227(4693), 1435-1441.
Loo, V. G., Poirier, L., Miller, M. A., Oughton, M., Libman, M. D., Michaud, S.,
et al. (2005). A Predominantly Clonal Multi-Institutional Outbreak of Clostridium
difficile-Associated Diarrhea with High Morbidity and Mortality. N Engl J Med,
353(23), 2442-2449.
MacCannell, D. R., Louie, T. J., Gregson, D. B., Laverdiere, M., Labbe, A.-C.,
Laing, F., et al. (2006). Molecular Analysis of Clostridium difficile PCR Ribotype
027 Isolates from Eastern and Western Canada. J. Clin. Microbiol., 44(6), 2147-
2152.
MacCollin, M., Braverman, N., Viskochil, D., Ruttledge, M., Davis, K., Ojemann,
R., et al. (1996). A point mutation associated with a severe phenotype of
neurofibromatosis 2. Annals of neurology, 40(3), 440-445.
Machado, H. E., & Renn, S. C. (2010). A critical assessment of cross-species
detection of gene duplicates using comparative genomic hybridization. BMC
Genomics, 11, 304.
Magrane, M., & Consortium, U. (2011). UniProt Knowledgebase: a hub of
integrated protein data. Database (Oxford), 2011, bar009.
Mahapatra, A., & Mahapatra, S. (2005). Escherichia fergusonii: an emerging
pathogen in South Orissa. Indian J Med Microbiol, 23(3), 204.
Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L.
A., et al. (2005). Genome sequencing in microfabricated high-density picolitre
reactors. Nature, 437(7057), 376-380.
164
Mazzarella, R., & Schlessinger, D. (1998). Pathological consequences of
sequence duplications in the human genome. Genome Res, 8(10), 1007-1021.
McDonald, L. C., Killgore, G. E., Thompson, A., Owens, R. C., Jr., Kazakova, S.
V., Sambol, S. P., et al. (2005). An Epidemic, Toxin Gene-Variant Strain of
Clostridium difficile. N Engl J Med, 353(23), 2433-2441.
McElroy, K. E., Luciani, F., & Thomas, T. (2012). GemSIM: general, error-model
based simulator of next-generation sequencing data. BMC Genomics, 13, 74.
McPherson, J. D. (2009). Next-generation gap. Nat Methods, 6(11 Suppl), S2-5.
Mellata, M., Ameiss, K., Mo, H., & Curtiss, R., 3rd. (2010). Characterization of
the contribution to virulence of three large plasmids of avian pathogenic
Escherichia coli chi7122 (O78:K80:H9). Infection and immunity, 78(4), 1528-
1541.
Merrigan, M., Venugopal, A., Mallozzi, M., Roxas, B., Viswanathan, V. K.,
Johnson, S., et al. (2010). Human hypervirulent Clostridium difficile strains
exhibit increased sporulation as well as robust toxin production. J Bacteriol,
192(19), 4904-4911.
Metzker, M. L. (2009). Sequencing in real time. Nat Biotechnol, 27(2), 150-151.
Millar, J. R. (2007). The relationship between use of apramycin in the poultry
industry and the detection of gentamicin resistant E. coli in processed chickens.
The Free Librar.
Milne, I., Bayer, M., Cardle, L., Shaw, P., Stephen, G., Wright, F., et al. (2010).
Tablet--next generation sequence assembly visualization. Bioinformatics, 26(3),
401-402.
Morales, S. E., & Holben, W. E. (2011). Linking bacterial identities and
ecosystem processes: can 'omic' analyses be more than the sum of their parts?
FEMS Microbiol Ecol, 75(1), 2-16.
Morin, R., Bainbridge, M., Fejes, A., Hirst, M., Krzywinski, M., Pugh, T., et al.
(2008). Profiling the HeLa S3 transcriptome using randomly primed cDNA and
massively parallel short-read sequencing. BioTechniques, 45(1), 81-94.
Mulvey, M. R., Boyd, D. A., Gravel, D., Hutchinson, J., Kelly, S., McGeer, A., et
al. (2010). Hypervirulent Clostridium difficile strains in hospitalized patients,
Canada. Emerg Infect Dis, 16(4), 678-681.
Murray, R., Boyd, D., Levett, P. N., Mulvey, M. R., & Alfa, M. J. (2009).
Truncation in the tcdC region of the Clostridium difficile PathLoc of clinical
isolates does not predict increased biological activity of Toxin B or Toxin A.
BMC Infect Dis, 9, 103.
165
Muto, C. A., Pokrywka, M., Shutt, K., Mendelsohn, A. B., Nouri, K., Posey, K.,
et al. (2005). A Large Outbreak of Clostridium difficile -Associated Disease With
an Unexpected Proportion of Deaths and Colectomies at a Teaching Hospital
Following Increased Fluoroquinolone Use. Infect Control Hosp Epidemiol, 26(3),
273-280.
Naiser, T., Kayser, J., Mai, T., Michel, W., & Ott, A. (2008). Position dependent
mismatch discrimination on DNA microarrays - experiments and model. BMC
Bioinformatics, 9, 509.
Nakamura, K., Oshima, T., Morimoto, T., Ikeda, S., Yoshikawa, H., Shiwa, Y., et
al. (2011). Sequence-specific error profile of Illumina sequencers. Nucleic Acids
Res, 39(13), e90.
NCBI. (2011). NCBI Map Viewer Retrieved December 20, 2011, 2011, from
http://www.ncbi.nlm.nih.gov/projects/mapview/
Ngeleka, M., Brereton, L., Brown, G., & Fairbrother, J. M. (2002). Pathotypes of
avian Escherichia coli as related to tsh-, pap-, pil-, and iuc-DNA sequences, and
antibiotic sensitivity of isolates from internal tissues and the cloacae of broilers.
Avian Dis, 46(1), 143-152.
Nikaido, H. (2009). Multidrug resistance in bacteria. Annu Rev Biochem, 78, 119-
146.
Ogura, Y., Ooka, T., Iguchi, A., Toh, H., Asadulghani, M., Oshima, K., et al.
(2009). Comparative genomics reveal the mechanism of the parallel evolution of
O157 and non-O157 enterohemorrhagic Escherichia coli. Proc Natl Acad Sci U S
A, 106(42), 17939-17944.
Ohki, R., Giyanto, Tateno, K., Masuyama, W., Moriya, S., Kobayashi, K., et al.
(2003). The BceRS two-component regulatory system induces expression of the
bacitracin transporter, BceAB, in Bacillus subtilis. Mol Microbiol, 49(4), 1135-
1144.
Parsons, J. D., Buehler, E., & Hillier, L. (1999). DNA sequence chromatogram
browsing using JAVA and CORBA. Genome research, 9(3), 277-281.
Pepin, J., Saheb, N., Coulombe, M. A., Alary, M. E., Corriveau, M. P., Authier,
S., et al. (2005). Emergence of fluoroquinolones as the predominant risk factor for
Clostridium difficile-associated diarrhea: a cohort study during an epidemic in
Quebec. Clin Infect Dis, 41(9), 1254-1260.
Perez-Enciso, M., & Ferretti, L. (2010). Massive parallel sequencing in animal
genetics: wherefroms and wheretos. Anim Genet, 41(6), 561-569.
166
Perna, N. T., Plunkett, G., 3rd, Burland, V., Mau, B., Glasner, J. D., Rose, D. J., et
al. (2001). Genome sequence of enterohaemorrhagic Escherichia coli O157:H7.
Nature, 409(6819), 529-533.
Phillippy, A. M., Schatz, M. C., & Pop, M. (2008). Genome assembly forensics:
finding the elusive mis-assembly. Genome biology, 9(3), R55.
Pi, W., Zhu, X., Wu, M., Wang, Y., Fulzele, S., Eroglu, A., et al. (2010). Long-
range function of an intergenic retrotransposon. Proceedings of the National
Academy of Sciences of the United States of America, 107(29), 12992-12997.
Pirs, T., Ocepek, M., & Rupnik, M. (2008). Isolation of Clostridium difficile from
food animals in Slovenia. J Med Microbiol, 57(Pt 6), 790-792.
Planche, T., Aghaizu, A., Holliman, R., Riley, P., Poloniecki, J., Breathnach, A.,
et al. (2008). Diagnosis of Clostridium difficile infection by toxin detection kits: a
systematic review. Lancet Infect Dis, 8(12), 777-784.
Quesada-Gomez, C., Rodriguez, C., Gamboa-Coronado Mdel, M., Rodriguez-
Cavallini, E., Du, T., Mulvey, M. R., et al. (2010). Emergence of Clostridium
difficile NAP1 in Latin America. J Clin Microbiol, 48(2), 669-670.
Razaq, N., Sambol, S., Nagaro, K., Zukowski, W., Cheknis, A., Johnson, S., et al.
(2007). Infection of hamsters with historical and epidemic BI types of Clostridium
difficile. J Infect Dis, 196(12), 1813-1819.
Rees, D. C., Williams, T. N., & Gladwin, M. T. (2010). Sickle-cell disease.
Lancet, 376(9757), 2018-2031.
Renn, S. C., Machado, H. E., Jones, A., Soneji, K., Kulathinal, R. J., & Hofmann,
H. A. (2010). Using comparative genomic hybridization to survey genomic
sequence divergence across species: a proof-of-concept from Drosophila. BMC
Genomics, 11, 271.
Rennie, C., Noyes, H. A., Kemp, S. J., Hulme, H., Brass, A., & Hoyle, D. C.
(2008). Strong position-dependent effects of sequence mismatches on signal ratios
measured using long oligonucleotide microarrays. BMC Genomics, 9, 317.
Richter, D. C., Ott, F., Auch, A. F., Schmid, R., & Huson, D. H. (2008).
MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One,
3(10), e3373.
Robinson, J. T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E. S.,
Getz, G., et al. (2011). Integrative genomics viewer. Nature biotechnology, 29(1),
24-26.
167
Rodriguez-Palacios, A., Staempfil, H. R., Duffield, T., & Weese, J. S. (2007).
Clostridium difficile in Retail Ground Meat, Canada. Emerging Infectious
Diseases, 13(3), 485-487.
Rodriguez-Siek, K. E., Giddings, C. W., Doetkott, C., Johnson, T. J., Fakhr, M.
K., & Nolan, L. K. (2005). Comparison of Escherichia coli isolates implicated in
human urinary tract infection and avian colibacillosis. Microbiology, 151(Pt 6),
2097-2110.
Ron, E. Z. (2006). Host specificity of septicemic Escherichia coli: human and
avian pathogens. Curr Opin Microbiol, 9(1), 28-32.
Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M., & Nyren, P. (1996).
Real-time DNA sequencing using detection of pyrophosphate release. Analytical
biochemistry, 242(1), 84-89.
Ronaghi, M., Uhlen, M., & Nyren, P. (1998). A sequencing method based on real-
time pyrophosphate. Science, 281(5375), 363, 365.
Rothberg, J. M., Hinz, W., Rearick, T. M., Schultz, J., Mileski, W., Davey, M., et
al. (2011). An integrated semiconductor device enabling non-optical genome
sequencing. Nature, 475(7356), 348-352.
Rozen, S., & Skaletsky, H. (2000). Primer3 on the WWW for general users and
for biologist programmers. Methods Mol Biol, 132, 365-386.
Rupnik, M. (2008). Heterogeneity of large clostridial toxins: importance of
Clostridium difficile toxinotypes. FEMS Microbiol Rev, 32(3), 541-555.
Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M. A.,
et al. (2000). Artemis: sequence visualization and annotation. Bioinformatics,
16(10), 944-945.
Salzberg, S. L., & Yorke, J. A. (2005). Beware of mis-assembled genomes.
Bioinformatics, 21(24), 4320-4321.
Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C.
A., et al. (1977). Nucleotide sequence of bacteriophage phi X174 DNA. Nature,
265(5596), 687-695.
Sanger, F., & Coulson, A. R. (1975). A rapid method for determining sequences
in DNA by primed synthesis with DNA polymerase. Journal of molecular
biology, 94(3), 441-448.
Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F., & Petersen, G. B. (1982).
Nucleotide sequence of bacteriophage lambda DNA. Journal of molecular
biology, 162(4), 729-773.
168
Savini, V., Catavitello, C., Talia, M., Manna, A., Pompetti, F., Favaro, M., et al.
(2008). Multidrug-resistant Escherichia fergusonii: a case of acute cystitis. J Clin
Microbiol, 46(4), 1551-1552.
Scaria, J., Ponnala, L., Janvilisri, T., Yan, W., Mueller, L. A., & Chang, Y. F.
(2010). Analysis of ultra low genome conservation in Clostridium difficile. PLoS
One, 5(12), e15147.
Schatz, M. C., Phillippy, A. M., Shneiderman, B., & Salzberg, S. L. (2007).
Hawkeye: an interactive visual analytics tool for genome assemblies. Genome
Biol, 8(3), R34.
Schmidt, D., Schwalie, P. C., Wilson, M. D., Ballester, B., Goncalves, A., Kutter,
C., et al. (2012). Waves of retrotransposon expansion remodel genome
organization and CTCF binding in multiple mammalian lineages. Cell, 148(1-2),
335-348.
Searls, D. B. (2010). The roots of bioinformatics. PLoS computational biology,
6(6), e1000809.
Sebaihia, M., Wren, B. W., Mullany, P., Fairweather, N. F., Minton, N., Stabler,
R., et al. (2006). The multidrug-resistant human pathogen Clostridium difficile has
a highly mobile, mosaic genome. Nature Genetics, 38(7), 779-786.
Sellers, P. H. (1974). On the Theory and Computation of Evolutionary Distances.
SIAM Journal on Applied Mathematics, 26(4), 787-793.
Shendure, J., Porreca, G. J., Reppas, N. B., Lin, X., McCutcheon, J. P.,
Rosenbaum, A. M., et al. (2005). Accurate multiplex polony sequencing of an
evolved bacterial genome. Science, 309(5741), 1728-1732.
Sillen, A., Andrade, J., Lilius, L., Forsell, C., Axelman, K., Odeberg, J., et al.
(2008). Expanded high-resolution genetic study of 109 Swedish families with
Alzheimer's disease. Eur J Hum Genet, 16(2), 202-208.
Sloan, L. M., Duresko, B. J., Gustafson, D. R., & Rosenblatt, J. E. (2008).
Comparison of real-time PCR for detection of the tcdC gene with four toxin
immunoassays and culture in diagnosis of Clostridium difficile infection. J Clin
Microbiol, 46(6), 1996-2001.
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular
subsequences. Journal of molecular biology, 147(1), 195-197.
Spigaglia, P., Carattoli, A., Barbanti, F., & Mastrantonio, P. (2010). Detection of
gyrA and gyrB mutations in Clostridium difficile isolates by real-time PCR. Mol
Cell Probes, 24(2), 61-67.
169
Spigaglia, P., & Mastrantonio, P. (2002). Molecular analysis of the pathogenicity
locus and polymorphism in the putative negative regulator of toxin production
(TcdC) among Clostridium difficile clinical isolates. J Clin Microbiol, 40(9),
3470-3475.
Stabler, R. A., Gerding, D. N., Songer, J. G., Drudy, D., Brazier, J. S., Trinh, H.
T., et al. (2006). Comparative Phylogenomics of Clostridium difficile Reveals
Clade Specificity and Microevolution of Hypervirulent Strains. J. Bacteriol.,
188(20), 7297-7305.
Stabler, R. A., He, M., Dawson, L., Martin, M., Valiente, E., Corton, C., et al.
(2009). Comparative genome and phenotypic analysis of Clostridium difficile 027
strains provides insight into the evolution of a hypervirulent bacterium. Genome
Biol, 10(9), R102.
Staden, R. (1996). The Staden sequence analysis package. Molecular
biotechnology, 5(3), 233-241.
Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and
syntenically mapped cDNA alignments to improve de novo gene finding.
Bioinformatics, 24(5), 637-644.
Stapley, J., Reger, J., Feulner, P. G., Smadja, C., Galindo, J., Ekblom, R., et al.
(2010). Adaptation genomics: the next generation. Trends Ecol Evol, 25(12), 705-
712.
Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., et al.
(2002). The generic genome browser: a building block for a model organism
system database. Genome Res, 12(10), 1599-1610.
Stevens, R., Goble, C., Baker, P., & Brass, A. (2001). A classification of tasks in
bioinformatics. Bioinformatics, 17(2), 180-188.
Stratton, M. R., Campbell, P. J., & Futreal, P. A. (2009). The cancer genome.
Nature, 458(7239), 719-724.
Tamura, K., Dudley, J., Nei, M., & Kumar, S. (2007). MEGA4: Molecular
Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol,
24(8), 1596-1599.
Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y. O., & Borodovsky, M.
(2008). Gene prediction in novel fungal genomes using an ab initio algorithm
with unsupervised training. Genome research, 18(12), 1979-1990.
Touchon, M., Hoede, C., Tenaillon, O., Barbe, V., Baeriswyl, S., Bidet, P., et al.
(2009). Organised genome dynamics in the Escherichia coli species results in
highly diverse adaptive paths. PLoS Genet, 5(1), e1000344.
170
Uemura, S., Aitken, C. E., Korlach, J., Flusberg, B. A., Turner, S. W., & Puglisi,
J. D. (2010). Real-time tRNA transit on single translating ribosomes at codon
resolution. Nature, 464(7291), 1012-1017.
Ukkonen, E. (1985). Algorithms for approximate string matching. Inf. Control,
64(1-3), 100-118.
Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G.,
et al. (2001). The sequence of the human genome. Science, 291(5507), 1304-
1351.
Victoria, X. W., Blades, N., Ding, J., Sultana, R., & Parmigiani, G. (2012).
Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13(1),
185.
Walker, F. O. (2007). Huntington's disease. Lancet, 369(9557), 218-228.
Walkty, A., Boyd, D. A., Gravel, D., Hutchinson, J., McGeer, A., Moore, D., et
al. (2010). Molecular characterization of moxifloxacin resistance from Canadian
Clostridium difficile clinical isolates. Diagn Microbiol Infect Dis, 66(4), 419-424.
Warny, M., Pepin, J., Fang, A., Killgore, G., Thompson, A., Brazier, J., et al.
(2005). Toxin production by an emerging strain of Clostridium difficile associated
with outbreaks of severe disease in North America and Europe. The Lancet,
366(9491), 1079-1084.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal,
P., et al. (2002). Initial sequencing and comparative analysis of the mouse
genome. Nature, 420(6915), 520-562.
Welch, R. A., Burland, V., Plunkett, G., 3rd, Redford, P., Roesch, P., Rasko, D.,
et al. (2002). Extensive mosaic structure revealed by the complete genome
sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A, 99(26),
17020-17024.
Wetterstrand, K. (2012). DNA Sequencing Costs: Data from the NHGRI Large-
Scale Genome Sequencing Program Retrieved April 22, 2012, from
www.genome.gov/sequencingcosts
Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., et
al. (2008). The complete genome of an individual by massively parallel DNA
sequencing. Nature, 452(7189), 872-876.
Wilbur, W. J., & Lipman, D. J. (1983). Rapid similarity searches of nucleic acid
and protein data banks. Proceedings of the National Academy of Sciences of the
United States of America, 80(3), 726-730.
171
Wolff, D., Bruning, T., & Gerritzen, A. (2009). Rapid detection of the
Clostridium difficile ribotype 027 tcdC gene frame shift mutation at position 117
by real-time PCR and melt curve analysis. Eur J Clin Microbiol Infect Dis.
Wu, C. H., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z. Z., et al.
(2002). The Protein Information Resource: an integrated public resource of
functional annotation of proteins. Nucleic Acids Res, 30(1), 35-37.
Xing, L., & Brendel, V. (2001). Multi-query sequence BLAST output
examination with MuSeqBox. Bioinformatics, 17(8), 744-745.
Zhang, J., Chiodini, R., Badr, A., & Zhang, G. (2011). The impact of next-
generation sequencing on genomics. J Genet Genomics, 38(3), 95-109.