Date post: | 15-Apr-2017 |
Category: |
Healthcare |
Upload: | william-hsiao |
View: | 207 times |
Download: | 2 times |
How Can We Make Genomic Epidemiology a Widespread Reality?
William Hsiao, [email protected]
@wlhsiao
BC Public Health Microbiology and Reference Laboratory
BCCDC Grand Round May 26 2015
Outline
• Part 1: What is genomic epidemiology and Why is it important for public health microbiology
• Part 2: What are the requirements to bring genomic epidemiology to routine public health practice– Introducing our project IRIDA as part of the
solution
3Source: Peter Gleick, Scienceblogs.com
PeoplePlaceTime
Source: Melanie Courtot
PeoplePlaceTime
Source: Melanie Courtot
PeoplePlaceTime
Source: Melanie Courtot
Molecular Epidemiology
• Laboratory generated biomarker results can be correlated to epidemiological investigations (People, Place, Time)
• Provides linkage based on common exposure to the same pathogen at the molecular level
• Most tests detect one or a few of specific biomarkers, representing a fraction of the pathogens’ genetic information
Current Methods of Characterizing Foodborne Pathogens in a Public Health Laboratory
• Growth characteristics • Phenotypic panels • Agglutination reactions • Enzyme immuno assays (EIAs) • PCR • DNA arrays (hybridization) • Sanger sequencing of marker genes• DNA restriction • Electrophoresis (PFGE, capillary)
Each pathogen is characterized by methods that are specific to that pathogen in multiple workflows (separate workflows for each pathogen) TAT: 5 min – weeks (months)
Source: Rebecca Lindsey
Genomic Epidemiology
Def: Using whole genome sequencing data from pathogens and epidemiological investigations to track spread of an infectious disease
Why Genomic Epidemiology
• One technology (DNA sequencing) compatible with many types of pathogens
• Capable of generating 10-1000s of high quality pathogen genomes within 1-7 days
Sequencing = lots of HQ Data
• Capture the pathogen’s entire genetic makeup• Unbiased (~97-99+% of the genome captured using
common sequencing approaches) • Significantly more data than traditional methods• Allow higher resolution and higher sensitivity methods to
be applied• Allow value-added
evolutionary & Functionalstudy of the pathogens– Virulence factors– AMR genes
$10K per human genome or $10 per bacterial genome
$100M per human genome
Sequencing cost continues to drop
Variations in genomes = Basis of Comparison
• Mutations– Point mutations– Small insertions and deletion (indels)– Can change functions of a gene
• Recombination, deletion, and duplication– Rearrange genes, can change expression– Increase gene copy number– Delete genes
• Horizontal gene transfer– Acquiring genetic material from non-parental organism
• E.g. Antibiotic resistance / new toxins
SNP Analysis
• What is a SNP?– A SNP (single nucleotide polymorphism) is DNA sequence
variation occurring when a single nucleotide differs between two or more genomes
ATCGCGATATCATACGGATCGCAATATCATACGGATCGCGATATCATACGGATCGCGATATCATACGGATCGCAATATCATACGG
• SNP can be created from point mutation but can also be created from insertion and deletion of one nucleotide
Why are SNPs useful
• Silent mutations that do not change protein sequences happen quite frequently due to DNA replication errors => High Resolution
• SNPs occurs across the whole genome and can be detected from whole genome sequencing => Unbiased markers
• SNPs can also be used to infer phylogeny of organisms– More shared SNPs = more closely related
SNP Minimal Spanning Tree – colored by Phage Type
PT8
PT4
PT13a
PT52
The most similar isolates are connected first => clustering them together
SNP Minimal Spanning Tree – colored by outbreaks
Many phylogenetic trees based on SNPs published to show clustering of outbreak cases
den Bakker et al Emerg Infect Dis. 2014 Aug;20(8)
Non-related cases
Outbreak cases
Allard, M et alPLoS ONE 8 (1) 2013
Forces Driving Pathogen Genome Evolution
Specialization“lean and mean”
New function can be derived through:
Gene expression and be turned on and off
Intra-cluster distances overlap with inter-cluster distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.
Different species have different clustering distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.
Genomics + Epidemiology
• Having genetic distance information alone may not be enough to fully characterize outbreaks
• Need to combine with epidemiological investigations
• Using known clusters to establish (sub-)species-specific genetic distance criteria
• Genomics can help connecting previous unlinked cases to uncover new cases
Each year, one in eight Canadians (or four million people)
get sick with a domestically acquired food-borne illness.
http://www.phac-aspc.gc.ca/efwd-emoha/efbi-emoa-eng.php
Whole Genome Sequencing of Foodborne Pathogens Around the World
• UK Public Health England committed to sequence all the Salmonella isolates submitted to PH Lab
• US FDA and CDC (supported by National Center for Biotechnology Information) created a distributed network of labs to utilize WGS for pathogen identification
https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm
Genome Canada Bioinformatics Competition: Large-Scale Project
“A Federated Bioinformatics Platform for Public Health Microbial Genomics”
Our Goal
The IRIDA platform(Integrated Rapid Infectious Disease Analysis)
An open source, standards compliant, high quality genomic epidemiology analysis platform based on web-technology to support real-time (food-
borne) disease outbreak investigations
25 www.IRIDA.ca
Partnership among public health agencies and academic institutes to bridge the gaps between advancements in genomic epidemiology and application to real-life and real-
time use cases in public health agencies
- Project Team has direct access to state of the art research in academia- Project Team is directly embedded in user organization
IRIDA Project Phases
• Phase 1: genomics process and analysis pipeline to produce categorical data (MLST and SNPs) suitable for current epidemiological analysis – almost completed
• Phase 2: combine the categorical data with epidemiological data (line list approach to replace current Excel based approach) – in progress
• Phase 3: Develop IRIDA as an exploratory platform for new ways of interpreting genomics data in light of epidemiological and clinical data – in progress; continuous process beyond current project
28
Interviews with key personnel to identify barriers to implement genomic epidemiology in
public health agencies
GAP 1: PUBLIC HEALTH PERSONNEL LACK TRAINING IN GENOMICS
Microbial genomics has been a valuable research tool
• Help us understand:– microbial evolution– pathogenesis– create novel industrial processes– create new laboratory tests
• Use historical isolates – not real time• Use of laboratory strains – no associated rich
clinical and epidemiological metadata
Cultural and Practical Differences
Genomics Research Laboratory Genomics Diagnostic Laboratory
Curiosity driven Production / Case driven
Exploratory analysis tolerated Exploratory analysis discouraged
Reproducibility = other labs’ problem Reproducibility critical
Tweaking protocols desirable Stability in protocols desirable
Protocols don’t need to be validated Protocols need to be validated
Novelty justifies the high cost of experiment
Conscious of cost per unit test; tests need to be scalable
How do we bridge the cultural and the practical differences?
Solution 1a: Build a User Friendly, high quality analysis platform to process genomics data
• Carefully designed and engineered software platform is just the starting point… User
Interface
Secu
rity
File system
Metadata Storage Application
logic
REST APIWorkflow Execution Manager
Continuous Integration Documentation
• Easy to use interface hiding the technical details
Solution 1a: Build a User Friendly, high quality analysis platform to process genomics data
Solution 1a: Build a User Friendly, high quality analysis platform to process genomics data
Solution 1b: Build Portable and Transparent Pipelines
• Use Galaxy as workflow engine – large community support
• Retools to address usability, security, and other limitations
• Version Controlled Pipeline Templates• Input files, parameters, and workflow are
sent to IRIDA-specific Galaxy for execution• Results and provenance information are
copied from Galaxy
1. Input files sent to
Galaxy
3. Results downloaded from Galaxy
IRIDA UI/DB
GalaxyAssembly Tools
Variant Calling Tools
…
REST API
Shared File System
Worker Worker
2. Tools executed on Galaxy workers
Source: Franklin Bristow
Solution 1c: Start the training NOW!
• Canada’s National Microbiology Laboratory has hosted genomic workshops for partners and collaborators
• At, PHMRL, we have been conducting workshops to train technologists and researchers on some common genomic analysis tools
• IRIDA Project has dedicated funding for hosting workshops in 4Q of 2015 and 2016
• We would like to engage the epidemiologists in the future for training purpose as well
GAP 2: INFORMATION SHARING IS INEFFICIENT AND AD-HOC
Many Players in surveillance and outbreak – ineffective information sharing
Source: M. Taylor, BCCDC
Provincial public health dept.
National laboratory
Local public health dept.
Provincial laboratory
Cases
Physicians Frontline lab
Information
Bioinformatics and Analytical Capacities
Many Systems used in Reporting Diseases –require data re-entry and re-coding
National Ministry of Health
Provincial public health dept.
National laboratory
Local public health dept.
Provincial laboratory
Cases
Physicians Local laboratory
Fax/Electronic
Fax
Phone/Fax
Electronic/Paper
Electronic/Fax/Phone Mailing of Samples/Fax/Eelctroni
c
Source: M. Taylor, BCCDC
Semantic Web
Credit: http://www.cs.rpi.edu/~hendler/
Semantic web is a suitable technology framework to organize and share arbitrary datasets
What’s the web?
• World-Wide-Web (WWW) is a platform where– Information is distributed (CBC for news, Netflix
for Movies, etc.)– Information is heterogeneous (text, video,
pictures)– (relevant) Information is linked by hyperlinks– Often, information is only human readable– Often, information is incorrect– Often, information is not attributed
What’s Semantic web?
• Semantic web inherits many of the (good) attributes of WWW (distributed, open, heterogeneous, and linked)
• It’s designed to be:– machine readable based on a common language of logic– Linking information can be automated making data sharing
easier– Easier to describe granular data – Errors can be detected based on logical reasoning– Information can be attributed and can be made to persist– “Smart Web”
IRIDA uses semantic web technologies to address information management issues
• Solutions:– 2a: Localized Instance of federated databases
– 2b: Permission Control – authentication /authorization for information sharing
– 2c: User role-based display of information
44
Solution 2a: Local/Cloud Instances and Data Federation
• Data processing capacity pushed to data generating labs• Allow data sharing securely for enhanced analysis• Eventually cultivating a culture of openness of data
sharing and collaborative development of tools
Authorization
Solution 2b: Security
• Local authorization per instance.• Method-level authorization.• Object-level authorization.• Allow secure, fine grained and
flexible information sharingcontrolled by data producer
Solution 2c: Role-based Dynamic Display driven by Ontology
• Ontologies often lack a content management system (CMS)• An Interface Model Ontology (IFM) can define a CMS for an
ontology
Source: Damion Dooley
IFM Interface View Permissions
Detailed View Restricted View
E.g. User role permissions control visibility and editing of content
Source: Damion Dooley
GAP 3: INFORMATION REPRESENTATION IS INCONSISTENT
There are at least 74 different ways to say “female” in ENA database
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383942/
Solution 3a: Use Ontology
• Ontology: a way to describe types of entities and relations between them
• Why use ontology– Ontology is flexible and expandable– Lower levels of expressivity (e.g. controlled vocabulary,
data dictionary) are heavy handed and show low level of compliance and adoption
– Free text used as an alternative that are not computing friendly
– Ontology and semantic web technologies may be a solution
The Utility of Ontologies in Food-borne Investigations
Example:Correlate PFGE type SSOXAI.0042 cases between 01 Mar 2015- 16 Mar 2015 with Spinach Leafy Greens Produce High-Risk Food Sources and Symptoms of Nausea and Fever
Ontologist organizes how terms are related in a tree so one can search for terms at different levelsProvides great information-resolving power!!
High-Risk Food
Produce Poultry Seafood
Leafy Greens Sprouts Deli Meat Nuggets Fish Shellfish
Source: Emma Griffiths
Many Domains of Knowledge are needed to describe an outbreak investigation Build On, Work With:
OBITypON NGSOnto NIAID-GSC-BRC core metadataMIxS Ontology NCBI Biosample etcTRANS – Pathogen Transmission EPOExposure OntologyInfectious Disease OntologyCARD, ARO for AMRUSDA Nutrient DBEFSA Comp. Food Consump. DB
Example gaps to be filled: Expand food ontology; expand CARD AMR data with others.
Lab Checklist/Ontology
• Currently finishing a lab/genomics checklist• Metadata Domains:
– Sample Collection– Sample Source– Environmental– Lab Analytics– Sequencing Process /QC– Sequencing Run /QC– Assembly Process / QC– Others overlapping with Epi: Demographic / Geographic / etc.
• Starting an epidemiology checklist to be completed this year
GAP 4: GENOMIC DATA INTERPRETATION IS COMPLEX AND TECHNOLOGY IS EVOLVING
Solution 4a: Use of QA/QC in IRIDA
• Software Engineering– High quality software that meets regulatory guidelines– Open Source product to ensure “white box” testing– Ontology driven software development– Follow proper software development cycle
• Data Quality– Built-in modules to check for input data quality – Warnings and Feedbacks during pipeline execution to laboratory technologists – Use of Ontology to check metadata (non-genomic) data quality
• Analytic Tool Quality– Utilize validation datasets– Use of abstract pipeline description – with version control– Periodic analysis of exceptions and boundary cases to assess tool accuracy
Solution 4b: Generation of validation datasets
To Participate, Contact Rene [email protected]
Or Errol Strain [email protected]
http://www.globalmicrobialidentifier.org/Workgroups#work-group-4
NML and BCPHMRL will be participating in the GMI proficiency test to compare our genomic sequencing and analysis protocols with other labs around the world
58
Solution 4c: Exploratory tools can access certain data via REST API securely
http://pathogenomics.sfu.ca/islandviewer
IslandViewer
Dhillon and Laird et al. 2015, Nucleic Acids Research
http://kiwi.cs.dal.ca/GenGIS
Parks et al. 2013, PLoS One
Availability
• Jun 1 2015: IRIDA 1.0 beta Internal Release– Release to collaborators for installation and full test
• Jul 1 2015: IRIDA 1.0 beta1– Announce Beta release, download, documentation available on
website – www.irida.ca
• Aug 1 2015: IRIDA 1.0 beta2– Cloud installer, with documentation– Additional pipelines as available – Visualization as available
60
AcknowledgementsProject LeadersFiona Brinkman – SFUWill Hsiao – PHMRLGary Van Domselaar – NML
University of LisbonJoᾶo Carriҫo
National Microbiology Laboratory (NML)Franklin BristowAaron PetkauThomas MatthewsJosh AdamAdam OlsonTarah LynchShaun TylerPhilip MabonPhilip AuCeline NadonMatthew Stuart-EdwardsMorag GrahamChrystal BerryLorelee TschetterAleisha Reimer
Laboratory for Foodborne Zoonoses (LFZ)Eduardo TaboadaPeter KruczkiewiczChad LaingVic GannonMatthew WhitesideRoss DuncanSteven Mutschall
Simon Fraser University (SFU)Melanie CourtotEmma GriffithsGeoff WinsorJulie ShayMatthew LairdBhav DhillonRaymond Lo
BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC)Judy Isaac-RentonPatrick TangNatalie PrystajeckyJennifer GardyDamion DooleyLinda HoangKim MacDonaldYin ChangEleni GalanisMarsha TaylorCletus D’SouzaAna Paccagnella
University of MarylandLynn Schriml
Canadian Food Inspection Agency (CFIA)Burton BlaisCatherine CarrilloDominic Lambert
Dalhousie UniversityRob BeikoAlex Keddy
McMaster UniversityAndrew McArthurDaim Sardar
European Nucleotide ArchiveGuy CochranePetra ten HoopenClara Amid
European Food Safety AgencyLeibana Criado ErnestoVernazza FrancescoRizzi Valentina
6161
IRIDA Annual General MeetingWinnipeg, April 8-9, 2015