How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

How Can We Make Genomic Epidemiology a Widespread Reality?

William Hsiao, [email protected]

@wlhsiao

BC Public Health Microbiology and Reference Laboratory

BCCDC Grand Round May 26 2015

mailto:[email protected]

Outline

• Part 1: What is genomic epidemiology and Why is it important for public health microbiology

• Part 2: What are the requirements to bring genomic epidemiology to routine public health practice– Introducing our project IRIDA as part of the

solution

3Source: Peter Gleick, Scienceblogs.com

PeoplePlaceTime

Source: Melanie Courtot

PeoplePlaceTime


PeoplePlaceTime


Molecular Epidemiology

• Laboratory generated biomarker results can be correlated to epidemiological investigations (People, Place, Time)

• Provides linkage based on common exposure to the same pathogen at the molecular level

• Most tests detect one or a few of specific biomarkers, representing a fraction of the pathogens’ genetic information

Current Methods of Characterizing Foodborne Pathogens in a Public Health Laboratory

• Growth characteristics • Phenotypic panels • Agglutination reactions • Enzyme immuno assays (EIAs) • PCR • DNA arrays (hybridization) • Sanger sequencing of marker genes• DNA restriction • Electrophoresis (PFGE, capillary)

Each pathogen is characterized by methods that are specific to that pathogen in multiple workflows (separate workflows for each pathogen) TAT: 5 min – weeks (months)

Source: Rebecca Lindsey

Genomic Epidemiology

Def: Using whole genome sequencing data from pathogens and epidemiological investigations to track spread of an infectious disease

Why Genomic Epidemiology

• One technology (DNA sequencing) compatible with many types of pathogens

• Capable of generating 10-1000s of high quality pathogen genomes within 1-7 days

Sequencing = lots of HQ Data

• Capture the pathogen’s entire genetic makeup• Unbiased (~97-99+% of the genome captured using

common sequencing approaches) • Significantly more data than traditional methods• Allow higher resolution and higher sensitivity methods to

be applied• Allow value-added

evolutionary & Functionalstudy of the pathogens– Virulence factors– AMR genes

$10K per human genome or $10 per bacterial genome

$100M per human genome

Sequencing cost continues to drop

Variations in genomes = Basis of Comparison

• Mutations– Point mutations– Small insertions and deletion (indels)– Can change functions of a gene

• Recombination, deletion, and duplication– Rearrange genes, can change expression– Increase gene copy number– Delete genes

• Horizontal gene transfer– Acquiring genetic material from non-parental organism

• E.g. Antibiotic resistance / new toxins

SNP Analysis

• What is a SNP?– A SNP (single nucleotide polymorphism) is DNA sequence

variation occurring when a single nucleotide differs between two or more genomes

ATCGCGATATCATACGGATCGCAATATCATACGGATCGCGATATCATACGGATCGCGATATCATACGGATCGCAATATCATACGG

• SNP can be created from point mutation but can also be created from insertion and deletion of one nucleotide

Why are SNPs useful

• Silent mutations that do not change protein sequences happen quite frequently due to DNA replication errors => High Resolution

• SNPs occurs across the whole genome and can be detected from whole genome sequencing => Unbiased markers

• SNPs can also be used to infer phylogeny of organisms– More shared SNPs = more closely related

SNP Minimal Spanning Tree – colored by Phage Type

PT8

PT4

PT13a

PT52

The most similar isolates are connected first => clustering them together

SNP Minimal Spanning Tree – colored by outbreaks

Many phylogenetic trees based on SNPs published to show clustering of outbreak cases

den Bakker et al Emerg Infect Dis. 2014 Aug;20(8)

Non-related cases

Outbreak cases

Allard, M et alPLoS ONE 8 (1) 2013

Forces Driving Pathogen Genome Evolution

Specialization“lean and mean”

New function can be derived through:

Gene expression and be turned on and off

Intra-cluster distances overlap with inter-cluster distances

Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.

Different species have different clustering distances

Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.

Genomics + Epidemiology

• Having genetic distance information alone may not be enough to fully characterize outbreaks

• Need to combine with epidemiological investigations

• Using known clusters to establish (sub-)species-specific genetic distance criteria

• Genomics can help connecting previous unlinked cases to uncover new cases

Each year, one in eight Canadians (or four million people)

get sick with a domestically acquired food-borne illness.

http://www.phac-aspc.gc.ca/efwd-emoha/efbi-emoa-eng.php

Whole Genome Sequencing of Foodborne Pathogens Around the World

• UK Public Health England committed to sequence all the Salmonella isolates submitted to PH Lab

• US FDA and CDC (supported by National Center for Biotechnology Information) created a distributed network of labs to utilize WGS for pathogen identification

https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/

https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Genome Canada Bioinformatics Competition: Large-Scale Project

“A Federated Bioinformatics Platform for Public Health Microbial Genomics”

Our Goal

The IRIDA platform(Integrated Rapid Infectious Disease Analysis)

An open source, standards compliant, high quality genomic epidemiology analysis platform based on web-technology to support real-time (food-

borne) disease outbreak investigations

25 www.IRIDA.ca

Partnership among public health agencies and academic institutes to bridge the gaps between advancements in genomic epidemiology and application to real-life and real-

time use cases in public health agencies

- Project Team has direct access to state of the art research in academia- Project Team is directly embedded in user organization

IRIDA Project Phases

• Phase 1: genomics process and analysis pipeline to produce categorical data (MLST and SNPs) suitable for current epidemiological analysis – almost completed

• Phase 2: combine the categorical data with epidemiological data (line list approach to replace current Excel based approach) – in progress

• Phase 3: Develop IRIDA as an exploratory platform for new ways of interpreting genomics data in light of epidemiological and clinical data – in progress; continuous process beyond current project

28

Interviews with key personnel to identify barriers to implement genomic epidemiology in

public health agencies

GAP 1: PUBLIC HEALTH PERSONNEL LACK TRAINING IN GENOMICS

Microbial genomics has been a valuable research tool

• Help us understand:– microbial evolution– pathogenesis– create novel industrial processes– create new laboratory tests

• Use historical isolates – not real time• Use of laboratory strains – no associated rich

clinical and epidemiological metadata

Cultural and Practical Differences

Genomics Research Laboratory Genomics Diagnostic Laboratory

Curiosity driven Production / Case driven

Exploratory analysis tolerated Exploratory analysis discouraged

Reproducibility = other labs’ problem Reproducibility critical

Tweaking protocols desirable Stability in protocols desirable

Protocols don’t need to be validated Protocols need to be validated

Novelty justifies the high cost of experiment

Conscious of cost per unit test; tests need to be scalable

How do we bridge the cultural and the practical differences?

Solution 1a: Build a User Friendly, high quality analysis platform to process genomics data

• Carefully designed and engineered software platform is just the starting point… User

Interface

Secu

rity

File system

Metadata Storage Application

logic

REST APIWorkflow Execution Manager

Continuous Integration Documentation

• Easy to use interface hiding the technical details



Solution 1b: Build Portable and Transparent Pipelines

• Use Galaxy as workflow engine – large community support

• Retools to address usability, security, and other limitations

• Version Controlled Pipeline Templates• Input files, parameters, and workflow are

sent to IRIDA-specific Galaxy for execution• Results and provenance information are

copied from Galaxy

1. Input files sent to

Galaxy

3. Results downloaded from Galaxy

IRIDA UI/DB

GalaxyAssembly Tools

Variant Calling Tools

…

REST API

Shared File System

Worker Worker

2. Tools executed on Galaxy workers

Source: Franklin Bristow

Solution 1c: Start the training NOW!

• Canada’s National Microbiology Laboratory has hosted genomic workshops for partners and collaborators

• At, PHMRL, we have been conducting workshops to train technologists and researchers on some common genomic analysis tools

• IRIDA Project has dedicated funding for hosting workshops in 4Q of 2015 and 2016

• We would like to engage the epidemiologists in the future for training purpose as well

GAP 2: INFORMATION SHARING IS INEFFICIENT AND AD-HOC

Many Players in surveillance and outbreak – ineffective information sharing

Source: M. Taylor, BCCDC

Provincial public health dept.

National laboratory

Local public health dept.

Provincial laboratory

Cases

Physicians Frontline lab

Information

Bioinformatics and Analytical Capacities

Many Systems used in Reporting Diseases –require data re-entry and re-coding

National Ministry of Health

Provincial public health dept.

National laboratory

Local public health dept.

Provincial laboratory

Cases

Physicians Local laboratory

Fax/Electronic

Fax

Phone/Fax

Electronic/Paper

Electronic/Fax/Phone Mailing of Samples/Fax/Eelctroni

c

Source: M. Taylor, BCCDC

Semantic Web

Credit: http://www.cs.rpi.edu/~hendler/

Semantic web is a suitable technology framework to organize and share arbitrary datasets

What’s the web?

• World-Wide-Web (WWW) is a platform where– Information is distributed (CBC for news, Netflix

for Movies, etc.)– Information is heterogeneous (text, video,

pictures)– (relevant) Information is linked by hyperlinks– Often, information is only human readable– Often, information is incorrect– Often, information is not attributed

What’s Semantic web?

• Semantic web inherits many of the (good) attributes of WWW (distributed, open, heterogeneous, and linked)

• It’s designed to be:– machine readable based on a common language of logic– Linking information can be automated making data sharing

easier– Easier to describe granular data – Errors can be detected based on logical reasoning– Information can be attributed and can be made to persist– “Smart Web”

IRIDA uses semantic web technologies to address information management issues

• Solutions:– 2a: Localized Instance of federated databases

– 2b: Permission Control – authentication /authorization for information sharing

– 2c: User role-based display of information

44

Solution 2a: Local/Cloud Instances and Data Federation

• Data processing capacity pushed to data generating labs• Allow data sharing securely for enhanced analysis• Eventually cultivating a culture of openness of data

sharing and collaborative development of tools

Authorization

Solution 2b: Security

• Local authorization per instance.• Method-level authorization.• Object-level authorization.• Allow secure, fine grained and

flexible information sharingcontrolled by data producer

Solution 2c: Role-based Dynamic Display driven by Ontology

• Ontologies often lack a content management system (CMS)• An Interface Model Ontology (IFM) can define a CMS for an

ontology

Source: Damion Dooley

IFM Interface View Permissions

Detailed View Restricted View

E.g. User role permissions control visibility and editing of content

Source: Damion Dooley

GAP 3: INFORMATION REPRESENTATION IS INCONSISTENT

There are at least 74 different ways to say “female” in ENA database

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383942/

Solution 3a: Use Ontology

• Ontology: a way to describe types of entities and relations between them

• Why use ontology– Ontology is flexible and expandable– Lower levels of expressivity (e.g. controlled vocabulary,

data dictionary) are heavy handed and show low level of compliance and adoption

– Free text used as an alternative that are not computing friendly

– Ontology and semantic web technologies may be a solution

The Utility of Ontologies in Food-borne Investigations

Example:Correlate PFGE type SSOXAI.0042 cases between 01 Mar 2015- 16 Mar 2015 with Spinach Leafy Greens Produce High-Risk Food Sources and Symptoms of Nausea and Fever

Ontologist organizes how terms are related in a tree so one can search for terms at different levelsProvides great information-resolving power!!

High-Risk Food

Produce Poultry Seafood

Leafy Greens Sprouts Deli Meat Nuggets Fish Shellfish

Source: Emma Griffiths

Many Domains of Knowledge are needed to describe an outbreak investigation Build On, Work With:

OBITypON NGSOnto NIAID-GSC-BRC core metadataMIxS Ontology NCBI Biosample etcTRANS – Pathogen Transmission EPOExposure OntologyInfectious Disease OntologyCARD, ARO for AMRUSDA Nutrient DBEFSA Comp. Food Consump. DB

Example gaps to be filled: Expand food ontology; expand CARD AMR data with others.

Lab Checklist/Ontology

• Currently finishing a lab/genomics checklist• Metadata Domains:

– Sample Collection– Sample Source– Environmental– Lab Analytics– Sequencing Process /QC– Sequencing Run /QC– Assembly Process / QC– Others overlapping with Epi: Demographic / Geographic / etc.

• Starting an epidemiology checklist to be completed this year

GAP 4: GENOMIC DATA INTERPRETATION IS COMPLEX AND TECHNOLOGY IS EVOLVING

Solution 4a: Use of QA/QC in IRIDA

• Software Engineering– High quality software that meets regulatory guidelines– Open Source product to ensure “white box” testing– Ontology driven software development– Follow proper software development cycle

• Data Quality– Built-in modules to check for input data quality – Warnings and Feedbacks during pipeline execution to laboratory technologists – Use of Ontology to check metadata (non-genomic) data quality

• Analytic Tool Quality– Utilize validation datasets– Use of abstract pipeline description – with version control– Periodic analysis of exceptions and boundary cases to assess tool accuracy

Solution 4b: Generation of validation datasets

To Participate, Contact Rene [email protected]

Or Errol Strain [email protected]

http://www.globalmicrobialidentifier.org/Workgroups#work-group-4

NML and BCPHMRL will be participating in the GMI proficiency test to compare our genomic sequencing and analysis protocols with other labs around the world




58

Solution 4c: Exploratory tools can access certain data via REST API securely

http://pathogenomics.sfu.ca/islandviewer

IslandViewer

Dhillon and Laird et al. 2015, Nucleic Acids Research

http://kiwi.cs.dal.ca/GenGIS

Parks et al. 2013, PLoS One

http://pathogenomics.sfu.ca/islandviewer

http://kiwi.cs.dal.ca/GenGIS

Availability

• Jun 1 2015: IRIDA 1.0 beta Internal Release– Release to collaborators for installation and full test

• Jul 1 2015: IRIDA 1.0 beta1– Announce Beta release, download, documentation available on

website – www.irida.ca

• Aug 1 2015: IRIDA 1.0 beta2– Cloud installer, with documentation– Additional pipelines as available – Visualization as available

60

AcknowledgementsProject LeadersFiona Brinkman – SFUWill Hsiao – PHMRLGary Van Domselaar – NML

University of LisbonJoᾶo Carriҫo

National Microbiology Laboratory (NML)Franklin BristowAaron PetkauThomas MatthewsJosh AdamAdam OlsonTarah LynchShaun TylerPhilip MabonPhilip AuCeline NadonMatthew Stuart-EdwardsMorag GrahamChrystal BerryLorelee TschetterAleisha Reimer

Laboratory for Foodborne Zoonoses (LFZ)Eduardo TaboadaPeter KruczkiewiczChad LaingVic GannonMatthew WhitesideRoss DuncanSteven Mutschall

Simon Fraser University (SFU)Melanie CourtotEmma GriffithsGeoff WinsorJulie ShayMatthew LairdBhav DhillonRaymond Lo

BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC)Judy Isaac-RentonPatrick TangNatalie PrystajeckyJennifer GardyDamion DooleyLinda HoangKim MacDonaldYin ChangEleni GalanisMarsha TaylorCletus D’SouzaAna Paccagnella

University of MarylandLynn Schriml

Canadian Food Inspection Agency (CFIA)Burton BlaisCatherine CarrilloDominic Lambert

Dalhousie UniversityRob BeikoAlex Keddy

McMaster UniversityAndrew McArthurDaim Sardar

European Nucleotide ArchiveGuy CochranePetra ten HoopenClara Amid

European Food Safety AgencyLeibana Criado ErnestoVernazza FrancescoRizzi Valentina

6161

IRIDA Annual General MeetingWinnipeg, April 8-9, 2015

Date post:	15-Apr-2017
Category:	Healthcare
Upload:	william-hsiao
View:	207 times
Download:	2 times

How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

Healthcare