INTEGRATING GENOMICS, EPIDEMIOLOGICAL AND CLINICAL
DATA USING ONTOLOGY
William Hsiao, PhDBC Centre for Disease Control and University of BC
IRIDA and GenEpiO Consortia
GloPID-R Zika Workshop 2016, São Paulo
BIG DATA IS CHANGING PUBLIC HEALTH PRACTICES
• Big Data: Increasing digitization of biomedical data (digital objects) requires computers for processing and management to turn into Information and Knowledge
• Molecular/ Genomic Epidemiology: high throughput DNA sequencing provides high resolution evidence for epidemiological investigations• Raw data (GBytes) -> Processed data (MB) -> Interpreted data (KB) -> Decision (Bytes; subtyping results)
• Localized data processing reduce data transfer bottleneck
• Data harmonization and sharing is for both human and, more critically, for computer consumptions
WhenSameWordsCanMeanDifferentThings
SEMANTIC AMBIGUITY
Nuts
IRIDA
Sequencing Instruments
Web Application
Data management
Built-in Analytical
Tools
External Galaxy
Command-line Tools
Project Information: http://www.irida.ca
• Open Source and Free
• Modular Design• User friendly web interface• Robust analysis pipeline engine• Data management for genomic data
• Secure authentication and authorization to access data and system
• Audit trail (data provenance)
IRIDA: GENOMIC DATA ANALYSIS PLATFORM
Project Information: http://www.irida.ca
IRIDA: FEDERATION AND DATA SHARING
• Federation: Multiple local instances able to communicate to each other• Allow on-site data generation and
analysis
• Sharing: Allow data sharing securely via standard API for enhanced analysis (using 3rd party analysis tools)
• Eventually cultivating a culture of openness of responsible data sharing and collaborative analysis tool development
Sequencing & Bioinformatics
• Sequencing, Assembly Pipeline Parameters
• QA/QC Metrics• Tree Construction Details
Sample Information
• Isolation source• Food, Clinical, Environment• Food category, Body Product• Dates, Location
Clinical and Epi Details
• Demographics• Host disease, Symptoms • Lab Test Results• Exposures
GENOMIC SEQUENCES NEED TO BE INTERPRETED IN CONTEXT
Descriptive – Organized - Standardized
FAIR Principles of Digital Data Management:
F – FindableA – AccessibleI – InteroperableR – Reusable
Published in Nature Scientific Data, March 2016
WHEN REQUISITIONING CONTEXTUAL DATA, WE NEED TO ANTICIPATE THE NEEDS OF DOWNSTREAM USERS
Difficult to foresee data integration needs during an evolving PH emergency such as zika or ebola outbreaks
ONTOLOGY
• A mechanism to specify and express a body of knowledge
• Standardized, well-defined hierarchy of terms
• Each term has a unique universal ID• Terms interconnected with logical
relationships• Have formats that are Human AND
computer readable
• This internally coherent tool can act as an universal translator of different standards
Lab AnalyticsGenomics, PFGE
Serotyping, Phage typingMLST, AMR
Sample MetadataIsolation Source (Food,
Host Body Product, Environmental),
BioSample
Epidemiology InvestigationExposures
Clinical DataPatient demographics,
Medical History, Comorbidities, Symptoms,
Health Status
ReportingCase/Investigation Status
GenEpiO(Genomic Epidemiology
Application Ontology)
GEN-EPI-O: COMBINING EPI, LAB, GENOMICS AND CLINICAL DATA FIELDS
GEN-EPI-O INITIAL DEVELOPMENT
Medical & Environmental Microbiologists
Bioinformaticians
Surveillance Analysts & Lab Personnel
Epidemiologists Software and Work Flows
Investigation ToolsInstrumentation
+ =
Interview users Examine resources
GenEpiO(Genomic
Epidemiology Application Ontology)
GenEpiO is part of the OBO Foundry library of ontologies
• Prescribes best practices for ontology development
• Common relations, syntax and data formats
• Re-use terms when possible• Committed to openness, interoperability
and collaboration• Attributable efforts
Open Biomedical Ontologies - http://www.obofoundry.org/
144 ontologies accepted or under development àDescribing genes and phylogenies to diseases and anatomy
See draft version at https://github.com/GenEpiO/genepio/wiki
ADVANTAGES OF ONTOLOGY
• Eliminates semantic ambiguity
• Term-mapping allows customization of displays
• Flexible to allow incorporation of new data sources/types• Faster data integration
• Triggers actionable events in same way• Reproducibility (suitable for organizational accreditation, validation)
• Curator Attribution (giving credit to people working on the resources)
• Formation of GenEpiO consortium to work on a common open resource• Identify priorities
• Build Consensus
Improved Public Health
Investigation power!
ACKNOWLEDGEMENTS
IRIDA Project LeadersFiona Brinkman – SFUWill Hsiao – PHMRLGary Van Domselaar – NMLRob Beiko – DalhousieAndrew McArthur - McMasterLeonid Chindelevitch – SFUCedric Chauve - SFUSimon Fraser University (SFU)Emma GriffithsGeoff WinsorJulie ShayMatthew LairdBhav Dhillon
McMaster UniversityDaim Sardar
European Food Safety AgencyLeibana Criado ErnestoVernazza FrancescoRizzi Valentina
National Microbiology Laboratory (NML)Franklin BristowAaron PetkauThomas MatthewsJosh AdamAdam OlsenTara LynchShaun TylerPhilip MabonPhilip AuCeline NadonMatthew Stuart-EdwardsMorag GrahamChrystal BerryLorelee TschetterEduardo ToboadaPeter KruczkiewiczChad LaingVic GannonMatthew WhitesideRoss DuncanSteven MutschallUniversity of LisbonJoᾶo Carriҫo
BC Public Laboratory and BC Centre for Disease Control (BCCDC)Damion DooleyJudy Isaac-RentonPatrick TangNatalie PrystajeckyJennifer GardyLinda HoangKim MacDonaldYin ChangEleni GalanisMarsha TaylorCletus D’sousaUniversity of MarylandLynn SchrimlCanadian Food Inspection Agency Adam KoziolBurton BlaisCatherine CarrilloDalhousie UniversityAlex KeddyEuropean Bioinformatics InstituteMelanie CourtotHelen Parkinson
GenEpiO Project LeadersWill Hsiao – BCCDCFiona Brinkman – SFUAndrew McArthur - McMasterSimon Fraser University (SFU)Emma GriffithsUniversity of British ColumbiaDamion DooleyMcMaster UniversityAmos RaphenyaBrian Alcock
GenEpiO