Making Phenotypic data FAIR++ for Disease Diagnosis and
DiscoveryFindable
Accessible outside paywalls and private data sources
Attributable
Interoperable and Computable,
Reusable, exchangeable across contexts and disciplines
@ontowonkaMelissa Haendel, PhD
Genes Environment Phenotypes+ =
Computable encodings are essential
Base pairsVariant notation (eg. HGVS)
Human Phenotype Ontology
Mammalian Phenotype Ontology
Medical procedure codingEnvironment Ontology
@ontowonka
Genes Environment Phenotypes
VCF PXFGFF
Standard exchange formats exist for genes …
but for phenotypes? Environment?
NEW
BED
@ontowonka
Problems with tabular formats
• Denormalized– Repetition of fields– Ad-hoc syntax for multi-values fields, nesting
• Proliferation– different formats generated for each use case• E.g. disease-phenotype, patient-phenotype, …
• Hard to extend– Not all phenotypes can be pre-packaged as a phenotype
term• E.g. Measurements, environments
• Ad hoc software, need standard libraries• Focus should be on the datamodel
Phenopackets for clinical labs
Patient and
family history
Diagnostic tests, clinical
phenotypes
Genomic informati
onPhysical
exam
Patient medical history
Clinical labs often get no phenotypes or one-line descriptions.
What if we could make the phenotype data PHI-free and simultaneously more descriptive?
Clinical testing lab
Phenopackets for journals
Each article can be associated with a
phenopacket
Robinson, P. N., Mungall, C. J., & Haendel, M. (2015). Capturing phenotypes for precision medicine. Molecular Case Studies, 1(1), a000372. doi:10.1101/mcs.a000372
Each phenopacket can be shared via DOI in any repository outside paywall (eg. Figshare,
Zenodo, etc) and cited as a data
citation
Phenopackets for databases
Databases could share G2P data in a standardized format, retaining domain or species specificity
OMIA
Ontologies provide pre-packaged phenotype descriptions
A simple data model Entities–Organism• Patient• Non-human animal• Population
–Genetic/genomic element–Condition• Disease• Phenotype
Associations–E.g. between disease and phenotype–Each association has• Evidence• Provenance
Entity
Condition
association
Evidence
Disease Phenotype
PhenoPacket export formats
CSV JSON RDF OWL
monarchinitiative.org
title: "age of onset example"persons:- id: "#1" label: "Donald Trump" sex: "M"
phenotype_profile:- entity: "person#1" phenotype: types: - id: "HP:0200055" label: "Small hands" onset: description: "during development" types: - id: "HP:0003577" label: "Congenital onset" evidence: - types: - id: "ECO:0000033" label: ”Traceable Author Statement" source: - id: "PMID:1"
Image credits: upi.com
What does a PhenoPacket look like?
Canonical JSON format
Nesting allows refinementphenotype_profile: - entity: “#1” phenotype: types: - id: HP:0100024 label: conspicuously happy disposition onset: types: - id: HP:0011463 label: Adult onset description: “Writes distracting tweets”
header
entities
assocs
persons: - id: „#1“ label: Mickey Mouse date_of_birth: 1928-01-01 sex: M - id: „#2“ label: Goofy sex: M
patients.pxf
monarchinitiative.org
title: "measurement example, taken from genenetwork.org"organisms:- id: "#1" label: "BXD mouse population” taxon: NCBITaxon:10090phenotype_profile:- entity: "#1" phenotype: description: "cerebellum weight" types: - id: "PATO:0000128" label: "weight" measurements: - unit: mg value: 61.400 property_values: - property: standard_error filler: 2.38 attribute_of: types: - id: "UBERON:0002037" label: "cerebellum" onset: description: "measured in adults" types: - id: "MmusDv:0000061" label: "early adult"
Ontology ofStatisticalproperties
We can representpopulation phenotypes too
attribute
For non-abnormalphenotypes we canuse a trait ontology,or a building block approach, with• PATO• Uberon
Measured entity
UO
How does it handle measurements?
Example: pathogenicity for a variant)disease_profile:
- entity: CLINVAR:226213 disease: - id: NCIT:C4872 label: "Breast Carcinoma" interpretation: "pathogenic" contributors: - id: CLINGEN:Agent007 label: "Clinical Pathogenicity Calculator v1" created: "2016-07-12T11:00:59+00:00" method: - id: doi:10.1038/gim.2015.30 label: "ACMG ISV guidelines 2015" evidence: - id: CLINGEN:ev025 type: ECO:9000100 ('population frequency evidence') acmg_criterion: CLINGEN:vic008 ('ACMG v2015 PM2, absent from controls in population databases') description: "Variant is absent from a large cohort of non-finnish europeans (NFE) in the ExAC population database, with sequencing coverage of the variant exceeding 25X" outcome: "moderately supporting" supporting_reference: - id: PMID:27997510 supporting_data: - id: CLINGEN:PAF082A type: SEPIO:9000895 ('allele frequency data') value: "0" - id: CLINGEN:PAF082B type: SEPIO:9000846 ('median sequencing coverage data') value: "28X" - id: CLINGEN:PAF082C type: SEPIO:9000878 ('population ethnicity data') value: "non-finnish european”…....
header
entities
assocs
variants: - id: CLINVAR:226213 type: SO:0001483 ('single
nucleotide variant') label:"NM_007294.3(BRCA1):c.4677_5075del" positions:
- type: HGVS value:"NM_007294.3:c.4677_5075del"
Use GA4GH variant representation (Reece Hart leading)
http://bit.ly/variant-path-PXF
ClinGen (Larry Babb) collab
Complex phenotypes
Not every phenotype can be boiled down to a pre-packaged ontology term
PXF allows post-coordination / post-composition– E.g. ‘mild’, ‘severe’ qualifiers– Temporal qualifiers: start, end, acute/chronic, …– Specifying precise location of phenotype– On-the-fly composition of phenotypic descriptors from base ontologies
• Chemical entities• Cell types• GO• Anatomy
Additionally– Free text descriptions– Measurements / quantitative phenotypes– Environments (ongoing)
Mungall, C. J., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., & Ashburner, M. (2010). Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2. doi:10.1186/gb-2010-11-1-r2
PXF and GA4GH Stack
PXF primary use case is as a file format GA4GH primary use case as an API Obviously these are related… ...But the devil is in the details– E.g. Is there a well-defined mapping between proto and
JSON? How can we better interoperate? Working to converge
(M. Diekhans)– Define PXF using ProtoBuf– What would a query API look like?
• As an exchange format, we don’t have to worry about this• Query APIs for complex data structures proliferate complexity• What is the overall GA4GH strategy here?
PXF, GA4GH, and other related activities
G2P– PXF extends initial implementation–Make PXF a FHIR resource
Metadata– Align how to reference ontology terms– Standardizing identifier prefixes
MME– PXF does not provide a search API– PXF subsumes phenotype profile representation
Beacon– PXF could be a response element
Summary: Phenotype Exchange Format
• One model, derive alternate concrete forms– YAML, JSON, RDF, TSV (subset)
• Species-agnostic– From microbes through plants through humans– clinical and basic research
• Applicable to a variety of entities– Patients/individual organisms, cohorts, populations– Diseases– Papers– Genes, genotypes, alleles, variants
• Simple for simple cases…– Bag of terms model
• …Incremental expressivity– Temporality and causality– Quantitative as well as qualitative– Negation, severity, frequency, penetrance, expressivity
• Ontology-smart– Rational Composition (post-coordination)– Explicit semantics
http://phenopackets.org
Phenopacket Tool ecosystem
• Non JVM language bindings– Python (beta)
• https://github.com/phenopackets/phenopacket-python/ – Javascript (alpha)
• https://github.com/phenopackets/phenopacket-js/ • Pxftools
– command line library, Scala utilities– https://github.com/phenopackets/pxftools
• PhenoPacketScraper– GSOC project to make phenopackets from case study articles– https://github.com/monarch-initiative/phenopacket-scraper-core
• OwlSim– Like blast, for phenotypes– https://github.com/monarch-initiative/owlsim-v3
• WebPhenote– Noctua extension for phenopacket creation– http://create.monarchinitiative.org
Acknowledgments• Chris Mungall
(schema/architecture)• Jules Jacobsen (java API)• James Balhoff (pxftools)• Jeremy Nguyen-Xuan (pxftools)• Seth Carbon (web phenote)• Kent Shefcheck (python API)• Matt Brush (modeling)• Dan Keith (web phenote)• Satwik Bhattamishra (GSOC
student, PhenoPacketScraper)
• Julie McMurry• Peter Robinson• Pier Buttigieg• Ramona Walls• Damian Smedley• Sebastian Kohler• Tudor Groza• Harry Hochheiser• Mark Diekhans• Melanie Courtot• Michael Baudis• Helen Parkinson• Suzanna Lewis