+ All Categories
Home > Documents > Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for...

Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for...

Date post: 22-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
99
Academic A.J.Brookes, R.Dalgleish University of Leicester UK P.Flicek, H.Parkinson European Molecular Biology Laboratory Germany C.Díaz Fundació IMIM Spain J.denDunnen Leiden University Medical Centre Netherlands C.Béroud Inst Natl de la Santé et de la Recherche Méd France A.Cambon-Thomsen Inst Natl de la Santé et de la Recherche Méd France J-E.Litton Karolinska Institute Sweden G.Potamias Greece G.Patrinos Greece S.Heath France J.Muilu Finland J.L.Oliveira Portugal D.Dash L.Yip Switzerland A.Devereau SMEs A.Kel BioBase GmbH Germany H.Gudbjartsson deCODE genetics Iceland D.Atlan PhenoSystems Belgium T.Kanninen Biocomputing Platforms Finland Associates H.Lehvaslaiho University of Western Cape South Africa M.Swertz Groningen University Medical Centre Netherlands M.Vihinen University of Tampere Finland GEN2PHEN GEN2PHEN Partners (www.gen2phen.org) ...towards an internet ‘Knowledge-Environment’ for G2P information
Transcript
Page 1: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Academic A.J.Brookes, R.Dalgleish University of Leicester UK P.Flicek, H.Parkinson European Molecular Biology Laboratory Germany C.Díaz Fundació IMIM Spain J.denDunnen Leiden University Medical Centre Netherlands C.Béroud Inst Natl de la Santé et de la Recherche Méd France A.Cambon-Thomsen Inst Natl de la Santé et de la Recherche Méd France J-E.Litton Karolinska Institute Sweden G.Potamias Foundation for Research & Technology Greece G.Patrinos University of Patras Greece S.Heath Centre National de Génotypage France J.Muilu University of Helsinki Finland J.L.Oliveira University of Aveiro – IEETA Portugal D.Dash Institute of Genomics and Integrative Biology India L.Yip Swiss Institute of Bioinformatics Switzerland A.Devereau University of Manchester UK

SMEs A.Kel BioBase GmbH Germany H.Gudbjartsson deCODE genetics Iceland D.Atlan PhenoSystems Belgium T.Kanninen Biocomputing Platforms Finland

Associates H.Lehvaslaiho University of Western Cape South Africa M.Swertz Groningen University Medical Centre Netherlands M.Vihinen University of Tampere Finland

GEN2PHEN GEN2PHEN Partners (www.gen2phen.org)

...towards an internet ‘Knowledge-Environment’ for

G2P information

Page 2: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

WP10 (Project Management)

WP2 (Domain Analysis)

WP3 (Standards Development)

WP5 (Genomics DBs) WP4 (Genetics DBs)

WP6 (Integration and Searching)

WP7 (Data Flows)

DATA

IN DATA

IN

WP8 (Knowledge Centre)

WP9 (Use & Sustainability)

KNOWLEDGE OUT

WP1 (Scientific Coordination)

LEADERSHIP

Page 3: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

dbGaP

EGA

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access

[GWAS Central India / China...]

GWAS Central

Page 4: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

ORCID ID: B-1242-2010

G. Thorisson, Univ. Leicester

G. A. Thorisson, Univ. Leicester

G. A. Thorisson, Cold Spring Harbor Lab.

unique, permanent, not reused !

...but, you can have more than one !

RESEARCHER IDENTIFIERS:

Page 5: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Openly share the ‘existence’ rather than the ‘substance’ of the data ….thereafter variably manage data access

Page 6: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

OPEN data sharing:

...more than one way!

Anthony Brookes University of Leicester, UK

Page 7: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

...a seamless internet ‘Knowledge-Environment’ for biomedical information

Page 8: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

GEN2PHEN: www.gen2phen.org

GEN2PHEN activities...

1: Analyse current needs and practices (global perspective)

2: Develop key standards for the G2P field

3: Create generic components, services and integration structures

4: Create search and presentation solutions, anchored on Ensembl

5: Assist deployment of GEN2PHEN solutions, and federate

6: Promote and facilitate data population into G2P databases

7: Consider system durability and long-term financing

Page 9: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

• Researchers may not have time nor funding to manually submit data, and/or submission process and requirements too complicated

• Researchers receive little or no recognition or reward for releasing data, hence little incentive to try

Issues that restrict sharing data

• Researchers may have positive reasons for NOT wanting to share data (ethical, legal, competitive edge)

• No current SANCTIONS for researchers that do not maximally share data

Page 10: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

‘Safe’ data: open access

Individual & aggregate level data:

managed access (EGA, dbGaP)

Page 11: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

- genetic association database

- aiming to integrate many datasets

- summary level data only

- links to data sources for primary data

Page 12: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

GWAS Central data content compares well with other resources

1

10

100

1000

10000

100000

1000000

10000000

100000000

GWAS catalog OADGAR GaP plus GWAS Central

Number of Studies

Number of p-values

Page 13: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access

works today, needs ‘more’

works today, needs ‘more’

absent today, needs ‘promotion’

Page 14: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access

Page 15: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

dbGaP

EGA

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access

[GWAS Central India / China...]

GWAS Central

Page 16: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

‘Federated’ GWAS Central

‘Public’ GWAS Central

gwc1.org

gwc2.org ✔ ✖

Study Where? Available?

Breast Cancer (HGVST1) Central ✔

Breast Cancer (HGVST2) Central ✔

Breast Cancer (HGVST56) gwc1 ✔

Breast Cancer (HGVST4000) gwc1 Request Access

Breast Cancer (HGVST4001) gwc2 ✖

?

Request for access

Who are you?

Page 17: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

User

Resource

ORCID ID Directory

identity credentials

D A C

Page 18: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

User

Resource

ORCID ID Directory

identity credentials

D A C

Page 19: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

User

Identity Provider (+ Directory)

Resource

Resource

Resource

Page 20: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

ORCID ID: B-1242-2010

G. Thorisson, Univ. Leicester

G. A. Thorisson, Univ. Leicester

G. A. Thorisson, Cold Spring Harbor Lab.

unique, permanent, not reused !

...but, you can have more than one !

RESEARCHER IDENTIFIERS:

Page 21: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Unique identifiers for authors and other contributors

Dec’09: launch of the Open Researcher

Contributor Identification Initiative - ORCID

~2/3 of the ~6 million authors in MEDLINE share a last name

and first initial with at least one other author, and an

ambiguous name refers to ~8 persons on average.

Torvik and Smalheiser. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge

Discovery from Data (2009) vol. 3 (3)

Page 22: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Digital Identities on the web...

Page 23: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

IDENTITY:

Page 24: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

IDENTITY:

Page 25: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Orc-ID:9324235238234

G. Thorisson, Univ. Leicester

G. A. Thorisson, Univ. Leicester

G. A. Thorisson, Cold Spring Harbor Lab.

Page 26: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Openly share the ‘existence’ rather than the ‘substance’ of the data ….thereafter variably manage data access

Page 27: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Mutation data sharing amongst groups such as LSDBs, diagnostic labs, research labs, data miners/curators

The problem...

Page 28: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Central Database

Page 29: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

USERS

ONE data format

Not a ‘database’

SUBMITTERS

Cafe Variome

Page 30: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

VarioML

• XML format elements for LSDB data exchange use cases – Same format components for different

applications

• Based on the Pheno-OM – Well defined semantics

• Intermediate format for semantic web – XSLT transformation to RDF

• Tools – Validators, JavaAPI, XSLTs

Page 31: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

“Café Rouge enabled” Gensearch DNA analysis tool (Phenosystems)

Uploaded via simple operation

Page 32: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

• An Analysis Computer (AC) send iteratively requests for fitting a given GLM to the Data Computers (DC) on which data are stored

DataSHIELD: Pooled data analysis without data sharing

Page 33: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

• Only summary statistics are sent back to the AC after each iteration – Individual-level data never leave DCs

• Eventually, iterations will converge to the same result as the model was fitted directly to the physically pooled data.

Page 34: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Local &/or Centralised &/or Federated technologies for data display and data mining

New database for sample collections, variables + results

Existing database for sample collections, variables + results

Web services Web services

Existing database for sample collections, variables + results

Web services

Tool for discovery of sample collections + original variables + counts/means

Tool for discovery of sample collections + harmonised variables + counts/means DataShaper development and use

Solutions for open sharing: summary level data, metadata,

& obfuscation strategies

Solutions for controlled sharing: individual level data,

primary and/or harmonised data

Means for controlled and/or open data use without sharing:

via DataShield

Eliminate ambiguity, maximise security, and enable recognition/reward: - Digital IDs for scientific publications (DOIs) - Digital IDs for Data Releases (DataCite) - Digital IDs for Researchers (ORCID/OpenID) - Digital IDs for BioResources (BRIF)

Tool for discovery of sample collections + original + harmonised variables + counts/means

Page 35: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Primary Research

Pharmacology

Clinical Experience

Medical Literature

Diagnostics

Today’s Healthcare

Tomorrow’s Healthcare

Inconsistent & sub-optimal health-care

Primary Research

Pharmacology

Clinical Experience

Medical Literature

Diagnostics

Page 36: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access Open data

‘discovery’ (Cafe Variome)

Remote pooled data analysis (DataShield)

Page 37: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

• Researchers may not have time nor funding to manually submit data, and/or submission process and requirements too complicated

• Researchers receive little or no recognition or reward for releasing data, hence little incentive to try

Issues that restrict sharing data

• Researchers may have positive reasons for NOT wanting to share data (ethical, legal, competitive edge)

• No current SANCTIONS for researchers that do not maximally share data

Page 38: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Acknowledgments

• GEN2PHEN Partners

• My team: Robert Free, Rob Hastings, Adam Webb, Tim Beck, Sirisha Gollapudi, Gudmundur Thorisson, Owen Lancaster

HGVbaseG2P has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 - the GEN2PHEN project.

“Data-to-Knowledge-to-Practice” (D2K2P) Center

Page 39: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Harmonisation software

BioShare access Public access

DATA METADATA CATALOGS

D a t a b a s e s

Biobank #1

Biobank #2

Biobank #3

ELSI software - no access - open access - controlled access - open discovery - remote analysis

BIOBANKING (‘BioShaRE’)

Page 40: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

..... and / or

• Open access (to any/all sensitive data) for data discovery purposes, without revealing data

• Open access (to any/all sensitive data) for pooled remote analysis

WP5: GENOMICS G2P DATABASES

Page 41: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

BIO-INFORMATICS MED-INFORMATICS

ACADEMICS COMPANIES

Data

Data

RESEARCH HEALTHCARE

Page 42: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Personal

Clinical Mutation Omics Drugs

Population Diseases

Data +

Information +

Knowledge

Disease specific Portals

Health Care Utility

Utilisation in healthcare

Page 43: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All Patient & Local System Data

Biosensors EHR

Modalities

Systems data

Text & Web pages

Computer Models

Decision Support Systems

BioScience & Omics

Databases

Fee

db

ack

/ O

pti

mis

atio

n

Systematised

Biomedical

Knowledge

Health(care) Avatar & Personalised Care

Self- Optimising

Feasible architectural Concept New Intelligence & Utility

Research & Technology advances

DISORGANISED DIGITAL INFORMATION RELEVANT TO PERSONALIZED HEALTHCARE

The I-Health Opportunity

Page 44: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Progress to date:

- operating as part of GEN2PHEN extended goals

- created 'I-Health community', >150 academics, companies, healthcare providers

- concept presented in many international meetings and forums

- free 1/2 day workshop as satellite to ESHG (6 invited speakers, funding in place)

- major international conference in Brussels, Oct 2011 (venue booked, funding in place)

- organising a 3-day exploratory 'think tank' in spring 2012, with PHG

- high level lobbying with funders and policy makers

- incorporating I-Health elements in EUR 70M of funding applications due autumn 2011

- launching the Leicester D2K2P Center, to implement I-Health concepts

“Data-to-Knowledge-to-Practice” (D2K2P) Center

Page 45: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Issues related to GWAS data sharing

• Researchers are not sharing G2P data generally for various reasons…..

– Insufficient staffing &/or bioinformatic capabilities

– Ethical issues / identifiable data (genotypes, phenotypes) / privacy

– Desire to monopolise and control “their data”

– No credit/recognition is given for data sharing or curation

• Lack of sharing is harming the scientific endeavor…..

– Most information not available to most researchers for consideration

– Heterogeneity across studies/populations, and smaller effect sizes missed

– Missed opportunities for collaboration & researcher recognition & reward

WP5: GENOMICS G2P DATABASES

Page 46: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Identifying Individuals in Aggregated Data

AGGREGATE LEVEL DATA

Safe Elements: - P values & odds ratios - graphically, all markers - non-directional, all markers - directional, hundreds of markers - Allele freqs (hundreds of markers)

Unsafe Elements: - P values & odds ratios - directional, all markers - Allele freqs - all markers

‘Speedy’ Access Open Access

Page 47: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Solving issues in modern bioscience research relating to... - researcher disambiguation - data access control

- data sharing & online publication - tracking & rewarding data contributions - data integration & knowledge mining ...via people having Digital Identities on the web

Page 48: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

DataSHIELD: Pooled data analysis without data sharing!!

• Conventionally, for individual-level analysis,

• one pools the data from each of the studies into one single large dataset

• Then, analyses this data set as it was a single study.

• Requires to have access to individual-level data

• ELSI restriction on 3rd party sharing

• For a wide class of analyses (GLMs), this can be avoided using the DataSHIELD approach (Wolfson et al, IJE 2010)

• DataSHIELD can give same analysis results without disclosing any individual-level data to the researchers!

Page 49: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All individual data: managed access (EGA, dbGaP)

All aggregate data: ‘speed pass’

‘Safe’ data: open access

works today, just need ‘more’

works today, just need ‘more’

absent today, needs ‘support’

Page 50: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Reluctance to share

Ethico-legal restrictions

Technical obstacles (integration, access, etc)

The journey to optimal data sharing...

...tackle via people having Digital Identities on the web

Page 51: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...
Page 52: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

MIQAS (Minimum Information for QTLs

and Association Studies)

PaGE-OM (Phenotype & Genotype Experiment

Object Model)

Page 53: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

1

*

*

1

*

Publication

Experiment_result

Phenotype_Value Individual Panel

1 *

1

Universal, Core Data Model for LSDBs (from LOVD, UMD, DMuDB, Findis)

1

*

Phenotype_feature

Phenotype_method

1 1

1 1

SUBMITTER

* 1

*

Genotype_phenotype_ correlation_experiment

*

*

*

Genomic_allele

*

*

1

Latent_genotype

1

1

*

Run

* Assayed_genomic_genotype

Variation_assay

* Publication

*

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

*

* *

Molecular_sample

* 1

Genomic_allele_ population_frequency

1

*

*

Page 54: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Core Model

Observable Entity

Observable Feature

Observation

[ObsOrInf]

Protocol Protocol Application

Page 55: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Phenotype Domain

‘Pheno-OM’

Observable Entity

Observable Feature

Observation

[ObsOrInf]

Protocol Protocol Application

Assayed_genotype ObsOrInf = ‘Obs’

Inferred_phenotype

ObsOrInf = ‘Inf’

Phenotype_of_interest

Phenotyping_method Phenotyping_run

Panel

Individual

Observed_phenotype

Page 56: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

WP6: INTEGRATION & DATA ACCESS TECHNOLOGIES

Page 57: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

! over 2000 standardised & interoperable LSDBs

! Web-services on top of these databases

! merging & centralisation of summary contents

! comprehensive listing of all LSDBs (with HGVS/HVP)

Page 58: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Observable Entity

Observable Feature

Observation

[ObsOrInf]

Protocol Protocol Application

Assayed_genotype ObsOrInf = ‘Obs’

Inferred_phenotype

ObsOrInf = ‘Inf’

Phenotype_of_interest

Phenotyping_method Phenotyping_run

Panel

Individual

Observed_phenotype

Patho-DB: Phenotype Domain

Page 59: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

DiploidCount = float

Assayed_genotype

Patho-DB: DNA Domain

Observable Entity

Observable Feature

Observation

[ObsOrInf]

Protocol Protocol Application

Sample

Individual

Panel

Assayed_genotype

ObsOrInf = ‘Obs’

Assayed_variant

ObsOrInf = ‘Inf’

Sequence_feature

IsCombo = YesNo

Marker

Variant

IsHaplo = YesNo

Genotyping_method Genotyping_run

Genotype

Page 60: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

PROJECTS:

GEN2PHEN technologies, standards, software, databases & policies towards seamless/holistic organisation and utility of Genotype-To-Phenotype information

BioShaRE-EU Harmonization, standardization, implementation & utilization of biobanking research tools (sampling, computing & analysis technologies)

COPD-MAP In charge of data management for £7M UK systems biology study into COPD. Exploring several platform options, including TransMart

'I-Health' Concepts Mapping medical informatics needs to bridge the gap between research & healthcare informatics, part of the IT Future of Medicine Pilot being run by Hans Lehrach

Data-2-Knowldge-2-Practice Centre (Director) Two floors of biobank & I-Health IT, atop a CVD & respiratory disease clinic PLUS advanced biobank

Page 61: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Observable Entity

Observable Feature

Observation

[ObsOrInf]

Protocol Protocol Application

Pathogenicity_method Pathogenicity_run

Pathogenicity

IsHaplo = YesNo

ObsOrInf = ‘Inf’

IsCombo = YesNo

Assayed_variant

ObsOrInf = ‘Obs’

IsCombo = YesNo

Assayed_genotype

DiploidCount = float

Pathogenicity_of_interest

ObsOrInf = ???

Patho-DB: Pathogenicity Domain

Page 62: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

- genetic association database

- integrates many (‘all’) datasets

- summary level data only

- links to data sources for primary data

‘GWAS Central’

Page 63: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Orc-ID:9324235238234

G. Thorisson, Univ. Leicester

G. A. Thorisson, Univ. Leicester

G. A. Thorisson, Cold Spring Harbor Lab.

Page 64: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

VarioML

• XML format elements for LSDB data exchange use cases – Same format components for different

applications

• Based on the Pheno-OM – Well defined semantics

• Intermediate format for semantic web – XSLT transformation to RDF

• Tools – Validators, JavaAPI, XSLTs

Page 65: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

BIO-INFORMATICS MED-INFORMATICS

ACADEMICS COMPANIES

Data

Data

RESEARCH HEALTHCARE

Page 66: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Primary Research

Pharmacology

Clinical Experience

Medical Literature

Diagnostics

Today’s Healthcare

Tomorrow’s Healthcare

Inconsistent & sub-optimal health-care

Primary Research

Pharmacology

Clinical Experience

Medical Literature

Diagnostics

Page 67: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Personal

Clinical Mutation Omics Drugs

Population Diseases

Data +

Information +

Knowledge

Disease specific Portals

Health Care Utility

Utilisation in healthcare

Page 68: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Acknowledgments

• GEN2PHEN Partners

• My team: Robert Free, Rob Hastings, Adam Webb, Tim Beck, Sirisha Gollapudi, Gudmundur Thorisson, Owen Lancaster

• I-Health supporters: Iain Buchan, Barend Mons, Allan Hanbury, Jane Kaye, Hans Lehrach, Kurt Zatloukal, Jaak Vilo, Alvis Brazma, Carlos Diaz, + 150 other groups.

GWAS Central has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 - the GEN2PHEN project.

Page 69: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Clinical Decision Making

KNOWLEDGE PORTALS

Ensembl

Annotation & archiving

GLOBAL RESOURCES: LSDBs, GWAS DBs, MODBs

DIAGNOSTICS LABS

BIOBANKS

EHRs

mutation data

DMuDB

Cafe Variome

Private/Sensitive data

OMICS PROJECTS, LOCAL DBs (deep phenotypes, omics/NGS, analyses, MO data, literature)

(phenotypes, omics &

lifestyle data)

Orphanet Knowledge

Base

Page 70: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (general)

Phenotype

M

Method (instance)

Patient

Integrates Patient and Variant Centric advantages (and optionally and Method* as well), whilst also providing a place to hold the pathogenicity of the variant in that patient

Experiment = Pathogenicity

(instance)

Experiment Centric

Variant (instance)

Pathogenicity (general)

Has Phenotype

*

Page 71: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Search G2P

Comments and

annotations

Feed of search results

etc.

Web services

Web services

UMD Web services

Café Rouge

Resource list

WP8: KNOWLEDGE CENTRE & TRAINING

Page 72: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Individual

0..* Panel

0..1

0..*

0..*

Molecular_sample 0..1

0..* Abstract_population

0..*

0..1

Abstract_observation_target

PaGE-OM ‘SAMPLE’ Domain

Page 73: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Assayed_genomic_genotype

Latent_genotype

Genomic_variation Genomic_allele Variation_assay

Frequency

0..* 0..*

1 1

1 1

0..* 0..*

0..*

0..*

1..*

1

1..*

1..*

1

Abstract_observation_target

0..*

1

1 0..*

0..* 1

1..*

0..1

measured genotype

0..* detectable genotypes

1..*

1

0..*

without assay details

without assay details

with assay details

with assay details

Genomic_genotype_population_frequency

Genomic_allele_population_frequency

PaGE-OM ‘GENOTYPE’ Domain

Page 74: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Individual

Observable_feature

Observable_feature_category

Observation_method

0..*

0..1 0..*

1

0..* 1

0..*

1

0..*

0..*

Observed_value

0..*

0..1

PaGE-OM ‘PHENOTYPE’ Domain

Page 75: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Study

Genotype_phenotype_correlation_experiment

Observable_feature

Observation_method

Observed_value

Abstract_observation_target

Genomic_variation

Variation_assay

Genomic_observation

Experiment_result

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

PaGE-OM ‘EXPERIMENT’ Domain

Page 76: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

OBSERVED_VALUE

average_values=(160/90)

STUDY

name=“hypertension replication study”

GENOTYPE_PHENOTYPE CORRELATION_EXPERIMENT

name=“replication of markers on gene x”

name=“replication of markers on gene y”

EXPERIMENT_RESULT

P-value=”1.0e-4”

GENOMIC_VARIATION

id=“rs12345”

PANEL

name=“hypertensives”

name=“normotensives”

average_values=(120/70)

GENOMIC_ALLELE_ POPULATION_FREQUENCY

value=“0.8”

value=“0.7”

GENOMIC_ALLELE

name=“C”

VARIATION_ASSAY

id=“rs12345.v1” description=“taqman”

OBSERVABLE_FEATURE

OBSERVATION_METHOD

name=“blood pressure"

description=“manual protocol, involving....”

name=“T”

Page 77: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

LSDBs GWDBs

1. Create ‘franchised’ databases - data models [e.g. PaGE-OM, Pheno model] - data management tools [BCP, Phenosys] - databases [LOVD, UMD, IGVdb, HGVbaseG2P]

Diagnostic labs

Research labs

Genome browsers

2. Build the connections - ontologies, nomenclatures - data formats, tools/software - reference standards [LRG]

GEN2PHEN

Page 78: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

LSDBs GWDBs

3. Enable the data flow - legal and ethical [permissions, privacy] - attribution, incentives, reward [BRIF]

Diagnostic labs

Research labs

Genome browsers

4. Enable data searching - software [SNP-DAS, APIs, HGVMart] - interfaces [browsers, DiseaseCard]

GEN2PHEN

Page 79: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

LSDBs GWDBs

5. Grid & semantic web - workflows, software, security - permanent global IDs for all ‘entities’ (people, web pages, pictures, functions...) - all components declare their existence and capabilities

Diagnostic labs

Research labs

Genome browsers

GEN2PHEN

Page 80: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

*

*

*

1

SUBMITTER

* 1

*

1

*

Genomic_allele

Experiment_result

Phenotype_Value Individual

*

*

1 *

*

Phenotype_feature

Phenotype_method

1 1

1 1

Genotype_phenotype_ correlation_experiment

LOVD 3.0

1 1

*

1

Run Panel

Latent_genotype

* Assayed_genomic_genotype

Variation_assay

Publication

Publication

*

* *

Molecular_sample

*

1

* 1

Genomic_allele_ population_frequency

1

*

*

*

*

1

Patients

Variants Genes

Phenotypes

Screenings

Submitters

Diseases

Page 81: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (instance)

M

Method (instance)

Patient

Method Centric (current LOVD 3.0 ?)

Suitable as a database for labs generating mutation data

Page 82: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

*

*

*

1

SUBMITTER

* 1

*

1

*

Genomic_allele

Experiment_result

Phenotype_Value Individual

*

*

1 *

*

Phenotype_feature

Phenotype_method

1 1

1 1

Patient

Gene_X_Variant

Patient2Variant

Gene Genotype_phenotype_ correlation_experiment

LOVD

1 1

*

1

Run Panel

Latent_genotype

* Assayed_genomic_genotype

Variation_assay

Phenotype

DetectionTechnique

Submitter

Publication

Publication

*

1 - *

???

* *

Molecular_sample

*

1

* 1

Genomic_allele_ population_frequency

1

*

*

*

*

1

Page 83: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

*

1

1

*

Variant

Reference

Reference

*

Publication

Experiment_result

Phenotype_Value Individual Panel

1 *

1

DMuDB

*

Phenotype_feature

Phenotype_method

1 1

1 1

* *

SUBMITTER

* 1

*

Genotype_phenotype_ correlation_experiment

*

*

*

Genomic_allele

* 1

Latent_genotype

1

1

*

Run

* Assayed_genomic_genotype

Variation_assay

* Publication

*

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

Patient

Referral_has_Variant Genotype

Reference_sequence

Disease Test_type

Laboratory

Sample

Reference

Referral

Interpretation

Molecular_sample

1 - *

External-reference

*

1

* 1

Genomic_allele_ population_frequency

1

*

*

*

Page 84: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

Frequency

Reference

Assay Publication

Patient

Phenotype -> Disease -> Picture

Patient Specific Interpretation -> Severity -> Class -> Experimental Data

Genotype

Variation

LRG Submitter

1

* *

Publication

Experiment_result

Phenotype_Value Panel

1 *

1

*

Phenotype_feature

Phenotype_method

1 1

1 1

SUBMITTER

* 1

*

Genotype_phenotype_ correlation_experiment

*

*

*

*

* 1

Latent_genotype

1

1

*

Run

* Assayed_genomic_genotype

Variation_assay

* Publication

*

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

1

* *

Molecular_sample

Related Individuals

UMD Transcripts Haplotypes

Variation Specific Interpretation -> UMD Predictor, SIFT, POLYPHEN, Structure

PLUS:

*

Genomic_allele_ population_frequency

1 1

Individual

Genomic_allele

*

1

*

*

*

Gene

1 - *

Page 85: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

*

1

1

*

*

REFSEQ XLINK

*

*

*

*

*

Publication

*

*

*

1

SUBMITTER

* 1

*

1

*

Genomic_allele

Experiment_result

Phenotype_Value Individual

*

*

1 *

*

Phenotype_feature

Phenotype_method

1 1

1 1

Mutation

Gene

Genotype_phenotype_ correlation_experiment

FINDIS

1 1

*

1

Run Panel

Latent_genotype

* Assayed_genomic_genotype

Variation_assay

Disease

Publication

Publication

*

* *

Molecular_sample

*

1

* 1

Genomic_allele_ population_frequency

1

*

*

*

*

1

Publication

Reference Sequence

Text annotations

Numeric annotations

PLUS:

1 - *

Page 86: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (general)

Variant (instance)

M

Method (instance)

Patient

Patient Centric

For the ultimate future, where the genome is sequenced once, and all variants detected

Page 87: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (general)

Variant (instance)

M

Method (instance)

Patient

Variant Centric

Old approach, suitable for LSDBs. Can relate to instance or general variants or both

Page 88: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (general)

Variant (instance)

M

Method (instance)

Patient

Variant + Patient Centric

Involves redundant relationships, necessarily

Page 89: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Variant (general)

Phenotype

Variant (instance)

M

Method (instance)

Patient

Phenotype Relationships

3 objectives, describing: a) phenotype of patient, b) variant pathogenicity in patient c) variant pathogenicity in general

Pathogenicity (instance)

Pathogenicity (general)

Has Phenotype

Page 90: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

USERS

ONE data format

Not a ‘database’

SUBMITTERS

Cafe Variome

Page 91: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

All Patient & Local System Data

Biosensors EHR

Modalities

Systems data

Text & Web pages

Computer Models

Decision Support Systems

BioScience & Omics

Databases

Fee

db

ack

/ O

pti

mis

atio

n

Systematised

Biomedical

Knowledge

Health(care) Avatar & Personalised Care

Self- Optimising

Feasible architectural Concept New Intelligence & Utility

Research & Technology advances

DISORGANISED DIGITAL INFORMATION RELEVANT TO PERSONALIZED HEALTHCARE

The I-Health Opportunity

Page 92: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Local &/or Centralised &/or Federated technologies for data display and data mining

New database for sample collections, variables + results

Existing database for sample collections, variables + results

Web services Web services

Existing database for sample collections, variables + results

Web services

Tool for discovery of sample collections + original variables + counts/means

Tool for discovery of sample collections + harmonised variables + counts/means DataShaper development and use

Solutions for open sharing: summary level data, metadata,

& obfuscation strategies

Solutions for controlled sharing: individual level data,

primary and/or harmonised data

Means for controlled and/or open data use without sharing:

via DataShield

Eliminate ambiguity, maximise security, and enable recognition/reward: - Digital IDs for scientific publications (DOIs) - Digital IDs for Data Releases (DataCite) - Digital IDs for Researchers (ORCID/OpenID) - Digital IDs for BioResources (BRIF)

Tool for discovery of sample collections + original + harmonised variables + counts/means

Page 93: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Harmonisation software

BioShare access Public access

DATA METADATA CATALOGS

D a t a b a s e s

Biobank #1

Biobank #2

Biobank #3

ELSI software - no access - open access - controlled access - open discovery - remote analysis

Page 94: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Need: Digital ‘Big-picture’ across diseases/services/self-care/pathways

Future: Realistically complex and dynamic model/avatar of “Mr Smith”

Diabetology: Glucose control

Ophthalmology: Diabetic eye care

Nephrology: Chronic kidney disease

Key research knowledge Patient Biometrics

Page 95: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Omics data Systems studies

Computer models Biobanks/Registries

Clinical trials Disease research

Drug research Epidemiology

Animal models

RESEARCH DATA

EHR content Medical publications

Medical websites / blogs Protocols / guidelines Diagnostic test results

Biosensors outputs Lifestyle data

Environment data Drug /treatment info

HEALTHCARE DATA

RESEARCH USE HEALTHCARE USE

DIGITAL INFORMATION RELEVANT TO PERSONALIZED HEALTHCARE

ICT ‘gap’

Page 96: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

I-Health Challenge: Three clouds …bring together people, methods, and research + patient data

across molecular, clinical and population scales

People with relevant expertise and authorisation

State-of-the-art algorithms

Quality assured integrated data

Intelligence

Page 97: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Data-2-Knowldge-2-Practice Centre Two floors of biobank & I-Health IT, atop a CVD & respiratory disease clinic PLUS advanced biobank

Page 98: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Large scale inference

Unified Graphical Model

Electronic Health Records

(eHR)

Data

Expertise Expertise Expertise Multi-scale &

Multi-system

Health:

• Research

• Policy

• Care

Model refinement Data Data

Health Records & Knowledge Silos

Health Avatars & Dynamic Models Open Unifying Modelling:

Across mechanisms and contexts

e.g. Lung cancer e.g. Chronic obstructive pulmonary disease

e.g. Coronary heart disease

Page 99: Knowledge Environment for · summary level data, metadata, & obfuscation strategies Solutions for controlled sharing: ... Solving issues in modern bioscience research relating to...

Central DBs Federated DBs


Recommended