Big data technologies to support the integration of ... · SQL vs. NoSQL with genetic variants....

Post on 03-Jun-2020

5 views 0 download

transcript

Big data technologies to support the integration of healthcare and

research data

Riccardo BellazziUniversity of Pavia

Italy

Knowledge and data integration

Clinical Bioinformatics (i2b2)

Knowledge discovery

Knowledge to practice

Bedside to bench

Test new Knowledge

BIOINFORMATICS METHODOLOGY AND TECHNOLOGY TO INTEGRATE

CLINICAL AND BIOLOGICAL KNOWLEDGESUPPORTING

ONCOLOGY TRANSATIONAL RESEARCH

ONCO-i2b2 project

Architecture overview

HIS

Clinical patient management

Data

Laboratory

Research

Samples

Biobank

CRCAnonymized data

Anonymized samples

i2b2

ResearcherPatient

Match IDs

FSM - I2b2 instances

Activesince

Patients Visits Observations Genetic NLP

2011 5.611 7.726 23.175 n n

Onco-i2b2Active since Patients Visits Observations Genetic

dataNLP

2010 28.838 142.464 2.341.771 Y y

CardiologyActivesince

Patients Visits Observations Genetic NLP

2009 6.334 15.094 205.418 y n

Administration

Active since Patients Visits Observations Geneticdata

NLP

Biobank 923 - 8188 - y

Sub-project: biobank

Predicting the development of

Diabetes complications and

assessing the evolution of the

disease

DATA: Clinical + Administrative information

10

Data pre-processingand organization

Knowledge to practice

FROMHOSPITALS

. Follow up visits. Medications

. Labs

FROM LOCALHEALTHCARE AGENCIES

.Drugs Purchases.Hospitalizations

.Environmental Data

Temporal and WorkflowData Mining

What happens to the patient inside

the hospital

What happens to the patients outside the hospital

Administrative

Data

Clinical Data Analysis

Models

1000 patientsRich temporal

characterization

Big data – big opportunity for innovation1

12

1http://www.innovationexcellence.com/

2 directions: Analytics, Data storage and retrieval

Analytics on Map-Reduce Parallel Programming Paradigm

Relies on a distributed file system

Some libraries available for data mining

02/06/2014

“Whole genome” predictors

Map-reduce at work

02/06/2014 15

Training

Testing

Results – predicting longevityAllSNPs

acc 0.6178 [0.6001-0.6356]

sens 0.1317 [0.0891-0.1738]

spec 0.9783 [0.9665-0.9902]

mcc 0.2102 [0-1497-0.2707]

ppv 0.8101 [0.7196-0.9005]

npv 0.6035 [0.5916-0.6153]

Serial implementation: 284 minutes

Map reduce (single slave): 34 minutes

Top predictors

Handling queries on variants from NGS

One patient - One exome - More than 20000 variants

The majority of them with role still unknown

Store them in files

Forget them

Query them ..

SQL vs. NoSQL with genetic variants

Query

DB population

O’Connor et al. BMC Bioinformatics 2010 11(Suppl 12):S2 doi:10.1186/1471-2105-11-S12-S2

NOSQL (CouchDB) pros

Advantages Mutations stored in JSON format (easily readable &

exportable) Cloud-based system (higly scalable) Fast pre-computed queries (no limits on queries’number)

Technologies Amazon Web Services CouchDB & Big Couch i2b2 (query platform & phenotype integration)

I2b2 – NGS/NOSQL Cell

I2b2 NGS plugin

A test Deployed on Amazon AWS 55 exome variant sets (1.2 M variants)

Extract individuals with a specific phenotype and with missense or non-sense mutations in a gene of interest

Refine Task 1 results considering only Polyphen2-considered damaging mutations

Apply Task 2 logic to more complex phenotypes Check individuals with a specific dominant autosomal variant

The ST-model (Ramoni, Stefanelli et al)

• An epistemological model of scientific and medical reasoning

• hypotheses

selection/generation

phase: abstraction and abduction

• hypotheses testing phase: ranking, deduction, eliminative induction

Thanks from the BMI Labs “Mario Stefanelli”