Big data technologies to support the integration of healthcare and
research data
Riccardo BellazziUniversity of Pavia
Italy
Knowledge and data integration
Clinical Bioinformatics (i2b2)
Knowledge discovery
Knowledge to practice
Bedside to bench
Test new Knowledge
BIOINFORMATICS METHODOLOGY AND TECHNOLOGY TO INTEGRATE
CLINICAL AND BIOLOGICAL KNOWLEDGESUPPORTING
ONCOLOGY TRANSATIONAL RESEARCH
ONCO-i2b2 project
Architecture overview
HIS
Clinical patient management
Data
Laboratory
Research
Samples
Biobank
CRCAnonymized data
Anonymized samples
i2b2
ResearcherPatient
Match IDs
FSM - I2b2 instances
Activesince
Patients Visits Observations Genetic NLP
2011 5.611 7.726 23.175 n n
Onco-i2b2Active since Patients Visits Observations Genetic
dataNLP
2010 28.838 142.464 2.341.771 Y y
CardiologyActivesince
Patients Visits Observations Genetic NLP
2009 6.334 15.094 205.418 y n
Administration
Active since Patients Visits Observations Geneticdata
NLP
Biobank 923 - 8188 - y
Sub-project: biobank
Predicting the development of
Diabetes complications and
assessing the evolution of the
disease
DATA: Clinical + Administrative information
10
Data pre-processingand organization
Knowledge to practice
FROMHOSPITALS
. Follow up visits. Medications
. Labs
FROM LOCALHEALTHCARE AGENCIES
.Drugs Purchases.Hospitalizations
.Environmental Data
Temporal and WorkflowData Mining
What happens to the patient inside
the hospital
What happens to the patients outside the hospital
Administrative
Data
Clinical Data Analysis
Models
1000 patientsRich temporal
characterization
Big data – big opportunity for innovation1
12
1http://www.innovationexcellence.com/
2 directions: Analytics, Data storage and retrieval
Analytics on Map-Reduce Parallel Programming Paradigm
Relies on a distributed file system
Some libraries available for data mining
02/06/2014
“Whole genome” predictors
Map-reduce at work
02/06/2014 15
Training
Testing
Results – predicting longevityAllSNPs
acc 0.6178 [0.6001-0.6356]
sens 0.1317 [0.0891-0.1738]
spec 0.9783 [0.9665-0.9902]
mcc 0.2102 [0-1497-0.2707]
ppv 0.8101 [0.7196-0.9005]
npv 0.6035 [0.5916-0.6153]
Serial implementation: 284 minutes
Map reduce (single slave): 34 minutes
Top predictors
Handling queries on variants from NGS
One patient - One exome - More than 20000 variants
The majority of them with role still unknown
Store them in files
Forget them
Query them ..
SQL vs. NoSQL with genetic variants
Query
DB population
O’Connor et al. BMC Bioinformatics 2010 11(Suppl 12):S2 doi:10.1186/1471-2105-11-S12-S2
NOSQL (CouchDB) pros
Advantages Mutations stored in JSON format (easily readable &
exportable) Cloud-based system (higly scalable) Fast pre-computed queries (no limits on queries’number)
Technologies Amazon Web Services CouchDB & Big Couch i2b2 (query platform & phenotype integration)
I2b2 – NGS/NOSQL Cell
I2b2 NGS plugin
A test Deployed on Amazon AWS 55 exome variant sets (1.2 M variants)
Extract individuals with a specific phenotype and with missense or non-sense mutations in a gene of interest
Refine Task 1 results considering only Polyphen2-considered damaging mutations
Apply Task 2 logic to more complex phenotypes Check individuals with a specific dominant autosomal variant
The ST-model (Ramoni, Stefanelli et al)
• An epistemological model of scientific and medical reasoning
• hypotheses
selection/generation
phase: abstraction and abduction
• hypotheses testing phase: ranking, deduction, eliminative induction
Thanks from the BMI Labs “Mario Stefanelli”