The SAIL DataBank
A National e-Research
Platform for Wales
David V Ford
Professor of Health Informatics
Swansea University Medical School
1. A quick overview of me and what my group does.
2. A description of the SAIL Databank as it operates in Wales
3. The technologies and systems we now have available for others to use
4. Questions / discussion
Agenda
1. Director, SAIL DataBank in Wales, UK
2. Director, Administrative Data Research
Centre Wales, part of the ADRNetwork
3. Deputy Director, CIPHER – Part of the
UK Farr Institute of Health Informatics
Research
4. Current Director of the International
Population Data Linkage Network
My major affiliations
New Data Science Building at Swansea
• Funded by MRC (Farr);
ESRC (ADRN) and Welsh
Government
• High security home to Farr
Institute, ADRC Wales,
SAIL Databank and many
other data-intensive
projects
• Office space for NHS and
other public sector staff and
industry to work alongside
university staff
Research infrastructures at Swansea
• Secure Anonymised Information Linkage (SAIL) system
• The MRC-led multi-funder Farr Institute Centre for the Improvement of Population Health through E-records Research (Farr Institute CIPHER)
• The MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB) Centre
• The Analysis Platform for the MRC UK Dementias Platform (DPUK)
• The ESRC Administrative Research Data Centre in Wales (ADRC-W)
• National Centre for Population Health and Wellbeing Research (NCPHWR)
• NHS Prudent Healthcare Intelligence Unit
What is the SAIL Databank?
• Designed to safely provide a means of linking together all person-
based data within Wales for use in research and public-benefit
enquiry
• Assembling population-scaled life stories from data across time and
across organisations (datasets)
• Initial focus on health, now broadened into wellbeing and beyond (+
local and central government) – social justice, housing, education,
policing, employment, etc
• Purpose: to support evaluation of natural experiments (i.e. policy and
service changes); epidemiology; “e” and hybrid cohort studies;
intervention studies (clinical trials); system modelling and many more
The data challenge
Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014;311 :2479-80.
• Built on the international best-practice “Separation Principle”
• Use of our automated, untouched-by-human-hands Trusted Third
Party (TTP) data linkage system, in a totally separate organisation
• Identifiers never given to SAIL
• Sensitive data never given to TTP
• Fully automated if required
• Data never leaves the Databank (instead, access to it is granted)
• All projects approved by independent IG panel (in 30% public)
• Only minimised data, that is needed is provided to projects
• Compulsory training for all data users. Strict legal agreements.
Built on well established principles
SAIL Split File Principle
Demographics
+ Link Key
Clinical (s)
+ Link Key
Supplier Data
ID Name Address BP Diag
56 Fred Bloggs The Big house 120/80 G33..
78 Jim Jones 87 Peterson Rd 135/45 P123.
45 Harry Lucas 19 Meirwen 125/75 G77..
ID Name Address
56 Fred Bloggs The Big house
78 Jim Jones 87 Peterson Rd
45 Harry Lucas 19 Meirwen
ID BP Diag
56 120/80 G33..
78 135/45 P123.
45 125/75 G77..
Load into SAIL
ALF_E BP Diag
4252 120/80 G33..
7482 135/45 P123.
8436 125/75 G77..
Linkage
File 1
File 2
File 3
ID ALF Conf
56 65276573 88
78 32377722 97
45 27638236 95
Add this field
SAIL Split File Principle
Additional Project level encryption of ALF_E PALF_E
It’s all about data linkage
• SAIL= Secure Anonymised Information Linkage
• >12 billion records of the people of Wales, >5 million people
• 500+ feeder systems from Wales, inc >350 GP practices (>80%)
• Much data goes back 20-25 years
• All pre-linked then de-identified
• £5m+ investment in high performance IT
• Strong privacy protection & IG
• Currently supporting externally funded projects with value >£90m
• Over 300 registered users and 140+ active SAIL projects.
• >100 staff in Swansea working on Health Informatics-related projects
• Average 35 day turnaround from application to data
• Applications open to all
• Built to consume (new data) swiftly
• Secure sharing for projects, based on project specific data views.
• Total population coverage
• Used for:
observational research (case control; e-cohort studies; etc)
trial feasibility,
outcome data for trials,
extending traditional cohorts,
post marketing surveillance
new technology evaluation
evaluation of natural experiments (i.e. service and policy change)
• Lots of trial and cohort study participants embedded in SAIL
Core holdings:
• Annual District Birth Extract (ADBE)
• Annual District Death Extract (ADDE)
• Bowel Screening Wales (BSW)
• Breast Test Wales (BTW)
• Cervical Screening Wales (CSW)
• Congenital Anomaly Register and Information Service (CARIS)
• Emergency department Data Set (EDDS)
• National Community Child Health Database (NCCHD)
• Outpatient Dataset (OPD)
• Patient Episode Database for Wales (PEDW)
• Primary Care GP dataset
• Welsh Cancer Intelligence and Surveillance Unit (WCISU)
• Welsh Demographic Service (WDS)
• Many more!
Data resources
Project-specific holdings:
• Clinical trials participants
• Conventional cohort participants
• Cross sectional survey participants
• Many, many others!!!
Reference data:
• Data quality reports
• Extract histories
• Coding and mapping information
• Metadata
• Organisation codes
• Lots more!
Sharing what we have learned . . . .
Now available under “research collaborations”
1. National Research Data Appliances (NRDAs): New concentrator
technology for NHS and other data owning organisations, including
automated matching, anonymisation, data management, metadata
capture, data quality assessment, etc.
2. UK Secure E-Research Platform (UKSeRP)– based on the SAIL
Gateway – massively extended to provide a secure platform for data
sharing across the UK – not just SAIL data (for Farr and ADRN)
3. New focus on the capture and analysis of electronic free text data
(on-board NLP in the Appliances)
4. Initially funded by MRC Farr institute and ESRC ADRN grants
National Research Data
Appliances
“Everything we know, on a box”
David V Ford
Professor of Health Informatics
Swansea University Medical School
National Research Data Appliance (NRDA)
• Brings many of SAIL's capabilities onto combined hardware and
software
• Shrink wrapped, ready to go.
• Easy to use, low expertise barrier
• Multiple Appliances work together as a larger whole
• Purposes: concentrate data, make it research ready
• Provide utility to data owners and partners
• Initial development funded by MRC
• Potentially provided free to our collaborators
Key features of relevance
• Appliances federate with each other to create a sharing network
• Network can be hierarchical or peer-to-peer
• IG Controls available at every point
• Metadata builds automatically and publishes to a global catalogue
• Data quality measurement automated
• NLP address rich, free text datasets, converting them to SNOMEDCT
• High quality identity reconciliation automatic, de-identification optional
• UKSERP provides scalable, performant analytics platform, with full IG controls
RDA Use-case
Large scale data sharing platform (SeRP)
Upload
Link
Anonymise
Measure
Catalogue
Manage
Share
Analyse
Organisation A
Upload
Link
Anonymise
Measure
Catalogue
Manage
Share
Analyse
Organisation B
Upload
Link
Anonymise
Measure
Catalogue
Manage
Share
Analyse
Organisation C
Upload
Link
Anonymise
Measure
Catalogue
Manage
Share
Analyse
TTP
Share
Share
Share
Easy, non-technical use
Data
“sc
hem
a” a
uto
mati
cally c
om
pute
d
base
d o
n d
ata
conta
ined in u
plo
aded f
ile
Publish Dataset – Depend on Configuration/Capabilities. Data will now be available
Data Catalogue – Key Component
Additional points following previous sessions: All DA carry a DC, DS can inherit from other DS DC entries, DC related to Programme/Security domain. DC’s replicate to Regional/Global DC. Road map: DC used to define and create DS
A Dataset
Specific version & Date
All section attach files
Contact
Request
VIMO
Theme / Type / Level
Tags
A Dataset (cont.)
DDI, SPSS, SAS, STATA
Data Catalogue – a specific table
Secure e-Research
Platform (SeRP)
“Combine and share your data
and stay in complete control”
David V Ford
Professor of Health Informatics
Swansea University Medical School
Secure e-Research Platform
• Modelled on the SAIL Gateway, now available as tenancies for other organisations
• Allows users to view, manipulate and analyse data using powerful and familiar tools
• No data need leave the Gateway – output vetting process on the box
• Data owners (NHS or academic) remain in total charge. They operate sharing
according to their own IG
• Full suite of IG facilities available to tenants
• All servers based at Swansea in ISO27001 and HSCIC-approved systems
• Multiple projects using different data configurations with multiple users all possible
• Full audit trails on every user, every action available.
• A SeRP tenancy can connect automatically to any number of NDRA’s if required, or
can be used alone (UKSeRP is powered by its own RDA
UK Se-RP
UKSeRP Example
Dementias Platform UK
DPUK Cohort Matrix
DPUK - expand
WP2: Shared Space Concept
C1
C2
C3
C4 C5
C6
C7
C8
C9
Temp
shared
space for
analyses
Analysts
C6
Data
Imaging
Omics Sensors
DPUK Operational model
FARR (Wales) enabling DPUK – Imaging WP
Cohorts
……
Oxford UCL Cambs Edinburgh Imperial Newcastle
Central Hub
Data catalogue
Imaging an important modality. All UKSeRP now image enabled
Image storage, HPC Cluster, Transmart, EMIF
DPUK - UK SeRP
Storage360TB
Storage360TB
XNAT XNATIncoming Dataset
Data Appliance
UKD
P Sp
eci
ficIn
fras
truc
ture
TransMart TransMartSymantec
Harmonisation
*
Data Model Transformation
*
Load balancer
PostgreSQLPostgreSQL
Load balancer
PG ClusterStorage Server
Storage360TB
Storage360TB
Storage Backup
Job Scheduler
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Backup Shared Infrastructure
XNAT DICOM
XNATfs
Open Stack10 to 15 server,
Intel 40 core, 96GB+ each
Questions?