+ All Categories
Home > Science > Working With Large-Scale Clinical Datasets

Working With Large-Scale Clinical Datasets

Date post: 01-Jul-2015
Category:
Upload: craig-smail
View: 170 times
Download: 1 times
Share this document with a friend
Description:
Presentation on working with large-scale clinical datasets, to be presented at KU Medical Center on 9-Oct-14
20
Working With Large-Scale Clinical Datasets Craig Smail, MA, MSc ( @craigsmail) KU Medical Center 9 th October 2014 Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
Transcript
Page 1: Working With Large-Scale Clinical Datasets

Working With Large-Scale Clinical Datasets

Craig Smail, MA, MSc ( @craigsmail)KU Medical Center9th October 2014

Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg

Page 2: Working With Large-Scale Clinical Datasets

Disclosures

• Industry grant funding:

– Merck

– Mallinckrodt

– Sanofi

Page 3: Working With Large-Scale Clinical Datasets

Overview

• Targeted audience: anyone involved (directly or indirectly) in clinical data extraction, validation, and standardization

• Sections:

1. Data extraction: planning

2. Data extraction

3. Data standardization

4. Data transfer

Page 4: Working With Large-Scale Clinical Datasets

Data Extraction: Planning

• Dataset type– Most common: limited and de-identified

– Difference: limited can contain some personal information (DOB, DOD, city, state, age)

• Legal agreements– Data Use Agreement (DUA)

– Business Associates Agreement (BAA)

– Institutional Review Board (IRB)• Usually only if IRB considers activity Human Subjects

Research

Page 5: Working With Large-Scale Clinical Datasets

Data Extraction: Planning

• Important to finalize list of data elements before pull

– Time-consuming to repull

– Reallocation of resources (e.g. programmer time)

• Summary statistics are helpful in planning stage

• e.g. death status requested a lot, but is very rarely available in the EHR

Page 6: Working With Large-Scale Clinical Datasets

Data Extraction: Planning

• Use of data proxy correlated with data element of interest

– sometimes need to develop proxies for data points of interest (e.g. severity of pain; hypoglycemic events)

– Example use case: aspirin as a proxy for antiphospholipid antibodies lab1

• Proxy data elements should be supported by data

1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 2011; 365:1758-1759N

Page 7: Working With Large-Scale Clinical Datasets

Example: Proxy for Death Status

• Data extracted from large multi-specialty clinic on the east coast

• 300,000 patients in EHR

• ~10,000 with date-of-death (we’ll take this as gold-standard)

• Is days since last encounter a good proxy?

Page 8: Working With Large-Scale Clinical Datasets

Example: Proxy for Death Statuslibrary(glm2)

# import data

setwd([dir here])

Encs = read.csv("lastenc.csv", header= FALSE)

# find days since last encounter

for (i in 1:nrow(Encs)) {

Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y")

}

# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750)

for (i in 1:nrow(Encs)) {

Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0)

}

# clean up table

Encs = Encs[ , c(2, 4)]

# fit model (logistic regression – but could use something else)

fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial")

confusionMatrix = table(round(fit$fitted.values), Encs[,1])

misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) #

0.34

Page 9: Working With Large-Scale Clinical Datasets

Example: Proxy for Death Status

• Is days since last encounter a good proxy? No (error rate = 34%)

• Consequences:

Page 10: Working With Large-Scale Clinical Datasets

Data Extraction: Planning

• Cohort definition– Spell out cohort definitions explicitly, including all

assumptions

– Real-world example: • ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90

days apart’

• Further restriction specified ‘if any value > 60 in between 90 days, then throw out’

• Word ‘consecutive’ means no values in between 90 days will be considered at all– If any another eGFR value occurs between 90 days, then the

patient does not meet the first restriction

Page 11: Working With Large-Scale Clinical Datasets

Data Extraction: Planning

• Final thought on planning:“Not everything that counts can be counted and not everything that can be counted counts.”

—Albert Einstein (or William Bruce Cameron,

depends who you believe)

• some data elements are well populated, but reflect things like coding bias (e.g. ‘up-coding’ to a code with larger reimbursement)

Page 12: Working With Large-Scale Clinical Datasets

Data Extraction

• What are data extractions being used for in the NRN?– Pharmaceutical companies: data on 143,057

patients from 8 health-care organizations/health care systems

– Federally-funded research (NIH, AHRQ): data on ~100,000 patients

– Health IT vendors: work with Cerner to produce performance reports for use by participating providers• Clinicians like performance feedback, if your EHR cannot

provide it they will go elsewhere (i.e. switch to another vendor)

Page 13: Working With Large-Scale Clinical Datasets

Data Extraction

• Longitudinal data important

– look at temporal trends over time in same patient

– during EHR transitions, some EHR vendors will import all data, but restrict full access to only last 18/24/26 months – clinicians don’t like this, they want to be able to access all data

Page 14: Working With Large-Scale Clinical Datasets

Data Validation

• Date parameters (e.g. look at min and max dates of encounter in dataset, when 1000s of patients of dataset, would expect to see dates match with range)– Percentage of distinct patients in extraction vs. overall practice count:

cohort percentages are quite stable across practices

» e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes defined by ICD-9 code xx.xxx

– Caveat: doesn’t work well with small practices (< 2,000 distinct patients)

Page 15: Working With Large-Scale Clinical Datasets

Data Standardization

• Open-source models (Observational Medical Outcomes Partnership)

• Script data out of database (e.g. SQL view)

• Map labs/procedures to standardized concept list– Why? different string labels referring to creatinine blood test from

three data feeds, with frequency of occurrence…

Page 16: Working With Large-Scale Clinical Datasets

Note: source values with counts < 100 were censored

Page 17: Working With Large-Scale Clinical Datasets

Data Transfer

• HIPPA requirements

• Usually FTP to secure site (e.g Egnyte)

Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/

Page 18: Working With Large-Scale Clinical Datasets

Concluding Thoughts

• Extracted data is treated as gold-standard, since it is pulled directly from data source (i.e. EHR), but data often comes from intermediate product (such as a registry product, like the product DARTNet provides); but usually don’t have control over data mapping from EHR to registry

• The EHR of the future (?):– Genetic data (WGS or WES)

» WGS = ~100 GB

» WES = ~8 GB

– Integration with consumer wearable devices (e.g. FitBit; iPhone ECG)

– Further down the road: human microbiome; home microbiome

Page 19: Working With Large-Scale Clinical Datasets

Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg

Always question

the data

Page 20: Working With Large-Scale Clinical Datasets

Questions?

• Slides available from slideshare(URL @craigsmail)

• Email: [email protected]


Recommended