Working With Large-Scale Clinical Datasets

Working With Large-Scale Clinical Datasets

Craig Smail, MA, MSc ( @craigsmail)KU Medical Center9th October 2014

Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg

Disclosures

• Industry grant funding:

– Merck

– Mallinckrodt

– Sanofi

Overview

• Targeted audience: anyone involved (directly or indirectly) in clinical data extraction, validation, and standardization

• Sections:

1. Data extraction: planning

2. Data extraction

3. Data standardization

4. Data transfer

Data Extraction: Planning

• Dataset type– Most common: limited and de-identified

– Difference: limited can contain some personal information (DOB, DOD, city, state, age)

• Legal agreements– Data Use Agreement (DUA)

– Business Associates Agreement (BAA)

– Institutional Review Board (IRB)• Usually only if IRB considers activity Human Subjects

Research


• Important to finalize list of data elements before pull

– Time-consuming to repull

– Reallocation of resources (e.g. programmer time)

• Summary statistics are helpful in planning stage

• e.g. death status requested a lot, but is very rarely available in the EHR


• Use of data proxy correlated with data element of interest

– sometimes need to develop proxies for data points of interest (e.g. severity of pain; hypoglycemic events)

– Example use case: aspirin as a proxy for antiphospholipid antibodies lab1

• Proxy data elements should be supported by data

1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 2011; 365:1758-1759N

Example: Proxy for Death Status

• Data extracted from large multi-specialty clinic on the east coast

• 300,000 patients in EHR

• ~10,000 with date-of-death (we’ll take this as gold-standard)

• Is days since last encounter a good proxy?

Example: Proxy for Death Statuslibrary(glm2)

# import data

setwd([dir here])

Encs = read.csv("lastenc.csv", header= FALSE)

# find days since last encounter

for (i in 1:nrow(Encs)) {

Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y")

}

# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750)

for (i in 1:nrow(Encs)) {

Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0)

}

# clean up table

Encs = Encs[ , c(2, 4)]

# fit model (logistic regression – but could use something else)

fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial")

confusionMatrix = table(round(fit$fitted.values), Encs[,1])

misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) #

0.34

Example: Proxy for Death Status

• Is days since last encounter a good proxy? No (error rate = 34%)

• Consequences:


• Cohort definition– Spell out cohort definitions explicitly, including all

assumptions

– Real-world example: • ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90

days apart’

• Further restriction specified ‘if any value > 60 in between 90 days, then throw out’

• Word ‘consecutive’ means no values in between 90 days will be considered at all– If any another eGFR value occurs between 90 days, then the

patient does not meet the first restriction


• Final thought on planning:“Not everything that counts can be counted and not everything that can be counted counts.”

—Albert Einstein (or William Bruce Cameron,

depends who you believe)

• some data elements are well populated, but reflect things like coding bias (e.g. ‘up-coding’ to a code with larger reimbursement)

Data Extraction

• What are data extractions being used for in the NRN?– Pharmaceutical companies: data on 143,057

patients from 8 health-care organizations/health care systems

– Federally-funded research (NIH, AHRQ): data on ~100,000 patients

– Health IT vendors: work with Cerner to produce performance reports for use by participating providers• Clinicians like performance feedback, if your EHR cannot

provide it they will go elsewhere (i.e. switch to another vendor)

Data Extraction

• Longitudinal data important

– look at temporal trends over time in same patient

– during EHR transitions, some EHR vendors will import all data, but restrict full access to only last 18/24/26 months – clinicians don’t like this, they want to be able to access all data

Data Validation

• Date parameters (e.g. look at min and max dates of encounter in dataset, when 1000s of patients of dataset, would expect to see dates match with range)– Percentage of distinct patients in extraction vs. overall practice count:

cohort percentages are quite stable across practices

» e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes defined by ICD-9 code xx.xxx

– Caveat: doesn’t work well with small practices (< 2,000 distinct patients)

Data Standardization

• Open-source models (Observational Medical Outcomes Partnership)

• Script data out of database (e.g. SQL view)

• Map labs/procedures to standardized concept list– Why? different string labels referring to creatinine blood test from

three data feeds, with frequency of occurrence…

Note: source values with counts < 100 were censored

Data Transfer

• HIPPA requirements

• Usually FTP to secure site (e.g Egnyte)

Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/

Concluding Thoughts

• Extracted data is treated as gold-standard, since it is pulled directly from data source (i.e. EHR), but data often comes from intermediate product (such as a registry product, like the product DARTNet provides); but usually don’t have control over data mapping from EHR to registry

• The EHR of the future (?):– Genetic data (WGS or WES)

» WGS = ~100 GB

» WES = ~8 GB

– Integration with consumer wearable devices (e.g. FitBit; iPhone ECG)

– Further down the road: human microbiome; home microbiome

Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg

Always question

the data

Questions?

• Slides available from slideshare(URL @craigsmail)

• Email: [email protected]

Date post:	01-Jul-2015
Category:	Science
Upload:	craig-smail
View:	170 times
Download:	1 times

Working With Large-Scale Clinical Datasets

Science