Applications of aggregate data in healthcare and
research
Mark Hoffman, Ph.D.
Chief Research Information Officer
@markhoffmankc
Topics
• Background
• Genomic consent
• Public data sets• Envirome
• Data derived from Electronic Health Records• Disease surveillance• Research• Quality Improvement• HIE data
• Big data required to train AI / Machine Learning algorithms
Data explosion - Biology and Medicine
YOU!Every search
Every like
DevicesConnectome
• 1010 neurons• 1014 connections
Genome
• 20,000 genes• 3 Billion nucleotides from each parent
Microbiome
• 10 trillion cells• 10x > than our own cells
Proteome
• 250,000 – 1M unique proteins
Apps
EHR
• VOLUME
• Velocity
• V a r i e t y
• Veracity
Large loads, no formal number as threshold
Data in motion, not at rest
Data has complexity
Data can be traced to origin, point in time
Characteristics of “Big Data”:
Veracity: Relationship between “Big Data” and “Little Data”
Irresponsible data: Questionable Veracity at the “Little data” level• “Big data” analysts don’t understand “little data”• Failure to recognize inherent bias
• Race, gender, age, socioeconomic, regional
• Unknown origin (provenance)• Out of context• Inappropriate level of rigor• Hype• Disallowed access
Whole exome sequencing (WES)
• Increasingly utilized for diagnostic purposes
• Sequence every gene
• Secondary findings – risk factors unrelated to original reason for ordering test
• Presidential Bioethics Commission and American College of Medical Genetics have specific recommendations for consent WES consent forms
Consent elements matrix
• Download consent forms from 18 academic and commercial labs offering diagnostic WES
• Score each form for recommended elements
• Also evaluated grade level readability of formo Flesch-KincaidoAverage was 10.8 vs
recommended 8th gradeo Some college reading level Fowler, SA; Saunders, CJ; Hoffman, MA “Variation among consent forms for
clinical whole exome sequencing” J. Genetic Counseling. July 2017 ePub. PMID:
28689263
Consent content
• All labs acknowledge uncertainty of results
• Most acknowledge possible inclusion of data in databases
• Few labs discuss release of secondary findings to family
Fowler, SA; Saunders, CJ; Hoffman, MA “Variation among consent forms for
clinical whole exome sequencing” J. Genetic Counseling. July 2017 ePub. PMID:
28689263
Envirome Data
Envirome data service
Zip codeCensus tract
ContextData
ElectronicHealth Record
ResearchData set
2010 Census
USDA Food Desert
Pu
blic
Dat
a So
urc
es
Address
Geocode
GeocodingService or
Application
Covered Entity
GIS installed in
house
Geocoding – Strategies for protecting PHI
BAA?
GeocodingService
Provider
Distinct IP of origin
If research, include in consent
BAA
Cloud ServicesProvider
• Address is one of 18 HIPAA protected fields
Live at Children’s Mercy
Source and version clearly presented
Precision to enable reproducibility
Primary uses of EHR data
• Support point of care decisions
• Enable immediate access to documentation
• Promote compliance
• Protect patient privacy
• Automate and streamline clinical operations
• Billing
Disease Surveillance – Public Health
• Some pathogens require notification of public health• Highly contagious
• Food poisoning
• Bioterrorism
• Requirements vary by jurisdiction
• Historically notification was by FAX, mail or phone call
• Electronic reporting directly from EHR offers multiple benefits
2001 - Anthrax
• Anthrax contaminated letters sent to news media and U.S. Senators
• 5 fatalities, 17 infections
• Kansas City Health Department and Cerner agreed to collaborate
Public Health
Surveillance Architecture
DATA COMPLETENESSReportable cases (non-STD): March-Sept 2002
*Average over 6 key data fields
UNDER-REPORTING
0%
100%
200%
300%
400%
Campy
SalmShig
Grp A
Stre
p
H. Influ
Hep C
% I
nc
rea
se
0%
20%
40%
60%
80%
100%
Conventional HealthSentry
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Conventional HealthSentry
TIMELINESS
% F
ield
s C
om
ple
te
*Average over all reportables
*Increased overall reporting by 96%D
ays t
o r
ec
eiv
e r
ep
ort
Improved public health reporting
Hoffman, MA., Wilkinson, T, Bush, A, Myers, W, Griffin R, Hoff, G, Archer, R. “Multijurisdictional approach to Biosurveillance, Kansas City” Emerg. Inf. Dis. 2003 9(10):1281-1286 PMID: 14609464
Public Health Network: 2009 Influenza initiative
• Opt-in at project level
• 850+ facilities, 48 States
• 57 million cases processed
• Positive influenza A results, ILI, ED
utilization
• Worked with CDC, state and local public
health
Data waterfalls
• Every handoff can result in the loss of meaning
• Data attribution
• Need to understand workflow and data flow at all levels and transitions
Intermediate 1
Source
Intermediate 2
Health Facts
No
data righ
ts
EHR Vendor clients
Health Facts™
De-ID Mapping, normalization
Data righ
ts
Health Facts data waterfall
• PHI
• Text documents
• Images
Data extracts
Electronic Health Record
Data warehouse
• Duplicates
• Cancelled orders
Health Facts
Data type Current release
Unique patients 63 million
Total laboratory results 4.3 billion
Total facilities 863
Total medication orders 734 million
Total diagnoses 489 million
Cerner Health Facts - Summary
• Actual, not potential data
Can we validate large EHR data sets?
Diagnostic category HCUP NIS Health Facts t value relative difference
Nervous system 6.03 6.12 0.39
Eye .15 .14 1.53
Hepatobiliary and pancreas 2.94 3.03 1.02
Male reproductive .5 .55 2.23
Female reproductive 1.75 .55 24.04
Pregnancy and childbirth 11.09 4.15 18.67
Myeloproliferative, poorly differentiated neoplasms
.91 .86 0.7
Mental health 3.89 2.22 7.1
Trauma .27 .21 3.08
DeShazo, J; Hoffman, MA “A comparison of a multistate inpatient EHR database to the HCUP nationwide inpatient sample” BMC Health Services Res. 2015 15(1):384 PMID: 26373538
Mg and AMI - Mortality
• Mg supplementation recommended after AMI but little evidence
• After inclusion/exclusion –11,683 HF patients with AMI and Mg results
• Both Low and High Mg levels correlate with higher risk of in-hospital mortality
Shafiq et.al. – J. Amer. Coll. Card. June 2017
Let the data speak
• Risk factors associated with hospital acquired C. diff infections
• Regression analysis
• Does not require a narrow question
Dean B., Campbell R., Nathanson B. et. al. “Risk factors associated with hospital-origin vs community-origin Clostridium difficile-associated diarrhea” ID week 2012
Data-informed selection of QI projects
A1c / Sickle Cell patients -Comparison of TMC to all HF sites
392711%
3322489%
Sickle Cell Patients at all 393 Facilities
Atleast one A1C Encounters No A1c Encunters
17032%
35668%
A1C Encounters of Sickle Cell Patients at TMC
Atleast one A1C Encounters No A1c Encunters
• Confirms high frequency
HIE Data
• Health information exchanges (HIE) are implemented to support patient care
• Most HIE agreements do not provide clauses that establish policies and procedures for the research use of the unconsented data (for research) crossing through the HIE
• HIE operator should not offer to redistribute or provide access to data for research without full understanding of consent policies
• Very risky territory, especially in light of Cambridge Analytica / Facebook issues
Recent developments
Why we need algorithms…
• Limits to human memory• 3,874 genes with phenotype-causing
mutation (OMIM)
• Limits to human perception• Vision limit: 170 pixels per inch• Limits of hearing, touch
• Limits to our ability to recognize patterns• Subtle but large scale variations
across population
• Knowledge explosion• “Omics”• 23 million citations in MedLine
• Major opportunities to improve the quality of care• Non-adherence to guidelines• Hospital acquired infections• Adverse events• Health disparities
• Efficiency• Workforce shortage –
• 53.3 physicians / 100,000 people –Urban
• 39.8 physicians / 100,000 people –Rural + Much greater distances
• High cost of healthcare
Algorithms don’t just happen…
Data!
Supervised learningUnsupervised learning
Deep learningAlgorithm
Algorithm
Literature reviewEvidence-based
medicineHuman curation
Randomized clinical trials
Providers apply algorithms based on this model:
Algorithms are hungry for data
• Supervised learning – heavy analyst involvement. For example, annotation of pneumonia cases by radiologist.
• Unsupervised learning – cluster analysis.
• Deep learning – algorithms train algorithms – unstructured, unlabeled data Training
setExperimental
Total data set
Most data must be “cleansed” before feeding algorithm
The reality… there is often a “man behind the curtain”
What we want to imagine: a well oiled data machine
Most data must be manipulated before feeding algorithm• Standard data prep:
• Convert text to numeric
• Address missing values
• Address values that are “out of bounds”
• This is often perfectly appropriate but requires subject matter expertise
Caucasian Female
Abdominal Pain,
Unspecified Site
Acute Bronchitis
Acute Pancreatitis
Benign Essential
Hypertension
Coronary Atheroscleros
is of Unspecified
Type of Vessel
Diabetes Mellitus without
Mention of Complication,
Type I
Diabetes mellitus without
mention of complication,
type II or unspecified
type
Diarrhea
Dysuria
Esophageal Reflux
Fever, UnspecifiedHypersplenis
m
Hypopotassemia
Nausea with Vomiting
Lymphosarcomaand
Reticulosarcomaand Other Specified
Malignant Tumors of
Lymphatic Tissue
Other Nonspecific Abnormal
Serum Enzyme Levels
Personal History of
Other Diseases of Digestive System
Q Fever
Thrombocytopenia,
Unspecified
Unspecified Chronic
Bronchitis
Unspecified Idiopathic Peripheral
Neuropathy
Urinary Tract Infection, Site Not Specified
Lymphosarcoma
Type II Diabetes
Q Fever
One very ill woman
Peripheral Neuropathy
Pancreatitis
Patient type categories (subset)
Code Category
77 Client
78 Clinic
76 Cerner test patient – not valid patients
122 HLA QC
123 Home health
109 TestUpdate: Cerner has removed manyNon-patient encounters in latest HF data cut
What can we do? Implementation Science• Data science teams must be interdisciplinary – subject matter and
technical• Contribute as clinical experts – you don’t have to be a programmer!
• Reduce data issues upstream of AI and algorithms• Advocate best practices for system implementations:
• Automate “boundary” checking for values:• Expected range• If necessary, provide override
• Thoughtful consideration of default values• Zero vs Null
• Versioning• Clinical forms – add or remove prompts
Thank you!
• @markhoffmankc
• My blog: markhoffmankc.com
• Some work funded by Centers for Disease Control and Prevention• Grant NU47OE000105-01-01
• Thanks to:• Kamani Lankachandra – UMKC/TMC
• Suman Sahil - UMKC
• Shivani Sivisankar - UMKC
• Earl Glynn
• Children’s Mercy Research Informatics Team