Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | nathan-kennedy |
View: | 216 times |
Download: | 0 times |
Big Data, Big Opportunities, Big Challenges
A Version of:
2015 ENAR Presidential Address
David L DeMets, PhD
University of Wisconsin-Madison
DeMets DeMets Conflict of Interest (COI)Conflict of Interest (COI)
• I consult with– NIH– FDA – IOM– Industry
• Pharmaceuticals• Medical Devices
• I serve on over 10 active DMCs currently• I hold no stock in biopharm/device industry• I am not selling any medical products 2
TopicsTopics
• Making a Difference
• Data Tsumani
• IOM Report: Genomic Predictors
• Algorithms as a Medical Device
• IOM Report: Data Sharing
• Need for Quantitative Training
Looking in Rearview MirrorLooking in Rearview Mirror
• Good to take a look in the rearview mirror from time to time
• Just don’t stare at it too long
• Look back 50-60 years to NIH
• Beginning of a build up of biostatistics and quantitative science
NIH Bethesda CampusNIH Bethesda Campus
Temporary NIH Building T- 6 (Circa 1946+)
NHLBI
T- 6
Office of BiometryFirst Generation(J Ellenberg)
Additional Early NIH Additional Early NIH BiostatisticiansBiostatisticians
• Edmund Gehan• Marvin Zelen• Seymour Geisser• Fred Ederer• Dan Siegel• Hal Kahn• George Weiss• David Alling
• Felix Moore• John Gart• John Bartko• Sid Cutler• Bill Haensel• ……• There were many
others not at NIH
-8
Their Biostatistical CultureTheir Biostatistical Culture
• Make a difference
• Make yourself useful
• Engage in the science
• Find a way to contribute
• Let collaborations drive methodology
• My job description in 1972
A Case Study: Cardiovascular A Case Study: Cardiovascular DiseaseDisease
• During WWII/Korean War, surgeons noticed young healthy men already had evidence of atherosclerosis
• By 1950, heart disease was an epidemic
• Largest cause of mortality by far as well as morbidity
-10
Cardiovascular Disease: a Major Cardiovascular Disease: a Major Problem in 1950Problem in 1950
• How to start: Framingham Heart Study• Began in 1948 by NIH & BU (Felix Moore, T
Dawber, W Kannel, W Castelli,…)• Enlisted 5,209 men & women (30-62) from
Framingham, MA in long term cohort study• After a decade of followup, used logistic
regression models to identify major CVD risk factors (Cornfield)
• Many analyses done during 1970’s (T Gordon, D McGee, M Hjortland,….)
-12
Framingham Heart StudyFramingham Heart StudySome Identified Risk FactorsSome Identified Risk Factors
• Age• Gender• Smoking• Blood Pressure• Cholesterol• Diabetes• Weight• …….
• It was understood that modifying risk factors would have to be tested in a RCT
• Observational studies not sufficient
-13
NHLBI Launched a Wave of RCTs in NHLBI Launched a Wave of RCTs in 1960’s & 1970’s1960’s & 1970’s
• Coronary Drug Project (CDP)• Urokinase Pulmonary Embolism Trial (UPET)• Hypertension Detection Followup Program
(HDFP)• Multiple Risk Factor Intervention Trial
(MRFIT)• Lipid Research Clinics (LRC) trials• Aspirin Myocardial Infarction Study (AMIS)• Coronary Artery Surgery Study (CASS)• Betablocker Heart Attack Trial (BHAT)• Several more in pulmonary disease
-14
Early NHLBI Trial ResultsEarly NHLBI Trial Results• Reduction in mortality/morbidity by lowering
blood pressure• Lowering cholesterol could reduce mortality
but depended on drug, some had modest effect
• Effective use of coronary bypass surgery – not for everyone
• Reduction in mortality post MI using betablockers
• Clot buster drugs improved patients with pulmonary embolism, later MI patients
• …..-15
Early Statistical ContributionsEarly Statistical Contributions& Lessons Learned& Lessons Learned
• Assessing risk factors using regression models (Cornfield,..)• Design of long term clinical trials (Halperin, et al)
– Changing risk, treatment effect, adherence over time– Screening patients for trials based on risk scores– Non-inferiority trials ie “proving the null hypothesis”
• Sequential methods (Halperin, Ware, DeMets, Lan..)– Curtailment methods– Conditional Power– Group Sequential & Alpha Spending
• Analysis– “Intention to Treat” & Informative Censoring– Longitudinal Data (Ware,Wu,..)
• ……….-16
Biostatistics CultureBiostatistics CultureThe Early YearsThe Early Years
• An amazing period during 1960’s & 1970’s• Common theme was to operate as statistical
collaborators, not consultants• Let statistical methodology be driven by
collaborative projects• Had time to discuss, debate, develop new
methods or modifications• First generation mentored second generation• Led to a large MS & PhD level team nationally• Contributions resulted in better research, better
trials, better treatments and better health-17
Fast Forward to 2015Fast Forward to 2015
• Era of Big Data
• Era of Computation
• Era of Data Sharing
• Era of Reproducibility Issues
• Era of Quantitative Science
• Era of Big Challenges & Incredible Opportunity
““Big Data”Big Data”• Era of massive data sets• Recent explosion of biomedical data
– Genome sequence data– Public health databases– Personal monitoring
• Need for new and better ways to make the most of this data– Speed discovery and innovation – Improve the nation’s health and
economy
The Five V’s of Big DataThe Five V’s of Big Data(Ref: B Marr: BIG DATA)(Ref: B Marr: BIG DATA)
• The defining properties of Big Data are often described as
• Volume: scale of available data
• Velocity: speed at which new data is
generated and transmitted
• Variety: different types of data available
• Veracity: levels of accuracy
• Value: meaningfulness!
The Era of Big DataThe Era of Big Data
Biomedical Big Data – varietyBiomedical Big Data – varietygenotypes
electronic health recordsimages
molecular profiles
text sources
drug profileswearable sensor data
environmental data
Biomedical Big Data – volume & Biomedical Big Data – volume & velocity: Electronic Health Recordsvelocity: Electronic Health Records
source: CDC
Butte’s Take Home PointsButte’s Take Home PointsSCT Presidential Address 2013SCT Presidential Address 2013
• •Molecular, clinical, trials, and epidemiological data and tools exist
• Can contribute to diagnostics and therapeutics.
• •Personalized medicine≥ DNA. Needs to include other clinical, molecular, and environment measures.
• •Public big data is highly enabling, need to learn to take advantage of it
Amazing Data Available Amazing Data Available (Butte 2013 SCT Talk)(Butte 2013 SCT Talk)
• NCBI/Gene Expression Omnibus (GEO)– Over 1 M microarrays available
• dbGAP (Genotype & Phenotype)– Eg Framingham Heart Study
• PubChem– >100M substances x 650 assays
• Conversant bio– Browse by disease
Available Data (2)Available Data (2)• Assay Depot
– Find latest services available world wide at lowest costs
– Can do animal testing at very low cost
• Complete Genomics– Sets of public genome sequences
• COMING SOON– Patient level data from RCTs (IOM Report)– De-identified Data Warehouse from EHRs
RCTs & Big DataRCTs & Big Data
• Some argue that we already collect too much data in our standard RCTs
• Now we have the potential to collect even more, a lot more!
• Costs of traditional trials are escalating
• Phase III failure rate too high
• More trials going off shore
• Current system not sustainable
Estimated Costs of Drug Estimated Costs of Drug Development ($Millions)Development ($Millions)
Fee R: The cost of clinical trials. Drug Discovery and Development Webcast. March 1, 2007DiMasi JA et al: The price of innovation: new estimates of drug development costs. J Health Economics 2003; 22:151-185
DiMasi JA: New drug development in the United States from 1963 to 1999. Clin Pharmacol Ther 2001; 69:286-96
Protocol Complexity and Burden Protocol Complexity and Burden Across All PhasesAcross All Phases
2000-03 2004-07 % Change
Unique procedures
20.5 28.2 38%
Total procedures 105.9 158.1 49%
Total site work 28.9 44.6 54%
Eligibility criteria 31 49 58%
Enrollment rate 75% 59% -21%
Retention rate 69% 48% -30%
Kaitin KI: Tufts Center for the Study of Drug Development Impact Report. May/June 2010, Volume 12 (3).
Globalization of Clinical TrialsGlobalization of Clinical Trials
• Increase in non-US clinical trials
• According to DHHS Inspector General Report, – 80 percent of the drugs approved for sale
in 2008 had trials in non-US countries– 80 percent of all subjects who participated
in clinical trials were enrolled at non-US sites.
• One-third of CTs by 20 largest US-based drug companies are conducted solely outside the US (Glickman et al,NEJM, 2009)
The Five V’s of Big DataThe Five V’s of Big Data(Ref: B Marr: BIG DATA)(Ref: B Marr: BIG DATA)
• The defining properties of Big Data are often described as
• Volume: scale of available data
• Velocity: speed at which new data is
generated and transmitted
• Variety: different types of data available
• Veracity: levels of accuracy
• Value: meaningfulness!
Big Data provides not just challenges, Big Data provides not just challenges, but opportunitiesbut opportunities
Evolution of Translational Evolution of Translational OmicsOmics
• Institute of MedicineThe National Academies Press, 2012
– Difficulty in defining the biological rationale underpinning an omics based test
• Biological rationale for single-analyte tests is often evident– ➠ Examples: HER2, LDL
– Challenges in data provenance and sharing
• Large, complex datasets used to create computational models
• Simple data management errors can easily occur
• Sharing of clinical data and code is not routine
• Difficult for other scientists to replicate and verify findings
Omics Based Tests
– Need for expertise in multiple disciplines, a TEAM Sport
• Biologists, geneticists, statisticians, bioinformaticians, clinical pathologists,…..
• Responsibility for omics-based test is shared among many
• No single investigator has breadth of expertise needed
– No widely accepted process for translation of omics test into clinics
Omics Based Tests (2)
Three Stages Three Stages of Omics Test Developmentof Omics Test Development(True for other predictors)(True for other predictors)
1. Discovery
2. Test Validation
3. Evaluation for Clinical Utility and Use
Omics-Based Test Omics-Based Test Development FrameworkDevelopment Framework
Stage I: Discovery PhaseStage I: Discovery Phase
– Candidate test is developed on training set,
– Candidate test is locked down.
• Training set data
• Computational algorithm
– Candidate test is evaluated & confirmed on an independent sample set.
– Computational code and test data must be available
.
Stage II: Test Validation PhaseStage II: Test Validation Phase• Regulatory oversight for omics based tests differs
from drug development • An omics-based test consists of both
• the data-generating assay and
• the fully specified computational model.
• Needs to be validated on an independent blinded data set
• FDA approval or clearance as a device
Use a Laboratory Developed Test (LDT) / CLIA laboratory
Stage III: Evaluation for Clinical Stage III: Evaluation for Clinical Utility - Three PathwaysUtility - Three Pathways
•Prospective–retrospective studies using archived specimens from previously conducted clinical trials.
Prospective clinical trials that shadow clinical practice, where the test does not impact practice
Prospective randomized clinical trials that directly address the utility of the omics-based test, where
The test does direct patient management by assessing risk or treatment response
Evaluation for Clinical Utility Evaluation for Clinical Utility and Useand Use
When is an Algorithm a When is an Algorithm a Device?Device?
FDA Regulations for Medical SoftwareFDA Regulations for Medical Softwareand Mobile Medical Appsand Mobile Medical Apps
Morgridge Institute for ResearchMorgridge Institute for ResearchUW-MadisonUW-Madison
Sept 2014Sept 2014
Keynote Speaker: Seth A. Mailhot
Michael Best & Friederick LLP
Definition of a “Device”Definition of a “Device”• FDA (Sec. 201(h)) defines a device as an
instrument, apparatus, implement, machine, contrivance, . . . or other similar or related article, including any component, part, or accessory, which is intended:– for use in the diagnosis of disease or other
conditions,– in the cure, treatment, or prevention of
disease,– to affect the structure or function of the body
Software CategoriesSoftware Categories• Software allowing the user to input
patient-specific information along with reference material to automatically diagnose a disease or condition is a device
• Generic software aids promoted for medical applications are devices
• Algorithms implemented via software for medical intervention need FDA oversight and discussion
• This would include genomic predictors, medical apps, remote monitoring
IOM Committee on Responsible Sharing of Clinical Trial Data
2015 IOM Report2015 IOM Report
IOM Study ContextIOM Study Context
• Requested by NIH, sponsors & public organizations
• Responsible clinical trial data sharing is in the public interest to advance science
• Many CTs now not analyzed and published in a timely manner• 1/3 of trials, results not published after 4
years• Already a momentum for data sharing• Question is not whether to share, but what
types of clinical trial data, when, & how to share
Key Benefits of Sharing Key Benefits of Sharing include:include:
• Allows other investigators to carry out additional analyses and reproduce published findings
• Strengthens evidence base for regulatory decisions & clinical guidelines
• Increases scientific knowledge gained from investments by funders
• Maximizes contributions of participants and avoids unnecessary duplicative trials
• Stimulates new ideas for research
Key Risks and Key Risks and Challenges include:Challenges include:
• The need to: • Protect participant privacy and
honor consent (eg HIPAA)• Safeguard legitimate economic
interests of sponsors (IP)• Guard against invalid secondary
analyses• Give researchers time to conduct
analyses, publish and get credit for sharing (academic currency)
Key StakeholdersKey Stakeholders
• Trial Participants• Investigators• Institutions• Funders / Sponsors• Regulatory Agencies• Research Ethics Groups• Journals• Professional Societies• Patient Advocacy Groups
Data Flow for CTsData Flow for CTs
Types of DataTypes of Data• Raw Patient Level Data
– CRFs– Lab data including xrays, MRI, EKGs, etc– QoL instruments
• Meta Data
• Analyzable Data
• Analyzed Data– Only a small portion of analyzable data is really
analyzed for publications
• Complete Summary Reports (CSRs)
• Publications
IOM RecommendationsIOM Recommendations• Rec 1: Stakeholders in clinical trials
should – foster a culture in which data
sharing is the expected norm, – commit to responsible strategies.
• Rec 2: Sponsors and investigators should share the various types of clinical trial data no later than the following timelines (when & what)
Rec 2:Timeliness & MilestonesRec 2:Timeliness & Milestones
• Trial Registration (eg Clintrials.gov)– Protocol, Data Sharing Plan, SAP
• Study Completion (LPLV)– Within 12 months, should provide– Summary Level Results– Lay/Public Presentation
• Publication– Within 6 months, should provide– Patient Level analyzed (de-identified) data
supporting paper– Full protocol, SAP & Analytic Code
Rec 2: Timeliness & Milestones (2)Rec 2: Timeliness & Milestones (2)
• Study Completion-II– Within 18 months, should provide– Full analyzable de-identified data set,
protocol, SAP & analytic code
• Trial submission for Regulatory Approval– Within 30 days after approval or within 18
months of product abandonment – Full analyzable de-identified data set,
protocol, SAP & redacted Completed Study Report (CSR)
Rec 3: Data Access: Rec 3: Data Access: How?How?
Open AccessOpen Access• De-identified data• No restrictions on access, no
conditions for use• Appropriate for sharing
clinical trial results (e.g. on clinicaltrials.gov)
• Individual Patient Data and CSRs present risks that generally need to be mitigated through appropriate controls
Rec 3: Data Access: Rec 3: Data Access: How?How?
ControlledControlled• Not a single model, but a spectrum of
controls and conditions• De-identification• Use of Registration and Data Use
Agreements• Central or federated data model• Sharing data in non-downloadable
format but analyzable• Review of data requests (by sponsor
or independent review panel)
Rec 4: Address Rec 4: Address Remaining CT Data Remaining CT Data Sharing ChallengesSharing Challenges
• Infrastructure- insufficient platforms to store and manage data
• Technological- current platforms are not consistently discoverable, searchable, and interoperable
• Workforce- lacks skills and knowledge to manage operational and technical aspects
• Sustainability- current model costs are borne by small subset of sponsors, funders and trialists, and is unsustainable.
Biomedical Workforce TrainingBiomedical Workforce Training
•Biomedical Research Workforce Working Group Report to the Advisory Committee to the Director (July 2012)
– Shirley Tilghman, Ph.D., President, Princeton University, N.J., co-chair
– Sally Rockey, Ph.D., NIH Deputy Director for Extramural Research, co-chair
– http://acd.od.nih.gov/Biomedical_research_wgreport.pdf
Biomedical Workforce Training (2)Biomedical Workforce Training (2)
• Data and Informatics Working Group Report to The Advisory Committee to the Director (ACD) (June 2012)– David DeMets, Ph.D., Professor, UW-
Madison; co-chair– Lawrence Tabak, D.D.S., Ph.D., Principal
Deputy Director, NIH; co-chair• Rescuing US biomedical research from its
systemic flaws by Alberts B, Kirschner M, Tilghman S & Varmus H; PNAS, April, 2014
Current Biomedical Workforce Current Biomedical Workforce Where Are We?Where Are We?
• 2015 AAAS Session: Crisis in Quantitative Training for Biomedical Science
– Key speaker: Donna Ginther PhD, Univ of Kansas
• Work based on 1980-2010 waves of Survey of Earned Doctorates and Survey of Doctorate Recipients
Academic and Non-Academic Employment GrowthAcademic and Non-Academic Employment Growth
Non-academic employment growth accounts for the majority of jobs for doctorates since the 1980s. Growth in doctorates is higher than academic employment growth.
(Source 1981-2010 Survey of Doctorate Recipients).
The Academic Labor MarketThe Academic Labor Market(Ref: Ginther, U Kansas))(Ref: Ginther, U Kansas))
• In academia, Supply of PhDs has outstripped Demand.
• While the non-academic sector has absorbed many of these doctorates, given the recent economic crisis and jobless recovery, it is not clear that it will continue to do so.
• The gap between Supply and Demand will likely grow larger given the current financial issues facing academia
PhDs in Biomedical & Clinical Fields 1980-2013PhDs in Biomedical & Clinical Fields 1980-2013
Source: NSF Survey of Earned Doctorates
Quantitative Biology is small relative to other Biomedical & Clinical fields.
McKinesy Report: Data ScienceMcKinesy Report: Data Science
McKinsey (2011) estimates the US Economy will need•140,000 – 190,000 Data Scientists•1.5 million data-literate managers to take advantage of big data opportunities•Data Science—emerging field that combines skills of:
• Data Management• Computer Science• Statistical Methodology & Analysis
•Biomedical researchers should develop its capacity to train students in statistics and data science.
““Big Data”Big Data”• Era of massive data sets• Recent explosion of biomedical data
– Genome sequence data– Public health databases
• Need for new and better ways to make the most of this data– Speed discovery and innovation– Ultimately lead to improvements in
the nation’s health and economy• Data and Informatics Working Group
– Presented by Tabak & DeMets 2012 Report to NIH Director
Overview of RecommendationsOverview of Recommendations
66
Recommendation 3Recommendation 3Build capacity by training the workforce in the relevant quantitative sciences (e.g., bioinformatics, biomathematics, biostatistics, and clinical informatics)
– 3a. Increase funding for quantitative training and fellowship awards:
• Training of experts should grow to meet the increasing demand in this field
• Perform a supply versus demand gap analysis• Develop a strategy to meet the demand
67
Recommendation 3 (cont.)Recommendation 3 (cont.) 3b. Enhance review of quantitative training
applications: Specialized quantitative training grants are often not
reviewed by those with the most relevant experience Consider formation of a new study section focused on
the review of quantitative science training grants
3c. Create a required quantitative component for all NIH training and fellowship awards: Enable the clinical and biological scientist workforce
with basic proficiency in use of quantitative tools Draw on the Clinical and Translational Science
Awards (CTSAs) centers in developing the curriculum for a core competency 68
62 CTSA Sites62 CTSA Sites
Office of Biomedical Data Office of Biomedical Data Science Science
• A newly created office in
Director’s Office: Associate Director for Biomedical Data Science
• Recruited Phil Bourne, PhD as Director
• Mission: To foster an open ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances health, lengthens life and reduces illness and disability & to train the next generation of data scientists
Community – BD2K AwardsCommunity – BD2K Awards
BD2K training investment on BD2K training investment on par with medium-sized ICspar with medium-sized ICs
Note about BD2K data: Although NIH planned to commit $12M in FY2016 for T programs, a total of $20M is planned for all BD2K Training progams.
BD2K and PedagogyBD2K and Pedagogy
• Lawrence Summers in the January 20, 2012 New York Times: “What You (Really) Need to Know.”
• Posed the question: How will what universities teach be different?• Education will be more about how to process
and use information and less about imparting it.
• Courses of study will place much more emphasis on the statistical analysis of data.
• This perspective informs how we should teach biomedical researchers in the future.
Training Students Training Students to Extract Value from Big Datato Extract Value from Big Data
(IOM Report: 2014)(IOM Report: 2014)
• Train to “to extract value” from big data
• Must Be Multidisciplinary– Computation, statistics, visualization,
bioinformatics, data management– Communication, problem solving
• Training for multiple audiences
• Need training menu– PhD & MS, Certificates, Workshops
Our Future Biostatistical Our Future Biostatistical CultureCulture
• Find ways to contribute & make a difference– Engage in the science– Conduct rigorous & innovative analyses– Develop methodology as needed– Respect & learn the other quantitative sciences– Train, Train, Train
• If we fail, the gap will be filled by others with less insight and training
• We have a tremendous opportunity!
AcknowledgementsAcknowledgements
• Mark Craven (UW-Madison)
• Bernie Lo (UCSF) & IOM
• Donna Ginther (U of Kansas)
• Russ Altman (Stanford)
• Phil Bourne (NIH)
• Larry Tabak & Frances Collins (NIH)
• Many others