Big Data, Big Opportunities, Big Challenges A Version of: 2015 ENAR Presidential Address David L...

Big Data, Big Opportunities, Big Challenges

A Version of:

2015 ENAR Presidential Address

David L DeMets, PhD

University of Wisconsin-Madison

DeMets DeMets Conflict of Interest (COI)Conflict of Interest (COI)

• I consult with– NIH– FDA – IOM– Industry

• Pharmaceuticals• Medical Devices

• I serve on over 10 active DMCs currently• I hold no stock in biopharm/device industry• I am not selling any medical products 2

TopicsTopics

• Making a Difference

• Data Tsumani

• IOM Report: Genomic Predictors

• Algorithms as a Medical Device

• IOM Report: Data Sharing

• Need for Quantitative Training

Looking in Rearview MirrorLooking in Rearview Mirror

• Good to take a look in the rearview mirror from time to time

• Just don’t stare at it too long

• Look back 50-60 years to NIH

• Beginning of a build up of biostatistics and quantitative science

NIH Bethesda CampusNIH Bethesda Campus

Temporary NIH Building T- 6 (Circa 1946+)

NHLBI

T- 6

Office of BiometryFirst Generation(J Ellenberg)

Additional Early NIH Additional Early NIH BiostatisticiansBiostatisticians

• Edmund Gehan• Marvin Zelen• Seymour Geisser• Fred Ederer• Dan Siegel• Hal Kahn• George Weiss• David Alling

• Felix Moore• John Gart• John Bartko• Sid Cutler• Bill Haensel• ……• There were many

others not at NIH

-8

Their Biostatistical CultureTheir Biostatistical Culture

• Make a difference

• Make yourself useful

• Engage in the science

• Find a way to contribute

• Let collaborations drive methodology

• My job description in 1972

A Case Study: Cardiovascular A Case Study: Cardiovascular DiseaseDisease

• During WWII/Korean War, surgeons noticed young healthy men already had evidence of atherosclerosis

• By 1950, heart disease was an epidemic

• Largest cause of mortality by far as well as morbidity

-10

Cardiovascular Disease: a Major Cardiovascular Disease: a Major Problem in 1950Problem in 1950

• How to start: Framingham Heart Study• Began in 1948 by NIH & BU (Felix Moore, T

Dawber, W Kannel, W Castelli,…)• Enlisted 5,209 men & women (30-62) from

Framingham, MA in long term cohort study• After a decade of followup, used logistic

regression models to identify major CVD risk factors (Cornfield)

• Many analyses done during 1970’s (T Gordon, D McGee, M Hjortland,….)

-12

Framingham Heart StudyFramingham Heart StudySome Identified Risk FactorsSome Identified Risk Factors

• Age• Gender• Smoking• Blood Pressure• Cholesterol• Diabetes• Weight• …….

• It was understood that modifying risk factors would have to be tested in a RCT

• Observational studies not sufficient

-13

NHLBI Launched a Wave of RCTs in NHLBI Launched a Wave of RCTs in 1960’s & 1970’s1960’s & 1970’s

• Coronary Drug Project (CDP)• Urokinase Pulmonary Embolism Trial (UPET)• Hypertension Detection Followup Program

(HDFP)• Multiple Risk Factor Intervention Trial

(MRFIT)• Lipid Research Clinics (LRC) trials• Aspirin Myocardial Infarction Study (AMIS)• Coronary Artery Surgery Study (CASS)• Betablocker Heart Attack Trial (BHAT)• Several more in pulmonary disease

-14

Early NHLBI Trial ResultsEarly NHLBI Trial Results• Reduction in mortality/morbidity by lowering

blood pressure• Lowering cholesterol could reduce mortality

but depended on drug, some had modest effect

• Effective use of coronary bypass surgery – not for everyone

• Reduction in mortality post MI using betablockers

• Clot buster drugs improved patients with pulmonary embolism, later MI patients

• …..-15

Early Statistical ContributionsEarly Statistical Contributions& Lessons Learned& Lessons Learned

• Assessing risk factors using regression models (Cornfield,..)• Design of long term clinical trials (Halperin, et al)

– Changing risk, treatment effect, adherence over time– Screening patients for trials based on risk scores– Non-inferiority trials ie “proving the null hypothesis”

• Sequential methods (Halperin, Ware, DeMets, Lan..)– Curtailment methods– Conditional Power– Group Sequential & Alpha Spending

• Analysis– “Intention to Treat” & Informative Censoring– Longitudinal Data (Ware,Wu,..)

• ……….-16

Biostatistics CultureBiostatistics CultureThe Early YearsThe Early Years

• An amazing period during 1960’s & 1970’s• Common theme was to operate as statistical

collaborators, not consultants• Let statistical methodology be driven by

collaborative projects• Had time to discuss, debate, develop new

methods or modifications• First generation mentored second generation• Led to a large MS & PhD level team nationally• Contributions resulted in better research, better

trials, better treatments and better health-17

Fast Forward to 2015Fast Forward to 2015

• Era of Big Data

• Era of Computation

• Era of Data Sharing

• Era of Reproducibility Issues

• Era of Quantitative Science

• Era of Big Challenges & Incredible Opportunity

““Big Data”Big Data”• Era of massive data sets• Recent explosion of biomedical data

– Genome sequence data– Public health databases– Personal monitoring

• Need for new and better ways to make the most of this data– Speed discovery and innovation – Improve the nation’s health and

economy

The Five V’s of Big DataThe Five V’s of Big Data(Ref: B Marr: BIG DATA)(Ref: B Marr: BIG DATA)

• The defining properties of Big Data are often described as

• Volume: scale of available data

• Velocity: speed at which new data is

generated and transmitted

• Variety: different types of data available

• Veracity: levels of accuracy

• Value: meaningfulness!

The Era of Big DataThe Era of Big Data

Biomedical Big Data – varietyBiomedical Big Data – varietygenotypes

electronic health recordsimages

molecular profiles

text sources

drug profileswearable sensor data

environmental data

Biomedical Big Data – volume & Biomedical Big Data – volume & velocity: Electronic Health Recordsvelocity: Electronic Health Records

source: CDC

Butte’s Take Home PointsButte’s Take Home PointsSCT Presidential Address 2013SCT Presidential Address 2013

• •Molecular, clinical, trials, and epidemiological data and tools exist

• Can contribute to diagnostics and therapeutics.

• •Personalized medicine≥ DNA. Needs to include other clinical, molecular, and environment measures.

• •Public big data is highly enabling, need to learn to take advantage of it

Amazing Data Available Amazing Data Available (Butte 2013 SCT Talk)(Butte 2013 SCT Talk)

• NCBI/Gene Expression Omnibus (GEO)– Over 1 M microarrays available

• dbGAP (Genotype & Phenotype)– Eg Framingham Heart Study

• PubChem– >100M substances x 650 assays

• Conversant bio– Browse by disease

Available Data (2)Available Data (2)• Assay Depot

– Find latest services available world wide at lowest costs

– Can do animal testing at very low cost

• Complete Genomics– Sets of public genome sequences

• COMING SOON– Patient level data from RCTs (IOM Report)– De-identified Data Warehouse from EHRs

RCTs & Big DataRCTs & Big Data

• Some argue that we already collect too much data in our standard RCTs

• Now we have the potential to collect even more, a lot more!

• Costs of traditional trials are escalating

• Phase III failure rate too high

• More trials going off shore

• Current system not sustainable

Estimated Costs of Drug Estimated Costs of Drug Development ($Millions)Development ($Millions)

Fee R: The cost of clinical trials. Drug Discovery and Development Webcast. March 1, 2007DiMasi JA et al: The price of innovation: new estimates of drug development costs. J Health Economics 2003; 22:151-185

DiMasi JA: New drug development in the United States from 1963 to 1999. Clin Pharmacol Ther 2001; 69:286-96

Protocol Complexity and Burden Protocol Complexity and Burden Across All PhasesAcross All Phases

2000-03 2004-07 % Change

Unique procedures

20.5 28.2 38%

Total procedures 105.9 158.1 49%

Total site work 28.9 44.6 54%

Eligibility criteria 31 49 58%

Enrollment rate 75% 59% -21%

Retention rate 69% 48% -30%

Kaitin KI: Tufts Center for the Study of Drug Development Impact Report. May/June 2010, Volume 12 (3).

Globalization of Clinical TrialsGlobalization of Clinical Trials

• Increase in non-US clinical trials

• According to DHHS Inspector General Report, – 80 percent of the drugs approved for sale

in 2008 had trials in non-US countries– 80 percent of all subjects who participated

in clinical trials were enrolled at non-US sites.

• One-third of CTs by 20 largest US-based drug companies are conducted solely outside the US (Glickman et al,NEJM, 2009)

The Five V’s of Big DataThe Five V’s of Big Data(Ref: B Marr: BIG DATA)(Ref: B Marr: BIG DATA)

• The defining properties of Big Data are often described as

• Volume: scale of available data

• Velocity: speed at which new data is

generated and transmitted

• Variety: different types of data available

• Veracity: levels of accuracy

• Value: meaningfulness!

Big Data provides not just challenges, Big Data provides not just challenges, but opportunitiesbut opportunities

Evolution of Translational Evolution of Translational OmicsOmics

• Institute of MedicineThe National Academies Press, 2012

– Difficulty in defining the biological rationale underpinning an omics based test

• Biological rationale for single-analyte tests is often evident– ➠ Examples: HER2, LDL

– Challenges in data provenance and sharing

• Large, complex datasets used to create computational models

• Simple data management errors can easily occur

• Sharing of clinical data and code is not routine

• Difficult for other scientists to replicate and verify findings

Omics Based Tests

– Need for expertise in multiple disciplines, a TEAM Sport

• Biologists, geneticists, statisticians, bioinformaticians, clinical pathologists,…..

• Responsibility for omics-based test is shared among many

• No single investigator has breadth of expertise needed

– No widely accepted process for translation of omics test into clinics

Omics Based Tests (2)

Three Stages Three Stages of Omics Test Developmentof Omics Test Development(True for other predictors)(True for other predictors)

1. Discovery

2. Test Validation

3. Evaluation for Clinical Utility and Use

Omics-Based Test Omics-Based Test Development FrameworkDevelopment Framework

Stage I: Discovery PhaseStage I: Discovery Phase

– Candidate test is developed on training set,

– Candidate test is locked down.

• Training set data

• Computational algorithm

– Candidate test is evaluated & confirmed on an independent sample set.

– Computational code and test data must be available

.

Stage II: Test Validation PhaseStage II: Test Validation Phase• Regulatory oversight for omics based tests differs

from drug development • An omics-based test consists of both

• the data-generating assay and

• the fully specified computational model.

• Needs to be validated on an independent blinded data set

• FDA approval or clearance as a device

Use a Laboratory Developed Test (LDT) / CLIA laboratory

Stage III: Evaluation for Clinical Stage III: Evaluation for Clinical Utility - Three PathwaysUtility - Three Pathways

•Prospective–retrospective studies using archived specimens from previously conducted clinical trials.

Prospective clinical trials that shadow clinical practice, where the test does not impact practice

Prospective randomized clinical trials that directly address the utility of the omics-based test, where

The test does direct patient management by assessing risk or treatment response

Evaluation for Clinical Utility Evaluation for Clinical Utility and Useand Use

When is an Algorithm a When is an Algorithm a Device?Device?

FDA Regulations for Medical SoftwareFDA Regulations for Medical Softwareand Mobile Medical Appsand Mobile Medical Apps

Morgridge Institute for ResearchMorgridge Institute for ResearchUW-MadisonUW-Madison

Sept 2014Sept 2014

Keynote Speaker: Seth A. Mailhot

Michael Best & Friederick LLP

Definition of a “Device”Definition of a “Device”• FDA (Sec. 201(h)) defines a device as an

instrument, apparatus, implement, machine, contrivance, . . . or other similar or related article, including any component, part, or accessory, which is intended:– for use in the diagnosis of disease or other

conditions,– in the cure, treatment, or prevention of

disease,– to affect the structure or function of the body

Software CategoriesSoftware Categories• Software allowing the user to input

patient-specific information along with reference material to automatically diagnose a disease or condition is a device

• Generic software aids promoted for medical applications are devices

• Algorithms implemented via software for medical intervention need FDA oversight and discussion

• This would include genomic predictors, medical apps, remote monitoring

IOM Committee on Responsible Sharing of Clinical Trial Data

2015 IOM Report2015 IOM Report

IOM Study ContextIOM Study Context

• Requested by NIH, sponsors & public organizations

• Responsible clinical trial data sharing is in the public interest to advance science

• Many CTs now not analyzed and published in a timely manner• 1/3 of trials, results not published after 4

years• Already a momentum for data sharing• Question is not whether to share, but what

types of clinical trial data, when, & how to share

Key Benefits of Sharing Key Benefits of Sharing include:include:

• Allows other investigators to carry out additional analyses and reproduce published findings

• Strengthens evidence base for regulatory decisions & clinical guidelines

• Increases scientific knowledge gained from investments by funders

• Maximizes contributions of participants and avoids unnecessary duplicative trials

• Stimulates new ideas for research

Key Risks and Key Risks and Challenges include:Challenges include:

• The need to: • Protect participant privacy and

honor consent (eg HIPAA)• Safeguard legitimate economic

interests of sponsors (IP)• Guard against invalid secondary

analyses• Give researchers time to conduct

analyses, publish and get credit for sharing (academic currency)

Key StakeholdersKey Stakeholders

• Trial Participants• Investigators• Institutions• Funders / Sponsors• Regulatory Agencies• Research Ethics Groups• Journals• Professional Societies• Patient Advocacy Groups

Data Flow for CTsData Flow for CTs

Types of DataTypes of Data• Raw Patient Level Data

– CRFs– Lab data including xrays, MRI, EKGs, etc– QoL instruments

• Meta Data

• Analyzable Data

• Analyzed Data– Only a small portion of analyzable data is really

analyzed for publications

• Complete Summary Reports (CSRs)

• Publications

IOM RecommendationsIOM Recommendations• Rec 1: Stakeholders in clinical trials

should – foster a culture in which data

sharing is the expected norm, – commit to responsible strategies.

• Rec 2: Sponsors and investigators should share the various types of clinical trial data no later than the following timelines (when & what)

Rec 2:Timeliness & MilestonesRec 2:Timeliness & Milestones

• Trial Registration (eg Clintrials.gov)– Protocol, Data Sharing Plan, SAP

• Study Completion (LPLV)– Within 12 months, should provide– Summary Level Results– Lay/Public Presentation

• Publication– Within 6 months, should provide– Patient Level analyzed (de-identified) data

supporting paper– Full protocol, SAP & Analytic Code

Rec 2: Timeliness & Milestones (2)Rec 2: Timeliness & Milestones (2)

• Study Completion-II– Within 18 months, should provide– Full analyzable de-identified data set,

protocol, SAP & analytic code

• Trial submission for Regulatory Approval– Within 30 days after approval or within 18

months of product abandonment – Full analyzable de-identified data set,

protocol, SAP & redacted Completed Study Report (CSR)

Rec 3: Data Access: Rec 3: Data Access: How?How?

Open AccessOpen Access• De-identified data• No restrictions on access, no

conditions for use• Appropriate for sharing

clinical trial results (e.g. on clinicaltrials.gov)

• Individual Patient Data and CSRs present risks that generally need to be mitigated through appropriate controls

Rec 3: Data Access: Rec 3: Data Access: How?How?

ControlledControlled• Not a single model, but a spectrum of

controls and conditions• De-identification• Use of Registration and Data Use

Agreements• Central or federated data model• Sharing data in non-downloadable

format but analyzable• Review of data requests (by sponsor

or independent review panel)

Rec 4: Address Rec 4: Address Remaining CT Data Remaining CT Data Sharing ChallengesSharing Challenges

• Infrastructure- insufficient platforms to store and manage data

• Technological- current platforms are not consistently discoverable, searchable, and interoperable

• Workforce- lacks skills and knowledge to manage operational and technical aspects

• Sustainability- current model costs are borne by small subset of sponsors, funders and trialists, and is unsustainable.

Biomedical Workforce TrainingBiomedical Workforce Training

•Biomedical Research Workforce Working Group Report to the Advisory Committee to the Director (July 2012)

– Shirley Tilghman, Ph.D., President, Princeton University, N.J., co-chair

– Sally Rockey, Ph.D., NIH Deputy Director for Extramural Research, co-chair

– http://acd.od.nih.gov/Biomedical_research_wgreport.pdf

http://acd.od.nih.gov/Biomedical_research_wgreport.pdf

Biomedical Workforce Training (2)Biomedical Workforce Training (2)

• Data and Informatics Working Group Report to The Advisory Committee to the Director (ACD) (June 2012)– David DeMets, Ph.D., Professor, UW-

Madison; co-chair– Lawrence Tabak, D.D.S., Ph.D., Principal

Deputy Director, NIH; co-chair• Rescuing US biomedical research from its

systemic flaws by Alberts B, Kirschner M, Tilghman S & Varmus H; PNAS, April, 2014

Current Biomedical Workforce Current Biomedical Workforce Where Are We?Where Are We?

• 2015 AAAS Session: Crisis in Quantitative Training for Biomedical Science

– Key speaker: Donna Ginther PhD, Univ of Kansas

• Work based on 1980-2010 waves of Survey of Earned Doctorates and Survey of Doctorate Recipients

Academic and Non-Academic Employment GrowthAcademic and Non-Academic Employment Growth

Non-academic employment growth accounts for the majority of jobs for doctorates since the 1980s. Growth in doctorates is higher than academic employment growth.

(Source 1981-2010 Survey of Doctorate Recipients).

The Academic Labor MarketThe Academic Labor Market(Ref: Ginther, U Kansas))(Ref: Ginther, U Kansas))

• In academia, Supply of PhDs has outstripped Demand.

• While the non-academic sector has absorbed many of these doctorates, given the recent economic crisis and jobless recovery, it is not clear that it will continue to do so.

• The gap between Supply and Demand will likely grow larger given the current financial issues facing academia

PhDs in Biomedical & Clinical Fields 1980-2013PhDs in Biomedical & Clinical Fields 1980-2013

Source: NSF Survey of Earned Doctorates

Quantitative Biology is small relative to other Biomedical & Clinical fields.

McKinesy Report: Data ScienceMcKinesy Report: Data Science

McKinsey (2011) estimates the US Economy will need•140,000 – 190,000 Data Scientists•1.5 million data-literate managers to take advantage of big data opportunities•Data Science—emerging field that combines skills of:

• Data Management• Computer Science• Statistical Methodology & Analysis

•Biomedical researchers should develop its capacity to train students in statistics and data science.

““Big Data”Big Data”• Era of massive data sets• Recent explosion of biomedical data

– Genome sequence data– Public health databases

• Need for new and better ways to make the most of this data– Speed discovery and innovation– Ultimately lead to improvements in

the nation’s health and economy• Data and Informatics Working Group

– Presented by Tabak & DeMets 2012 Report to NIH Director

Overview of RecommendationsOverview of Recommendations

66

Recommendation 3Recommendation 3Build capacity by training the workforce in the relevant quantitative sciences (e.g., bioinformatics, biomathematics, biostatistics, and clinical informatics)

– 3a. Increase funding for quantitative training and fellowship awards:

• Training of experts should grow to meet the increasing demand in this field

• Perform a supply versus demand gap analysis• Develop a strategy to meet the demand

67

Recommendation 3 (cont.)Recommendation 3 (cont.) 3b. Enhance review of quantitative training

applications: Specialized quantitative training grants are often not

reviewed by those with the most relevant experience Consider formation of a new study section focused on

the review of quantitative science training grants

3c. Create a required quantitative component for all NIH training and fellowship awards: Enable the clinical and biological scientist workforce

with basic proficiency in use of quantitative tools Draw on the Clinical and Translational Science

Awards (CTSAs) centers in developing the curriculum for a core competency 68

62 CTSA Sites62 CTSA Sites

Office of Biomedical Data Office of Biomedical Data Science Science

• A newly created office in

Director’s Office: Associate Director for Biomedical Data Science

• Recruited Phil Bourne, PhD as Director

• Mission: To foster an open ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances health, lengthens life and reduces illness and disability & to train the next generation of data scientists

Community – BD2K AwardsCommunity – BD2K Awards

BD2K training investment on BD2K training investment on par with medium-sized ICspar with medium-sized ICs

Note about BD2K data: Although NIH planned to commit $12M in FY2016 for T programs, a total of $20M is planned for all BD2K Training progams.

BD2K and PedagogyBD2K and Pedagogy

• Lawrence Summers in the January 20, 2012 New York Times: “What You (Really) Need to Know.”

• Posed the question: How will what universities teach be different?• Education will be more about how to process

and use information and less about imparting it.

• Courses of study will place much more emphasis on the statistical analysis of data.

• This perspective informs how we should teach biomedical researchers in the future.

Training Students Training Students to Extract Value from Big Datato Extract Value from Big Data

(IOM Report: 2014)(IOM Report: 2014)

• Train to “to extract value” from big data

• Must Be Multidisciplinary– Computation, statistics, visualization,

bioinformatics, data management– Communication, problem solving

• Training for multiple audiences

• Need training menu– PhD & MS, Certificates, Workshops

Our Future Biostatistical Our Future Biostatistical CultureCulture

• Find ways to contribute & make a difference– Engage in the science– Conduct rigorous & innovative analyses– Develop methodology as needed– Respect & learn the other quantitative sciences– Train, Train, Train

• If we fail, the gap will be filled by others with less insight and training

• We have a tremendous opportunity!

AcknowledgementsAcknowledgements

• Mark Craven (UW-Madison)

• Bernie Lo (UCSF) & IOM

• Donna Ginther (U of Kansas)

• Russ Altman (Stanford)

• Phil Bourne (NIH)

• Larry Tabak & Frances Collins (NIH)

• Many others

Date post:	20-Jan-2016
Category:	Documents
Upload:	nathan-kennedy
View:	216 times
Download:	0 times

Big Data, Big Opportunities, Big Challenges A Version of: 2015 ENAR Presidential Address David L...

Documents