The Hutch Data Commonwealth
Matthew TrunnellCIO & VP of ITExecutive Director, Hutch Data CommonwealthFred Hutchinson Cancer Research Center
TechConnect Conference13 March 2018
© Fred Hutchinson Cancer Research Center 2
© Fred Hutchinson Cancer Research Center 3Photo Credit Len Rubenstein
Cost per Genome
200 TB
1 PB10 PB
IT infrastructure • Expansive storage services with deep security and effective tools for consent management and access control
• An elastic computing infrastructure for running analyses without boundData services • Biomedical data stores leveraging and driving global standard interfaces
• Methods stores that enable users and groups across the research community to share, discover and build off each other’s methods and ideas
Analysis engine • A comprehensive suite of methods for high-dimensional data analysis for use in both research and clinical environments
• New approaches to data mining leveraging applied machine learning
Knowledge sharing and dissemination
• Knowledge bases that seamlessly weave internal and external knowledge and data
• Researcher and clinician portals that leverage all of the above to provide customized, user-driven views of the data
6
New capabilities needed
7
New competencies needed
IT infrastructureDevOps
Data services
Data Engineering
Analysis engine
Data Science
Knowledge sharing and dissemination UI/UX
© Fred Hutchinson Cancer Research Center 8
295Total faculty members
3000+Total employees
NCI-DesignatedComprehensive Cancer Center
5Research Divisions
• Basic Sciences• Clinical Research• Human Biology• Public Health
Science• Vaccine & Infectious
Disease
Nobel laureates 3
$284,704,566NIH funding FY17
RESEARCH EXCELLENCE
1,072Scientific papers published FY18 YTD
JAMA. 2014;311(24):2479-2480. doi:10.1001/jama.2014.4228
Columbia University, New York, NY; andMicrosoft Research, Redmond, WA
ASSOCIATED CONTENT
Appendix DOI: 10.1200/JOP.2015.010504
DOI: 10.1200/JOP.2015.010504;published online ahead of print atjop.ascopubs.org on June 7, 2016.
Screening for PancreaticAdenocarcinoma Using SignalsFrom Web Search Logs: FeasibilityStudy and ResultsJohn Paparrizos, MSc, Ryen W. White, PhD, and Eric Horvitz, MD, PhD
AbstractIntroductionPeople’s online activities can yield clues about their emerging health conditions. We
performed an intensive study to explore the feasibility of using anonymized Web querylogs to screen for the emergence of pancreatic adenocarcinoma. The methods used
statistical analyses of large-scale anonymized search logs considering the symptomqueries from millions of people, with the potential application of warning individual
searchers about the value of seeking attention from health care professionals.
MethodsWe identified searchers in logs of online search activitywho issued special queries that aresuggestive of a recent diagnosis of pancreatic adenocarcinoma.We then went back many
months before these landmark queries were made, to examine patterns of symptoms,which were expressed as searches about concerning symptoms. We built statisticalclassifiers that predicted the future appearance of the landmark queries based onpatterns
of signals seen in search logs.
ResultsWe found that signals about patterns of queries in search logs can predict the futureappearance of queries that are highly suggestive of a diagnosis of pancreatic
adenocarcinoma. We showed specifically that we can identify 5% to 15% of cases, whilepreserving extremely low false-positive rates (0.00001 to 0.0001).
ConclusionSignals in search logs show the possibilities of predicting a forthcoming diagnosis ofpancreatic adenocarcinoma from combinations of subtle temporal signals revealed in the
queries of searchers.
INTRODUCTIONPancreatic adenocarcinoma poses a diffi-cult and resistant challenge in oncology. Itis the fourth leading cause of cancer deathin the United States and is the sixth leadingcause of cancer death in Europe.1 The ill-ness is frequently diagnosed too late tobe treated effectively2 ,3 and can progress
from stage I to stage IV in just over1 year.4Approximately 75%ofpatientswithpancreatic adenocarcinoma who are notcandidates for surgery will die within 1 yearof diagnosis, and only 4% will survive for5 years postdiagnosis.5
Early signs and symptoms of pancre-atic adenocarcinoma are subtle and often
Copyright © 2016 by American Society of Clinical Oncology jop.ascopubs.org 1
Original Contribution FOCUS ON QUALITYInform
ation downloaded from jop.ascopubs.org and provided by at Arnold Library - Fred Hutchinson Cancer Research Center on Septem
ber 9, 2016 from 140.107.179.210
Copyright © 2016 American Society of Clinical O
ncology. All rights reserved.
The Hutch Data Commonwealth
© Fred Hutchinson Cancer Research Center 11
© Fred Hutchinson Cancer Research Center 12
Corollary:Enable investigators to trace backwards from
publication to analysis results to laboratory results to biospecimens in freezers.
Vision:Enable investigators to leverage all possible data
in the effort to eliminate disease by driving the development of data infrastructure and data
science capabilities through collaborative research, strategic partnering and robust
engineering.
© Fred Hutchinson Cancer Research Center 14
Areas of focus for the Commonwealth• High-dimensional data integration
• Natural language processing (NLP) to extract structured information from clinical records
• Application of “deep learning” to medical image analysis
• Collection and analysis of data from mobile devices
• Data management and discovery
• Developing partnerships with regional technology and research organizations
~90 FTE• Service Desk• Systems Engineering • Network Engineering• Information Security• Enterprise applications• Project management office• Scientific computing
~45 FTE
• Product engineering• Software development
• Data engineering
• Product management
• Clinical Informatics• Applied data science• Partnership development
Hutch Center IT Hutch Data Commonwealth
Data Commons
© Fred Hutchinson Cancer Research Center 18
https://kids.nationalgeographic.com/explore/space/black-holes/
Data have gravity.
© Fred Hutchinson Cancer Research Center 19
Disease-specific Commons
OpenAPS Data Commons Nightscout Data Commons
on theOpen Humans platform
Data-specific Commons
Federal Research CommonsInstitutional Commons
Data Commons: Cyber-infrastructure that co-locates data, storage, and computing infrastructure with commonly used tools for analyzing and sharing data to create an interoperable resource for the research community.Requirements:
• Permanent digital IDs• Permanent metadata• API-based access• Data portability• Data peering• Pay-for compute
Grossman, R. et al. 2017. A Case For Data Commons: Toward Science as a Service. arXiv:1604.02608 [cs.CY]
Key Takeaways
© Fred Hutchinson Cancer Research Center 21
© Fred Hutchinson Cancer Research Center 22
Big data requires newinfrastructureinvestmentorganization
© Fred Hutchinson Cancer Research Center 23
Research computing is becoming more costly.
It’s time we start sharing.