DIBBSs Successes and Future Challenges
1st NSF Data Infrastructure Building Blocks PI Workshop (DIBBs17)
January 11-12, 2017
Irene M. Qualters
Director
Director, Office of Advanced Cyberinfrastructure
Directorate for Computer & Information Science & Engineering
January 11, 2017
Image Credit: Exploratorium.
Overview
NSF, OAC and the Research Context
• Who are we?
• Where do we make investments?
DIBBs: A data portrait, 2013-16
Challenges and next steps
NSF AND THE RESEARCH
CONTEXT
January 11, 2017 National Science Foundation 3
NSF by the Numbers
$7.72 billion FY 2016 budget request
94% funds research, education and
related activities
50,000 proposals
11,000 awards funded
2,000 NSF-funded
institutions
300,000 NSF-supported
researchers
217 Nobel Prize
winners
Fund research in all
S&E
disciplines
Fund STEM education &
workforce
NSF’s mission encompasses all areas of
science and engineering
January 11, 2017 National Science Foundation 5
NSF support is critical to the US academic
science and engineering communities
Source: NSF/NCSES, Survey of Federal Funds for Research & Development, FY 2014
NSF support of Academic Basic Research (as a percentage of total federal support)
3
NSF Addresses National Priorities
through Support of Fundamental
Research
Food/Energy/Water
…..and thus requires a highly capable, highly
interoperable Research Infrastructure
Understanding the Brain
INCLUDES
Example: LIGO detection of gravitational waves
8
Open Science Grid
Researcher sustained access to diverse and interoperable CI
o Massive, parallel event searches and validation;
o New high performance simulations of numerical
relativity and magnetohydrodynamics;
o Support by multiple agencies and international funders
• Network upgrades at campus, national, international levels
• HPC services and resources: Open Science Grid (OSG):
Comet (SDSC); Blue Waters (UIUC); XSEDE
• Computational science advances embodied in Software
Infrastructure
• Simulations
• Visualizations
• Workflow and dataflow
LIGO relied on a portfolio of advances in computational science, software, hardware,
and expert services:
Office of
Advanced Cyberinfrastructure
(OAC)
Data
High Performance Computing
Networking/ Cybersecurity
Software
Office of Advanced
Cyberinfrastructure Program Staff
Science Advisor
Cross-cutting CI
Learning/
Workforce
Development
Office Director: Irene Qualters Office Deputy Director: A. Friedlander
Public Access: P. Knezek
Cooperative Agreements: Alejandro Suarez
R. Chadduck A. Walton R. Chadduck R. Eigenmann E. Walker A. Nikolich K. Thompson R.Ramnath V. Chaudhary
W. Miller
S. Prasad
OAC supports Research Cyberinfrastructure to uniquely enable collaboration and discovery frontiers at all scales
Gateways, Hubs, and Services
Cloud Resources &
Services
CI-Enabled Instrumentation
Computing Resources
Data Networks, Cybersecurity
Coordination & User support
Software, Applications, Workflow Systems
Shared resources,
capabilities & services across the scientific workflow
January 11, 2017 National Science Foundation 10
Best Practices in Data Infrastructure Workshop
May 17-18, 2016
Pittsburgh, PA
PSC hosted a workshop on Best Practices in Data Infrastructure to bring together developers and users of advanced cyberinfrastructure relating to data management and analytics. The workshop was designed with the following groups in mind: awardees of NSF DIBBs and DataNet projects, leads for acquisitions having data as a major focus, and users with challenging data requirements. The workshop was an excellent opportunity for NSF ACI developers and users to interact. Goals of the workshop included disseminating significant results, creating opportunities for collaboration between data cyberinfrastructure projects, and identifying gaps where users need additional innovation or resources.
Community Input Critical to NSF CI Planning
NSF is launching an effort to refresh the Foundation’s cyberinfrastructure vision and strategy, as the current activity, Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21), enters its final year.
Through this Request for Information, NSF invites contributions from the research community to inform this planning effort.
We request input on scientific challenges, associated cyberinfrastructure needs, and bold forward-looking ideas to advance science and engineering frontiers over the next decade and beyond.
Deadline for submissions: April 5, 2017, 5:00 PM ET.
Questions about this RFI? Send to [email protected].
https://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf17031
National Science Foundation W H E R E D I S C O V E R I E S B E G I N
DIBBS: A DATA PORTRAIT
2013-2016
January 11, 2017 National Science Foundation 13
DIBBs launched as a result of CIF21 Vision and Strategy
Crosscutting/NSF-wide CIF21 Initiative
Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21) is a portfolio of activities to provide integrated cyber resources that will enable new multidisciplinary research opportunities in all science and engineering fields by leveraging ongoing investments and using common approaches and components.
NSF 17-500 Data Infrastructure Building Blocks (DIBBs):
The NSF vision for a Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21) considers an integrated, scalable, and sustainable cyberinfrastructure to be crucial for innovation in science and engineering (see www.nsf.gov/cif21). The Data Infrastructure Building Blocks (DIBBs) program is an integral part of CIF21. The DIBBs program encourages development of robust and shared data-centric cyberinfrastructure capabilities, to accelerate interdisciplinary and collaborative research in areas of inquiry stimulated by data.
January 11, 2017 National Science Foundation 14
Summary of DIBBs funding, 2013-2016
Year Category N Value Co-funding Directorates
2013 Implementation 4 $ 27,521,583
2013 Conceptualization 4 $ 429,392
2014 EarLy implementation 2 $ 9,830,819 EdHR; SBE
2014 Pilot demonstrations 16 $ 21,340,996 BIO; CISE; ENG; GEO; MPS; SBE 2015 Multi-campus; multi-
institutional 5 $23,685,304 *co-located with CC*
2016 Pilot demonstrations 5 $ 1,946,064 BIO; EdHR; ENG; GEO; MPS
2016 Early implementations 8 $ 28,115,008 BIO; CISE; ENG; MPS; SBE
44 $ 112,869,166
January 11, 2017 National Science Foundation 15
Co-funding by directorate (detail)
ACI 65%
BIO 1%
CISE 8%
EdHR 6%
ENG 3%
EPSCoR 0%
GEO 9%
MPS 3%
SBE 5%
DIBBs 2014 Funding Contributions
ACI 82%
BIO 2%
CISE 3%
EdHR 1%
ENG 2%
GEO 1%
MPS 8%
SBE 1%
DIBBs 2016 Funding Contributions
Note: In 2013, there was no co-funding; in 2015; DIBBs was co-located with CC*.
January 11, 2017 National Science Foundation 16
Distribution Across Core Areas
Category N Example Topics Generation / Acquisition / Discovery
9 Access, New data types, Instrument data
Curation / Storage / Management
18 Provenance, Storage, Cloud resources, Automatic Curation, Repository Indices
Analysis / Modeling / Visualization
16 Tools, Data Integrity and Security, Spatial Data Analysis, Collaborative Data Analysis
43
January 11, 2017 National Science Foundation 17
SOME OTHER “DATA”
PROJECTS
January 11, 2017 National Science Foundation 18
Innovations at the Nexus of Food, Energy and Water
Developing research and societal capacity for robust decision support through integrated view of nation’s water data
Award 1639529 will enable the first comprehensive empirical map of the U.S. Food, Energy, and Water System. This web-based interactive map will model the impacts of economic production, consumption, and agricultural trade; political, economic, and regulatory stresses and shocks; water system; environmental flows; carbon dioxide emissions; and land use. (Ruddell, Sabo, Gurney, Shutters, and Hanemann, Northern Arizona University)
BrainLab CI (1649880) prototypes a cloud-based experimental-management system for reproducible science. The system will provide workflows, visualization, and analysis, and will draw on the principle of continuous integration (CI) from agile software engineering to enable users to define community experiments that open data sets and analyses to contributions from the global neuroscience community.
The system will be tested via creation of community experiments to study batch effects in MRI and spike-sorting algorithms in electrophysiology, the results of which will be shared with the community.
(Burns, Vogelstein, Miller/Johns Hopkins)
Robust and Reliable Science: Collaborative, Community
Experiments in Brain Research
Charles Catlett, University of Chicago [Award #ACI-1532133] Co-funded by CISE/OAD, ENG/CBET, ENG/CMMI
MRI: Development of an Urban-Scale Instrument for Interdisciplinary Research
Broader Impacts: • In partnership with the City of Chicago, 500 nodes will be mounted
around the city by 2017. • Many scientific disciplines will benefit from this new data source.
The ‘Array of Things’ instrument allows researchers to rapidly deploy sensors, embedded systems, computing, and communications systems at scale in an urban environment. • This project funds the development and installation of AoT ‘nodes’ --
enclosures containing instruments for measuring temperature, barometric pressure, light, vibration, carbon monoxide, nitrogen dioxide, sulfur dioxide, ozone, ambient sound intensity, pedestrian and vehicle traffic, and surface temperature.
• All data collected by the nodes will be free and publicly available through the City of Chicago Data Portal and other open data platforms.
• Public health researchers will be able to study the relationship between diseases, which occur at higher rates in urban areas, and environmental conditions.
• Climate researchers will have higher resolution data than currently provided by existing weather stations to study urban micro-climates, with benefits for hyper-local weather forecasting and energy efficiency.
• Social scientists can study the dynamics of urban activity in public spaces and the effects on economics and livability.
H. Birali Runesha, University of Chicago [Award #ACI- 1626552]
Data Lifecycle Instrument (DaLI) for Management and Sharing of Data from Instruments and Observations
DaLI Broader Impacts: • Replicable data infrastructure • Access to tools for data lifecycle management • Integration with campus and national CI • Outreach programs
The Data Lifecycle Instrument (DaLI) enables researchers to acquire, transfer, process, store, manage and share, in a unified workflow including: • Telescopes and Astronomical Arrays • Ecological Field Stations and Sensors • Massively Parallel Sequencers • Microscopes • Advanced Photon Source • High Speed/High Definition Video Cameras • Multidisciplinary Neuroscience Experiments
DaLI features scalable resources: HPC for pre- and post-processing of data, a high performance hierarchical storage pool.
South Pole Telescope
X Ray Reconstruction
of Moving Morphology Instrument
XENON1T Dark Matter
Detector
Lightsheet Microscope
DaLI Data
Acquisition
Data Pre- & Post-
Processing
Data
Sharing
Data
transport / storage
Broader
Community
External
Collaborators
XDM
Globus
MyTardis
Etc.
Instrument
Acquisition of Data
(XDM/MyTardis)
Processing to
Assist Acquisition
(LSM Tools)
DaLI
Data Mangement
Globus
GPFS
XDM
External
Collaborators
Broader
Community
Collaboration
and Sharing
(Globus, XDM)
Repositories for
the Community
(SAGA)
National
Cyberinfrastructures
XSEDE
CERN,
ANL
etc.
Field Stations,
Research Labs
HPC/Big Data as an Enabler of NSF Big Ideas: Navigating the New Arctic
ArcticDEM : UMn/Morin
• Presidential Initiative for the the Arctic Council –
Interagency/Public/Private Partnership – NSF,
NGA, ESRI and 5 Universities
• Publicly available, Time-dependent, 2m resolution
elevation dataset, ~1-2 m vertical accuracy
covering 20 million km2
• When complete, the Arctic will have higher
resolution continuous elevation data than the
Western US
• Produced on Blue Waters over the next 2 years
CHALLENGES AND
NEXT STEPS. . .FOR YOU AND US
January 11, 2017 National Science Foundation 24
How do we leverage your work and experience to advance Research CI?
Where are the gaps?
Where do you see yourselves making connections?
What has changed in the last 5 years and how do those changes create new opportunities for scientists as well as for the CI community?
January 11, 2017 National Science Foundation 25
campus, national resources NSF-supported
CI ecosystem
international
… … … … … National/International Research and Education Network
private, commercial cloud
Discipline-specific Science APIs, Applications, Portals, Gateways
Science Drivers
Existing and new CI services
New Data Services: Access, Discovery, Deep Analytics, Semantics
UtB, NBO
INFEWS S&CC Facilities, MREFC
Enabling and accelerating science drivers, including NSF initiatives & facilities
An architectural vision for research cyberinfrastructure?
Go
vern
ance
, po
licy,
su
stai
nab
ility
How well would it work? What are alternatives?
January 11, 2017 National Science Foundation 26
Some personal thoughts on research-driven challenges
Sustainability? • Value of data vs. cost of data infrastructure over time
• Reusability of data infrastructure
• Commercial/commodity infrastructure
Priority/emphasis on Research Dataflows and Workflows? • Interoperability in a “multi-cloud”, multi-institutional
ecosystem
• Facilities and instruments
Contribution to Robust and Reliable Science? • Reproducibility is a small aspect
• Credibility of analyses?
Incentives and career paths?
January 11, 2017 National Science Foundation 27