Supported by
Towards Reproducible Data Analysis
Using Container Technologies
Sergio Maffioletti
EnhanceR project director
UZH/S3IT
https://www.enhancer.ch
Disclaimer
What I’m presenting here is the result of a personal experience plus the outcomes of different discussions
within the EnhanceR project.
i.e.:
if you like the talk, congratulate with me…if you don’t, blame EnhanceR
https://www.enhancer.ch
What are we going to talk about ?
• Context• What is the user story we have in mind ?• Let’s build the infrastructure support• Let’s not stop here: building containers for/with end-users• One more step: what do we put inside container ?• Main challenges and open questions
https://www.enhancer.ch
Who is EnhanceR again ?
https://www.enhancer.ch
What problems are we facing ?
Reproducible data analysis
“Reproducibility is just collaboration with people you don’t know, including yourself next week”
— Philip Stark, UC Berkeley Statistic
https://www.enhancer.ch
Context
Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.
Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts.
Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.
https://www.enhancer.ch
Let’s simplify...
Peng, R. D. (2011). Reproducible research in computational science. Science (New York, Ny), 334(6060), 1226.
https://www.enhancer.ch
What is the user story we have in mind ?
on average a researcher:• develops on personal server• changes code and data as research progress• finally gets publishable results
• sometimes running on a large-scale research IT infrastructure
• prepares slides / images / tables / manuscript• publishes manuscript
• at the end of a review process
https://www.enhancer.ch
What is the user story we have in mind ?
Researcher’s side recommendations for Open Science:
● Share data, software, workflows and other digital artifacts.● Persistent links should appear in the published article for data,
code, and digital artifacts. ● Citation should be standard practice, to enable credit for shared
digital scholarly objects.● Document digital scholarly artifacts, to facilitate reuse.● Use Open Licensing when publishing digital scholarly objects.
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support
* I know - I’m intentionally skipping the business aspect of this...
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support● Why?
○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.
* I know - I’m intentionally skipping the business aspect of this...
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support● Why?
○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.
● at the end ?○ we become a valuable asset for a research group○ we actually help them
* I know - I’m intentionally skipping the business aspect of this...
https://www.enhancer.ch
Let’s build the infrastructure
what container technology
orchestration
integration withresource management
storage for data and container’s images
deployment and management
monitoring
https://www.enhancer.ch
Let’s build the infrastructure
validation andverification
automatedpolicies
scanning signing
https://www.docker.com
https://www.enhancer.ch/pipeline
https://www.enhancer.ch
Let’s not stop here
what to consider● Automated build / integration with CD/CI● design strategies● naming schema● Path binding● documentation, metadata and runner script
building containers for/with end-users
competences● version control - CD/CI● container build process
opportunities● development best practises● embed policies● standardise assumptions
https://www.enhancer.ch
Container design strategies
https://www.enhancer.ch/pipeline
https://www.enhancer.ch
what do we put inside container now ?
https://nbis-reproducible-research.readthedocs.io/en/course_1811/tutorial_intro/
what to consider: ● Track software dependencies:
● in-container executions:
competences:● track requirements in sw
development● sw deployment - CD/CI
opportunities:● end-user best practices● better handling of sw
dependencies
https://www.enhancer.ch
Open questions
Infrastructure / Pull● what containers shall I allow on my infrastructure ?● how do I make sure cited container is exactly what I’m getting ?● how do I verify and validate containers when we deploy them on our infrastructure● how do I know what the container is doing ?● how do I know whether the container has the latest security patch ?
Run● how do I make sure a deployed container runs ‘as documented’ on my data ?● “how do I find a container that I need for running RNAseq ?”
Build● what assumptions can I make when building a container and what I should try to avoid ?
○ data mapping in and out / user privileges /● where do I publish my container and how do I get a DOI for the publication ?● how do I publish my container so that people can find it for their purposes ? (metadata)● how do I describe/document my container’s behaviour
https://www.enhancer.ch
Main challenges
● Social○ adoption by end-users○ how to address: “is it worth the investment ?”
● Technical○ scale-out / orchestration○ integration of specialised resources (e.g. GPU)○ multi-tenancy - privileges○ documented assumptions within the containers○ maintenance
■ bugfix and security○ portability vs performance
https://www.enhancer.ch
Acknowledgments
● Guidelines for pipeline interoperability using containers○ https://www.enhancer.ch/pipeline
● Survey for Research IT Infrastructure providers○ https://forms.gle/JBW78qDPWabd4GDR8