Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations...

Supported by

Towards Reproducible Data Analysis

Using Container Technologies

Sergio Maffioletti

EnhanceR project director

UZH/S3IT

https://www.enhancer.ch

Disclaimer

What I’m presenting here is the result of a personal experience plus the outcomes of different discussions

within the EnhanceR project.

i.e.:

if you like the talk, congratulate with me…if you don’t, blame EnhanceR


What are we going to talk about ?

• Context• What is the user story we have in mind ?• Let’s build the infrastructure support• Let’s not stop here: building containers for/with end-users• One more step: what do we put inside container ?• Main challenges and open questions


Who is EnhanceR again ?


What problems are we facing ?

Reproducible data analysis

“Reproducibility is just collaboration with people you don’t know, including yourself next week”

— Philip Stark, UC Berkeley Statistic


Context

Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.

Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts.

Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.


Let’s simplify...

Peng, R. D. (2011). Reproducible research in computational science. Science (New York, Ny), 334(6060), 1226.


What is the user story we have in mind ?

on average a researcher:• develops on personal server• changes code and data as research progress• finally gets publishable results

• sometimes running on a large-scale research IT infrastructure

• prepares slides / images / tables / manuscript• publishes manuscript

• at the end of a review process


What is the user story we have in mind ?

Researcher’s side recommendations for Open Science:

● Share data, software, workflows and other digital artifacts.● Persistent links should appear in the published article for data,

code, and digital artifacts. ● Citation should be standard practice, to enable credit for shared

digital scholarly objects.● Document digital scholarly artifacts, to facilitate reuse.● Use Open Licensing when publishing digital scholarly objects.


What does this means for a service provider ?

● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +

best-practises + support

* I know - I’m intentionally skipping the business aspect of this...




best-practises + support● Why?

○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.





best-practises + support● Why?

○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.

● at the end ?○ we become a valuable asset for a research group○ we actually help them



Let’s build the infrastructure

what container technology

orchestration

integration withresource management

storage for data and container’s images

deployment and management

monitoring


Let’s build the infrastructure

validation andverification

automatedpolicies

scanning signing

https://www.docker.com

https://www.enhancer.ch/pipeline


Let’s not stop here

what to consider● Automated build / integration with CD/CI● design strategies● naming schema● Path binding● documentation, metadata and runner script

building containers for/with end-users

competences● version control - CD/CI● container build process

opportunities● development best practises● embed policies● standardise assumptions


Container design strategies



what do we put inside container now ?

https://nbis-reproducible-research.readthedocs.io/en/course_1811/tutorial_intro/

what to consider: ● Track software dependencies:

● in-container executions:

competences:● track requirements in sw

development● sw deployment - CD/CI

opportunities:● end-user best practices● better handling of sw

dependencies


Open questions

Infrastructure / Pull● what containers shall I allow on my infrastructure ?● how do I make sure cited container is exactly what I’m getting ?● how do I verify and validate containers when we deploy them on our infrastructure● how do I know what the container is doing ?● how do I know whether the container has the latest security patch ?

Run● how do I make sure a deployed container runs ‘as documented’ on my data ?● “how do I find a container that I need for running RNAseq ?”

Build● what assumptions can I make when building a container and what I should try to avoid ?

○ data mapping in and out / user privileges /● where do I publish my container and how do I get a DOI for the publication ?● how do I publish my container so that people can find it for their purposes ? (metadata)● how do I describe/document my container’s behaviour


Main challenges

● Social○ adoption by end-users○ how to address: “is it worth the investment ?”

● Technical○ scale-out / orchestration○ integration of specialised resources (e.g. GPU)○ multi-tenancy - privileges○ documented assumptions within the containers○ maintenance

■ bugfix and security○ portability vs performance


Acknowledgments

● Guidelines for pipeline interoperability using containers○ https://www.enhancer.ch/pipeline

● Survey for Research IT Infrastructure providers○ https://forms.gle/JBW78qDPWabd4GDR8


https://forms.gle/JBW78qDPWabd4GDR8

Date post:	29-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations...

Documents