The CernVM File SystemSimple configuration: specify path to local cache, URL of software repository,...

The CernVM File System

Jakob Blomer

2. May 2011

1 / 25

1 Introduction

2 CernVM-FS

3 HEP Software Figures

4 Experimental Extensions

5 Use Case: Cloud Computing & Volunteer Computing

6 Conclusions

2 / 25

Motivation I

Building blocks in HEP computing:Scheduling (Condor, PROOF, LSF, . . . )Event data storage (XRootD, dCache, GridFTP, . . . )

Distribution of experiment software:

∙ Installation on the “travelling physicist’s laptop”Practically unfeasible, software development by ssh+emacs

∙ Packaging of experiment software and virtual machine imagesHalf-life of few days, large space and network consumption,unfeasible for Volunteer Computing

∙ Grid: Distribution of several GB / day to ≈ 100 sitesAFS, RPM, tarballs, handwritten scripts

∙ Grid: Software installation on worker nodesMisuse of the job scheduling system, “Installation Jobs”

∙ Experience from tier 1/2 sites: overload of the shared software areaCritical usage pattern: search path traversal, compilation, synchronized workernodes

3 / 25

Motivation IITechnologies and Problems (Cont.):

An alternative model to distribute VO specific software to WLCG sites: a prototype at PIC based on CernVM file system

E.Lanciotti for the PIC Tier1 team in collaboration with J.Blomer (CERN)

Abstract:

.The current model to dostribute VO software to the WLCG sites presents several limitations. A possible alternative is a model based on a new protocol: CernVM file system, a network file system developed in the framework of CernVM project. A test-bed set up at PIC Tier1 shows the feasibility of this solution

CVMFS installation and configuration!Installation with yum. No need to change any kernel parameter of the WN.

!Simple configuration: specify path to local cache, URL of software repository, and a local site proxy (Squid proxy)

CernVM-FS v2.47: added multi VO support, two repositories (ATLAS, LHCb mounted on the same system)

Summary and outlook:!Limitations of the current model for software distribution!Possible alternative model based on CernVM-FS: First tests very promising

! No change required in the VO software or in the framework for job submission (only an environment variable has to be changed)

! Very fast execution when software cached locally

Advantages with respect to the current model:! No need to install the software at the site! No need to maintain a shared area at the site and mount it on the WNs

!Next to do:! Test scalability adding a list of Squid servers at the site and running many (O(1000))

concurrent jobs

Metrics to measure:

Execution time for SetupProject script - the most demanding phase of the job for the software area (huge amount of stat() and open() calls )

! Execution time for DaVinci! Dependence with the number of concurrent jobs

1Results: LHCb SetupProject execution time!Clear dependence of the execution time with the number of jobs per node with NFS protocol.

! Effect on client side: 8 jobs on one node gives an execution time of 100s. Might depend on mount options (ro, noatime, nolock)

!Very low execution time, and no dependence with the number of jobs per node for CernVM-FS

Test-bed set up at PIC!A dedicated blade of 16 Wns, 8 cores each, configured in a test queue!On each node: software area mounted through NFS (production NFS server) and software repository of CernVM mounted through CernVM-FS !One Squid server at the site as http proxy and web cache: necessary to reduce network latency and reduce the load on the origin web server of CernVM

Current model to distribute VO software to sites: !In a distributed computing model as WLCG the software of VO specific applications has to be efficiently distributed to any site of the Grid!Applications software currently installed in a shared area of the site visible for all worker nodes (WN) of the site (through NFS, AFS or other)!The software is installed by jobs which run on the SGM node (a privileged node of the computing farm where the shared area is mounted in write mode)

Limitations of the current model for software distribution ! NFS scalability issues! Shared area sometimes not reachable (not properly mounted on the WN, or

NFS server very slow, due to high load...)! NFS locked by SQLite (known bug if NFS is mounted in r-w mode)! Software installation in many Grid sites is a tough task (job failures,

resubmission, tags publications...)! Limited quota per VO in the shared area: if VOs want to install new releases

and keep the old ones they have to ask for an increase of quota !Number of GGUS tickets relative to shared area issues for LHCb: 33 in the last quarter

SGM

nodeNFS server

All WNRO

RW

PIC shared area setup

An alternative model based on CernVM-FS! CernVM-FS is a network file system developed in the framework of CernVM project ! Repository of applications: contains the result of a make install! Can be mounted on the WN and accessed as read-only file system, through http protocol! Complemented by local site proxy for web caching

NFS server16 WNs

NFS

PIC LAN ORIGIN WEB SERVER hosting the software

repository

CVMFS/

HTTP

Description of tests done at PIC:!A test job which sets the environment and runs the application for LHCb analysis package (DaVinci)!No development required: only difference is an environment variable (path to the software area)

Results: ATLAS test job execution time!Same test-bed and test script than LHCb, different executable!Executable is an Athena test job: MC generation (1 evt) + reconstruction + analysis!Only the total execution time is measured

Very slight difference between CernVM-FS and NFS. ! This ATLAS job accesses less data in the software area. Not possible to compare

with the LHCb SetupProject

CernVM-FS performs

equally or slightly better than NFS for a

typical ATLAS test job

Barcelona (Spain)

HTTP

Source: E. Lanciotti

Worst case: Millions of concurrent system callsOverloads shared area e. g. NFS, Lustre,NetAppProblem: Meta data (order of 107)

Benign characteristics:Immutable, file level redundancy, particular job requires ≈ 10% of a software release

Design criteria (first addressed by GROW-FS):

∙ One-time installation by experiment (not Grid sites)

∙ Distribution:

1 only required files2 reliable and scalable3 standard protocols (NAT traversal)

∙ Aggressive local cache on worker nodes4 / 25

1 Introduction

2 CernVM-FS




6 Conclusions

5 / 25

Software Distribution with CernVM-FS

Principle: Virtual software installation by means of an HTTP File System

CernVM 2 /Standard SL5Worker Node

CernVM-FS

SL5 KernelFuse

Hierarchy ofHTTP Caches

(Squid)

LinuxFile System Buffers

1 GBCernVM-FS Cache

LRU managed

10GBSingle Release

(all releases available)

Layers of the system:

1 Publishing on a release manager machine

2 Content Distribution Network

3 Aggressively caching Fuse module

6 / 25

Content Addressable StorageShadow Tree

Repository

/cvmfs/atlas.cern.ch

software

15.6.9

ChangeLog...

806fbb67373e9...

Chunk store File catalogs

Compression, SHA-1

⇒ Immutable files,trivial to check for corruption

Data Store∙ Compressed chunks (files)

∙ Eliminates duplicates

∙ Never deletes

File Catalog

∙ Directory structure

∙ Symlinks

∙ SHA1 of regular files

∙ Digitally signed

∙ Time to live

∙ Nested catalogs7 / 25

Transformation in Content Addressable Storage

Problem: Processing of the directory difference setBad options: Fuse, redirecting syscalls, inotify, SystemTAP

process 1 · · · process n

user space

kernel space

VFSinode cachedentry cache

nfsd

redirfs

· · ·Ext3 NFS

/dev/cvmfs

filter 1

cvmfsflt

filter m

...

...

ringbuffer

system call system call

synchronize (offline)

Redirfs filters apply on path prefix:Normal operation Track of writing VFS callsRing buffer full Block of writing VFS callsRepository synchronization Reject writing VFS calls

8 / 25

Content Distribution, Simple Model

+ Simple

Piggi-back on deployed infrastructure

+ Scalable

in production for CernVM and several Tier1s

+ Fast

Clients served by memory cache

– Single point of failure

Master R/W Copy

ProxyHierarchy

ProxyHierarchy

9 / 25

Content Distribution, Stratum Model

+ Similarly fast and scalable

+ No single point of failure– Complex hierarchy

Stratum 0Master R/W Copy

CERN

RALBNL

Other

Stratu

m

1

Public

M

irrors

Stratum 2Private Replicas

(Tier 1)

ProxyHierarchy

9 / 25

Content Distribution, Decentralized ModelExperimental Setup

Backbone: Stable, distributed,public p2p network of webservers

Worker nodes: Independent P2Pcells building a decentralizedmemory cache

Stratum 0Master R/W Copy

CERN

RALBNL

Other

9 / 25

Fuse Moduleopen() syscall

open(/ChangeLog)

glibc


Buffer cache ext3

NFS

...

Fuse

libfuse

CernVM-FS

user space

kernel spacesyscall /dev/fuse

SHA1

file descr.fdHTTP GET

inflate+verify

∙ open() syscall redirects file descriptor from local cache

∙ Not cacheable in kernel

10 / 25

Fuse Modulestat() syscall

stat(/ChangeLog)

glibc


Buffer cache ext3

NFS

...

Fuse

libfuse

CernVM-FS

user space

kernel spacesyscall /dev/fuse

direntretval

∙ Meta data operations entirely on locally cached file catalogs

∙ Mostly handled by kernel caches

10 / 25

Client-Side Fail-Over

Worker Nodes

Proxy Servers Mirror Servers

HTTP

HTTP

Proxies SL5 Squid, load-balancing + fail-overe. g. CVMFS_HTTP_PROXY="A|B|C"

Mirrors Fail-over mirrors at CERN, RAL, BNLFor roaming users automatic ordering based on RTT

11 / 25

Client-Side Fail-Over

Worker Nodes

Proxy Servers Mirror Servers

HTTP

Fail-Over

Worker Nodes

Proxies SL5 Squid, load-balancing + fail-overe. g. CVMFS_HTTP_PROXY="A|B|C"

Mirrors Fail-over mirrors at CERN, RAL, BNLFor roaming users automatic ordering based on RTT

11 / 25

Integrity and Authenticity

Principle: Digitally signed repositories with certificate white-list

release

manager

certificate white-list repository

CernVM-FS +CernVM public key

fingerprint sign catalog

sign whitelist

1download

signed catalog +signed whitelist

2verify whitelist +check fingerprint

3download

files

4compare secure hashagainst catalog entry

12 / 25

1 Introduction

2 CernVM-FS




6 Conclusions

13 / 25

Software Characteristics I

15

10

20

30

All ATLAS LHCb CMS All ATLAS LHCb CMS

100200

400

800

Num

ber

ofO

bjec

ts·1

06

Vol

ume

[GB

]

FS Obj. Regular Unique Compr.

Overall: 910GB, 28Mio. File System ObjectsRepository Core: 150GB (16%), 4.2Mio. File System Objects (15%)

Runtime behaviour (data + meta data, syscall number in thousands):

TestSyscall stat() open() read()

all uniq all uniq all uniqKernel Compilaton 438.8 4.2 426.9 2.4 426.2 2.4

Cache-Hitrate 99% 95% 99%ATLAS Examples Compilaton 4 987.7 43.5 111.1 2.3 119.5 2.3

Cache-Hitrate 91% 94% 96%

Few paths subject to many system calls, ideal for caching14 / 25

Software Characteristics IIRepository Growth

∙ LHCb monthly growth of file system objects and volume∙ Unreferenced data chunks in LHCb repository by April 2011 (Archive Data)

50

100

200

300

400

Jul 10 Aug Sep Oct Nov Dec Jan 11 Feb Mar

2

4

6

8

10

12

19

ΔN

o.FS

Obj

ects

·103

ΔVolum

e[G

B]

Shadow No. Obj.

Repository No. Obj.

Archive No. Obj.

Shadow Volume

Repository Volume

Archive Volume

15 / 25

Software Characteristics IIIFile Size Distribution

16B

64B

256B

1KiB

4KiB

16KiB

64KiB

0 10 20 30 40 50 60 70 80 90 100

File

Size

Percentile

ATLASLHCbALICE

CMSLCD

LCG Externals

(Squid “hot set”, required at least every week: 670 000 files, 48GB)No need for file chunking, latency is key issue

16 / 25


16B

64B

256B

1KiB

4KiB

16KiB

64KiB

0 10 20 30 40 50 60 70 80 90 100

File

Size

Percentile

ATLASLHCbALICE

CMSLCD

LCG Externals

Squid (1 month, 14 Mio. requests, HTTP compressed)


16 / 25


16B

64B

256B

1KiB

4KiB

16KiB

64KiB

0 10 20 30 40 50 60 70 80 90 100

File

Size

Percentile

ATLASLHCbALICE

CMSLCD

LCG Externals

Squid (1 month, 14 Mio. requests, HTTP compressed)


16 / 25

1 Introduction

2 CernVM-FS




6 Conclusions

17 / 25

Decentralized Cluster Cache I

Use case Local cluster of virtualized worker nodes

Goals ∙ Elimination of central cache∙ Resilient against high peer churn∙ Automatic configuration

Idea Customized DHT algorithm for memcached,Automatic configuration by Multicast-IP

∙ ∙ ∙

memcached

CernVM-FS

memcached∙ LRU-managed cache in

RAM

∙ Slab allocator

∙ Steering via TCP/UDP

(Mid-term option: off the shelve distributed key-value store)18 / 25





∙ ∙ ∙

memcached

CernVM-FS


RAM

∙ Slab allocator







∙ ∙ ∙

memcached

CernVM-FSstore

retrieve


RAM

∙ Slab allocator



Decentralized Cluster Cache II

Which peer has a particular chunk?

Traditional DHT: Peer ID determines responsibilityThis algorithm: Peers “float” in hash spaceSimulation result with typical workload:

90% efficiency, resilient to peer churn

Each peer maintains a small number of slots:

000 001 010 011 100 101 110 111

00 01 10 11

0 1split

merge

Slots of peer ASlots of peer B

Requires < 1KB / peer

splitDrop responsibility forleft/right sub tree

∙ too many cache misses

∙ too much load

mergeCreate a free slot by mergingexisting slots with greatestcommon prefix

19 / 25

1 Introduction

2 CernVM-FS




6 Conclusions

20 / 25

Cloud Computing for HEP

Use Cases: Server consolidation, worker node consolidation, portableanalysis environment, Volunteer Computing, long-term data preservationIdeal: Incarnation of utility computing

Common denominator in practical terms:

Private CloudPublic Cloud (z. B. Amazon EC2)

Standardized API Standardized API

Inte

ract

ive

Serv

ices

Grid

Serv

ices

Bui

ldan

dTe

stSe

rvic

es

Wor

ker

Nod

es

· · · 1 Virtualizedinfrastructure(IaaS)

2 Dynamicallyreconfigurable

3 Adjusts to load

21 / 25

HEP-Specific Obstacles

Envisaged Model:

∙ Jobs packaged as virtual machines

∙ 1 VM / 1-4 CPU cores

∙ Typical running time: hours–days

∙ Relatively high peer churn

∙ ≈ 4GB RAM / VM,Optimized by Kernel SharedMemory

∙ VM image useable in variousclouds and BOINC

Obstacles:

∙ PerformanceDepends on work load, up to 15%

∙ Proliferation of VM imagesHigh trust requirements by Grid sites

∙ Connection to job and storageframeworks

∙ Software distributionMany GB per release>1 release per weekPlatform + software: imagedistribution not scalable

∙ TrustCorrectness of results?

22 / 25

HEP-Specific Obstacles

Envisaged Model:

∙ Jobs packaged as virtual machines

∙ 1 VM / 1-4 CPU cores

∙ Typical running time: hours–days

∙ Relatively high peer churn

∙ ≈ 4GB RAM / VM,Optimized by Kernel SharedMemory

∙ VM image useable in variousclouds and BOINC

Obstacles:

∙ PerformanceDepends on work load, up to 15%Higher resource utilization

∙ Proliferation of VM imagesHigh trust requirements by Grid sitesCernVM-FS

∙ Connection to job and storageframeworksApproaches: Co-Pilot, XRootD, EOS

∙ Software distributionMany GB per release>1 release per weekPlatform + software: imagedistribution not scalableCernVM-FS

∙ TrustCorrectness of results?

22 / 25

1 Introduction

2 CernVM-FS




6 Conclusions

23 / 25

Software Distribution – Lessons Learned

Content Addressable Storage⇒ Immutable

⇒ Trivial to verify

⇒ Trivial to cache

⇒ De-duplication

+ Versioning

+ Garbage collection by referencecounting

Meta data∙ As important as data

∙ Almost same order of magnitudethan experiment meta data

∙ Natural pre-cache by means of filecatalogs

System Design∙ Scalable Infrastructure

∙ Failure handling within workernodes

∙ Failures are the norm

∙ Virtualization: homogeneousinterface

∙ Separate concerns of Grid sitesand experiments

∙ Avoid large p2p networks:mid-sized p2p backbone +independent, mid-sized p2p cells

24 / 25

Date post:	21-Jan-2021
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

The CernVM File SystemSimple configuration: specify path to local cache, URL of software repository,...

Documents