The CernVM File System
Jakob Blomer
2. May 2011
1 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
2 / 25
Motivation I
Building blocks in HEP computing:Scheduling (Condor, PROOF, LSF, . . . )Event data storage (XRootD, dCache, GridFTP, . . . )
Distribution of experiment software:
∙ Installation on the “travelling physicist’s laptop”Practically unfeasible, software development by ssh+emacs
∙ Packaging of experiment software and virtual machine imagesHalf-life of few days, large space and network consumption,unfeasible for Volunteer Computing
∙ Grid: Distribution of several GB / day to ≈ 100 sitesAFS, RPM, tarballs, handwritten scripts
∙ Grid: Software installation on worker nodesMisuse of the job scheduling system, “Installation Jobs”
∙ Experience from tier 1/2 sites: overload of the shared software areaCritical usage pattern: search path traversal, compilation, synchronized workernodes
3 / 25
Motivation IITechnologies and Problems (Cont.):
An alternative model to distribute VO specific software to WLCG sites: a prototype at PIC based on CernVM file system
E.Lanciotti for the PIC Tier1 team in collaboration with J.Blomer (CERN)
Abstract:
.The current model to dostribute VO software to the WLCG sites presents several limitations. A possible alternative is a model based on a new protocol: CernVM file system, a network file system developed in the framework of CernVM project. A test-bed set up at PIC Tier1 shows the feasibility of this solution
CVMFS installation and configuration!Installation with yum. No need to change any kernel parameter of the WN.
!Simple configuration: specify path to local cache, URL of software repository, and a local site proxy (Squid proxy)
CernVM-FS v2.47: added multi VO support, two repositories (ATLAS, LHCb mounted on the same system)
Summary and outlook:!Limitations of the current model for software distribution!Possible alternative model based on CernVM-FS: First tests very promising
! No change required in the VO software or in the framework for job submission (only an environment variable has to be changed)
! Very fast execution when software cached locally
Advantages with respect to the current model:! No need to install the software at the site! No need to maintain a shared area at the site and mount it on the WNs
!Next to do:! Test scalability adding a list of Squid servers at the site and running many (O(1000))
concurrent jobs
Metrics to measure:
Execution time for SetupProject script - the most demanding phase of the job for the software area (huge amount of stat() and open() calls )
! Execution time for DaVinci! Dependence with the number of concurrent jobs
1Results: LHCb SetupProject execution time!Clear dependence of the execution time with the number of jobs per node with NFS protocol.
! Effect on client side: 8 jobs on one node gives an execution time of 100s. Might depend on mount options (ro, noatime, nolock)
!Very low execution time, and no dependence with the number of jobs per node for CernVM-FS
Test-bed set up at PIC!A dedicated blade of 16 Wns, 8 cores each, configured in a test queue!On each node: software area mounted through NFS (production NFS server) and software repository of CernVM mounted through CernVM-FS !One Squid server at the site as http proxy and web cache: necessary to reduce network latency and reduce the load on the origin web server of CernVM
Current model to distribute VO software to sites: !In a distributed computing model as WLCG the software of VO specific applications has to be efficiently distributed to any site of the Grid!Applications software currently installed in a shared area of the site visible for all worker nodes (WN) of the site (through NFS, AFS or other)!The software is installed by jobs which run on the SGM node (a privileged node of the computing farm where the shared area is mounted in write mode)
Limitations of the current model for software distribution ! NFS scalability issues! Shared area sometimes not reachable (not properly mounted on the WN, or
NFS server very slow, due to high load...)! NFS locked by SQLite (known bug if NFS is mounted in r-w mode)! Software installation in many Grid sites is a tough task (job failures,
resubmission, tags publications...)! Limited quota per VO in the shared area: if VOs want to install new releases
and keep the old ones they have to ask for an increase of quota !Number of GGUS tickets relative to shared area issues for LHCb: 33 in the last quarter
SGM
nodeNFS server
All WNRO
RW
PIC shared area setup
An alternative model based on CernVM-FS! CernVM-FS is a network file system developed in the framework of CernVM project ! Repository of applications: contains the result of a make install! Can be mounted on the WN and accessed as read-only file system, through http protocol! Complemented by local site proxy for web caching
NFS server16 WNs
NFS
PIC LAN ORIGIN WEB SERVER hosting the software
repository
CVMFS/
HTTP
Description of tests done at PIC:!A test job which sets the environment and runs the application for LHCb analysis package (DaVinci)!No development required: only difference is an environment variable (path to the software area)
Results: ATLAS test job execution time!Same test-bed and test script than LHCb, different executable!Executable is an Athena test job: MC generation (1 evt) + reconstruction + analysis!Only the total execution time is measured
Very slight difference between CernVM-FS and NFS. ! This ATLAS job accesses less data in the software area. Not possible to compare
with the LHCb SetupProject
CernVM-FS performs
equally or slightly better than NFS for a
typical ATLAS test job
Barcelona (Spain)
HTTP
Source: E. Lanciotti
Worst case: Millions of concurrent system callsOverloads shared area e. g. NFS, Lustre,NetAppProblem: Meta data (order of 107)
Benign characteristics:Immutable, file level redundancy, particular job requires ≈ 10% of a software release
Design criteria (first addressed by GROW-FS):
∙ One-time installation by experiment (not Grid sites)
∙ Distribution:
1 only required files2 reliable and scalable3 standard protocols (NAT traversal)
∙ Aggressive local cache on worker nodes4 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
5 / 25
Software Distribution with CernVM-FS
Principle: Virtual software installation by means of an HTTP File System
CernVM 2 /Standard SL5Worker Node
CernVM-FS
SL5 KernelFuse
Hierarchy ofHTTP Caches
(Squid)
LinuxFile System Buffers
1 GBCernVM-FS Cache
LRU managed
10GBSingle Release
(all releases available)
Layers of the system:
1 Publishing on a release manager machine
2 Content Distribution Network
3 Aggressively caching Fuse module
6 / 25
Content Addressable StorageShadow Tree
Repository
/cvmfs/atlas.cern.ch
software
15.6.9
ChangeLog...
806fbb67373e9...
Chunk store File catalogs
Compression, SHA-1
⇒ Immutable files,trivial to check for corruption
Data Store∙ Compressed chunks (files)
∙ Eliminates duplicates
∙ Never deletes
File Catalog
∙ Directory structure
∙ Symlinks
∙ SHA1 of regular files
∙ Digitally signed
∙ Time to live
∙ Nested catalogs7 / 25
Transformation in Content Addressable Storage
Problem: Processing of the directory difference setBad options: Fuse, redirecting syscalls, inotify, SystemTAP
process 1 · · · process n
user space
kernel space
VFSinode cachedentry cache
nfsd
redirfs
· · ·Ext3 NFS
/dev/cvmfs
filter 1
cvmfsflt
filter m
...
...
ringbuffer
system call system call
synchronize (offline)
Redirfs filters apply on path prefix:Normal operation Track of writing VFS callsRing buffer full Block of writing VFS callsRepository synchronization Reject writing VFS calls
8 / 25
Content Distribution, Simple Model
+ Simple
Piggi-back on deployed infrastructure
+ Scalable
in production for CernVM and several Tier1s
+ Fast
Clients served by memory cache
– Single point of failure
Master R/W Copy
ProxyHierarchy
ProxyHierarchy
9 / 25
Content Distribution, Stratum Model
+ Similarly fast and scalable
+ No single point of failure– Complex hierarchy
Stratum 0Master R/W Copy
CERN
RALBNL
Other
Stratu
m
1
Public
M
irrors
Stratum 2Private Replicas
(Tier 1)
ProxyHierarchy
9 / 25
Content Distribution, Decentralized ModelExperimental Setup
Backbone: Stable, distributed,public p2p network of webservers
Worker nodes: Independent P2Pcells building a decentralizedmemory cache
Stratum 0Master R/W Copy
CERN
RALBNL
Other
9 / 25
Fuse Moduleopen() syscall
open(/ChangeLog)
glibc
VFSinode cachedentry cache
Buffer cache ext3
NFS
...
Fuse
libfuse
CernVM-FS
user space
kernel spacesyscall /dev/fuse
SHA1
file descr.fdHTTP GET
inflate+verify
∙ open() syscall redirects file descriptor from local cache
∙ Not cacheable in kernel
10 / 25
Fuse Modulestat() syscall
stat(/ChangeLog)
glibc
VFSinode cachedentry cache
Buffer cache ext3
NFS
...
Fuse
libfuse
CernVM-FS
user space
kernel spacesyscall /dev/fuse
direntretval
∙ Meta data operations entirely on locally cached file catalogs
∙ Mostly handled by kernel caches
10 / 25
Client-Side Fail-Over
Worker Nodes
Proxy Servers Mirror Servers
HTTP
HTTP
Proxies SL5 Squid, load-balancing + fail-overe. g. CVMFS_HTTP_PROXY="A|B|C"
Mirrors Fail-over mirrors at CERN, RAL, BNLFor roaming users automatic ordering based on RTT
11 / 25
Client-Side Fail-Over
Worker Nodes
Proxy Servers Mirror Servers
HTTP
Fail-Over
Worker Nodes
Proxies SL5 Squid, load-balancing + fail-overe. g. CVMFS_HTTP_PROXY="A|B|C"
Mirrors Fail-over mirrors at CERN, RAL, BNLFor roaming users automatic ordering based on RTT
11 / 25
Integrity and Authenticity
Principle: Digitally signed repositories with certificate white-list
release
manager
certificate white-list repository
CernVM-FS +CernVM public key
fingerprint sign catalog
sign whitelist
1download
signed catalog +signed whitelist
2verify whitelist +check fingerprint
3download
files
4compare secure hashagainst catalog entry
12 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
13 / 25
Software Characteristics I
15
10
20
30
All ATLAS LHCb CMS All ATLAS LHCb CMS
100200
400
800
Num
ber
ofO
bjec
ts·1
06
Vol
ume
[GB
]
FS Obj. Regular Unique Compr.
Overall: 910GB, 28Mio. File System ObjectsRepository Core: 150GB (16%), 4.2Mio. File System Objects (15%)
Runtime behaviour (data + meta data, syscall number in thousands):
TestSyscall stat() open() read()
all uniq all uniq all uniqKernel Compilaton 438.8 4.2 426.9 2.4 426.2 2.4
Cache-Hitrate 99% 95% 99%ATLAS Examples Compilaton 4 987.7 43.5 111.1 2.3 119.5 2.3
Cache-Hitrate 91% 94% 96%
Few paths subject to many system calls, ideal for caching14 / 25
Software Characteristics IIRepository Growth
∙ LHCb monthly growth of file system objects and volume∙ Unreferenced data chunks in LHCb repository by April 2011 (Archive Data)
50
100
200
300
400
Jul 10 Aug Sep Oct Nov Dec Jan 11 Feb Mar
2
4
6
8
10
12
19
ΔN
o.FS
Obj
ects
·103
ΔVolum
e[G
B]
Shadow No. Obj.
Repository No. Obj.
Archive No. Obj.
Shadow Volume
Repository Volume
Archive Volume
15 / 25
Software Characteristics IIIFile Size Distribution
16B
64B
256B
1KiB
4KiB
16KiB
64KiB
0 10 20 30 40 50 60 70 80 90 100
File
Size
Percentile
ATLASLHCbALICE
CMSLCD
LCG Externals
(Squid “hot set”, required at least every week: 670 000 files, 48GB)No need for file chunking, latency is key issue
16 / 25
Software Characteristics IIIFile Size Distribution
16B
64B
256B
1KiB
4KiB
16KiB
64KiB
0 10 20 30 40 50 60 70 80 90 100
File
Size
Percentile
ATLASLHCbALICE
CMSLCD
LCG Externals
Squid (1 month, 14 Mio. requests, HTTP compressed)
(Squid “hot set”, required at least every week: 670 000 files, 48GB)No need for file chunking, latency is key issue
16 / 25
Software Characteristics IIIFile Size Distribution
16B
64B
256B
1KiB
4KiB
16KiB
64KiB
0 10 20 30 40 50 60 70 80 90 100
File
Size
Percentile
ATLASLHCbALICE
CMSLCD
LCG Externals
Squid (1 month, 14 Mio. requests, HTTP compressed)
(Squid “hot set”, required at least every week: 670 000 files, 48GB)No need for file chunking, latency is key issue
16 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
17 / 25
Decentralized Cluster Cache I
Use case Local cluster of virtualized worker nodes
Goals ∙ Elimination of central cache∙ Resilient against high peer churn∙ Automatic configuration
Idea Customized DHT algorithm for memcached,Automatic configuration by Multicast-IP
∙ ∙ ∙
memcached
CernVM-FS
memcached∙ LRU-managed cache in
RAM
∙ Slab allocator
∙ Steering via TCP/UDP
(Mid-term option: off the shelve distributed key-value store)18 / 25
Decentralized Cluster Cache I
Use case Local cluster of virtualized worker nodes
Goals ∙ Elimination of central cache∙ Resilient against high peer churn∙ Automatic configuration
Idea Customized DHT algorithm for memcached,Automatic configuration by Multicast-IP
∙ ∙ ∙
memcached
CernVM-FS
memcached∙ LRU-managed cache in
RAM
∙ Slab allocator
∙ Steering via TCP/UDP
(Mid-term option: off the shelve distributed key-value store)18 / 25
Decentralized Cluster Cache I
Use case Local cluster of virtualized worker nodes
Goals ∙ Elimination of central cache∙ Resilient against high peer churn∙ Automatic configuration
Idea Customized DHT algorithm for memcached,Automatic configuration by Multicast-IP
∙ ∙ ∙
memcached
CernVM-FSstore
retrieve
memcached∙ LRU-managed cache in
RAM
∙ Slab allocator
∙ Steering via TCP/UDP
(Mid-term option: off the shelve distributed key-value store)18 / 25
Decentralized Cluster Cache II
Which peer has a particular chunk?
Traditional DHT: Peer ID determines responsibilityThis algorithm: Peers “float” in hash spaceSimulation result with typical workload:
90% efficiency, resilient to peer churn
Each peer maintains a small number of slots:
000 001 010 011 100 101 110 111
00 01 10 11
0 1split
merge
Slots of peer ASlots of peer B
Requires < 1KB / peer
splitDrop responsibility forleft/right sub tree
∙ too many cache misses
∙ too much load
mergeCreate a free slot by mergingexisting slots with greatestcommon prefix
19 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
20 / 25
Cloud Computing for HEP
Use Cases: Server consolidation, worker node consolidation, portableanalysis environment, Volunteer Computing, long-term data preservationIdeal: Incarnation of utility computing
Common denominator in practical terms:
Private CloudPublic Cloud (z. B. Amazon EC2)
Standardized API Standardized API
Inte
ract
ive
Serv
ices
Grid
Serv
ices
Bui
ldan
dTe
stSe
rvic
es
Wor
ker
Nod
es
· · · 1 Virtualizedinfrastructure(IaaS)
2 Dynamicallyreconfigurable
3 Adjusts to load
21 / 25
HEP-Specific Obstacles
Envisaged Model:
∙ Jobs packaged as virtual machines
∙ 1 VM / 1-4 CPU cores
∙ Typical running time: hours–days
∙ Relatively high peer churn
∙ ≈ 4GB RAM / VM,Optimized by Kernel SharedMemory
∙ VM image useable in variousclouds and BOINC
Obstacles:
∙ PerformanceDepends on work load, up to 15%
∙ Proliferation of VM imagesHigh trust requirements by Grid sites
∙ Connection to job and storageframeworks
∙ Software distributionMany GB per release>1 release per weekPlatform + software: imagedistribution not scalable
∙ TrustCorrectness of results?
22 / 25
HEP-Specific Obstacles
Envisaged Model:
∙ Jobs packaged as virtual machines
∙ 1 VM / 1-4 CPU cores
∙ Typical running time: hours–days
∙ Relatively high peer churn
∙ ≈ 4GB RAM / VM,Optimized by Kernel SharedMemory
∙ VM image useable in variousclouds and BOINC
Obstacles:
∙ PerformanceDepends on work load, up to 15%Higher resource utilization
∙ Proliferation of VM imagesHigh trust requirements by Grid sitesCernVM-FS
∙ Connection to job and storageframeworksApproaches: Co-Pilot, XRootD, EOS
∙ Software distributionMany GB per release>1 release per weekPlatform + software: imagedistribution not scalableCernVM-FS
∙ TrustCorrectness of results?
22 / 25
1 Introduction
2 CernVM-FS
3 HEP Software Figures
4 Experimental Extensions
5 Use Case: Cloud Computing & Volunteer Computing
6 Conclusions
23 / 25
Software Distribution – Lessons Learned
Content Addressable Storage⇒ Immutable
⇒ Trivial to verify
⇒ Trivial to cache
⇒ De-duplication
+ Versioning
+ Garbage collection by referencecounting
Meta data∙ As important as data
∙ Almost same order of magnitudethan experiment meta data
∙ Natural pre-cache by means of filecatalogs
System Design∙ Scalable Infrastructure
∙ Failure handling within workernodes
∙ Failures are the norm
∙ Virtualization: homogeneousinterface
∙ Separate concerns of Grid sitesand experiments
∙ Avoid large p2p networks:mid-sized p2p backbone +independent, mid-sized p2p cells
24 / 25