Post on 27-Feb-2021
transcript
Scalla/XRootdScalla/XRootdScalla/XRootd Scalla/XRootd AdvancementsAdvancementsdv ce e sdv ce e s
xrootd /cmsd (f.k.a. olbd)
Fabrizio FuranoCERN – IT/PSSCERN IT/PSS
Andrew HanushevskyStanford Linear Accelerator Center
http://xrootd slac stanford eduhttp://xrootd.slac.stanford.edu
Outline
Current ElaborationsComposite Cluster Name SpacePOSIX file system access via FUSE+xrootdy
New DevelopmentsCl t M t S i ( dd)Cluster Management Service (cmsdcmsd)
Cluster globalization
WAN direct data accessConclusion
28-November-07 2: http://xrootd.slac.stanford.edu
Conclusion
The Distributed Name Space
The Scalla/xrootdScalla/xrootd suite implements a distributed name spacedistributed name space
Very scalable and efficientSufficient for data analysis
Some users and applications (e.g., SRM) rely pp yon a centralized name space
Spurred the development of a Composite Name p p pSpace (cnsdcnsd) add-on
Simplest solution with the least entanglement
28-November-07 3: http://xrootd.slac.stanford.edu
Composite Cluster Name Spaceopendir() refers to the directory structure maintained by xrootd:2094
Client
xroot.redirect mkdir myhost:2094 Redirectorxrootd:1094
Name Spacexrootd:2094ManagerManager open/trunc
mkdirmv
xroot.redirect mkdir myhost:2094
Data Data ServersServers
mvrm
rmdirServersServers
cnsdofs.notify closew, create, mkdir, mv, rm, rmdir |/opt/xrootd/etc/cnsd
28-November-07 4: http://xrootd.slac.stanford.edu
cnsdcnsd Specifics
Servers direct name space actions to common xrootd(s)C td i t i itCommon xrootd maintains composite name space
Typically, these run on the redirector nodes
Name space replicated in the file systemName space replicated in the file systemNo external database neededSmall disk footprintSmall disk footprint
Deployed at SLAC for AtlasNeeds synchronization utilities, more documentation, and packagingy p g g
See Wei Yang for detailsSimilar mySQL based system being developed by CERN/Atlas
A b ll L < b ll l @ h>
28-November-07 5: http://xrootd.slac.stanford.edu
Annabelle Leung <annabelle.leung@cern.ch>
Data System vs File System
ScallaScalla is a data access systemSome users/applications want file system semantics
More transparent but many times less scalable
For years users have asked ….Can ScallaScalla create a file system experience?Can ScallaScalla create a file system experience?
The answer is ….It can to a degree that may be good enough
We relied on FUSEFUSE to show how28-November-07 6: http://xrootd.slac.stanford.edu
We relied on FUSEFUSE to show how
What is FUSEFUSE
FFilesystem in UUsersspaceeU d t i l t fil t iUsed to implement a file system in a user space program
Linux 2.4 and 2.6 onlyRefer to http://fuse sourceforge net/Refer to http://fuse.sourceforge.net/
Can use FUSE FUSE to provide xrootd accessLooks like a mounted file systemLooks like a mounted file system
SLAC and FZK have xrootd-based versions of thisWei Yang at SLAC g
Tested and practically fully functionalAndreas Petzold at FZK
I l h t t t f ll f ti l t
28-November-07 7: http://xrootd.slac.stanford.edu
In alpha test, not fully functional yet
XrootdFS (Linux/FUSE/Xrootd)
ClientClient Kernel
User Space
Appl
POSIX File SystemInterface FUSE
FUSE/X t I t fHostHost opendir
createmkdir
xrootd POSIX Client
Appl FUSE/Xroot Interface
mkdirmvrm
rmdir
Redirectorxrootd:1094
Name Spacexrootd:2094RedirectorRedirector
HostHostHostHost
Should run cnsd on serverst t FUSE t
28-November-07 8: http://xrootd.slac.stanford.edu
to capture non-FUSE events
XrootdFS PerformanceSun V20zRHEL4
2x 2.2Ghz AMD Opteron
VA Linux 1220RHEL3
2x 866Mhz Pentium 34GB RAM
1Gbit/sec Ethernet1GB RAM
100Mbit/sec Ethernet
Unix dd, globus-url-copy & uberftp
Client
, g py p5-7MB/sec with 128KB I/O block size
Unix cp 0.9MB/sec with 4KB I/O block size
Conclusion: Better for some things than othersConclusion: Better for some things than others..
28-November-07 9: http://xrootd.slac.stanford.edu
f gf g
Why XrootdFS?
Makes some things much simplerM t SRM i l t ti t tlMost SRM implementations run transparentlyAvoid pre-load library worries
But impacts other thingsPerformance is limited
Kernel-FUSE FUSE interactions are not cheapRapid file creation (e.g., tar) is limited
FUSEFUSE t b d i i t ti l i t ll d t b dFUSE FUSE must be administratively installed to be usedDifficult if involves many machines (e.g., batch workers)Easier if it involves an SE node (i e SRM gateway)
28-November-07 10: http://xrootd.slac.stanford.edu
Easier if it involves an SE node (i.e., SRM gateway)
Next Generation Clustering
Cluster Management Service (cmsdcmsd)Functionally replaces olbd
Compatible with olbd config fileUnless you are using deprecated directives
Straight forward migrationEither run olbd or cmsd everywhere
Currently in alpha test phaseAvailable in CVS headDocumentation on web site
28-November-07 11: http://xrootd.slac.stanford.edu
cmsdcmsd Advantages
Much lower latencyNew very extensible protocolNew very extensible protocolBetter fault detection and recoveryAdded functionalityAdded functionality
Global clustersAuthenticationServer selection can include space utilization metricUniform handling of opaque informationC l b l f lCross protocol messages to better scale xproof clusters
Better implementation for reduced maintenance cost
28-November-07 12: http://xrootd.slac.stanford.edu
Cluster Globalization
xrootdxrootd
BNL all.role meta managerall.manager meta atlas.bnl.gov:1312root://atlas.bnl.gov/root://atlas.bnl.gov/
includesincludes Meta Managers can be
cmsdcmsdSLAC, UOM, UTASLAC, UOM, UTAxroot clustersxroot clusters
Meta Managers can be geographically replicated!
Note:the security hats will likely
require you use xrootdnative proxy support
cmsdcmsd
xrootdxrootd
cmsdcmsd
xrootdxrootd
cmsdcmsd
xrootdxrootd
cmsdcmsd
UTA
cmsdcmsd
SLAC
cmsdcmsd
UOMall.role manager all.role manager all.role managerll t tl b l 1312 ll t tl b l 1312 ll t tl b l 1312
28-November-07 13: http://xrootd.slac.stanford.edu
all.manager meta atlas.bnl.gov:1312 all.manager meta atlas.bnl.gov:1312 all.manager meta atlas.bnl.gov:1312
Why Globalize?Uniform view of participating clusters
Can easily deploy a virtual MSSCan easily deploy a virtual MSSIncluded as part of the existing MPS framework
Try out real time WAN accessTry out real time WAN accessYou really don’t need data everywhere!
Alice is slowly moving in this directionAlice is slowly moving in this directionThe non-uniform name space is an obstacle
Sl l h i th ld hSlowly changing the old approachSome workarounds could be possible, though
28-November-07 14: http://xrootd.slac.stanford.edu
Virtual MSSPowerful mechanism to increase reliability
Data replication load is widely distributedM lti l it il bl fMultiple sites are available for recovery
Allows virtually unattended operationBased on BaBar experience with real MSSp
Idea: to consider as a MSS the meta-cluster a cluster is subscribed inAutomatic restore due to server failure
Missing files in one cluster fetched from anothergTypically the fastest one which has the file really online
Local cluster file (pre)fetching on demandCan be transformed into a 3rd-party copy
When cmsd is deployedPractically no need to track file location
But still need for metadata repositories
28-November-07 15: http://xrootd.slac.stanford.edu
Virtual MSS – a way to do it
xrootdxrootd
CERN meta all.role meta managerall.manager meta metaxrd.cern.ch:1312root://metaxrd.cern.ch/root://metaxrd.cern.ch/
includesincludes Meta Managers can be
cmsdcmsdSLAC, GSISLAC, GSIxroot clustersxroot clusters
Meta Managers can be geographically replicated!
A local client stillcontinues to work
cmsdcmsd
xrootdxrootd
cmsdcmsd
xrootdxrootdMissing a file?
Ask to the global metamgr cmsdcmsd
GSI
cmsdcmsd
SLACll t tl b l 1312 ll t t d h 1312
Get it from any othercollaborating cluster
28-November-07 16: http://xrootd.slac.stanford.edu
all.manager meta atlas.bnl.gov:1312 all.manager meta metaxrd.cern.ch:1312
**Dumb WAN Access**
Setup: client at CERN, data at SLAC164 RTT ti il bl b d idth < 100Mb/164ms RTT time, available bandwidth < 100Mb/s
Test 1: Read a large ROOT Tree (~300MB, 200k interactions)
Expected time: 38000s (latency)+750s (data)+CPU➙10 hrs!
Test 2: Draw a histogram from that tree data(6k interactions)
Measured time ~15-20min Using xrootd with WAN optimizations disabled
28-November-07 17: http://xrootd.slac.stanford.edu**Federico Carminati, Federico Carminati, The The ALICE ALICE Computing Status and ReadinessComputing Status and Readiness, LHCC, November 2007, LHCC, November 2007
**Smart WAN Access**
Exploit xrootd WAN OptimizationsTCP multi-streaming: for up to 15x improvement data WAN throughputTh ROOT TT C h id th hi t ”f t ” d tThe ROOT TTreeCache provides the hints on ”future” data accessesTXNetFile/XrdClient ”slides through” keeping the network pipeline full
Data transfer goes in parallel with computationData transfer goes in parallel with computationThroughput improvement comparable to “batch” file-copy tools
70-80%, we are doing a live analysis, not a file copy!Test 1 actual time: 60-70 secondsTest 1 actual time: 60 70 seconds
Compared to 30 seconds using a Gb LANVery favorable for sparsely used files
Test 2 actual time: 7-8 secondsTest 2 actual time: 7 8 seconds Comparable to LAN performance
100x improvement over dumb WAN access (i.e., 15-20 minutes)
28-November-07 18: http://xrootd.slac.stanford.edu**Federico Carminati, Federico Carminati, The The ALICE ALICE Computing Status and ReadinessComputing Status and Readiness, LHCC, November 2007, LHCC, November 2007
Conclusion
Scalla is a robust frameworkElaborative
Composite Name SpaceXrootdFS
ExtensibleCluster globalization
Many opportunities to enhance data analysisMany opportunities to enhance data analysisSpeed and efficiency
28-November-07 19: http://xrootd.slac.stanford.edu
AcknowledgementsSoftware Collaborators
INFN/Padova: Alvise DorigoRoot: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows)( y) ( )Alice: Derek Feichtinger, Guenter KickingerCERN: Fabrizio Furano (client) , Andreas Peters (Castor)STAR/BNL: Pavel JaklCornell: Gregory SharpCornell: Gregory SharpSLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill WeeksBaBar: Peter Elmer (packaging)
Operational collaboratorsOperational collaboratorsBNL, CNAF, FZK, INFN, IN2P3, RAL, SLAC
FundingUS D t t f EUS Department of Energy
Contract DE-AC02-76SF00515 with Stanford UniversityFormerly INFN (BaBar)
28-November-07 20: http://xrootd.slac.stanford.edu