Date post: | 03-Jun-2018 |
Category: |
Documents |
Upload: | manish-jangid |
View: | 290 times |
Download: | 0 times |
of 13
8/12/2019 Torrent Distribution
1/13
C o s t i n . G r i g o r a s @ c e r n . c h
Torrent-based
Software Distribution in ALICE
8/12/2019 Torrent Distribution
2/13
Outline
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
2
Motivation
How it works
Site requirements
History Migration status
8/12/2019 Torrent Distribution
3/13
Motivation
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
3
ALICE was using site shared areas for installing the pre-compiled experiment software packages
Large sites suffered from AFS/NFS/ scalability issuesand being a single point of failure
Large space needed for the many active versions Old model needed a site local service to manage the
installation, unpacking and deletion of the packages Requirement for strict site configuration to support
operation excludes use of opportunistic
resources/centres From the very beginning, the shared SW area and its
access from the VO-box was considered a security risk All of the above and more are solved by the use of the
Torrent protocol to distribute the software packages
8/12/2019 Torrent Distribution
4/13
Torrent terminology
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
4
package.tar.gz
Chunks of equal size
package.tar.gz.torrent
Clients
Metadata of the original file-SHA1 of chunks-SHA1 of entire file-Tracker location
Tracker
Initial seeder
Seeder
Leech
Leech
Exchange chunks
Prefer high-speed peers
8/12/2019 Torrent Distribution
5/13
How it works
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
5
Buildservers
Software repository( one tar.gz / version )
AliEn file cataloguetorrent://alitorrent.cern.ch/
Torrent trackeralitorrent.cern.ch:8088
Torrent seederalitorrent.cern.ch:8092
Site X
WN 1
WN 2
WN n
Site Y
WN 1
WN 2
WN n
No seeding between sites
8/12/2019 Torrent Distribution
6/13
How it works (2)
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
6
Build servers for SLC5 (32b, 64b), SLC6 (32b, 64b),Mac OS X, Ubuntus
Software repository: 150GB in 600 archives
Total size of a compressed (4x factor) software set per job is~300MB (this is what is downloaded to the WN)
One central tracker and seeder
Limited to 50MB/s to the world
Fallback to other download methods if torrentdownload fails for any reason
wget, xrdcp
But seed them nevertheless
8/12/2019 Torrent Distribution
7/13
How it works (3)
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
7
Bootstrap
Pilot job script fetches and installs on the local node (`pwd`)the latest AliEn build by Torrent (20MB)
AliEn JobAgent gets a real job from the central queueand downloads the required software packages Continuing to seed them in background for other local agents
to quickly get them by LAN
The JA will run more jobs of the same type (user andSW requirements) within the TTL of the job
Everything is downloaded in the sandbox of the job,so is wiped at the end of its execution
8/12/2019 Torrent Distribution
8/13
Torrent features we use
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
8
Clients explicitly publish their private IP in thecentral tracker
Allowing the discovery of LAN peers via this common serviceeven behind NAT
Local Peer Discovery
Multicast to discover peers on same network
Peer exchange
Peer lists are distributed between the local peers Distributed Hash Tables Decentralized seeder lookup seeders are trackers
8/12/2019 Torrent Distribution
9/13
Site requirements
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
9
How to allow this to happen iptables rules accepting:
Outgoing to alitorrent.cern.chTCP/8088,8092
WN-to-WN on
TCP, UDP / 6881:6999 aria2c default listening ports UDP, IGMP -> 224.0.0.0/4 local peer discovery
Typically this is already the case, in some cases the ports had tobe whitelisted (very smart firewalls )
Implicitly sites do notexchange any torrent traffic between
them No service to run on the site or on the machines, no
shared area any more, no SPF, essentially no localsupport for this
8/12/2019 Torrent Distribution
10/13
History
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
10
The deployment has faced only policy difficulties Eventually accepted after understanding the technology
There is no evil technology, only evil use
First tests at CERN in 02.2009
Site deployments starting 06.2009 As the shared areas were proving insufficient
First at the large sites, in operation since 2 years
Presented in various forums within the collaborationand at CHEPs
Large awareness call in 01.2012 at ALICE T1/T2Workshop in Karlsruhe
8/12/2019 Torrent Distribution
11/13
Migration status
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
11
First transitions done in close collaboration with thesites
debugging on the WNs, following up the consequences on thelocal network, firewalls and such
One month ago we have asked allsites for
permission to enable torrent Most have confirmed that the policy allows the torrent
protocol and checked the firewall policies and now they runtorrent
Working with the rest to solve the (mostly) non-technicalissues
Some mails went to unread mailboxes
8/12/2019 Torrent Distribution
12/13
Migration status
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
12
T0 in operation since 3 years
T1s 5 / 6 migrated
T2s 36 / 78 migrated
Currently covering 2/3 of the resources, so on averagemore than 20K concurrent jobs are using torrent Rock solid, very efficient technology
No incidents reported
Aiming for full migration until next AliEn version isdeployed, to completely drop the PackMan VoBox serviceand the need for shared SW area and caches
8/12/2019 Torrent Distribution
13/13
Conclusion
GDB, Annecy 10.10.2012Torrent-based software distribution in ALICE
13
Torrents have enabled us to Simplify site operations by removing a VoBox service and the shared
SW areas Significantly reduce problems associated with SW deployment,
relieves the sites support staff
Have quick software release cycles (both experiment and Gridmiddleware)
The migration process was carefully staged Policy limitation clarified discussion with security experts Discussions and deployment at T0/T1s and selected T2s (regional
coverage) Presently towards complete site coverage
Lifts some of the requirement for a site VoBox, specificconfigurations and services Forward-looking system - towards opportunistic use of resources and
clouds!