Experiment Support
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBES
A. Abramyan, L. Betev, D. Goyal, A. Grigoras, C. Grigoras, M. Litmaath, N. Manukyan,
M. Martinez, J. Porter, P. Saiz, S. Sankar, S. Schreiner
Roadmap to AliEn v2-20
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
29 Mar 2012 Pablo Saiz ALICE offline week
• Plenty of new improvements– Catalogue simplification– Client UI– Extreme Job Brokering– Removal of PackMan – New JDL fields– Proxy renewal– Job Memory checkup
• And baseline for new development
What’s new
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
39 Mar 2012 Pablo Saiz ALICE offline week
Catalogue Simplification
• Up to now, catalogue divided in multiple DB:– Simplifies scalibility– Logic slightly more complicated
• Changing username/userid– Smaller tables
Thanks Dushyant, Subho
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
49 Mar 2012 Pablo Saiz ALICE offline week
PackMan
• Removing the PackMan/PackManMaster services
• Functionality stays in client UI/JA– JA can install packages directly– Very powerful if combined with torrent
• Speeds up most of the packman operations
Thanks Narine, Arm
enuhi
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
59 Mar 2012 Pablo Saiz ALICE offline week
New JDL fields
• MaxWaitingTime: amount of time that job can stay in ‘WAITING’– If time exceeded, job ends up in error– New state: ERROR_EW (Expired Waiting)
• Retrial:– Number of times that a single job can be
resubmitted– Resubmission done by central services
• Reusing JobId in resubmission• Direct removal of KILLED jobs
Thanks Miguel
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
69 Mar 2012 Pablo Saiz ALICE offline week
Extreme Brokering
• Postpone splitting of job until last moment• Decide data to be analyzed based on
current location of JA & files not analyzed yet
• Can define Max/Min number of files to be analyzed– Even if the files are not local
• Less subjobs:– Easier merging
Thanks Pablo
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
79 Mar 2012 Pablo Saiz ALICE offline week
Current situation
Works nicely if one replica per file
JobManager
JOBJOB
JOBJOB
A bit more complex with 3 SE and 2 replicas
JobManager
JOB
JOB JOB
JOB
JOB
JOB
JOB
And a lot more with 50 SE and 3 replicas
JobManager
JOB
JOB JOB
JOB
JOB
JOB
JOB
JOB JOB
JOB
JOB
JOB
JOB
JOBJOB JOBJOB
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
89 Mar 2012 Pablo Saiz ALICE offline week
Example
Site A Site B Site CFile 1
File 2
File 3
File 4
File 5
Current schemaSubmit 4 jobs:
File1File 4 File2 File3 File 5
Broker per fileSubmit 3 empty subjobs
File1,2,4,5
When a job starts, analyze as much as possible
File 3
If nothing left, just exit
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
99 Mar 2012 Pablo Saiz ALICE offline week
Proxy renewal system
• Replaces vobox-proxy-renewal service• Can receive ‘validity’ or proxies
– Simplifies CREAM-CE job submission• No corruption of proxies• Can be started by non-root user• Already deployed at CERN
– And for some CMS sites…• Can already be deployed
Thanks Maarten
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
109 Mar 2012 Pablo Saiz ALICE offline week
New development
• More than 1 year since last mayor update• Some backward incompatible changes
– Change of catalogue schema• What to do with new requests, bugs:
– Debug current system?– Debug in new version?– Both!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
119 Mar 2012 Pablo Saiz ALICE offline week
AliEn deployment for ALICE
catalogue
TaskQueue Transfers
LDAP
Central Services
Api
Api
Api
Api
aliensh
vobox
ROOT
3 machines (+1 slave, backups)
12 machines
8 machines
80 sites
3 machines (+1 slave, backups)
AliEn v2-17
12 machinesAliEn v2-19**, v2-17
8 machinesAliEn v2-19**
80 sitesAliEn v2-19.(80-163)
JA
40.000 wn40.000 wnAliEn v2-19.(80-163)
BACKUP
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
129 Mar 2012 Pablo Saiz ALICE offline week
How to test new versions…
• Build system:– Multiple platforms– Integration & basic functionality tests
• No API/access from ROOT tests – Similar to the AliROOT, ROOT build systems– Running the whole system on a single machine– http://alienbuild.cern.ch:8888
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
139 Mar 2012 Pablo Saiz ALICE offline week
Already deployed for PANDA
• Running since September– 12th PANDA Grid Workshop and 2nd AliEn
Developers Week• Multiple sites, smaller load than ALICE
– No API services– ‘Old’ v2.20 version
Thanks PANDA
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
149 Mar 2012 Pablo Saiz ALICE offline week
Previous major update
• Stopping the whole system– 1 week to redeploy– 1 month ironing out details
Not an optio
n!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
159 Mar 2012 Pablo Saiz ALICE offline week
Second set of services:
catalogue
TaskQueue Transfers
LDAP
Central Services
ApiApiApi
Api
aliensh
CE
ROOT
JA
catalogue
TaskQueue Transfers
LDAP
Central Services
ApiApiApi
Api
aliensh
CE
ROOT
JA
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
169 Mar 2012 Pablo Saiz ALICE offline week
Second set of services
• Copy of the catalogue• 3 different central machines, 3 voboxes,
same SE
• What to do with output– Throw away (easiest)– Incorporate back (easy if output in a different
directory)
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
179 Mar 2012 Pablo Saiz ALICE offline week
Timeline
Now:1 week: Investigate test system1 week: Test Catalogue migration1 week: Define New VO1 week: Verify quotas
1 month:New hardware for CS2 days: Central deployment from backup3 days: First site working (CERN)2 weeks: At least 2 external sites (CCIN2P3, ?)After that works, keep adding sites
2 months:1 day: Switch VO1 day: Overall site upgrade
Mar Apr May
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
189 Mar 2012 Pablo Saiz ALICE offline week
Summary
• AliEn v2.20 ready for deployment– With plenty of new features and bug fixes
• Minimize upgrade downtime– Create testing setup with several sites, and with
all the SE– More effort on testing (also from site admins)
• Deploy Test V0 with ALICE sites• And say goodbye to v2-19 in two months
Thank you!!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
199 Mar 2012 Pablo Saiz ALICE offline week
xrootd
Job execution
JobManager
JOBTASKQUEUE
Job Broker
CEMonALISA
xrootd
Site A
JOB
MonALISAxrootd
Site BMonALISA
Site C
File catalogue
LFN GUID Metadata
JOBJOB
CE
CEJA
JA
JA