Recent PROOF developmentsRecent PROOF developments
G. GanisG. Ganis
PROOF workshop, 29 November 2007PROOF workshop, 29 November 2007
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 22
The The PROOFPROOF approach approach in a nutshellin a nutshell
catalog StoragePROOF farm
schedulerquery
MASTER
PROOF job:data file list, myAna.C
files
final outputs
(merged)feedbacks (merged)
farm perceived as extension of local PC same syntax as in local session
more dynamic use of resources real time feedback automated splitting and merging
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 33
Issues addressed by the Issues addressed by the developmentsdevelopments
User interfaceUser interface Processing of generic jobsProcessing of generic jobs Data set, software handlingData set, software handling
Performance and responsivenessPerformance and responsiveness Load balancing within a queryLoad balancing within a query Access to dataAccess to data
Monitoring toolsMonitoring tools Processing information at the end of a queryProcessing information at the end of a query Memory usageMemory usage
Resource usage controlResource usage control Enforce prioritiesEnforce priorities Improve responsiveness in multi-user environment Improve responsiveness in multi-user environment
Testing, tutorials, installationTesting, tutorials, installation
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 44
Dataset managerDataset manager
Metadata about a set of files stored in sandbox on the master on Metadata about a set of files stored in sandbox on the master on dedicated subdirectory dedicated subdirectory <DatsetDir>/group/user/dataset<DatsetDir>/group/user/dataset or or <SandBox>/dataset<SandBox>/dataset
Data-sets are Data-sets are identified by nameidentified by name
Data-sets can be Data-sets can be processed by nameprocessed by name
No need to create the chain locally (i.e. on the client)No need to create the chain locally (i.e. on the client)
root[0] TProof *proof = TProof::Open(“master”);root[1] TFileCollection fc(“dum”,””,”file.list”);root[2] proof->RegisterDataSet(“MyDataSet”, &fc);root[3] proof->ShowDataSets();Existing Datasets:MyDataSet
root[] proof->Process(“MyDataSet”, “MySelector.C+”);
J. Iwaszkiewicz + G. Bruckner + J.F. Grosse-Oetringhaus (more on Jan-Fiete’s talk)
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 55
Begin()•Create histos, …•Define output list
Terminate()•Final analysis (fitting, …)
output listSelector
Time
Process()
analysis
1…N
// Open the PROOF sessionroot[0] TProof *p = TProof::Open(“master”)// Run 1000 times the analysis defined in the// MonteCarlo.C TSelectorroot[1] p->Process(“MonteCarlo.C+”, 1000)
New TProof::New TProof::ProcessProcess(const char *(const char *selectorselector, Long64_t, Long64_t times times))
Implement algorithm in a TSelectorImplement algorithm in a TSelector
Generic, non-data-driven analysisGeneric, non-data-driven analysisL. Tran-Thanh
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 66
Generic, non-data-driven analysisGeneric, non-data-driven analysis
New packetizer TPacketizerUnitNew packetizer TPacketizerUnit Time-based packet sizesTime-based packet sizes Processing speed of each worker measured Processing speed of each worker measured
dynamicallydynamically Included in ROOT 5.17/04Included in ROOT 5.17/04
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 77
Output file mergingOutput file merging
Large output objectsLarge output objects (e.g. trees) create memory (e.g. trees) create memory problemsproblems
Solution:Solution: save them in files on the workerssave them in files on the workers merge the files on the master using TFileMergermerge the files on the master using TFileMerger
New class New class TProofFileTProofFile defines the file and provide tools defines the file and provide tools to handle the mergingto handle the merging Unique file names are created internally to avoid crashesUnique file names are created internally to avoid crashes
Merging will happen on the Master at the end of the Merging will happen on the Master at the end of the query query
Final file is left in sandbox on the master or saved Final file is left in sandbox on the master or saved where the client wisheswhere the client wishes
Included in ROOT 5.17/04Included in ROOT 5.17/04
L. Tran-Thanh
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 88
Output file merging: exampleOutput file merging: examplevoid PythiaMC::SlaveBegin(TTree *) { // Meta file object: to be added to the output list fProofFile = new TProofFile();
fOutput->Add(fProofFile); // Output filename (any format understood by TFile::Open) TNamed *outf = (TNamed *) fInput->FindObject(“PROOF_OUTPUTFILE”); if (outf) fProofFile->SetOutputFileName(outf->GetTitle()); // Open the file with a unique name fFile = fProofFile->OpenFile(“RECREATE”); // Create the tree and attach it to the file fTree = new TTree(…); fTree->SetDirectory(fFile); …}Bool_t PythiaMC::Process(Long64_t entry) { fTree->Fill();}void PythiaMC::SlaveTerminate() { if (fFile) { fFile->cd(); // Write here big objects fTree->Write(); fFile->Close(); }}
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 99
Software handlingSoftware handling
Package handlingPackage handling Separated behaviour client / cluster for enablingSeparated behaviour client / cluster for enabling Real-time feedback during buildReal-time feedback during build API to modify include / library paths on the workersAPI to modify include / library paths on the workers
Use packages globally available on the clusterUse packages globally available on the cluster Load mechanism extended to single class / macroLoad mechanism extended to single class / macro
Selectors / macros / classes binaries cachedSelectors / macros / classes binaries cached Decreases initialization time if selector did not changeDecreases initialization time if selector did not change Version check for binaries based also on SVN revisionVersion check for binaries based also on SVN revision
Support for multiple ROOT versionsSupport for multiple ROOT versions
root[] TProof *proof = TProof::Open(“master”)root[] proof->Load(“MyClass.C”)
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1010
Software handlingSoftware handling
Next stepsNext steps Package versioning (e.g. ESD-v1.12.103-new)Package versioning (e.g. ESD-v1.12.103-new)
Directory structure including also the ROOT Directory structure including also the ROOT versionversion
Filter the selector code into a “client” and “cluster” Filter the selector code into a “client” and “cluster” partsparts
Clients should not be obliged to load tons of Clients should not be obliged to load tons of experiment libraries typically needed only for experiment libraries typically needed only for processing on the clusterprocessing on the cluster
~$ pwd<SandBox>/packages/ESD/1.12.103-new/root_v5.17.05-r20920/ESD
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1111
Load balancing: improved packetizerLoad balancing: improved packetizer
Packetizer’s goal: optimize work distribution to Packetizer’s goal: optimize work distribution to process queries as fast as possibleprocess queries as fast as possible
Standard TPacketizer’s strategyStandard TPacketizer’s strategy first process local files, than try to process remote datafirst process local files, than try to process remote data
End-of-query bottleneckEnd-of-query bottleneck
Active workersActive workers
Processing timeProcessing time
J. Iwaszkiewicz
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1212
New strategy: TPacketizerAdaptiveNew strategy: TPacketizerAdaptive
Predict processing time of local files for each workerPredict processing time of local files for each worker Keep assigning remote files from start of the queryKeep assigning remote files from start of the query to to
workers expected to finish fasterworkers expected to finish faster Processing time Processing time improved by up to 50%improved by up to 50%
Remote packetsRemote packets
SameSamescalescale
Processing rateProcessing rate for all packetsfor all packets
NEW
OLD
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1313
Data accessData access
Tree cache enabled (+ asynchronous reading)Tree cache enabled (+ asynchronous reading) Expect improvements in the case of many users Expect improvements in the case of many users
and non-local filesand non-local files Under study:Under study:
Exploit large number of cores and relatively large Exploit large number of cores and relatively large amount of memory of new machinesamount of memory of new machines
Separate thread for unzipping the dataSeparate thread for unzipping the data Use xrootd as dynamic pre-loaderUse xrootd as dynamic pre-loader
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1414
Monitoring the resource usageMonitoring the resource usage
Per-query informationPer-query information CPU time, wall time, bytes read, events, user, groupCPU time, wall time, bytes read, events, user, group
Posted by the master via Posted by the master via TVirtualMonitorWriter TVirtualMonitorWriter E.g. MonAlisa, MySQLE.g. MonAlisa, MySQL
Used for monitoring or to correct priorities Used for monitoring or to correct priorities based on usage history (see M.Meoni’s talk)based on usage history (see M.Meoni’s talk)
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1515
Memory consumption monitoringMemory consumption monitoring
Workers monitor their memory usage and Workers monitor their memory usage and save info in the log filesave info in the log file
New button in the dialog box to display the New button in the dialog box to display the evolution of memory usage per node in real evolution of memory usage per node in real timetime
Client get warned of high usageClient get warned of high usage The session may be eventually killedThe session may be eventually killed
Prototype being testedPrototype being tested
A. Kreshuk
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1616
Motivation for scheduling?Motivation for scheduling?
Controlling resources and how they are usedControlling resources and how they are used Improving efficiency Improving efficiency
assigning to a job those nodes that have data which assigning to a job those nodes that have data which needs to be analyzed.needs to be analyzed.
Implementing different scheduling policiesImplementing different scheduling policies e.g. fair share, group priorities & quotase.g. fair share, group priorities & quotas
Efficient use even in case of congestionEfficient use even in case of congestion
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1717
PROOF specific requirementsPROOF specific requirements
Interactive systemInteractive system Jobs should be processed as soon as submitted.Jobs should be processed as soon as submitted. However when max system throughput is reached However when max system throughput is reached
some jobs has to postponedsome jobs has to postponed I/O bound jobs use more resources at the start I/O bound jobs use more resources at the start
and less at the end (file distribution)and less at the end (file distribution) Try to process data at its location for Try to process data at its location for
performanceperformance User defines a dataset not the #workersUser defines a dataset not the #workers
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1818
Enforcing experiment priority policiesEnforcing experiment priority policies
Based on group priority information defined in Based on group priority information defined in dedicated filesdedicated files
TechnologyTechnology ““renice” low priority non-idle sessionsrenice” low priority non-idle sessions
Priority = 20 – nice ( -20 <= nice <= 19)Priority = 20 – nice ( -20 <= nice <= 19) Limit max priority to avoid over killing the systemLimit max priority to avoid over killing the system
May be centrally controlledMay be centrally controlled Master updates the priorities and broadcast them Master updates the priorities and broadcast them
to the active workersto the active workers Feedback mechanism – e.g. via monitoring tool – Feedback mechanism – e.g. via monitoring tool –
allows to adjust the priorities (see M.Meoni’s talk)allows to adjust the priorities (see M.Meoni’s talk)
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 1919
Central SchedulerCentral Scheduler
Assigning a set of workers for a job based on:Assigning a set of workers for a job based on: The data set locationThe data set location User priority (Quota + historical usage)User priority (Quota + historical usage) The current load of the clusterThe current load of the cluster
First implementation:First implementation: # of Workers ≈ relativePriority * nFreeCPUs # of Workers ≈ relativePriority * nFreeCPUs Assign least loaded workers firstAssign least loaded workers first
Missing ingredientsMissing ingredients Come&Go functionality for workersCome&Go functionality for workers
Needed also by the Condor interfaceNeeded also by the Condor interface
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 2020
Central schedulingCentral scheduling
Schematic viewSchematic view
PROOFPROOFmastermaster
DatasetDatasetLookupLookup
ClientClient SchedulerScheduler Load, history,Load, history,policy, …policy, …
1: Job{dataset, …}
2: dataset 3: file locations
4: Job info
5: workers
StartStartworkersworkers
6: workers
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 2121
Tutorials, Testing, installationTutorials, Testing, installation
TutorialsTutorials Frame for PROOF examples:Frame for PROOF examples:
$ROOTSYS/tutorials/proof/runProof.C$ROOTSYS/tutorials/proof/runProof.C Currently available: Currently available:
« simple »: histogram filling with random entries« simple »: histogram filling with random entries « h1-http »: H1 analysis reading data via HTTP« h1-http »: H1 analysis reading data via HTTP
TestingTesting Frame for PROOF tests:Frame for PROOF tests:
$ROOTSYS/test/stressProof.C$ROOTSYS/test/stressProof.C InstallationInstallation
Interactive script to simplify the installation of a small clusterInteractive script to simplify the installation of a small cluster $ROOTSYS/etc/proof/utils/proofinstall.sh$ROOTSYS/etc/proof/utils/proofinstall.sh
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 2222
PROOF and SVNPROOF and SVN
PROOF development branchPROOF development branch http://root.cern.ch/svn/root/branches/dev/proofhttp://root.cern.ch/svn/root/branches/dev/proof Synchronized daily with the main trunk Synchronized daily with the main trunk
PROOF tagsPROOF tags http://root.cern.ch/svn/root/branches/dev/proof-tagshttp://root.cern.ch/svn/root/branches/dev/proof-tags Specific « snapshots » of the dev branchSpecific « snapshots » of the dev branch Binaries installed on AFS at Binaries installed on AFS at
/afs/cern.ch/sw/lcg/contrib/proof/root/afs/cern.ch/sw/lcg/contrib/proof/root
29/11/200729/11/2007 G. Ganis, PROOF workshop 2007G. Ganis, PROOF workshop 2007 2323
Questions? Questions?
CreditsCredits G.G., J. Iwaszkiewizc, A. Kreshuk, F. Rademakers, L. G.G., J. Iwaszkiewizc, A. Kreshuk, F. Rademakers, L.
Tran-Thanh (summer student ‘07)Tran-Thanh (summer student ‘07) G. Bruckner, J.F. Grosse-Oetringhaus, M.Meoni, A. G. Bruckner, J.F. Grosse-Oetringhaus, M.Meoni, A.
Peters (ALICE)Peters (ALICE) F. Furano (CERN)F. Furano (CERN) A. Hanushevsky (SLAC)A. Hanushevsky (SLAC)