Recent PROOF developmentsRecent PROOF developments
G. GanisG. Ganis
CAF meeting, ALICE offline week , 11 October CAF meeting, ALICE offline week , 11 October 20072007
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 22
PROOF at the CAFPROOF at the CAF
CAF: first / main PROOF testbed in LHC environmentCAF: first / main PROOF testbed in LHC environment UnderstandUnderstand
ProblemsProblems Instabilities and error recoveryInstabilities and error recovery Performance (end-of-query tails) Performance (end-of-query tails)
Missing / improvable functionalityMissing / improvable functionality Handling of input data, additional softwareHandling of input data, additional software Generic task processingGeneric task processing Quota (data/resources) controlQuota (data/resources) control Handling of big outputsHandling of big outputs Diagnostics tools (memory usage)Diagnostics tools (memory usage)
Multi-user behaviourMulti-user behaviour Fair-sharing of resourcesFair-sharing of resources
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 33
OutlineOutline
User interface developmentsUser interface developments ImprovementsImprovements
Packetizer Packetizer Software handlingSoftware handling Dataset handlingDataset handling
New features New features Non-data driven processingNon-data driven processing Output file merging Output file merging Memory monitoringMemory monitoring
Resource control developmentsResource control developments Fair share based on experiment policyFair share based on experiment policy Central schedulerCentral scheduler
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 44
Packetizer improvementsPacketizer improvements
Packetizer’s goal: optimize work distribution to Packetizer’s goal: optimize work distribution to process queries as fast as possibleprocess queries as fast as possible
Standard TPacketizer’s strategyStandard TPacketizer’s strategy first process local files, than try to process remote datafirst process local files, than try to process remote data
End-of-query bottleneckEnd-of-query bottleneck
Active workersActive workers
Processing timeProcessing time
J. Iwaszkiewicz
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 55
New strategy: TPacketizerAdaptiveNew strategy: TPacketizerAdaptive
Predict processing time of local files for each workerPredict processing time of local files for each worker Keep assigning remote files from start of the queryKeep assigning remote files from start of the query to to
workers expected to finish fasterworkers expected to finish faster Processing time Processing time improved by up to 50%improved by up to 50%
Remote packetsRemote packets
SameSamescalescale
Processing rateProcessing rate for all packetsfor all packets
NEW
OLD
Default sinceDefault since Jun 5thJun 5th
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 66
Software handlingSoftware handling
Package enablingPackage enabling Separated behaviour client / clusterSeparated behaviour client / cluster Real-time feedback during buildReal-time feedback during build Soon: package versioning (e.g. ESD-v1.12.103-new)Soon: package versioning (e.g. ESD-v1.12.103-new)
Load mechanism extended to single class / macroLoad mechanism extended to single class / macro
Selectors / macros / classes binaries are now cachedSelectors / macros / classes binaries are now cached Decreases initialization timeDecreases initialization time
API to modify include / library paths on the workersAPI to modify include / library paths on the workers Use packages globally available on the clusterUse packages globally available on the cluster
Improved version check for binariesImproved version check for binaries Based also on SVN revisionBased also on SVN revision
root[] TProof *proof = TProof::Open(“master”)root[] proof->Load(“MyClass.C”)
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 77
Dataset managerDataset manager
Metadata about a set of files stored in sandbox on the master on Metadata about a set of files stored in sandbox on the master on dedicated subdirectory dedicated subdirectory <DatsetDir>/group/user/dataset<DatsetDir>/group/user/dataset or or <SandBox>/dataset<SandBox>/dataset
Data-sets are Data-sets are identified by nameidentified by name
Data-sets can be Data-sets can be processed by nameprocessed by name
No need to locally create the chain (CreateESDchain)No need to locally create the chain (CreateESDchain)
root[0] TProof *proof = TProof::Open(“master”);root[1] TFileCollection *fc = new TFileCollection(“dummy”);root[2] fc->AddFromFile(“ESD5000_5029.txt”)root[2] proof->CreateDataSet(“ESD5000_5029”, fc->GetList());root[3] proof->ShowDataSets();Existing Datasets:ESD5000_5029
root[] proof->Process(“ESD5000_5029”, “MySelector.C+”);
J. Iwaszkiewicz + G. Bruckner (more on Gerhard’s talk)
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 88
Begin()•Create histos, …•Define output list
Terminate()•Final analysis (fitting, …)
output listSelector
Time
Process()
analysis
1…N
// Open the PROOF sessionroot[0] TProof *p = TProof::Open(“master”)
// Run 1000 times the analysis defined in the// MonteCarlo.C TSelectorroot[1] p->Process(“MonteCarlo.C+”, 1000)
New TProof::Process(const char *selector, Long64_t times)New TProof::Process(const char *selector, Long64_t times)
Implement algorithm in a TSelectorImplement algorithm in a TSelector
Non-data-driven analysisNon-data-driven analysisL. Tran-Thanh
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 99
Non-data-driven analysisNon-data-driven analysis
New packetizer TPacketizerUnitNew packetizer TPacketizerUnit Time-based packet sizesTime-based packet sizes Processing speed of each worker measured Processing speed of each worker measured
dynamicallydynamically Included in ROOT 5.17/04Included in ROOT 5.17/04
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1010
Output file mergingOutput file merging
Address the case of Address the case of large output objectslarge output objects (e.g. trees) (e.g. trees) which create memory problemswhich create memory problems
Idea: save them in files on the workers and merge Idea: save them in files on the workers and merge them using TFileMergerthem using TFileMerger
New class New class TProofFileTProofFile defines the file and provide tools defines the file and provide tools to handle the mergingto handle the merging Unique file names are created internally to avoid crashesUnique file names are created internally to avoid crashes
Merging will happen on the Master at the end of the Merging will happen on the Master at the end of the query query
Final file is left in sandbox on the master or saved Final file is left in sandbox on the master or saved where the client wisheswhere the client wishes
Included in ROOT 5.17/04Included in ROOT 5.17/04
L. Tran-Thanh
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1111
Output file merging: exampleOutput file merging: examplevoid PythiaMC::SlaveBegin(TTree *) { // Meta file object: to be added to the output list fProofFile = new TProofFile();
fOutput->Add(fProofFile); // Output filename (any format understood by TFile::Open) TNamed *outf = (TNamed *) fInput->FindObject(“PROOF_OUTPUTFILE”); if (outf) fProofFile->SetOutputFileName(outf->GetTitle()); // Open the file with a unique name fFile = fProofFile->OpenFile(“RECREATE”); // Create the tree and attach it to the file fTree = new TTree(…); fTree->SetDirectory(fFile); …}Bool_t PythiaMC::Process(Long64_t entry) { fTree->Fill();}void PythiaMC::SlaveTerminate() { if (fFile) { fFile->cd(); // Write here big objects fTree->Write(); fFile->Close(); }}
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1212
Memory consumption monitoringMemory consumption monitoring
Normal levelNormal level Workers monitor their memory usage and save info Workers monitor their memory usage and save info
in the log filein the log file Client get warned of high usageClient get warned of high usage
The session may be eventually killedThe session may be eventually killed New button in the progress dialog box to display New button in the progress dialog box to display
the evolution of memory usage per nodethe evolution of memory usage per node Advanced levelAdvanced level
Possibility to save in a dedicated tree (TProofStats) Possibility to save in a dedicated tree (TProofStats) very detailed information (e.g. interface to Marian very detailed information (e.g. interface to Marian Ivanov’s memsta tool)Ivanov’s memsta tool)
To be run as second pass when a problem shows upTo be run as second pass when a problem shows up Coming soonComing soon
A. Kreshuk
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1313
Scheduling multi-usersScheduling multi-users
Fair resource sharingFair resource sharing System scheduler not enough if NSystem scheduler not enough if Nusersusers >= ~ N >= ~ Nworkersworkers / 2 / 2
Enforce priority policiesEnforce priority policies Two levelsTwo levels
Quota-based worker level load balancingQuota-based worker level load balancing Based on group quotas Based on group quotas
Central level (scheduler)Central level (scheduler) Per-query decisions based on cluster load, Per-query decisions based on cluster load,
resources need by the query, user history and resources need by the query, user history and prioritiespriorities
Generic interface to external schedulersGeneric interface to external schedulers
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1414
Quota-based worker level load Quota-based worker level load balancingbalancing
Based on group priority information defined in Based on group priority information defined in dedicated files or communicated by mastersdedicated files or communicated by masters
Two technologiesTwo technologies Slowdown requests for new packets to match the quotasSlowdown requests for new packets to match the quotas
Worker sleeps before asking for the next packetWorker sleeps before asking for the next packet PROS: quantitatively correctPROS: quantitatively correct CONS: large fluctuations if packet sizes are large and CONS: large fluctuations if packet sizes are large and
variable; requires round-robin system scheduling; acts only variable; requires round-robin system scheduling; acts only on CPUon CPU
““renice” low priority sessionsrenice” low priority sessions Priority = 20 – nice ( -20 <= nice <= 19)Priority = 20 – nice ( -20 <= nice <= 19)
Limit max priority to avoid over killing the systemLimit max priority to avoid over killing the system PROS: independent of packet size; controls all resourcesPROS: independent of packet size; controls all resources CONS: quantitatively more difficult to controlCONS: quantitatively more difficult to control
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1515
Resource quotas based on experiment Resource quotas based on experiment policypolicy
Feedback mechanismFeedback mechanism At the end of each query the amount of resources At the end of each query the amount of resources
used is reported to MonALisa per user/groupused is reported to MonALisa per user/group This information is used to calculate effective group This information is used to calculate effective group
priorities based target priorities (see Marco’s talk)priorities based target priorities (see Marco’s talk) PROOF masters broadcast the effective group PROOF masters broadcast the effective group
priorities to their workerspriorities to their workers
The central scheduler will use the effective The central scheduler will use the effective priorities to determine which workers to assign priorities to determine which workers to assign to a user to a user
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1616
Central schedulingCentral scheduling
Entity running on master XPD, loaded as plug-inEntity running on master XPD, loaded as plug-in Abstract interface XrdProofSched definedAbstract interface XrdProofSched defined
Input:Input: Query info (via XrdProofServProxy ->proofserv) Query info (via XrdProofServProxy ->proofserv) Cluster status and past usage (e.g. from ML)Cluster status and past usage (e.g. from ML) PolicyPolicy
Output:Output: List of workers to continue withList of workers to continue with
class XrdProofSched { …public: virtual int GetWorkers(XrdproofServProxy *xps, std::list<XrdProofWorker *> &wrks)=0; …};
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1717
Central schedulingCentral scheduling
TProofPlayerTProofPlayer(session)(session)
DatasetDatasetLookupLookup
TProofTProof
ClientClient MasterMaster
SchedulerScheduler
TPacketizerTPacketizer(query)(query)
XPDXPD
Load, history,Load, history,……
Schematic viewSchematic view
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1818
Central scheduling statusCentral scheduling status
Basic version in place (but not always Basic version in place (but not always enabled)enabled) Selection a subset of workers based onSelection a subset of workers based on
Round-robin, random, load (# of sessions)Round-robin, random, load (# of sessions) Version using the ML information to chose the Version using the ML information to chose the
best set of workers for a given user under testbest set of workers for a given user under test
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1919
Coming versions at CAFComing versions at CAF
Later this week Later this week Non-data driven processingNon-data driven processing Output file merging Output file merging Fair share based on experiment policyFair share based on experiment policy
Next (end of October)Next (end of October) Memory monitoringMemory monitoring Improved dataset handlingImproved dataset handling
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 2020
PROOF and SVNPROOF and SVN
PROOF development branchPROOF development branch http://root.cern.ch/svn/root/branches/dev/proofhttp://root.cern.ch/svn/root/branches/dev/proof Synchronized daily with the main trunk Synchronized daily with the main trunk
Versions installed on CAF correspond to a Versions installed on CAF correspond to a revision on the dev branchrevision on the dev branch vPROOFDEV_r20285 vPROOFDEV_r20285
11/10/200711/10/2007 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 2121
Questions? Questions?
CreditsCredits B. Bellenot, G.G., J. Iwaszkiewizc, A. Kreshuk, F. B. Bellenot, G.G., J. Iwaszkiewizc, A. Kreshuk, F.
Rademakers, L. Tran-Thanh (summer student ‘07)Rademakers, L. Tran-Thanh (summer student ‘07) G. Bruckner, M. Meoni, J.F. Grosse-Oetringhaus, A. G. Bruckner, M. Meoni, J.F. Grosse-Oetringhaus, A.
Peters (ALICE)Peters (ALICE) A. Hanushevsky (SLAC)A. Hanushevsky (SLAC)