+ All Categories
Home > Documents > PROOF developments

PROOF developments

Date post: 08-Jan-2016
Category:
Upload: lemuel
View: 28 times
Download: 2 times
Share this document with a friend
Description:
PROOF developments. G. Ganis CAF meeting, ALICE offline week , 11 July 2008. Overview. Recent / Current developments focus mostly on Solving Instabilities and improving on error recovery Improving the user interface Resource control in multiuser - PowerPoint PPT Presentation
19
PROOF developments PROOF developments G. Ganis G. Ganis CAF meeting, ALICE offline week , 11 CAF meeting, ALICE offline week , 11 July 2008 July 2008
Transcript
Page 1: PROOF developments

PROOF developmentsPROOF developments

G. GanisG. Ganis

CAF meeting, ALICE offline week , 11 July 2008CAF meeting, ALICE offline week , 11 July 2008

Page 2: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 22

OverviewOverview

Recent / Current developments focus mostly Recent / Current developments focus mostly onon Solving Instabilities and improving on error Solving Instabilities and improving on error

recovery recovery Improving the user interfaceImproving the user interface Resource control in multiuserResource control in multiuser

CAF is one of the main source of feedback toCAF is one of the main source of feedback to Understand problemsUnderstand problems spot missing functionalityspot missing functionality

Page 3: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 33

Today’s SubjectsToday’s Subjects

Stability issueStability issue New XrdProofd plug-inNew XrdProofd plug-in Related issuesRelated issues

New Log boxNew Log box Monitoring of the memory consumptionMonitoring of the memory consumption

Dataset managementDataset management Scheduling developmentsScheduling developments

Page 4: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 44

New XrdProofd plug-in (1)New XrdProofd plug-in (1)

Addresses Addresses stability issues stability issues observed typically observed typically after a failure and the attempt to reset the after a failure and the attempt to reset the sessionsession

We traced-back these to deadlock situations We traced-back these to deadlock situations due to concurrent actions not well protecteddue to concurrent actions not well protected

New plug-in implements re-designed interaction New plug-in implements re-designed interaction between components significantly reducing between components significantly reducing lockslocks

The changes for the user are minimalThe changes for the user are minimal But the level of asynchronism introduced may But the level of asynchronism introduced may

confuse people looking at the process tables, as the confuse people looking at the process tables, as the processes are cleaned with some delayprocesses are cleaned with some delay

Page 5: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 55

New XrdProofd plug-in (2)New XrdProofd plug-in (2)

New featuresNew features Resiliance to xrootd failures/glitchesResiliance to xrootd failures/glitches

Applications attempt to restore the connections for Applications attempt to restore the connections for 10 mins10 mins

Solves the problem of restarting xrootd to change Solves the problem of restarting xrootd to change the configurationthe configuration

Directive to define workers in the xrootd config Directive to define workers in the xrootd config filefile Example: on CAF DEV the workers are define withExample: on CAF DEV the workers are define with

Get rid of proof.confGet rid of proof.conf

xpd.worker master lxb6043xpd.worker worker lxb60[41-42,44]xpd.worker worker lxb60[41-42,44]

Page 6: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 66

Related ImprovementsRelated Improvements

Automatic shutdown of orphalin sessionsAutomatic shutdown of orphalin sessions Get rid of proofserv processes hanging aroundGet rid of proofserv processes hanging around

Improved notification in case of a worker Improved notification in case of a worker deathdeath

Page 7: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 77

New Log Dialog boxNew Log Dialog box

Using TProof::Mgr(master)->GetSessionLogs()Using TProof::Mgr(master)->GetSessionLogs() Should work even if the session hangsShould work even if the session hangs

A. Kreshuk

Page 8: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 88

Memory usage monitoringMemory usage monitoring Worker: RAM vs events procWorker: RAM vs events proc Master: RAM vs object Master: RAM vs object

mergedmerged Should allow to spot easily Should allow to spot easily

mem leaksmem leaks Additional analysis w/ Additional analysis w/

another tool: TMemStat?another tool: TMemStat?

A. Kreshuk

Page 9: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 99

Memory consumption monitoringMemory consumption monitoring

Normal levelNormal level Workers monitor their memory usage and save info Workers monitor their memory usage and save info

in the log filein the log file Client get warned of high usageClient get warned of high usage

The session may be eventually killedThe session may be eventually killed Advanced levelAdvanced level

Possibility to save in a dedicated tree (TProofStats) Possibility to save in a dedicated tree (TProofStats) very detailed information (e.g. interface to Marian very detailed information (e.g. interface to Marian Ivanov’s memsta tool)Ivanov’s memsta tool)

To be run as second pass when a problem shows upTo be run as second pass when a problem shows up

First version in SVN the coming daysFirst version in SVN the coming days

A. Kreshuk

Page 10: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1010

Dataset management (1)Dataset management (1)

Hot topic for T2/T3Hot topic for T2/T3 DatasetDataset: metadata about a set of files: metadata about a set of files

TFileCollection: list of TFileInfoTFileCollection: list of TFileInfo TFileInfoTFileInfo

UUID, TUrl’s of the fileUUID, TUrl’s of the file TFileInfoMeta: one per Ttree with name, entries, …TFileInfoMeta: one per Ttree with name, entries, …

Data-sets are Data-sets are identified by nameidentified by name Info may come from different places: catalogs, Info may come from different places: catalogs,

SQL databases, file systemsSQL databases, file systems

JFGO

Page 11: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1111

Dataset manager (2)Dataset manager (2)

TProofDataSetManagerTProofDataSetManager: abstract interface : abstract interface describing the basic functionalitydescribing the basic functionality RegisterDataSet, GetDataSet, VerifyDataSet, …RegisterDataSet, GetDataSet, VerifyDataSet, … VerifyDataSet opens the files, i.e. may trigger VerifyDataSet opens the files, i.e. may trigger

stagingstaging TProofDataSetManagerFileTProofDataSetManagerFile: implementation : implementation

handling information via ROOT files handling information via ROOT files datasetname.rootdatasetname.root

Stored on the master on dedicated Stored on the master on dedicated subdirectory subdirectory <DatsetDir>/group/user/dataset <DatsetDir>/group/user/dataset

JFGO

Page 12: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1212

Dataset manager (3)Dataset manager (3)

TProofDataSetManagerFile is what is used on TProofDataSetManagerFile is what is used on CAFCAF

Users can register, scan, getUsers can register, scan, get Verify is disallowed (to avoid staging overload)Verify is disallowed (to avoid staging overload)

It is run by a dedicated daemon (JFGO)It is run by a dedicated daemon (JFGO) Datasets can be processed by nameDatasets can be processed by name

Provide a way to cache the information needed at the Provide a way to cache the information needed at the validation step, speeding this up considerablyvalidation step, speeding this up considerably

TProofDataSetManager can be used also locally TProofDataSetManager can be used also locally to organize your datasets or chains.to organize your datasets or chains. No need of a dedicated macro to create the chain No need of a dedicated macro to create the chain

(CreateESDchain)(CreateESDchain)

JFGO

Page 13: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1313

Dataset manager (4)Dataset manager (4)

ATLAS is very interestedATLAS is very interested They are oriented a MySQL backend and They are oriented a MySQL backend and

validity tokens for the datasetvalidity tokens for the dataset Will provide TProofDataSetManagerSQLWill provide TProofDataSetManagerSQL

Other issues raised by ATLASOther issues raised by ATLAS Possibility to use multiple dataset sources, e.g. file Possibility to use multiple dataset sources, e.g. file

and SQL based concurrentlyand SQL based concurrently problem of the datasets in federated clusters (multi-problem of the datasets in federated clusters (multi-

masters) which is challenging on the PROOF side toomasters) which is challenging on the PROOF side too

JFGO

Page 14: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1414

Scheduling developmentsScheduling developments

Control resources and how they are usedControl resources and how they are used Improving efficiency Improving efficiency

assigning to a job those nodes that have data which needs assigning to a job those nodes that have data which needs to be analyzed.to be analyzed.

Implementing different scheduling policiesImplementing different scheduling policies e.g. fair share, group priorities & quotase.g. fair share, group priorities & quotas

Efficient use even in case of congestionEfficient use even in case of congestion

J. Iwaszkiewicz

Page 15: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1515

Scheduling developments (2)Scheduling developments (2)

Assigning a set of workers for a job based on:Assigning a set of workers for a job based on: The data set locationThe data set location User priority (Quota + historical usage)User priority (Quota + historical usage)

Can be taken for external sourceCan be taken for external source

The current load of the clusterThe current load of the cluster Create (priority) queues for queries that cannot be Create (priority) queues for queries that cannot be

startedstarted

Page 16: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1616

Scheduling developments (3)Scheduling developments (3)

Implementation exists with:Implementation exists with: # of Workers ≈ relativePriority * nFreeCPUs # of Workers ≈ relativePriority * nFreeCPUs Assign least loaded workers firstAssign least loaded workers first

Missing piecesMissing pieces Dynamic worker setup (Dynamic worker setup (advanced prototype exists)advanced prototype exists) Worker nodes auto-registrationWorker nodes auto-registration

Improved load monitoringImproved load monitoring Support for “put-on-hold” submission (Support for “put-on-hold” submission (prototype)prototype)

Page 17: PROOF developments

Scheduling schemaScheduling schema

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1717

PROOFPROOFmastermaster

DatasetDatasetLookupLookup

ClientClient SchedulerScheduler Load, history,Load, history,policy, …policy, …

1: Job{dataset, …}

2: dataset 3: file locations

4: Job info

5: workers

StartStartworkersworkers

6: workers

Page 18: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1818

Other developments Other developments

PROOFLITEPROOFLITE Version of PROOF optimized for multicore machines with Version of PROOF optimized for multicore machines with

workers started directly by the ROOT session (no workers started directly by the ROOT session (no daemon)daemon)

Useful to quickly test code in a real PROOF environmentUseful to quickly test code in a real PROOF environment Will be used to study I/O issues in multicoreWill be used to study I/O issues in multicore Almost ready to go into the trunkAlmost ready to go into the trunk

PROOF / Condor integrationPROOF / Condor integration Possible ATLAS model for T3 farms not dedicated to Possible ATLAS model for T3 farms not dedicated to

PROOFPROOF Condor provides mechanism to give high priority to Condor provides mechanism to give high priority to

PROOF queries when required by PROOF queries when required by suspending/hibernating batch jobssuspending/hibernating batch jobs

Page 19: PROOF developments

11/7/200811/7/2008 G. Ganis, CAF, Alice offline weekG. Ganis, CAF, Alice offline week 1919

Questions? Questions?

CreditsCredits G.G., J. Iwaszkiewizc, A. Kreshuk, F. RademakersG.G., J. Iwaszkiewizc, A. Kreshuk, F. Rademakers M. Meoni, J.F. Grosse-Oetringhaus (ALICE)M. Meoni, J.F. Grosse-Oetringhaus (ALICE) F.Furano, A. Peters (CERN/IT)F.Furano, A. Peters (CERN/IT) A. Hanushevsky (SLAC)A. Hanushevsky (SLAC)


Recommended