Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno...

Status report on SRM v2.2 implementations: results of

first stress tests

2th July 2007

Flavia DonnoCERN, IT/GD

2

Tests All implementations pass basic tests

https://twiki.cern.ch/twiki/bin/view/SRMDev Use-case test family enhanced with even more tests:

CASTOR: Passes all use-cases. Disk1 implemented by switching off garbage collector (not gracefully

handled by CASTOR) PutDone slow. *Fixed**Fixed*

dCache : Passes all use-cases No tools provided for site administrators to reserve space statically for

a VO. In Tape1Disk0 allocated space decreases when files are migrated to

tape *Fixed**Fixed* DPM: Passes all use-cases

Garbage collector for expired space available with next release of DPM (1.6.5 in certification). *Fixed**Fixed*

StoRM: Passes all use-cases No tools provided for site administrators to reserve space statically for

a VO. BeStMan: Passes all use-cases

No tools provided for site administrators to reserve space statically for a VO.

Some calls not compliant to the specs as defined during the WLCG Workshop in January 2007 (for instance, requestToken not always returned).

https://twiki.cern.ch/twiki/bin/view/SRMDev

3

Tests Details about implementations status:

https://twiki.cern.ch/twiki/bin/view/SRMDev/ImplementationsProblems

Minor issues still open: dCache:

An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_NO_FREE_SPACE at file and request level if the space specified is expired instead of returning SRM_FAILURE at file level and SRM_SPACE_LIFETIME_EXPIRED at request level, or (if the space token is no longer valid) SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level. .

An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_FAILURE at file and request level if no space of the requested class is available instead of returning SRM_NO_FREE_SPACE at file and request level or SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level.

When method is not supported the explanation contains often the following string: "handler discovery and dinamic load failedjava.lang.ClassNotFoundException:..."

StoRM: srmPrepareToPut and srmStatusOfPutRequest return

SRM_FAILURE at request level instead of returning SRM_SPACE_LIFETIME_EXPIRED when the space specified in the request is expired and the space token is still available. If the space token is unavailable SRM_INVALID_REQUEST should be returned.

BeStMan: Permanent files are not allowed to live in volatile space



4

Tests Stress tests started on all development

endpoints using 9 client machines. Small server instances are preferred in order to reach easily the limits.

First Goals: Understand the limits of the instance under test Make sure it does not crash or hang under heavy load Make sure that the response time does not degrade

to an “unreasonable” level Further goals:

Make sure there are no hidden race-conditions for the calls that are the most used

Understand server tuning Learn from stress testing

Parallel stress-testing activities are on-going by the EIS team with GSSD input

5

Stress Tests description

GetParallelGetParallel This test puts a file (/etc/group) in SRM default space. Then it

spawns many (configurable statically at each run) threads requesting a TURL (=protocol dependent handle) to access the same file. The test can be driven to use different access protocols in different threads. The polling frequency to check if the TURL has been assigned can be specified in a fixed mode or can increasingly become high. Polling continues even after the TURL is assigned to check for changes in status. The test tries to clean up after itself. I was planning to introduce in another test of the same kind other operations such as Abort while trying to use the aborted TURL.

GetParallelTransfGetParallelTransf Same as previous test but once the TURL is obtained each thread

tries to actually retrieve the file. The test tries to clean up after itself. I was planning to introduce another test of the same kind where clients use the TURL assigned to other clients.

PutGet01PutGet01 This test simulates many clients putting and getting (small) files

simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.

6

Stress Tests description

PutGetMixPutGetMix This test simulates many clients putting and getting randomly

small (oKB) and big files (oMB/GB) simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.

PutMany/01PutMany/01 This test performs many PrepareToPut requests in parallel. Then

the requests are also aborted in parallel. (Same characteristics as previous tests). The test PutMany01 only performs the PrepareToPut (without abort) . Better checking of the system response is needed. No file transfer is performed !

ReserveSpaceReserveSpace This test does not apply to CASTOR. This test simulates many

requests in parallel to reserve 1GB of disk space. BringOnlineBringOnline

It reserves 1GB of disk space of type Tape1Disk0, fills it in with files (122MB) and checks the response of the system when the reserved space is full. It checks if some file is migrated to tape and if so it requests for the file to be staged on disk.

7

Stress Tests presentationhttp://lxdev25.cern.ch/s2farm/results/final/history/

8

Stress Tests presentationUnder the date there is one directory per run

9

Stress Tests presentation

DPM DPM

Each numberCorresponds to a node.The nodes where failuresoccur have bold/italic numbers

9 client machines

10


Small instances preferredFor stress-testing.In this case the failure happened on the client side (because of S2,each client cannot run more than 100 Threads)

BeStMan BeStMan

11


dCache dCache

12

Stress Tests presentationStoRM StoRM

The system is not yet dropping requests.The response time degrades with load.

13


StoRM StoRM

With 60 threads the system drops requests. The system slows down (more time to complete a test). However, the server recovers nicely after the crisis.

14


CASTOR CASTOR

srmStatusOfGetRequest srm://lxb6033.cern.ch:8443 requestToken=54549 SURL[srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt] Returns:sourceSURL0=srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt returnStatus.explanation0="PrepareToGet failed: Bad address" returnStatus.statusCode0=SRM_FAILURE returnStatus.explanation="No subrequests succeeded" returnStatus.statusCode=SRM_FAILURE Race condition ?

Slow PutDone cured!Test completed in < 3 minutes

15


CASTOR CASTOR

The server respondswell under load.Requests get droppedbut the response time is still good.

16

Summary of First Preliminary Results

CASTOR: Race conditions found. Working with developers to address problems Good handling of heavy-load: requests are dropped if server busy (the

client can retry) Response time for the requests being processed is good.

dCache: Authorization module crash Server very slow or unresponsive (max heap size reached - restart cures

the problem) Working with developers to address problems

DPM: No failures Good handling of heavy-load: requests are dropped if server busy (the

client can retry) Response time for the requests being processed is good.

StoRM Response time degrades with load. The system might become

unresponsive. However it recovers after the crisis. Working with developers to address problems

BeStMan Server unresponsive under heavy load. It does not resume operations when

load decreases. Working with the developers to address problems

More analysis is needed in order to draw conclusions

17

Stress-test client improvements

The green/red presentation is probably not adequate What does red mean ? How can we make it easy for the developers the diagnosis of the

problem ? What happens when we increase the number of client nodes ? I AM STILL PLAYING WITH THE PRESENTATION PAGE!!

PLEASE DO NOT TAKE THE RED BOXES AS CORRECT!!! Improve the management of the test-suite itself.

To efficiently stop/start/abort/restart To easily diagnostic client problems Etc.

How can we monitor improvements ? Reproduce race condition problems It is important to stress-test one system at the time It is important to register degradation of performance

Extend the test suite with more use-cases Experiments input very much appreciated.

External system monitoring is needed.

18

Plans Continue stress-testing of development endpoints till

allowed by the developers/sites Coordinate with other testers

In order to understand what happens it is better to have dedicated machines

Publish results As done for basic/use-case, publish a summary of the

status of the implementations to help developers react, as a reference for sites and experiments.

Report monthly at the MB Follow up possible problems at deployment sites

What else ?

Date post:	12-Jan-2016
Category:	Documents
Upload:	dorothy-mitchell
View:	216 times
Download:	0 times

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno...

Documents