Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | dorothy-mitchell |
View: | 216 times |
Download: | 0 times |
Status report on SRM v2.2 implementations: results of
first stress tests
2th July 2007
Flavia DonnoCERN, IT/GD
2
Tests All implementations pass basic tests
https://twiki.cern.ch/twiki/bin/view/SRMDev Use-case test family enhanced with even more tests:
CASTOR: Passes all use-cases. Disk1 implemented by switching off garbage collector (not gracefully
handled by CASTOR) PutDone slow. *Fixed**Fixed*
dCache : Passes all use-cases No tools provided for site administrators to reserve space statically for
a VO. In Tape1Disk0 allocated space decreases when files are migrated to
tape *Fixed**Fixed* DPM: Passes all use-cases
Garbage collector for expired space available with next release of DPM (1.6.5 in certification). *Fixed**Fixed*
StoRM: Passes all use-cases No tools provided for site administrators to reserve space statically for
a VO. BeStMan: Passes all use-cases
No tools provided for site administrators to reserve space statically for a VO.
Some calls not compliant to the specs as defined during the WLCG Workshop in January 2007 (for instance, requestToken not always returned).
3
Tests Details about implementations status:
https://twiki.cern.ch/twiki/bin/view/SRMDev/ImplementationsProblems
Minor issues still open: dCache:
An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_NO_FREE_SPACE at file and request level if the space specified is expired instead of returning SRM_FAILURE at file level and SRM_SPACE_LIFETIME_EXPIRED at request level, or (if the space token is no longer valid) SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level. .
An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_FAILURE at file and request level if no space of the requested class is available instead of returning SRM_NO_FREE_SPACE at file and request level or SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level.
When method is not supported the explanation contains often the following string: "handler discovery and dinamic load failedjava.lang.ClassNotFoundException:..."
StoRM: srmPrepareToPut and srmStatusOfPutRequest return
SRM_FAILURE at request level instead of returning SRM_SPACE_LIFETIME_EXPIRED when the space specified in the request is expired and the space token is still available. If the space token is unavailable SRM_INVALID_REQUEST should be returned.
BeStMan: Permanent files are not allowed to live in volatile space
4
Tests Stress tests started on all development
endpoints using 9 client machines. Small server instances are preferred in order to reach easily the limits.
First Goals: Understand the limits of the instance under test Make sure it does not crash or hang under heavy load Make sure that the response time does not degrade
to an “unreasonable” level Further goals:
Make sure there are no hidden race-conditions for the calls that are the most used
Understand server tuning Learn from stress testing
Parallel stress-testing activities are on-going by the EIS team with GSSD input
5
Stress Tests description
GetParallelGetParallel This test puts a file (/etc/group) in SRM default space. Then it
spawns many (configurable statically at each run) threads requesting a TURL (=protocol dependent handle) to access the same file. The test can be driven to use different access protocols in different threads. The polling frequency to check if the TURL has been assigned can be specified in a fixed mode or can increasingly become high. Polling continues even after the TURL is assigned to check for changes in status. The test tries to clean up after itself. I was planning to introduce in another test of the same kind other operations such as Abort while trying to use the aborted TURL.
GetParallelTransfGetParallelTransf Same as previous test but once the TURL is obtained each thread
tries to actually retrieve the file. The test tries to clean up after itself. I was planning to introduce another test of the same kind where clients use the TURL assigned to other clients.
PutGet01PutGet01 This test simulates many clients putting and getting (small) files
simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.
6
Stress Tests description
PutGetMixPutGetMix This test simulates many clients putting and getting randomly
small (oKB) and big files (oMB/GB) simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.
PutMany/01PutMany/01 This test performs many PrepareToPut requests in parallel. Then
the requests are also aborted in parallel. (Same characteristics as previous tests). The test PutMany01 only performs the PrepareToPut (without abort) . Better checking of the system response is needed. No file transfer is performed !
ReserveSpaceReserveSpace This test does not apply to CASTOR. This test simulates many
requests in parallel to reserve 1GB of disk space. BringOnlineBringOnline
It reserves 1GB of disk space of type Tape1Disk0, fills it in with files (122MB) and checks the response of the system when the reserved space is full. It checks if some file is migrated to tape and if so it requests for the file to be staged on disk.
7
Stress Tests presentationhttp://lxdev25.cern.ch/s2farm/results/final/history/
8
Stress Tests presentationUnder the date there is one directory per run
9
Stress Tests presentation
DPM DPM
Each numberCorresponds to a node.The nodes where failuresoccur have bold/italic numbers
9 client machines
10
Stress Tests presentation
Small instances preferredFor stress-testing.In this case the failure happened on the client side (because of S2,each client cannot run more than 100 Threads)
BeStMan BeStMan
11
Stress Tests presentation
dCache dCache
12
Stress Tests presentationStoRM StoRM
The system is not yet dropping requests.The response time degrades with load.
13
Stress Tests presentation
StoRM StoRM
With 60 threads the system drops requests. The system slows down (more time to complete a test). However, the server recovers nicely after the crisis.
14
Stress Tests presentation
CASTOR CASTOR
srmStatusOfGetRequest srm://lxb6033.cern.ch:8443 requestToken=54549 SURL[srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt] Returns:sourceSURL0=srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt returnStatus.explanation0="PrepareToGet failed: Bad address" returnStatus.statusCode0=SRM_FAILURE returnStatus.explanation="No subrequests succeeded" returnStatus.statusCode=SRM_FAILURE Race condition ?
Slow PutDone cured!Test completed in < 3 minutes
15
Stress Tests presentation
CASTOR CASTOR
The server respondswell under load.Requests get droppedbut the response time is still good.
16
Summary of First Preliminary Results
CASTOR: Race conditions found. Working with developers to address problems Good handling of heavy-load: requests are dropped if server busy (the
client can retry) Response time for the requests being processed is good.
dCache: Authorization module crash Server very slow or unresponsive (max heap size reached - restart cures
the problem) Working with developers to address problems
DPM: No failures Good handling of heavy-load: requests are dropped if server busy (the
client can retry) Response time for the requests being processed is good.
StoRM Response time degrades with load. The system might become
unresponsive. However it recovers after the crisis. Working with developers to address problems
BeStMan Server unresponsive under heavy load. It does not resume operations when
load decreases. Working with the developers to address problems
More analysis is needed in order to draw conclusions
17
Stress-test client improvements
The green/red presentation is probably not adequate What does red mean ? How can we make it easy for the developers the diagnosis of the
problem ? What happens when we increase the number of client nodes ? I AM STILL PLAYING WITH THE PRESENTATION PAGE!!
PLEASE DO NOT TAKE THE RED BOXES AS CORRECT!!! Improve the management of the test-suite itself.
To efficiently stop/start/abort/restart To easily diagnostic client problems Etc.
How can we monitor improvements ? Reproduce race condition problems It is important to stress-test one system at the time It is important to register degradation of performance
Extend the test suite with more use-cases Experiments input very much appreciated.
External system monitoring is needed.
18
Plans Continue stress-testing of development endpoints till
allowed by the developers/sites Coordinate with other testers
In order to understand what happens it is better to have dedicated machines
Publish results As done for basic/use-case, publish a summary of the
status of the implementations to help developers react, as a reference for sites and experiments.
Report monthly at the MB Follow up possible problems at deployment sites
What else ?