+ All Categories
Home > Documents > L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests...

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests...

Date post: 22-Dec-2015
Category:
Upload: herbert-washington
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
20
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet Yannick Perret, Xavier Canehan Suzanne Poulat, Rolf Rumler Initially presented at Jamboree LHCb March 7 th , 2011 Updated on March 25th for Atlas CAF
Transcript
Page 1: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

SW distribution tests at Lyon

Pierre GirardLuisa Arrabito, David BouvetYannick Perret, Xavier CanehanSuzanne Poulat, Rolf Rumler

Initially presented at Jamboree LHCb

March 7th, 2011Updated on March 25th for Atlas CAF

Page 2: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Content

• AFS Latency story AFS latency problem schedule LHCb SetupProject tests Preliminary conclusions

• xxx-FS client stress tests Test suite results CVMFS tests results

• Conclusions

2

Page 3: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 3

Page 4: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

AFS latency problem schedule

4

08/07: New LHCb setup timeout problems

May June July Aug. Sept. Oct. Nov. Dec. Jan.

2011

17/05: SetupProject.sh timeout (5%)

17/06: SetupProject.sh timeout (0,5%)

New AFS server versionAFS Story

RO AFS Volumes for LHCb SW Area New AFS client

03/11: ATLAS timeout problems (50% failures)

05/11: Atlas increased its timeout (3600s)

26/11: PARTLY SOLVED LHCb increased its timeout (3600s)

Many WN crashes due to IO problem

Many (temporary) freezing WNs after OS patch

SL5 Story

Stable SL5 WNs after new kernel patch

LHCb test infrastructure setup for different tunings (AFS, SL5, LRMS)

Tests of different kernel parameters

Tests Story

07/01: SOLVEDCCIN2P3 reduced the number of job slots on recent HW

06/09:CCIN2P3 adding new HW (24 logical cores): 110 WNs

Page 5: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

LHCb job setup tests

5

Source: L. Arrabito

Page 6: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

AFS Latency / LHCb job efficiency

6

All-T1s CPU efficiency during the same period

Sou

rce:

http

://w

ww

3.eg

ee.c

esga

.es/

grid

site

/acc

ount

ing/

CE

SG

A/ti

er1_

view

.htm

l

SIteJan 10

Feb 10

Mar 10

Apr 10

May 10

Jun 10

Jul 10

Aug 10

Sep 10

Oct 10

Nov 10

Dec 10

Jan 11

Total

CC 39.9 46.3 67.1 80.3 84.7 87.3 88.5 88.6 79.8 80.7 88.6 88.5 90.9 86.0

All T1s

42.7 58.4 74.6 82.4 88.2 88.4 88.3 73.5 79.4 85.1 81.7 89.8 92.8 82.5

Visible effect of adding latest (24-cores) WNs?

Page 7: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Preliminary conclusions

• LHCb/ATLAS environment setup is very (too much) FS-intensive By stracing SetupProject.sh

17 868 open() 110 765 stat()

• Investigate on job distribution strategy to avoid too many similar jobs on the same WN According to “lhcb-alone” and “atlas-excluded” tests results

• AFS latency problem is now a AFS client scalability problem Temporary solved by decreasing the number of job slots on the most

recent machines, but … Is that a major concern in the near future ?

• Have the other sites already experienced the same problem ?

7

Page 8: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 8

Page 9: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Test suite conditions

• For each test Walk of the same directory arborescence (DAVINCI) Same actions are achieved in the same order

LHCb ProjectSetup-like100 000 stat()7 000 open()

– First block is read to ensure the open() is effective

• Pre-loading the cache (if any) by pre-executing the test once

• Averaged results are taken from 4 executions

9

Page 10: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

FS Test Results

10

Page 11: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

CVMFS test suite conditionsDifferent cache sizes

• Dedicated SQUID Used by the tested WN only With pre-loaded LHCb cache

• On CVMFS client (0.2.53-1), before each test Cache was removed Service was restarted Different cache sizes

« ls –lR » on sibling directories to make grow up the cache

11

Page 12: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

CVMFS Cache Size Tests Results

12

Page 13: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

CVMFS Min/Max results

13

Page 14: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Comparison latest CVMFS (0.2.61)

14

NEW

(2011/03/25

)

Page 15: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 15

Page 16: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Conclusions

• LHCb/Atlas job setup should be optimized

• Multi-VOs sites should try to implement a fair distribution of VO’ jobs over the cluster WNs Restrict the number of similar jobs on a WN

• Issue of shared FS client scalability (Most likely) Checked with AFS, NFS4.0, and CVMFS Tests must go on

NFS4.1 (pNFS) still to be tested With other HWs (for now, only Poweredge C6100) By virtualizing the WNs (“divide and rule” principle)

– First attempt was achieved by basically splitting 24-cores WN into 2x(12-cores VM-WN)– Must be further investigated

• CVMFS Interesting for VO SW distribution (without installation job) But, take care that latency increases with cache size

16

Page 17: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Questions & Comments

17

Page 18: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 18

Page 19: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Other T1s CPU EfficiencyFrom January 2010 to January 2011

19

CERN KIT PIC

CNAF NL-T1 RAL

Source: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier1_view.html

Page 20: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler

Virtualized WN: divide and rule ?

20


Recommended