+ All Categories
Home > Documents > The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

Date post: 21-Jan-2016
Category:
Upload: atira
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Holger Marten Forschungszentrum Karlsruhe GmbH Institut for Scientific Computing, IWR Postfach 3640 D-76021 Karlsruhe. The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2. 0. Content. GridKa location & organization - skipped - but included in the slides - PowerPoint PPT Presentation
31
1 LHCC Review, November 19-20, 2007 Forschungszentrum Karlsruhe in der Helmholtz- Gemeinschaft The German Tier 1 The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2 LHCC Review, 19/20-nov-2007, stream B, part 2 Holger Marten Forschungszentrum Karlsruhe GmbH Institut for Scientific Computing, IWR Postfach 3640 D-76021 Karlsruhe
Transcript
Page 1: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

1 LHCC Review, November 19-20, 2007

Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft

The German Tier 1The German Tier 1

LHCC Review, 19/20-nov-2007, stream B, part 2LHCC Review, 19/20-nov-2007, stream B, part 2

Holger Marten

Forschungszentrum Karlsruhe GmbHInstitut for Scientific Computing, IWR

Postfach 3640D-76021 Karlsruhe

Page 2: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

2LHCC Review, November 19-20, 2007

0. Content0. Content

1. GridKa location & organization - skipped - but included in the slides

2. Resources and networks

3. Mass storage & SRM

4. Grid Services

5. Reliability & 24x7 operations

6. Plans for 2008

Page 3: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

7LHCC Review, November 19-20, 2007

2. Resources and networks

Page 4: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

8LHCC Review, November 19-20, 2007

LCG non-LCG HEP others

CPU [kSI2k] 1864 (55%) 1270 (37%) 264 (8%)

Disk [TB] 878 443 60

Tape [TB] 1007 585 120

Current Resources in ProductionCurrent Resources in Production

October 2007 accounting (example):

• CPUs provided through fair share

• 1.6 Mio. hours wall time by 300k jobs

on 2514 CPU cores

• 55% LCG, 45% non-LCG HEP

Page 5: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

9LHCC Review, November 19-20, 2007

Installation of MoU Resources 2007Installation of MoU Resources 2007(from WLCG accounting spread sheets)(from WLCG accounting spread sheets)

installed

WLCGmilestone

Page 6: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

10LHCC Review, November 19-20, 2007

GridKa WAN connectionsGridKa WAN connections

internal network

Page 7: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

11LHCC Review, November 19-20, 2007

GridKa WAN connectionsGridKa WAN connections

internal network

redundancyredundancy

CERN

Page 8: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

12LHCC Review, November 19-20, 2007

The benefit of network redundancyThe benefit of network redundancy

April 26, 2007: failure of DFN router of CERN-GridKa OPN

Automatic (!) re-routing through our backup link via CNAF; this was not a test !

Page 9: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

13LHCC Review, November 19-20, 2007

Summary of GridKa networksSummary of GridKa networks

• LAN

- Full internal redundancy (of one router)

- Additional layer-3 BelWue backup link (to be realized in 2008)

• WAN

- multiple 10 Gbps available to CERN, Tier-1s, Tier-2s

- Sara/Nikhef: will be in production (end of Q4/2007)

- additional CERN independent Tier-1 transatlantic link(s) would be

highly desirable

Page 10: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

14LHCC Review, November 19-20, 2007

3. Mass storage & SRM

Page 11: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

15LHCC Review, November 19-20, 2007

Long time instabilities with SRM and gridFTP implementation• reduced availability because SAM critical tests fail; many patches since

Dual effort for complex and labour intensive software (data management)• running instable dCache SRM in production• running next SRM 2.2 release in pre-production• in the end SRM 2.2 was tested formally with F.Donnos S2 test suite, but

only very limited by the experiments

Read-only disk storage (T0D1) is administrative difficulty• full disks imply stopping experiment’s work

=> experiments ask for “temporary ad-hoc” conversions into T1D1• no failover or maintenance (reboot) is possible, otherwise jobs will

crash

dCache & MSS at GridKadCache & MSS at GridKa

Page 12: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

16LHCC Review, November 19-20, 2007

Migrated to dCache 1.8 with SRM 2.2 on Nov 6/7• very fruitful collaboration with dCache/SRM developers in situ• bug fix for globus-url-copy in combination with space reservation

“on-the-fly” during migration process

=> many thanks to Timur Perelmutov and Tigran Mkrtchyan for support

Stability has to be verified during the coming months.

Connection to tape (MSS) is fully functional and scalable for writes• read tests by experiments have only started recently• difficult to estimate tape resources to reach required read throughput• workgroup with local experiment representatives to provide access

patterns, tape classes and recall optimisation proposals

dCache & MSS at GridKadCache & MSS at GridKa

Page 13: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

17LHCC Review, November 19-20, 2007

4. Grid Services

Page 14: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

18LHCC Review, November 19-20, 2007

Installed WLCG middleware services*Installed WLCG middleware services*

# Service Remarks

3 Top-level BDII round robin; supports EGEE region DECH

2 Resource Broker lcg-flavour; gLite WMS to be installed

1 Proxy Server

8 UI 4x VO-Box, 2x login, 1x gm, 1x admin

4 VO-Boxes also front-ends for experiment admins

5 3D HEP DBs 2x ATLAS, 2x LHCb, Conditions DB etc., 1x CMS Squid

1 Site BDII (GIIS)

1 Mon Box accounting

1 LFC MySQL migrated to 3 nodes Oracle

3 FTS 3 DNS load balanced front-ends; 3 clustered Oracle back-ends

3(+1) Compute Elements 4th CE currently set up;

2 Storage Elements

2 SRM v1.2 and v2.2

dCache pools 1 head node; pool nodes with gridFTP doors

900 Worker Nodes 2500 cores SL4; gLite 3.0.x to be migrated to 3.1.x

* In a wide sense, i.e. incl. physics DBs and dCache pools with grdiFTP; only production listed

FTS 2.0 deployment example

Page 15: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

19LHCC Review, November 19-20, 2007

FTS 2.0 [+FTS 2.0 [+LFCLFC] deployment at GridKa] deployment at GridKa

Setup to ensure high availability.

Three nodes hosting web services. VO- and channel agents are distributed on the three nodes. Nodes located in 2 different cabinets to have at least one node working in case of a cabinet power failure or network switch failure.

3 nodes RAC on Oracle 10.2.0.3, 64 bit RAC will be shared with LFC database. Two nodes preferred for FTS, one node preferred for LFC. Distributed over several cabinets. Mirrored disks in the SAN.

Page 16: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

20LHCC Review, November 19-20, 2007

FTS/LFC DB: One 3-node Cluster on Oracle 10.2.0.3, 64bit

node 3

192.x.x.x

SANex

tern

al n

etw

ork

inte

rna

l ne

twor

k

192.

168.

52

node 1

PrivIP1a

PubIP1 VirIP1

node 2

PrivIP2b

PubP2 VirIP2

eth 2

PrivIP3b

PubIP3 VirIP3

PrivIP Switch1

eth 1

PrivIP1b

PrivIP2a

PrivIP3a

10.x.x.x

VirI

P,

Pub

IP

Ext

IP

public VLAN VLAN

.53 .52

RA

ID1

142

GB

FTSREC1

ASMSpfile

RA

ID1

142

GB

LFCDATA1

RA

ID1

142

GB

FTSDATA1

RA

ID1

142

GB

LFCREC1

Voting

OCR

FTS(LFC)

FTS(LFC)

LFC(FTS)

Page 17: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

21LHCC Review, November 19-20, 2007

Tested FTS channels GridKa Tested FTS channels GridKa ⇔ Tier-0 / 1 / 2⇔ Tier-0 / 1 / 2(likely incomplete list)(likely incomplete list)

Tier-0 FZKCERN - FZK

FZK Tier-1IN2P3 - FZKPIC - FZKRAL - FZKSARA - FZKTAIWAN - FZKTRIUMF - FZKBNL - FZKFNAL - FZKINFNT1 - FZKNDGFT1 - FZK

FZK Tier-2FZK - CSCSFZK - CYFRONETFZK - DESYFZK - DESYZNFZK - FZUFZK - GSIFZK - ITEPFZK - IHEPFZK - JINRFZK - PNPIFZK - POZNANFZK - PRAGUEFZK - RRCKIFZK - RWTHAACHENFZK - SINPFZK - SPBSU

FZK Tier-2 (cont.)FZK - TROITSKINRFZK - UNIFREIBURGFZK - UNIWUPPERTALFZK - WARSAW

Page 18: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

22LHCC Review, November 19-20, 2007

FTS 2.0 deployment experienceFTS 2.0 deployment experience

ToDo’s @ GridKa after experience with FTS 1.5• Migrate FTS to 3 new redundant servers => buy, install LAN, OS, … in advance• Set up new Oracle RAC (new version) on 64 bit• Migrate DB to redundant disks => new SAN configurations required• Set up and test all existing transfer channels (by all experiments)

And the migration experience• learning curve for new 64-bit Oracle version• fighting esp. with changes in behaviour with two networks (internal + external)• setting up and testing channels needs people, sometimes on both ends

(vacation time, workshops, local admins communicate with 3 experiments –

sometimes with different views – in parallel)

WLCG milestone – as a member of MB I accepted it

For sites, upgrading also means time consuming service hardening and optimization, and is not just “pushing the update button.”

Page 19: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

24LHCC Review, November 19-20, 2007

5. Reliability & 24x7 operations

Page 20: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

25LHCC Review, November 19-20, 2007

SAM reliability (from WLCG report)SAM reliability (from WLCG report)

Page 21: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

26LHCC Review, November 19-20, 2007

SAM reliabilitySAM reliability

Some examples with zero severity for experiments• config. changes of local or central services that result in failures for

OPS-VO only- missing rpm ‘lcg-version’ in new WN distribution- SAM tests CA-certificates that already became officially obsolete

More severe examples• pure local hardware / software failures (redundancy required…)• scalability of services after resource upgrades or during heavy load• stability of “MSS-related” software pieces (SRM, gridFTP)

Overall very complex hierarchy of dependencies• esp. transient scalability and stability issues are difficult to analyse• but this is necessary: analyse + fix instead of reboot !

(sometimes at the expense of availability though)

Page 22: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

27LHCC Review, November 19-20, 2007

Site availability – OPS vs. CMS viewSite availability – OPS vs. CMS view

To be further analysed: Do we have the correct (customers) view?

Page 23: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

28LHCC Review, November 19-20, 2007

To be further analysed…

Page 24: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

29LHCC Review, November 19-20, 2007

Preparations for 24x7 supportPreparations for 24x7 support

Currently• site admins (experts) during normal working hours• experiment admins with special admin rights for VO-specific services• operators (not always “experts”) watch the system and intervene during

weekends and public holidays on a voluntary basis

Needs for and permanently working on• redundancy, redundancy, redundancy

- multiple experts 24h x 7d x 52w on site is out of discussion

• hardening / optimization of services- the more scalability tests in production, the better (even if it hurts)- but we depend on robust software

• documentation of service components and procedures for operators• service dashboard for operators

Page 25: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

30LHCC Review, November 19-20, 2007

GridKa service dashboard for operatorsGridKa service dashboard for operators

See A. Heiss et al., CHEP 2007

Page 26: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

31LHCC Review, November 19-20, 2007

6. Plans for 2008

Page 27: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

32LHCC Review, November 19-20, 2007

C-RRB on 23-oct-2007: LCG status reportC-RRB on 23-oct-2007: LCG status report

Concern: Are sites aware of the ramp-up (incl. power & cooling)?Concern: Are sites aware of the ramp-up (incl. power & cooling)?

Page 28: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

33LHCC Review, November 19-20, 2007

Electricity and cooling at GridKaElectricity and cooling at GridKa

Planning & upgrades done during the last 3 years

• second (redundant) main power line available since 2007

• 3(+1; redundancy) x 600 kW new chillers available

• 1 MW of cooling (water cooling) capacity ready for 2008

Capacity not an issue, but concerned about running cost

• started benchmarking of compute and el. power in 2002

• efficiency (ratio of SPECint / power consumption) enters into

call for tenders since 2004 (“penalty” of 4 €/W at selection)

• many discussions with providers (Intel, AMD, IBM,…)

• contributing to HEPiX benchmarking group and publishing

results

Page 29: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

34LHCC Review, November 19-20, 2007

Efficiency (SPECint_rate_base2000 per W)

0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Intel Xeon 3.06 GHz

Intel Xeon 2.66 GHz

Intel Xeon 2.20 GHz

Intel Pent. 3 1.26 GHz

Intel Xeon E5345

Intel Xeon 5160

Intel Pentium M 760

AMD Opteron 270

AMD Opteron 246 (b)

AMD Opteron 246 (a)

2001-2004: very alarming

2005-2007: much morepromising

Based on own benchmarks and measurements with GridKa hardware.

Page 30: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

35LHCC Review, November 19-20, 2007

Extensions for 04/2008: everything is bought !Extensions for 04/2008: everything is bought !

Oct’07• 40 new cabinets delivered and installed• 1/3 of CPUs (~130 machines) delivered

Nov’07: arrival and base installation of• all new networking components (incl. cabling)• remaining 2/3 of CPUs• tape cartridges & drives

Nov/Dec’07:• arrival of 2.3 PB disks (incl. non-LHC) + servers

Jan-Mar’08: installations, tests, acceptance, bug fixes, …

Page 31: The German Tier 1 LHCC Review, 19/20-nov-2007, stream B, part 2

36LHCC Review, November 19-20, 2007

SummarySummary

• GridKa contributes with full MoU 2007 resources- we are ready for the April’08 ramp-up

• Good collaboration with- sites, developers and experiments (e.g. local / remote VO admins)

• Much effort spent into- service hardening (redundancy …)

- tools and procedures for operations

- scalability and stability analysis

- access performance optimization (e.g. tape reads)

• This is still a necessity which requires- time of admins

- patience and understanding by customers

- …sometimes at the expense of reliability measures


Recommended