+ All Categories
Home > Documents > Текущее состояние и ближайшие перспективы...

Текущее состояние и ближайшие перспективы...

Date post: 13-Jan-2016
Category:
Upload: danno
View: 54 times
Download: 7 times
Share this document with a friend
Description:
А.Минаенко Совещание по физике и компьютингу , 03 февраля 20 10 г . НИИЯФ МГУ , Москва. Текущее состояние и ближайшие перспективы компьютинга для АТЛАСа в России. Layout. Ru-Tier-2 tasks Computing resources in 2009 and their increase for 2010 NL Tier-1 CPU resources usage in 2009 - PowerPoint PPT Presentation
Popular Tags:
23
А.Минаенко Совещание по физике и компьютингу, 03 февраля 2010 г. НИИЯФ МГУ, Москва Текущее состояние и ближайшие перспективы компьютинга для АТЛАСа в России
Transcript
Page 1: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

А.Минаенко

Совещание по физике и компьютингу,

03 февраля 2010 г. НИИЯФ МГУ, Москва

Текущее состояние и ближайшие

перспективы компьютингадля АТЛАСа в России

Page 2: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Layout

• Ru-Tier-2 tasks

• Computing resources in 2009 and their increase for 2010

• NL Tier-1

• CPU resources usage in 2009

• RuTier-2 and data distribution

• STEP09 lessons

03/02/2010 A.Minaenko 2

Page 3: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

ATLAS RuTier-2 tasks

• Russian Tier-2 (RuTier-2) computing facility is planned to supply with computing resources all 4 LHC experiments including ATLAS. It is a distributed computing center including computing farms of 6 institutions: ITEP, RRC-KI, SINP (all Moscow), IHEP (Protvino), JINR (Dubna), PNPI (St.Petersburg)

• Recently two smaller sites MEPhI and FIAN have been added

• The main RuTier-2 task is providing facilities for physics analysis of collected data using mainly AOD, DPD and user derived data formats

• It includes also development of reconstruction algorithms using limited subsets of ESD and Raw data

• 50 active ATLAS users are supposed to carry on physics data analysis at RuTier-2

• Now group ru is created in the framework of ATLAS VO. It includes physicists intending to carry analysis in RuTier-2 and the group list contains 38 names at the moment. The group will have privilege of write access to local RuTier-2 disk resources (space token LOCALGROUPDISK)

• All the data used for analysis should be stored on disks

• The second important task is production and storage of MC simulated data

• The full size of data and CPU needed for their analysis are proportional to the collected statistics. The resources needed should constantly grow with the increase of the number of collected events

03/02/2010 A.Minaenko 3

Page 4: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Current RuTier-2 resources for ATLASCPU, kSI2k Disc, TB ATLAS Disk, TB

IHEP 490 130 80

ITEP 550 140 10

JINR 2150 440 120

RRC-KI 2000 670 230

PNPI 410 140 60

SINP 440 150 3

MEPhI 420 60 30

FIAN 90 46 23

Total 6460 1730 556

• The total number of CPU cores is about 2500• Red – sites for user analysis of ATLAS data, the other for simulation only• Now the main type of LHC grid jobs is official production jobs and CPU resources are at the moment dynamically shared by all 4 LHC VO on the base of equal access rights for each VO at each site• Later, when analysis jobs will take larger part of RuTier-2 resources, the sharing of CPU resources will be proportional to VO disk share at each site

03/02/2010 A.Minaenko 4

Page 5: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

ATLAS Storage Space• ATLAS(DATA/MC)TAPE

– At all T1s, write disk buffer to tape system– Centrally managed– ACL:

• Write permission to /atlas/Role=production• Recall from tape permission to /atlas/Role=production

• ATLAS(DATA/MC)DISK– At all T1s and T2s, disk only pools for data and Monte Carlo– Centrally managed– ACL: write permission to /atlas/Role=production

• ATLASHOTDISK– At all T1s and T2s, disk only pools for hot files– Possibly enable file replication.

• How to do it depends on storage technology – ACL: write permission to /atlas/Role=production

• ATLASPRODDISK– At all T1s and T2s, buffers for Monte Carlo production– Centrally managed, automated cleanup– ACL: write permission to /atlas/Role=production 03/02/2010 A.Minaenko 5

Page 6: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

ATLAS Storage Space• ATLASGROUPDISK

– At some T1s and T2s, disk only pools for Groups– Managed by the group via DDM tools. – ACL: write permission to /atlas/Role=production (!!!!)

• ATLASSCRATCHDISK– At all T1s and T2s, buffers

• user analysis• data distribution• datasets upload

– Centrally managed, automatic cleanup• 1 month dataset cache

– Very subject to creation of dark data.• More on this later

– ACL: write permission to /atlas

• ATLASLOCALGROUPDISK– Non pledged space for users– Policy, ACLs, etc … defined by the site.

03/02/2010 A.Minaenko 6

Page 7: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Space tokens at RuTier-2

• According ATLAS request all available disk space is assigned to space tokens as it is shown at the table• Preliminary 3 group disks are assigned to RuTier-2

• RRC-KI – exotic• JINR – SM• IHEP - JetEtmiss

• Probably we can get one more after increase of the resources • At the end of 2009 CPU resources were increased by 1000 kSI2k and disk – by 1000 TB • The additional resources will be available in February-March

03/02/2010 A.Minaenko 7

Page 8: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Estimate of resources needed to fulfil RuTier-2 tasks in 2010

• The announced effective time available for physics data taking during long 2009-2010 run is 0.6*10 7 seconds

• ATLAS DAQ system event recording rate is 200 events per second, i.e. whole expected statistics is 1.2*109 events

• The current AOD event size is 180 KB, 1.8 times larger than ATLAS computing model requirement and hardly it will be easily decreased to the moment of the data taking

• Full expected size of the current AOD version is equal to 220 TB

• It is necessary to keep 30-40% of the previous AOD version for comparisons. This gives full AOD size of 290 TB - DATADISK

• During first years of LHC work the very important task is study of the detector performance characteristics and the task requires more detailed information than the one available on the AOD. ATLAS plans to use for these task “performance DPDs” which are prepared on the base of ESD. Targeted full performance DPD size is equal to full AOD size, i.e. another 290 TB - DATADISK

• The expected physics DPD size (official physics DPD produced by physics analysis groups) is at the level of 0.5 of full AOD size, i.e. 150 TB more - GROUPDISK

• 50 TB should be reserved for local users usage (ntuples, histograms kept at LOCALGROUPDISK token, 1 TB per user) - LOCALGROUPDISK

• Expected size of simulated AOD for MC08 (10 TeV) events only is equal to 80 TB, so need to reserve for full simulated AOD about 250 TB -MCDISK

• It is necessary to keep some samples of ESD and RAW events and, probably calibration data

• So, minimal requirement for needed disk space is at the level of 1000 TB

• Using usual CPU/disk ratio 3/1 one gets 3000 kSI2k estimate for needed CPU resources

03/02/2010 A.Minaenko 8

Page 9: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

BiG Grid – the dutch e-science grid

• Realising an operational ICT infrastructure at the national level for scientific research (e.g. High Energy Physics, Life Sciences and others).

• Projects includes: hardware, operations and support. • Project time: 2007 – 2011• Project budget: 29 M€

– Hardware and operations (incl. people): 16 M€ (lion’s share for HEP).

• 4 central facility sites (NIKHEF, SARA, Groningen, Eindhoven)• 12 small clusters for Life Sciences• wLCG Tier-1 is run as a service

03/02/2010 A.Minaenko 9

Page 10: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Tier-1 by BiG Grid (history)

• The Dutch Tier-1 (NL-T1) is run as BiG Grid service by the operational partners SARA and NIKHEF

• Activity initiated by PDP-group@NIKHEF (involved in EDG and EGEE) and SARA (involved in EDG and EGEE)

• At that point chosen for a 2 site setup– Nikhef: Compute and Disk

– SARA: Compute, Disk, Mass Storage, Database and LHCOPN networking

• Tier-2’s connected to NL-T1, none in the Netherlands (Israel, Russia, Turkey, Ireland, Northern UK as a guest)

03/02/2010 A.Minaenko 10

Page 11: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

T1 hardware resource for Atlas

Atlas Resources Side December 2009 March 2010

Computing S 14k HEPSPEC 14k HEPSPEC

N 14k HEPSPEC 14k HEPSPEC

Front End Storage S 1200 TB 2000 TB

N 1000 TB 1000 TB

Tape Storage S 800 TB 2100 TB (after March)

Bandwidth to tape S 450 MBps

Atlas is allocated 70% of total resource for HEP

03/02/2010 A.Minaenko 11

Page 12: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Architecture overview

03/02/2010 A.Minaenko 12

Page 13: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

03/02/2010 A.Minaenko

Groups and group space

13

Page 14: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

JINR (Dubna) MonALISA monitoring and accounting

03/02/2010 A.Minaenko 14

Page 15: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

RuTier-2 CPU resources usage in 2009

03/02/2010 A.Minaenko 15

Page 16: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

RuTier-2 CPU resources usage for ATLAS in 2009

• ALICE – 25%• ATLAS – 31%• CMS – 23%• LHCb – 21%

• JINR – 63%• IHEP – 14%• RRC-KI – 12%• ITEP – 6%• PNPI – 3%• SINP – 3%

03/02/2010 A.Minaenko 16

Page 17: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

RuTier-2 CPU resources usage in 2009 (EGEE accounting)

03/02/2010 A.Minaenko 17

Page 18: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

NIKHEF, SARA CPU resources usage in 2009 (EGEE accounting)

03/02/2010 A.Minaenko 18

Page 19: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

ATLAS RuTier-2 and data distribution

• The sites of RuTier-2 are associated with ATLAS Tier-1 SARA• Now 8 sites IHEP, ITEP, JINR, RRC-KI, SINP, PNPI, MEPhI, FIAN are

included in TiersOfAtlas list and FTS channels are tuned for data transfers to/from the sites

• 4 sites of them (IHEP, JINR, RRC-KI, PNPI) will be used by ATLAS data analysis and all physics data need for analysis will be kept at these sites. The other 4 sites will be used for MC simulations only/mostly

• All sites successfully participating in data transfer functional tests. This is a coherent data transfer test Tier-0 →Tiers-1→Tiers-2 for all clouds, using existing SW to generate and replicate data and to monitor data flow. Now this is a regular activity done once per week. All the data transmitted during FTs are deleted at the end of each week. The volume of the data used for functional tests is at the level of 10% of data obtained during real data taking

• RuTier-2 is now subscribed to get all simulated AOD, DPD, Tags as well as real 2009 AOD. The data transfer is done automatically under steering and control of central ATLAS DDM (Distributed Data Management) group

• The currently used shares (40%,30%,15%, 15% for RRC-KI, JINR, IHEP, PNPI) correspond to disk resources available for ATLAS at the sites

• MC data are transferred to MCDISK space token and cosmic and real data – to DATADISK space token at RuTier-2

• Problem – during all January not a single file of reprocessed data was replicated to our sites

03/02/2010 A.Minaenko 19

Page 20: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

STEP09 Analysis Global Results

03/02/2010 A.Minaenko 20

Page 21: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

STEP09 lessons• Data transfer part of STEP09 was successful in general. The measures have been taken

to make the RRC-KI SE work more stable; the external bandwidth at IHEP has been already increased up to 1 Gbps

• Analysis jobs: only IHEP demonstrated good results: about 10 Hz event rate and 40% CPU efficiency. Problems with JINR were due to ATLAS sw specific bugs

• The main real problem: the farms in RRC-KI and PNPI needed to be reconfigured to remove bottlenecks. The farm design in the other sites need to be checked and, probably, revised also to take up future challenges.

• Known problems were fixed and analysis exercise repeated. The goal: try to increase frequency up to 20 Hz and CPU/wall up to 100% (see results)

• It is necessary to estimate output bandwidth of our separate fileservers of different types as well as full SE output bandwidth at each site for each VO. This is necessary to understand how many analysis jobs can accept each the site (see enclosed presentation)

• It is not done yet. Will be done after the resources increase.

• The problem is that ratio (Output bandwidth)/(Fileserver volume) is different for different fileservers due to historical reasons. This leads to the fact that full SE bandwidth can be considerably less than just a sum of fileserver bandwidths. It is hard to fix the problem but it is just necessary to take this into account for future equipment purchases.

03/02/2010 A.Minaenko 21

Page 22: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Hammercloud test #895

03/02/2010 A.Minaenko 22

Page 23: Текущее состояние и ближайшие  перспективы компьютинга для АТЛАСа в России

Hammercloud test #895

03/02/2010 A.Minaenko 23


Recommended