Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | hilary-long |
View: | 214 times |
Download: | 0 times |
Status of the Bologna Computing Status of the Bologna Computing Farm and GRID related activitiesFarm and GRID related activities
Vincenzo M. VagnoniVincenzo M. Vagnoni
Thursday, 7 March 2002 Thursday, 7 March 2002
OutlineOutline
Currently available resourcesCurrently available resources Farm configurationFarm configuration PerformancePerformance Scalability of the system (in view of the DC)Scalability of the system (in view of the DC) Resources Foreseen for the DCResources Foreseen for the DC Grid middleware issuesGrid middleware issues ConclusionsConclusions
Current resourcesCurrent resources
Core system (hosted in two racks at INFN-CNAF)Core system (hosted in two racks at INFN-CNAF) 56 CPUs hosted in Dual Processor machines (18 PIII 866 MHz +
32 PIII 1 GHz + 6 PIII Tualatin 1.13 GHz), 512 MB RAM 2 Network Attached Storage systems
1 TB in RAID5, with 14 IDE disks + hot spare 1 TB in RAID5, with 7 SCSI disks + hot spare
1 Fast Ethernet switch with Giga Uplink. Ethernet Controlled power distributor for remote power cicle
Additional resources by INFN-CNAFAdditional resources by INFN-CNAF 42 CPUs in dual Processor machines (14 PIII 800 MHz, 26 PIII 1
GHz, 2 PIII Tualatin 1.13 GHz)
Farm Configuration (I)Farm Configuration (I)
Diskless processing nodes with OS centralized on a file Diskless processing nodes with OS centralized on a file server (Root over NFS)server (Root over NFS)
It makes trivial the introduction or removal of a node in the system, i.e. no need of software installation on local disks
Grants easy interchange or CEs in case of shared resources (e.g. among various experiments), and permits dynamical allocation of the latter without additional work
Very stable! No real drawback observed in about 1 year of run Improved securityImproved security
Usage of private network IP addresses and Ethernet VLAN High level of isolation Access to external services (afs, mccontrol, bookkeeping db, servlets of
various kinds, …) provided by means of NAT technology on the GW Most important critical systems (Single Points of Failure), but not
everything actually, made redundant Two NAS in the core system with RAID5 redundancy GW and OS server: operating systems installed on two RAID1 disks
(Mirroring)
Farm Configuration (II)Farm Configuration (II)
N A S
R ed H a t 7 .2D N SIP F o rw a rd in gM a sq u era d in g
D isk lessR ed H a t 6 .1K ern e l 2 .2 .1 8P B S M a sterM cserverF a rm M o n ito rin g
G atew ay
F ast E th ern et S w itch
N A S
P ow er D istr ib u tor
E th ern etL in k
P ow er C on tro l
C on tro l N o d e 1
P ro cessin g N od e 2
P ro cessin g N od e n
R ed H a t 7 .2
Va rio u s serv ices:H o m e d irec to ries
P X E rem o te b o o t,D H C P, N IS
R A ID 5 R A ID 5
G ig aU p lin k
R A ID 1
R A ID 1
P u b licV L A N
P riv a teV L A N
D isk lessR ed H a t 6 .1K ern e l 2 .2 .1 8P B S S la ve
O S F ile S ystem s
Fast ethernet switch
NAS 1TB
Ethernet controlled power distributor (32 channels)
Rack (1U dual-processor MB)
PerformancePerformance
System has been fully integrated in the LHCb MC production System has been fully integrated in the LHCb MC production since August 2001since August 2001
20 CPUs until December, 60 CPUs until last week, 100 CPUs 20 CPUs until December, 60 CPUs until last week, 100 CPUs nownow
Produced mostly bb inclusive DST2 with the classic detector Produced mostly bb inclusive DST2 with the classic detector (SICBMC v234 and SICBDST v235r4, 1.5 M) + some 100k (SICBMC v234 and SICBDST v235r4, 1.5 M) + some 100k channel data sets for LHCb light studieschannel data sets for LHCb light studies
Typically roughly 20 hours needed on a 1 GHz PIII for the full Typically roughly 20 hours needed on a 1 GHz PIII for the full chain (minbias RAWH + bbincl RAWH + bbincl piled up chain (minbias RAWH + bbincl RAWH + bbincl piled up DST2) for 500 eventsDST2) for 500 events
Farm capable of producing about (500 events/day)*(100 CPUs)=50000 events/day, i.e. 350000 events/week, i.e. 1.4 TB/week (RAWH + DST2)
Data transfer to CASTOR at CERN realized with standard ftp Data transfer to CASTOR at CERN realized with standard ftp (15 Mbit/s over available bandwidth of 100 Mbit/s), but tests (15 Mbit/s over available bandwidth of 100 Mbit/s), but tests with bbftp reached very good troughput (70 Mbit/s)with bbftp reached very good troughput (70 Mbit/s)
Still waiting for IT to install a bbftp server at CERN
ScalabilityScalability
Production tests made these days with 82 MC Production tests made these days with 82 MC processes running in parallelprocesses running in parallel
Using the two NAS systems independently (instead to share the load between them)
Each NAS worked at 20% of full performance, i.e. each of them can be scaled up much more than a factor 2
Distributing the load we are pretty sure this system can handle more than 200 CPUs working at the same time at 100% (i.e. without bottlenecks)
For the analysis we want to test other technologies We plan to test a fibre channel network (SAN, Storage Area
Network) on some of our machines, with nominal 1 Gbit/s bandwidth to fibre channel disk arrays
Resources for the DCResources for the DC
Additional resources by INFN-CNAF foreseen Additional resources by INFN-CNAF foreseen for the DC periodfor the DC period
We’ll join the DC with order of 150-200 CPUs We’ll join the DC with order of 150-200 CPUs (around 1 GHz or more), 5 TB of disk storage (around 1 GHz or more), 5 TB of disk storage and a local tape storage system (CASTOR like? and a local tape storage system (CASTOR like? Not yet officially decided)Not yet officially decided)
Still need some work to make the system fully Still need some work to make the system fully redundantredundant
Grid issues (A. Collamati)Grid issues (A. Collamati)
2 nodes reserved at the moment for tests on GRID 2 nodes reserved at the moment for tests on GRID middlewaremiddleware
The two nodes form a minifarm, i.e. they have exactly the The two nodes form a minifarm, i.e. they have exactly the same configuration as the production nodes (one master same configuration as the production nodes (one master node and one slave node) and can run MC jobs as wellnode and one slave node) and can run MC jobs as well
Globus has been installed and first trivial tests on job Globus has been installed and first trivial tests on job submission through PBS were successfulsubmission through PBS were successful
Test job submission via globus on large scale by Test job submission via globus on large scale by extending the PBS queue of the globus test farm to all extending the PBS queue of the globus test farm to all our processing nodesour processing nodes
No interference with the distributed production working system
ConclusionsConclusions
Bologna is ready to join the DC with a reasonable amount Bologna is ready to join the DC with a reasonable amount of resourcesof resources
Scalability tests were successfulScalability tests were successful The farm configuration is pretty stableThe farm configuration is pretty stable We need the bbftp server installed at CERN to fully We need the bbftp server installed at CERN to fully
exploit WAN connectivity and throughputexploit WAN connectivity and throughput We are waiting for the decision of the DC period by CERN We are waiting for the decision of the DC period by CERN
for the final allocation of INFN-CNAF resourcesfor the final allocation of INFN-CNAF resources Work on GRID middleware started, first results are Work on GRID middleware started, first results are
encouragingencouraging We plan to install Brunel ASAPWe plan to install Brunel ASAP