Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | alexina-dawson |
View: | 214 times |
Download: | 1 times |
Israel ATLAS Tier2 Status 1NL Cloud Meeting, 5 April 2011
Israel ATLAS TIER-2
Status
April 2011
Lorne Levinson
Israel ATLAS Tier2 Status 2NL Cloud Meeting, 5 April 2011
Israel HEP community
• ATLAS is the only LHC experiment in which we participate– also Phenix (Heavy Ion @BNL), ILC, ZEUS– Israel is “1.35% of ATLAS” (MoU pledge, authors, common fund)– 25-30 people doing physics analysis
• 3 sites: – Tel Aviv University, Tel Aviv (1956)
• a university– The Technion Israel Institute of Technology, Haifa (1924)
• a university– Weizmann Institute of Science, Rehovot (1934)
• a research institute for Biology, Chemistry, Physics, Math & CS) with graduate school (no undergrads)
• longest travel is Weizmann Technion 2 hours office-to-office
Israel ATLAS Tier2 Status 3NL Cloud Meeting, 5 April 2011
Organization
• we are a distributed Tier2/Tier3
• each site combines Tier2 and Tier3 resources in the same cluster– all resources shared flexibly between T2 and T3 (Lustre/Storm)
• single management and budget, single purchasing
• three sites as identical as possible
• Steering Committee for overall policy
• Management & Operations team for the three sites
• stable funding approved until 2012
Israel ATLAS Tier2 Status 4
StorageContinues to be the biggest reliability issue.
• Our hardware is now stable:– replaced DDN 6620’s with DDN 9900
• Fully redundant, 300 disk slots, 8x8Gb/s FC ports 5GB/s– two Lustre “OSS” servers – WI servers with 10Gb/s to cluster,
TAU, Tech will install 10G in April
• Gave up on Thumpers+Lustre and Thumpers+iSCSI+Lustre. – We NFS mount Thumpers with Solaris+ZFS for extra "archive"
storage, home directories or /opt/exp_soft
• Lustre + Storm problem is Storm team does not test new Storm releases on Lustre– Storm-Lustre community must solve this
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status 5NL Cloud Meeting, 5 April 2011
Storm/Lustre
• Storm allows LCG SRM storage and our local global file name space to share the same physical storage.– No rigid boundary– Jobs in cluster can do Linux file io to read SRM files
• Storm can run over Lustre (open source) or GPFS (IBM)
• Lustre:– Object Storage Targets serve (stripes of) file data– Meta-Data Server holds directories
• redundant failover of MDS’s will soon be supported
Israel ATLAS Tier2 Status 6
Storage – installed SRM + local capacity
TAU Technion Weizmann Total
2010 240 192 288 720
2011 purchase 96 144 144 384
Total 2011 336 336 432 1104
Heavy Ion 3Q2011 48 1152
NL Cloud Meeting, 5 April 2011
Net TB
Israel ATLAS Tier2 Status 7NL Cloud Meeting, 5 April 2011
Group disks
• We are hosting four ATLASGROUPDISK areas– Muon performance (Technion)– Top (Weizmann)– Heavy Ion (Weizmann)– Standard Model (TAU) (empty)
Israel ATLAS Tier2 Status 8
CPU
• Last purchase was dual Intel E5520 quad core
• May delivery purchase is dual Intel X5650 hex-core– again 4 motherboards per 2U box with redundant power supply
NL Cloud Meeting, 5 April 2011
cores Tel Aviv Technion Weizmann Total
Now 192 272 448 944
May 336 464 640 1440
We benefit a lot that some other groups place some cores in our cluster:
* Weizmann: ATLAS+Phenix/Heavy-Ion, HEP Theory, Condensed matter
* Technion: HEP Theory and Bio-informatics
* TAU includes: HEP Theory
Israel ATLAS Tier2 Status 9
Services nodes
Virtualize most services• Two 8-core servers, 48GB• Failover• Easier management
– VM images– Roll-back– Image sharing– Easier testing: temp machines
• May delivery of HW• Deciding among: VMware,
Xen, Citrix, KVM• SE not included
Service Where
gLite CE per site
gLite site-BDII per site
gLite MON per site
glite APEL per site
ELOG electronic log book WI
Zenoss fabric monitoring per site
LDAP, DNS, DHCP, syslog per site
Frontier DB cache per site
VOMS (for Israel) TAU
gLite WMS, LB (for Israel) WI
gLite myproxy (for Israel) WI
gLite Top-BDII (for Israel) WI
gLite NAGIOS for Israel grid service monitoring
WI
Mantis issue tracker Tech
Managers’ Wiki pages Tech
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status 10
Networking
Our networking is not good• Geant connection is 2 x 1.5G (subscribed on 2 x 2.5G infrastructure)
• “Political” limits: TAU 500M, Technion 350M, WI 400M
– Because a 1G line is shared with institute traffic and the shared router is not really able to do 1G duplex
• We suspect that the gross mismatch with SARA/NIKHEF’s 10G causes failed connections due to dropped packets.– Lowering the # of files & streams to avoid dropped packets leaves
us with even worse net BW• Expensive because it is an undersea fiber and one (Italian) company
owns the fibers.– An Israeli competitor is installing another fiber now
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status 13
Networking plans
May 2011(?):
• Increase international connection: from 3Gb/s to 4Gb/s.– 5G might be possible later this year, but not budgeted.
• Replace old routers at entrances to institutes with 10G capable equipment.– This should increase our thru’put and reliability and allow us to
actually use a major share of the 1G BW to the sites
• Negotiating 10G academic backbone
• Could have 10G to Geant in spring 2012
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status 14
SAM/NAGIOS
• Our NGI did not take on the SAM/NAGIOS monitoring responsibility
• After the new NAGIOS tests replaced SAM tests, we received no alerts on failed tests.
• This was a severe problem
• Finally in December it was agreed with EGI, our NGI and us that we would deploy a NAGIOS test service for Israel, until our NGI succeeded to do it.
– The only functioning grid sites in Israel are our 3 ATLAS sites
• Our NAGIOS service was up and running in January.
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status 15
Upcoming work
• Deploy Zenoss fabric and service monitor on all three clusters– currently in-test at Weizmann
• Deploy Puppet configuration system on all three clusters– We gave up on Quattor after having finally succeeded in getting
it to run,• Clear that it was unsustainable
– Currently for work nodes at Weizmann– Needs to include gLite nodes
• Virtualization of services (excl SE)
• Address Storm “untested new version” problem
NL Cloud Meeting, 5 April 2011