+ All Categories
Home > Documents > HEPiX Fall 2008 @ ASGC Taipei Summary Report

HEPiX Fall 2008 @ ASGC Taipei Summary Report

Date post: 13-Jan-2016
Category:
Upload: carrie
View: 38 times
Download: 1 times
Share this document with a friend
Description:
HEPiX Fall 2008 @ ASGC Taipei Summary Report. HEPSysMan @ Manchester 6 November 2008 Martin Bly RAL Tier1 Fabric Manager. Overview. Venue/Format/Themes Storage Castor OS/Environment Benchmarking Virtualisation Networking Security Site Reports. Fall HEPiX 2008. - PowerPoint PPT Presentation
Popular Tags:
23
HEPiX Fall 2008 @ ASGC Taipei Summary Report HEPSysMan @ Manchester 6 November 2008 Martin Bly RAL Tier1 Fabric Manager
Transcript
Page 1: HEPiX Fall 2008 @ ASGC Taipei Summary Report

HEPiX Fall 2008 @ ASGC Taipei Summary Report

HEPSysMan @ Manchester6 November 2008

Martin BlyRAL Tier1 Fabric Manager

Page 2: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 2

Overview

• Venue/Format/Themes• Storage• Castor• OS/Environment• Benchmarking• Virtualisation• Networking• Security• Site Reports

Page 3: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 3

Page 4: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 4

Fall HEPiX 2008

• Hosted by Academia Sinica Grid Computing (ASGC), Taipei, Taiwan, 20th to 24th October– Preceded by a Grid Camp specifically aimed at Far

East Tier2s– Very comfortable, good wireless network access– Excellent food!

• Format– Sessions based on themes– Themes spread over several sessions

• Agenda: http://indico.twgrid.org/conferenceDisplay.py?confId=471

Page 5: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 5

Themes

• Site Reports (3 sessions)• Storage Issues (3)• Data Centres (2)• CASTOR session• Operating Systems & Applications (2)• Virtualisation (2)• Networking & Security (2)• Benchmarking• Miscellaneous Talks (2)• Banquet, Visit to ASGC Data Centre

Page 6: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 6

Storage –File Systems Working Group

• Former IHEPCCC WG now supported by HEPiX Board.

• Work continuing– New tests based on a ‘standard’ job from

CMS, and based on hardware at FZK. – Running with xrootd, AFS, Lustre. – Results continue to indicate Lustre offers

considerable performance benefits.

Page 7: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 7

RAID Technology Update – End-to-End Data Protection Dr. Ted Pang (Infortrend)

• Review of issues surrounding data integrity and protection in modern storage systems. – Errors can occur in any one or more components in the storage

system. – Description of existing data integrity technologies;

• ECC for each data block on HDD, RAID parity checking, CRC, ECC on data links.

– Silent Data Corruption – unreported, unnoticed. • Could be caused by on-wire corruption, incorrect writes, unfinished writes.

– Review of methods to detecting data errors: media scans -> rebuild, media error -> mark bad data block -> report error to host, RAID parity check on read. Need more checking for data blocks.

• Protection for data integrity: add 8bytes to each 512 byte block, written along with original block, checked as data is re-read. – DIF format, IEEE P1619

• How does ZFS fit into the story?– ZFS is a whole system whereas DIF is at the hardware layer.

• What is performance hit for DIF etc? – ~ 5 to 10% performance and 1.5% space.

Page 8: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 8

Lustre File System news and roadmapPeter Bojanic (Sun)

• Reviewed Lustre and developments. – Well used in HEP and used in 6 of top 10 in last Top500. – Demonstrated wide scalability and high performance.

• Notes on Lustre 1.8, 2.0, 3.0. – Lustre 1.8 has faster recovery in high availability use (adaptive timeouts), OST pools,

OSS read cache, client SMP scalability. – Lustre 2.0 is the next ‘big’ release, with Kerberos security, server change logs,

replication, commit on share, enhancements to client IO subsystem. – Scaling to 100k clients feasible, looking to a million clients in 5 years. – Increasing OSS numbers. – Investigating increase in MDS to more than 1+1 hit standby (32+). – OST read and write caches, including asynchronous IO speedups.– Looking at Flash Cache to speed up IO – store data bursts and drain out while compute

node goes back to work. • End-to-end data integrity with Lustre network checksums, now working at data

integrity using ZFS. • Investigating checksum offloading and checksum for ldiskfs. • Extensive news of other future developments.

Page 9: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 9

Castor• CASTOR status and plans

– Sebastien Ponce• New use case: user analysis, based mainly on disk-only use• Architecture changes necessary - chose xrootd as the protocol to give low

latency access, bandwidth and simultaneous connections. • New plugin to handle Xrootd.

• SRM2 and monitoring projects in CASTOR– Giuseppe Lo Presti

• Review of SRM v2 and announcement of review of database schemas review

• Castor Monitoring– Giuseppe Lo Presti

• Dashboard for each instance, based on the Lemon/SLS system at CERN • CASTOR Operational Experiences

– Miguel Coelho Dos Santos • Increasing Tape Efficiency

– Steven Murray• Review of post ‘Curran talk’ work on improvements to tape handling

Page 10: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 10

Operating Systems & Applications • CVS Replacement: Getting Subversive

– Alan Silverman• Report on program to replace CVS service at CERN with Subversion

• High Performance Cryptographic Computing – Chen-Mou Cheng (National Taiwan University)

• Interesting talk on using GPUs on modern Nvidia graphics cards for cryptographic work.

• Update on Mail Services at CERN– Rafal Otto (CERN)

• Update on Mail handling at CERN including SPAM fighting: two phases. – Phase 1 is to reject high-confidence spam with SMTP standard error

mechanism – 96% of SPAM rejected this way. – Phase 2 is to quarantine medium-confidence spam. 50% of messages passing

phase 1 test are still spam. Users can tune threshold (low/medium/high). Messages either delivered into spam folder or direct to user’s mailboxes.

• Details of the upgrade to Exchange 2007 for the CERN mailers using public betas in a virtual environment followed by a rollout as hardware is upgraded.

• Notes on replacing the mailing list system (SIMBA) using mix of SharePoint, exchange group addresses and e-groups.

Page 11: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 11

Scientific LinuxTroy Dawson - FermiLab

• SL 5.2 releases in June 08 for i386 and x86_64. – Post install XFS, dynamic kernel modules, kdeedu, yum utilities.– Pine replaced by alpine, r1000 is in the kernel, yum-installonlyn now in yum

itself. – SL5.2 live released for i386, x86_64.

• SL4.7 released September 08. • SL4.7 Live released. • Automatic download and building of RedHat errata using virtual

machines for SL3,4, mock for SL5. – Some issues – kernel modules done by hand, 64bit library misplacement on

64bit machines. • Fastbugs for SL4,5 pushed weekly. • Poor choice of RPM names caused a problem with updates – fixed with

a plugin to RPM and new RPMs, but still finding ‘bad’ RPMs. • Working on automatic downloading of packages from RedHat. • Need a new logo for SL6 (must be a carbon atom). • Best guesses for next releases:

– SL5.3 ~ March 09, SL4.8 ~ June 09, SL6 ~ Jan 2010.

Page 12: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 12

Benchmarking I CPU Benchmarking at GridKa – Update

• Manfred Alef (GridKa)– Reviewed new worker nodes available at GridKa (E5345, E5430, 2356), the

benchmarks used and environment in which they are run. – Comparison of SPECint_base2000 for i386, x86_64/m32 and x86_64/m64 shows that

64bit on 64bit is best and that the Opterons show a greater improvement. – The differences with SPECint_base2006 are smaller with no special advantages for

64bit/64bit, and that for E5400 series chips, 32bit modes are better because of the overhead of shifting 64bit data.

– Comparison of the scaled benchmarks show good scaling for Intel chips and better results for AMD chips if one uses 64bit on 64bit.

• Discussion of the GridKa power consumption studies, starting with relevant configuration details of the nodes tested and the method of using Cpuburn.

– Comparison of the data shows that overall the more modern chips are better for total power and efficiency (more cores).

– Noted that San Clemente chipset (normal RAM) much less power hungry than the chipsets supporting FB DIMMs.

– Also reviewed results obtained using the SPECpower benchmark for comparison which seem to show that the power measurements with CPUburn are a better match for our purposes.

• Question use of CPUburn rather than HPL which would load RAM too. – Suggests that the physics code is less RAM intensive.– CERN use 50% CPUburn and 50% Lapack for power measurements.

Page 13: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 13

Benchmarking II Power Efficiency of Servers

• Helge Meinhard• Reviewed the need for power efficiency (limited power, cost, green-ness), and a survey of

the sources of power use in a server. • CERN approach is to treat the server as a unit and measure the AC input power. • Outlined method of adjudicating tenders to take into account power consumption. • Data for generations of systems shows that efficiency is factor 9 better since 2004, leading

to CERN retiring older systems much more aggressively. • Quantum steps:

– Micro-architecture – Netburst to Core; – multi-core chips; – Chipsets – 5000P (Blackford) requires FB-DIMMs, 5100 (San Clemente) requires DDR2.– Improvement in PSU technology. – (Intel) L v E chips: can get Es that take less power than the Ls.

• Intel says L is guaranteed to be an L, and E may be as good as or better than and L.

• Redundant PSUs cost a lot of power so service nodes running idle can cost a lot of power.• For disk servers, same things apply.

– Some SATA disks are more efficient than others particularly when idle. Only a single CPU – sufficient performance. Choose the right chipset.

• Q/A: Evident that blades are better for infrastructure machines because of redundant PSUs.

Page 14: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 14

Benchmarking IIIBenchmarking Working Group

• WLCG MoUs based on SI2K– SPEC2000 deprecated in favour of SPEC2006

• no longer available and maintained

• Remit– Find a benchmark accepted by HEP and others as many sites serve different communities

• Work now essentially complete– Studied experiment apps, SPEC2006– 64bit SL4 with 32bit apps – this is how the experiments currently run– Profiling codes using perfmon on lxbatch and SPEC2006 codes on test units shows that the c++

elements in SPEC2006 best match to experiment codes– Rigorous statistical treatment to be completed

• All_cpp benchmark put forward as new benchmark standard – name to be determined– Advantage – must measure – vendors can’t look up the answer on the SPEC site– Measurement must use standard method– Run under prevailing environment at site– Need Site Licence for SPEC2006, everything else is provided

• WG will submit a paper to CHEP09 detailing work• New group from WLCG management to take work on to review MoUs and process for

revising commitments to new benchmark– Alef, Michelotto, Meinhard from original group– HEPiX group will continue efforts

Page 15: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 15

Virtualisation• New possibilities and ideas around Hyper-V virtualisation and Virtual Desktop

Infrastructure (VDI): – Rafal Otto (CERN)

• Review of virtual services provision at CERN. • Issues include limited performance with Virtual Server 2005, management difficult, lack of

support for highly available clusters, lack of connection broker. Some new services not currently available: server consolidation, and a Virtual Desktop infrastructure.

• Using Hyper-V as a new host – much better performance, uses hardware support for virtualisation, support for 64bit guests, multiple cores for guests, much bigger RAM for guests.

• Notes on performance, details of new service and facilities. • Review of Virtual Desktop facilities and TCO.

• Virtualization Experience with Hyper-V on Windows Server 2008 System– Reinhard Baltrusch (DESY)

• Introduction to the virtualisation setup at DESY, based on Hyper-V on Windows Server 2008.• Reviewed past situation and hardware provision, and new work on their current setup.

• Bridging Grid and Virtualization– Dr. Fubo Zhang (Platform Computing)– An intro to Platform Computing and its activities.

• Described their concept of how the grid and virtualisation can work together. • Described LSF2 product that allows a virtual machine resource to be specified as a job resource

parameter.

Page 16: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 16

Networking –IPv4/IPv6 Transition Status and

Recommendations• Fred Baker (CISCO)– Chair of the IETF IPv6 Operations Working Group and is apart chair of the

IETF (1996-2001).– Overview of his position on IPv6 from his period as chair of IETF and later as

an industry commentator with some comments on the difficulties then anticipated in brining the Chinese university system online.

– Reviewed of the current state of the IP universe and how the various players in the game see the issues and goals differently.

– Fred sees the goal as ‘extending the lifetime of the internet with maximized application options and minimized long term operational cost’. The IETF basic transition recommendation is “turn it on in your existing IPv4 network”, and Fred outlined the possibilities of what this means. All solutions should provide both ipv4 and ipv6 transparently to all devices.

– RFC 5211 gives an alternative deployment plan from John Curran. In 2009, prepare by bringing up v4+v6, encourage use of ipv6. 2010, transition phase – raise price of ipv4 only services to encourage transition. Turn off v6 by 2012, which seems optimistic.

• Also reviewed alternative ways forward including translation between the two types of addresses (not enough addresses for everyone the planet, never mind all the sensors we might want).

Page 17: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 17

Cyber Security Update: Tony Cass (CERN)

• Tony gave a talk about the general state of cyber security, some specific issues affecting CERN and the HEP community, and general trends.

– Debian OpenSSL vulnerability– Signing by the badguys of malicious binary RPMs using RedHat and Fedora keys– Ongoing compromised SSH key attacks. – Notes on the two recent network security threats (DNS poisoning, TCP stack issues), and the

problems with many SCADA systems.• In the HEP community, there was a problem with infected USB sticks – remember floppies

of old?– CERN, will restrict access to external DNS server and use of TOR anonymous routing. – Enhanced logging and traceability from logging performed by IT/FIO. – Seeing lots of Web application attacks attempting to take over web servers.– email viruses with rapidly changing signatures– Semi-targeted phishing attacks, usually messages purporting to come from the local ‘IT department’.

• Tony reviewed the CMSMON web compromise, and the general consequences of the

exposure of much sensitive information because CERN is an ‘open lab’.– Plans for the future include notification of ssh logins from ‘new’ places.

• Issues in general cyber security included the same classes of problem cropping up in software again and again:

– more sophisticated and malicious viruses– kits for generating malicious software– MBR root kits delivered by drive-by download, – click-jacking, massive botnets, targeted automatic attacks driven by criminal gangs

Page 18: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 18

Selected fragments from Site Reports

Page 19: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 19

Site Reports I

• CERN– Denise Heagerty stepping down as head of security. – Largest risks: flaws in custom-built apps, socially engineered

phishing attacks, zero-day exploits. – Problems with too-open web servers, test accounts with

weak passwords. – VPN closed. – Skype trials for six months. – Designing new data centre but no official go-ahead. (Power

running out). Can utilise chillers to implements some water-cooled racks.

– Procurements: problems with a major supplier going out of business – orders not delivered. 1700 Systems no longer covered by warranties.

Page 20: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 20

Site Reports II• FermiLab

– Pushing to ISO20000 and ITIL v2 framework.– Moving email and calendaring to Exchange.

• University of Tokyo– Used for analysis by Atlas-Japan – Torque/Maui, Quattor, Twiki, Lemon, MRTG– Implementing Nagios

• INFN (T1)– Many sites migrating from national cell to local AFS cells– Description of TRIP systems to allow INFN-wide wireless

access authentication based on RADIUS. Looked at a similar system for email access but local resistance and wish for local control.

– IPV6 working group to evaluate addressing and connectivity to get used to it for when it is necessary. Some issues with IPv6 bypassing firewalls.

Page 21: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 21

Site Reports III

• SNAL (SLAC)– Change of name to the SLAC National Accelerator

Laboratory – Various staff changes

• Chuck Boheim now at Cornell• Richard Mount now working for Atlas• New Director of Computing to be appointed

• RAL– Progress with new building– Tier1 migration planning– Procurements– Change of email addresses: [email protected]– Production Team (Gareth Smith, John Kelly)

Page 22: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 22

Site Reports IV

• DESY– Using system’s temperature monitoring to

profile machine room temperature– Update on Lustre use

• GSI– Power and cooling shortages

• Using water-cooled racks provided an emergency solution – also much quieter

• Various problems with sudden loss of cooling

– Continued positive experience of Lustre

Page 23: HEPiX Fall 2008 @ ASGC Taipei Summary Report

6 November 2008 HEPiX Fall 2008 Report - HEPSysMan @ Manchester 23

Site Reports V

• PDSF (NERSC)– Report on problems with MCP55 forcedeth

driver:• Nvidia driver with same name and version

number available but different code – no indication in change log it is different. Solved problems.

– Moving away from 3ware-based storage to SATA/FC to separate server from storage elements.


Recommended