Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | miranda-lindsay-boyd |
View: | 222 times |
Download: | 0 times |
Wolfgang von Rüden, CERN, IT Department
September 2006
IHEPCCC Meeting
CERN Site Reportbased on the input from many IT colleagues,
with additional information in hidden slides
Wolfgang von Rüden
IT Department Head, CERN
22 September 2006
Wolfgang von Rüden, CERN, IT Department
September 2006
General Infrastructure and Networking
3
Wolfgang von Rüden, CERN, IT Department
September 2006
3
3
Computer Security (IT/DI)• Incident analysis
• 14 compromised computers on average per month in 2006• Mainly due to user actions on Windows PCs, e.g. trojan code installed• Detected by security tools monitoring connections to IRC/botnets• Some Linux systems were compromised by knowledgeable
attacker(s)• Motivation appears to be money earned from controlled computers
• Security improvements in progress• Strengthened computer account policies and procedures • Ports closed in CERN main firewall (http://cern.ch/security/firewall)• Controls networks separated and stronger security policies applied• Logging and traceability extended to better identify cause of incidents• Investigation of intrusion detection at 10Gbps based on netflow data
• What is the policy of other labs concerning high-numbered ports?
4
Wolfgang von Rüden, CERN, IT Department
September 2006
4
4
Timeline for Security Incidents May 2000 - August 2006
0
50
100
150
200
250
Jan-00 Jul-00 Jan-01 Jul-01 Jan-02 Jul-02 Jan-03 Jul-03 Jan-04 Jul-04 Jan-05 Jul-05 Jan-06 Jul-06
Nu
mb
er
of
inc
ide
nts
Code Red Worm (Webservers)
Suckit Rootkits(LINUX)
Blaster Worm variants
(Windows)
IRC Based Hacker Networks
(ALL platforms)
Non-centrally managed laptops & downloaded
code caused most incidents
Systems exposed in firewall caused most incidents
Change in trend
Compromised Machines
Timeline for Security IncidentsMay 2000 – August 2006
5
Wolfgang von Rüden, CERN, IT Department
September 2006
5
5
Computing and Network Infrastructure for Controls: CNIC (IT/CO & IT/CS)
• Problem:• Control systems now based on TCP/IP and commercial PCs and devices• PLCs and other controls equipment cannot currently be secured
• Consequences: Control Systems vulnerable to viruses and hacking attacks• Risks: Down-time or physical damage of accelerators and experiments• Constraints:
• Access to control systems by off-site experts is essential• Production systems can only be patched during maintenance periods
• Actions Taken: Set up CNIC Working Group• Establish multiple separate Campus and Controls network domains• Define rules and mechanisms for inter-domain & off-site communications• Define policies for access to and use of Controls networks• Designate persons responsible for controls networks & connected equipment• Define & build suitable management tools for Windows, Linux and Networks• Test security of COTS devices and request corrections from suppliers• Collaborate with organizations & users working on better controls security
• Ref: A 'defence-in-depth' strategy to protect CERN's control systemshttp://cnlart.web.cern.ch/cnlart/2006/001/15
6
Wolfgang von Rüden, CERN, IT Department
September 2006
6
6
Networking Status (IT/CS)• Internal CERN Network infrastructure progressing on time
• New campus backbone upgraded.• Farm router infrastructure in-place.• New Infrastructure for external connectivity in-place.• CERN internal infrastructure (starpoints) upgrade in progress to provide
better desktop connectivity.• Management tools to improve security control have been developed and
put into production (control of connections between the Campus and the Technical networks). This is part of the CNIC project.
• New firewall infrastructure being developed to improve aggregate bandwidth and integrate into a common management scheme.
• Large parts of the experimental areas and pits now cabled.
• The Dante POP for Geant-2 was installed at CERN towards the end of 2005.
• Current work items• Improved wireless network capabilities being studied for the CERN site.
7
Wolfgang von Rüden, CERN, IT Department
September 2006
7
7
LHCOPN Status
• LHCOPN links coming online:• Final circuits to 8 Tier-1’s• Remaining 3 due before the end of the year.
• LHCOPN Management• Operations and Monitoring responsibilities being shared
between EGEE (Layer3) and Dante (Layer 1/2)• Transatlantic link contracts passed to USLHCNet (Caltech) to aid
DoE transparency• 3 links to be commissioned this year, Geneva-Chicago, Geneva-
NewYork and Amsterdam-NewYork• Current work items
• Improve management of multiple services across transatlantic links using VCAT/LCAS technology. Being studied by the USLHCNet group.
• Investigate the use of cross border fiber for path redundancy in the OPN.
Wolfgang von Rüden, CERN, IT Department
September 2006
LHCOPNL2 CIRCUITS
3x10G
2x10G
1x10G
<10G
Bandwidth Managed
Cross Border Fiber
9
Wolfgang von Rüden, CERN, IT Department
September 2006
9
9
Scientific Linux @ CERN (IT/FIO)• See https://www.scientificlinux.org/distributions/roadmap.
• Many thanks to FNAL for all their hard work.• Support for SL3 ends 31st October 2007
• SLC4• CERN specific version of SL4
• Binary compatible for end user• Adds AFS, tape hardware support, ... required at CERN
• Certified for general use at CERN end-March• Interactive and Batch services available since June• New CPU servers commissioned in October (1MSI2K) will all be
installed with SLC4• Switch of the default from SLC3 to SLC4 foreseen (hoped!) for end
October/November• Depends on availability of EGEE middleware• Will almost certainly be 32-bit
– Too much software not yet 64-bit compatible• SLC5 is low priority
• Could arrive 1Q07 at the earliest and no desire to switch OS just before LHC startup
• But need to start planning for 2008 soon.
10
Wolfgang von Rüden, CERN, IT Department
September 2006
10
10
Internet Services (IT/IS)
• Possible subjects for HEP-wide coordination• E-mail coordination for anti-spam, attachments, digital
signatures, secure E-mail and common policies for visitors• Single sign on and integration with Grid certificates• Managing Vulnerabilities in Desktop Operating systems and
applications. Policies concerning “root” and “Administrator” rights on Desktop computers. Antivirus and anti-spyware policies
• Common Policies for Web hosting, role of CERN as a “catch-all” web hosting service for small HEP labs, conferences and activities distributed across multiple organizations.
• Desktop Instant messaging and IP telephony? Protocols, integration with email, presence information?
11
Wolfgang von Rüden, CERN, IT Department
September 2006
11
11
Conference and AV Support (IT/UDS)• Video conferencing services
• HERMES H323 MCU• Joint project with IN2P3 (host), CNRS and INSERM
• VRVS preparing EVO rollout• Seamless Audio/Video conference integration through
SIP (beta test)• SMAC: conference recording (Smart Multimedia Archive
for Conferences)• Joint project with EIF (engineering school) and Uni Fribourg• Pilot in main auditorium
• Video Conference rooms refurbishment• Pilot rooms in B.40: standard (CMS), fully-featured (ATLAS)• 12 more requested before LHC turn-on
• Multimedia Archive Project• Digitisation: Photo / Audio / Video• CDS Storage and Publication
• e.g. http://cdsweb.cern.ch/?c=Audio+Archives
12
Wolfgang von Rüden, CERN, IT Department
September 2006
12
12
Indico & Invenio Directions (IT/UDS)
• Indico as the “single interface”• Agenda migration virtually complete• VRVS booking done• HERMES and eDial booking soon• CRBS: Physical room booking under study• Invenio for Indico search
• CDS powered by Invenio• Released in collaboration with EPFL• Finishing major code refresh into Python• Flexible output formatting; XML, BibTeX
• RSS feeds; Google Scholar interfacing• In 18 languages (contributions from around the globe)
• Collaborative tools• Baskets, reviewing, commenting
• Document “add on”• Citation extraction linking (SLAC planning to collaborate)• Key-wording (ontology with DESY)
13
Wolfgang von Rüden, CERN, IT Department
September 2006
13
13
Open Access• Preprints – already wholly OA
• Operational circular 6 (rev 2001) requires every CERN author to submit a copy of their scientific documents to the CERN Document Server (CDS)
• Institutional archive & HEP Subject archive
• Publications• Tripartite Colloquium December 2005: “OA Publishing in Particle
Physics”• Authors, publishers, funding agencies
• Task force (report June 2006)• …to study and develop sustainable business models for
particle physics• Conclude: a significant fraction of particle physics journals
are ready for a rapid transition to OA under a consortium funded sponsoring model
14
Wolfgang von Rüden, CERN, IT Department
September 2006
14
14
Oracle related issues (IT/DES)
• Serious bug causing logical data corruption (wrong cursor sharing, side effect of new algorithm enabled by default in RDBMS 10.2.0.2)
• LFC and VOMS affected• Problem reported 11 Aug• Workaround in place 21 Aug (with small negative side-
effect)• First pre-patch released 29 Aug• Second pre-patch released 14 Sep• Prod-patch expected any day now
• Support request escalated to highest level• “In one of the most complex parts of the product”• Regular phone conferences with Critical Account Manager
• What to learn:• We feel we got good attention but still took time• Not always good to be on the latest release!
15
Wolfgang von Rüden, CERN, IT Department
September 2006
15
15
CERN openlab
Concept• Partner/contributor sponsors latest
hardware, software and brainware (young researchers)
• CERN provides experts, test and validation in Grid environment
• Partners: 500’000 €/ year, 3 years• Contributors: 150’000 €, 1 year
Current Activities• Platform competence centre• Grid interoperability centre• Security activities• Joint events
Wolfgang von Rüden, CERN, IT Department
September 2006
WLCG Update
les.robertson@cern,ch 17
LCGWLCG depends on two major science grid infrastructures
….EGEE - Enabling Grids for E-ScienceOSG - US Open Science Grid
les.robertson@cern,ch 18
LCG Grid progress this year Baseline services from the TDR are in operation
Agreement (after much discussion) on VO Boxes.. gLite 3
Basis for startup on EGEE grid Introduced (just) on time for SC4 New Workload Management System - now entering production
Metrics accounting introduced for Tier-1s and CERN (cpu and storage) site availability measurement system introduced – reporting for
Tier-1s & CERN from May job failure analysis
Grid operations All major LCG sites active Daily monitoring and operations now mature – EGEE and OSG –
taken in turn by 5 sites for EGEE Evolution of EGEE regional operations support structure
les.robertson@cern,ch 21
LCG
Pre-SC4 April tests CERN T1s – SC4 target 1.6 GB/s reached – but only for one day
But – experiment-driven transfers (ATLAS and CMS) sustained 50% of the targetunder much more realistic conditions
CMS transferred a steady 1 PByte/month between Tier-1s & Tier-2s during a 90 day period
ATLAS distributed 1.25 PBytes from CERN during a 6-week period
Data Distribution
1.6 GBytes/sec
0.8 GBytes/sec
target 88%74% 75%85% 86%
avail: 95% reliability: 95% avail: 69% reliability: 71% avail: 94% reliability: 94%
avail: 69% reliability: 73% avail: 59% reliability: 60% avail: 83% reliability: 83%
avail: 87% reliability: 87% avail: 97% reliability: 97% avail: 4% reliability: 4%
avail: 88% reliability: 88% avail: n/a reliability: 0% avail: n/a reliability: 0%
USCMS-FNAL-WC1
IN2P3-CC
SARA-MATRIX
NDGF
Data from SAM monitoring. Site availability and reliability as agreed in WLCG MB on 11 July 2006 (scheduled interruptions are excluded when calculating reliability)
TRIUMF-LCG2 Taiwan-LCG2
CERN-PROD FZK-LCG2
INFN-T1 RAL-LCG2
scheduled downlegend:
PIC BNL
Availability of WLCG Tier-1 Sites + CERN August 2006
tests passed
average (8 best sites):
reliabilityaverage (all sites):
site average colour coding: < 90% of target ≥ 90% of target ≥ targetavailability
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
SAM tests fail due to dCache function failure that does not affect CMS jobs. The problem is understood and is being worked on
All sites assumed up while SAM had problems on 1, 3, 4 August
Site not integrated into the Site Availability Monitoring (SAM) system -
not included in overall average
Site not integrated into the Site Availability Monitoring (SAM) system -
not included in overall average
August 2006 • two sites not yet integrated in measurement framework• SC4 target - 88% availability• 10-site average – 74% • best 8 sites average – 85%• reliability (excludes scheduled down time) ~1% higher
les.robertson@cern,ch 24
LCG Job Reliability Monitoring Ongoing work System to process and analyse job logs implemented for some
of the major activities in ATLAS and CMS Errors identified, frequency reported to developers, TCG
Expect to see results feeding through from development to products in a fairly short time
More impact expected when the new RB enters in full production (old RB is frozen)
Daily report on most important site problems allows the operation team to drill down from site, to computing
elements to worker nodes In use by the end of August
Intention is to report long-termtrends by site, VO
0
0.2
0.4
0.6
0.8
1
1.2
23-May 12-Jun 2-Jul 22-Jul 11-Aug 31-Aug 20-Sep
FNAL
les.robertson@cern,ch 25
LCG
Commissioning
Schedule2006
2007
2008
SC4 – becomes initial service whenreliability and performance goals met
01jul07 - service commissioned - full 2007 capacity, performance
first physics
Continued testing of computing models, basic services
Testing DAQTier-0 (??) & integrating into DAQTier-0Tier-1data flow
Building up end-user analysis support
Exercising the computing systems, ramping up job rates, data management performance, ….
Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 X 7 operation, ….
Introduce residual services Full FTS services; 3D; SRM v2.2; VOMS roles
les.robertson@cern,ch 26
LCG Challenges and Concerns
Site reliability Achieve MoU targets – with a more comprehensive set of tests Tier-0, Tier-1 and (major) Tier-2 sites Concerns on staffing levels at some sites 24 X 7 operation needs to be planned and tested – will be
problematic at some sites, including CERN, during the first year when unexpected problems have to be resolved
Tier-1s and Tier-2s learning exactly how they will be used Mumbai workshop, Tier-2 workshops Experiment computing model tests Storage, data distribution
Tier-1/Tier-2 interaction Test out data transfer services, network capability Build operational relationships
Mass storage Complex systems difficult to configure Castor 2 not yet fully mature SRM v2.2 to be deployed – and storage classes, policies
implemented by sites 3D Oracle - Phase 2 – sites not yet active/staffed
les.robertson@cern,ch 27
LCG Challenges and Concerns
Experiment service operation Manpower intensive Interaction with Tier-1s, large Tier-2s Need sustained test load – to verify site and experiment
readiness
Analysis on the Grid is very challenging Overall grow in usage very promising
CMS has the lead with over 13k jobs/day submitted by ~100 users using ~75 sites (July 06)
They will continue to have an impact on and uncover weaknesses in services at all levels
Understanding the CERN Analysis Facility DAQ testing looks late
the Tier-0 needs time to react to any unexpected requirements and problems
Wolfgang von Rüden, CERN, IT Department
September 2006
Tier0 Update
les.robertson@cern,ch 29
LCGCERN Fabric progress this
yearCERN Fabric Tier-0 testing has progressed well
Artificial system tests .. and ATLAS Tier-0 testing at full throughput
Comfortable that target data rates, throughput can be met .. Including CASTOR 2
But DAQ systems not yet integrated in these tests CERN Analysis Facility (CAF)
Testing of experiment approaches to this have started only in the past few months
Includes PROOF evaluation by ALICE Much has still to be understood Essential to maintain Tier-0/CAF flexibility for hardware during
early years CASTOR 2
Performance is largely understood Stability and the ability to maintain a 24 X 365 service is now the
main issue
30
Wolfgang von Rüden, CERN, IT Department
September 2006
30
30
CERN Tier0 Summary (IT/FIO)• Infrastructure
• A difficult year for cooling but the (long delayed) upgrade to the air conditioning system is now complete.
• The upgrade to the electrical infrastructure should be complete in early 2007 with the installation of an additional 2.4MW of UPS capacity
• No spare UPS capacity for physics services until then; the additional UPS systems are required before we install the hardware foreseen for 2007.
• Looking now at possible future computer centre as rise in power demand for computing systems seems inexorable—demand likely to exceed current 2.5MW limit by 2009/10.
• Water cooled racks as installed at the experiments seem to be more cost-effective than air cooling.
• Procurement• We have evaluated tape robots from IBM and STK and also their high-end tape
drives over the past 9 months.• Re-use of media means high-end drives are more cost-effective over a 5 year period.• Good performance seen from equipment from both vendors
• CPU and Disk server procurement continues with regular calls for tender• Long time between start of process and equipment delivery remains, but process is
well established
31
Wolfgang von Rüden, CERN, IT Department
September 2006
31
31
Tier0, suite
• Readiness for LHC Production• Castor2 now seems on track
• All LHC experiments fully migrated to Castor2• Meeting testing milestones [images to show on next slides]• Still some development required, but effort is now focussed on the known
problem areas as opposed to firefighting.• Grid services now integrated with other production services
• Service Dashboard at https://cern.ch/twiki/bin/view/LCG/WlcgScDash shows readiness of services for production operation
– Significant improvement in readiness over past 9 months. [see later for image]
• Now a single daily meeting for all T0/T1 services• Still concerns over possible requirement for 24x7 support by engineers
• Many problems still cannot be debugged by on-call technicians• For data distribution, full problem resolution is likely to require contact with
remote site.
35
Wolfgang von Rüden, CERN, IT Department
September 2006
35
35
Grid Service dashboard
36
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE project• Phase 1
– 1 April 2004 – 31 March 2006 71 partners in 27 countries
(~32M€ funding from EU)
• Phase 2– 1 April 2006 – 31 March 2008
91 partners in 32 countries(~37M€ EU funding)
• Status– Large-scale, production-quality
grid infrastructure in use by HEPand other sciences(~190 sites, 30,000 jobs/day)
– gLite3.0 Grid middlewaredeployed
EGEE provides essential support to the LCG project
37
Wolfgang von Rüden, CERN, IT Department
September 2006
37
37
EU projects related to EGEE
39
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Sustainability: Beyond EGEE-II
• Need to prepare for permanent Grid infrastructure– Production usage of grid infrastructure requires long-term
planning– Ensure a reliable and adaptive support for all sciences– Independent of short project cycles– Modelled on success of GÉANT
Infrastructure managed in collaboration with national grid initiatives
40
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
EGEE’06 Conference
• EGEE’06 – Capitalising on e-infrastructures – Keynotes on state-of-the-art and real-world use– Dedicated business track– Demos and business/industry exhibition– Involvement of international community
• 25-29 September 2006• Geneva, Switzerland, organised by CERN• http://www.eu-egee.org/egee06