EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Network Performance Monitoring in EGEEJeremy Nowell, EPCC5th TERENA NRENs and Grids Workshop, Paris11-12 June 2007
www.egee-npm.org
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Overview
• EGEE Overview• Motivation and Requirements for NPM in EGEE• Strategy• Architecture• Tools and data available• Diagnostic Tool walkthrough• Issues and Observations• Conclusions
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
EGEE Overview
• EGEE– 1 April 2004 – 31 March 2006– 71 partners in 27 countries, federated in regional Grids
• EGEE-II– 1 April 2006 – 31 March 2008– > 90 partners in 32 countries
• Objectives– Large-scale, production-quality
infrastructure for e-Science– Improving and maintaining
“gLite” Grid middleware– Attracting new resources and
users from industry as well as science
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Why NPM for Grids? • For Grid operations
– Help diagnose performance problems between sitesThis transfer is slow, what’s broken? – the network, the server, the middleware…I can’t see site X, has the network gone down or just the cluster head-node?My application’s performance varies with time of day – is there a network bottleneck?
• For Grid middleware– I want to increase the performance of file transfers between sites– I want to know which compute site is “closest” to my data to submit a
job to it
• What’s different about NPM for the Grid?– Large amounts of application data, often continuous– Multiple streams– End-to-end performance crucial
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
NPM User Requirements
Middleware• Programmatic interface
– Web service– Database
• Info for 100 paths returned in 0.2s• Relate Compute/Storage Element
with NMP • Raw, historical data for 24 hrs• Mainly end-to-end data
Operation Centres• NOCs and GOCs
– Web-based GUI– Interface to define alarms– On-demand & historical data– Backbone & end-to-end data
• NOCs – Display which tool gathered
the results and how– Per hop data/ability to zoom in
• GOCs– High-level statistics
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
NPM Metric RequirementsRelevant to groupMetric / Info
Middleware NOC GOCTCP Achievable Bandwidth Yes YesPacket-loss Yes Yes YesRound-trip time Yes Yes YesRound-trip IPDV Yes YesOne-way delay Yes YesOne-way delay variation YesAvailable bandwidth (path) Yes Yes YesAvailable bandwidth (hop) YesPacket reordering Yes YesHop/list network topology Yes YesAvailability Yes YesPath MTU Yes YesQoS Class Yes YesService Level Agreement Yes YesOn-demand test on all metrics Yes Yes
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
NPM General Requirements
• Scale and heterogeneity of EGEE fabric poses a requirement to support diversity of all kinds– Multitude of ways of collecting monitoring data
Different measurement types• end-to-end
o Appropriate to experience of user and application, eg TCP achievable bandwidth
• Backboneo Lower level measurements, used to pin-point source of problems
Different measurement toolsDifferent data formats
– Many administrative domains– Different user groups
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Strategy• Aim to standardise access to NPM data across different domains
and frameworks– Note – we are not building measurement tools, but rather facilitating access
to data collected by them• Interoperability pursued through use of OGF NM-WG
– EGEE should not and cannot aim to enforce the uptake of a specific NPM framework across the diverse EGEE fabric or the associated networks
– Use NM-WG interfaces where they have been adopted; facilitate their useelsewhere.
End Users of Network Data
Resource-brokeringMiddleware
NOC/GOCUser
NPM Clientsand Services
Monitoring Frameworks
NREN usingPerfSONAR
Backbone usingPerfSONAR
End-sites usinge2emonit
Home-grownFramework
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
NPM Architecture
BackbonePerfmonit
NM-WG
Some Client
BackbonePiPEs
NM-WG
BackbonePerfSONAR
NM-WG
End-siteHome-grown
NM-WGEnd-sitee2emonit
NM-WG
E2emonit Monitoring Framework
E2emonit Service
CapDiscoveryNM-WG v1
• Single point of contact• Standard interface• Insulation from framework
interface changes
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
What’s available - Software• Clients
– The Diagnostic Tool (DT)For use by people
– The PublisherFor use by middleware
• Middleware– Mediator/Discoverer
• Monitoring Frameworks– e2emonit
Formerly EDG::WP7Provided and maintained by NPM team
– PerfSONAR– LHC-OPN
Soon?PerfSONAR Monitoring Framework
PerfSONARTranslation Service
CapDiscoveryNM-WG v1
NM-WG v2 Client
E2emonit Monitoring Framework
E2emonit Service
CapDiscoveryNM-WG v1
Data from GÉANT2
Data from EGEE PPS
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
What’s available - Metrics
• Data depends on which tools you use!– We will allow access to any relevant data, provided it is available using
a OGF NM-WG compliant interface• e2emonit
– pingConnectivity
• Round trip time, packet loss
– iperfReal life application performance
• TCP achievable bandwidth
– udpmonNetwork health, congestion etc
• UDP achievable bandwidth, one-way delay, UDP packet loss
• PerfSONAR– Developed by GÉANT, Internet2 and ESNet– Currently accessing utilisation data
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Data Federation
• Use of NM-WG schema facilitates federation– e2emonit from EGEE sites– e2emonit from related projects – BalticGrid– PerfSONAR Measurement Archives
Currently via translation layer
• Currently adopting version 2 of the NM-WG schema– Will allow access to more data sources
Gridmon (UK GridPP)Other PerfSONAR components
• E2E layer 2 link status (relevant for LHC-OPN)• Measurement Archives through native interface• BWCTL, OWAMP Measurement Points
Others – RRD based, flow etc?
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (1)
• Step 1: Access the NPM Diagnostic Tool.– The Diagnostic Tool can be accessed using a standard web browser, which users are individually authorised to use.
• In the future, we plan to use VOMS for authorisation.• Please mail us for access!
– The intended user is a NOC/GOC/ROC operator
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (2)
• Step 2: Select a Time.– The end-user does not have a specific time, but wants to see the performance for the past four weeks.– The user enters the appropriate time range, specifying a Start date/time of 2007-05-01 00:00:00 and a period of 4 weeks.– The user presses the Set button to confirm and the alternate time range representations update.
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (3)
• Step 3: Select a Path.– The end-user wants to see the performance for the path between Cyfronet in Krakow and CERN.– The user selects e2emonit sites at Cyfronet and CERN, adds the path and then selects “Find Data For This Query”
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (4)
• Step 4: Select a Metric.– The end-user experienced throughput problems.– Although there are several possibly relevant metrics to choose from (and only those measured are available to select from), the user decides to look at the Achievable Bandwidth on the path.– Achievable Bandwidth is selected from the Metrics box and the Set button pressed to confirm.
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (5)
• Step 5: Select a Statistic.– Several types of statistical data are available, such as Minimum, Maximum, Mean. – A particular interval can be applied to each, to provide, for example, an hourly mean over the past two days.– The user just wants a general overview of measurements and elects to retrieve raw data (Statistic check-box not checked).
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (6)
• Step 6: Select a View.– Currently Data Table and Time Plot views are available.– The user wants an overview of how the Achievable Bandwidth has changed over time, so selects the Time Plot.– The Query entry is complete, and the user selects Submit Query.
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 19
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (7)
• Step 7: Examine results.– The results are plotted, with Time on the x-axis and Achievable Bandwidth on the y-axis.– The parameters used to gather measurements are shown - here, showing that the iperf tool was used to gather the achievable bandwidth information.– These parameters can be useful in interpreting the results.
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
DT Usage (8)– Information from multiple
paths may be plotted at the same time.
– Here utilisation data for the GÉANT2 to JANET router is plotted for both inbound and outbound traffic over the course of one week, obtained from the GÉANT2 PerfSONAR Measurement Archive.
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Issues and Observations• Providing data federation tools usually not enough by itself
– Sites will not necessarily have any monitoring data available, so they still need guidance to install monitoring tools
Those that do have monitoring may not know about it
• Deployment of monitoring tools is not easy– There has to be a clear benefit to the site before they install tools– This benefit is not obvious until after an incident has occurred, by which time it is too
late…– Firewall changes may be difficult (eg ICMP blocked by default)– They need to be trivial to install and robust when running– Need to carefully consider scheduling for end-to-end tests
• Different user groups may have widely different requirements fordisplaying data
– e.g. site or service admins may just want an alarm that tells them “your network is broken”, and never look at the DT
– Network people would not contemplate investigating problems without clear historical data to look at
• The network is still assumed by many to “just work”
NPM in EGEE - Jeremy Nowell, 5th NRENs and Grids Workshop 22
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Conclusions
• Providing federating access to network measurement data is an interesting technical challenge, but achievable– Facilitated by standards such as OGF NM-WG schema
• Getting access to data itself is much harder– Deployment challenge– Need to “sell” to sites the value of having data available