WLCG perfSONAR-PS Update
Shawn McKee/University of MichiganWLCG Network and Transfers Metrics Co-Chair
Spring 2014 HEPiX
LAPP, Annecy, France
May 21st, 2014
Overview
perfSONAR in WLCG
The WLCG perfSONAR-PS Deployment Task-force
The (new) WLCG Network and Transfer Metrics WG
Future Plans
21 May 2014HEPiX - Annecy 2
Introductory Considerations
All distributed, data-intensive projects critically depend upon the network. Network problems can be hard to diagnose and slow to fix Network problems are multi-domain, complicating the
process Standardizing on specific tools and methods allows groups to
focus resources more effectively and better self-support (as well as benefiting from others work)
Performance issues involving the network are complicated by the number of components involved end-to-end. We need the ability to better isolate performance bottlenecks
WLCG wants to make sure their(our) scientists are able to effectively use the network and quickly resolve network issues when and where they occur
21 May 2014HEPiX - Annecy 3
Vision for perfSONAR-PS in WLCG
21 May 2014HEPiX - Annecy 4
Goals: Find and isolate “network” problems; alerting in a timely way Characterize network use (base-lining) Provide a source of network metrics for higher level services
First step: get monitoring in place to create a baseline of the current situation between sites (see details later)
Next: continuing measurements to track the network, alerting on problems as they develop
Choice of a standard “tool/framework”: perfSONAR We wanted to benefit from the R&E community consensus
perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.
WLCG deployment plan
WLCG choose to deploy perfSONAR-PS at all sites worldwide A dedicated WLCG Operations Task-Force was started in Fall 2013
Sites are organized in regions Based on geographical locations and experiments computing
models All sites are expected to deploy a bandwidth host and a latency host
Regular testing is setup using a centralized (“mesh”) configuration Bandwidth tests: 30 seconds tests
every 6 hours intra-region, 12 hours for T2-T1 inter-region, 1 week elsewhere
Latency tests; 10 Hz of packets to each WLCG site Traceroute tests between all WLCG sites each hour Ping(ER) tests between all site every 20 minutes
21 May 2014HEPiX - Annecy 5
HEPiX - Annecy 6
Summary of perfSONAR Deployment
WLCG Deployment Task-force completed work April 27,
2014
All WLCG sites should have installed/upgraded perfSONAR
following instructions at https://
twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment
Baseline release is 3.3.2.
Task-force deadline for deployment was April 1, 2014 We have 205 hosts running and in the mesh
There are 8 sites not installed
We have 64 sites not at the current version Versions prior to 3.3 unable to use the Mesh-config
21 May 2014
HEPiX - Annecy 7
Remaining WLCG Deployments
There are 8 sites not installed and configured yet: BelGrid-UCL: asked for SLC6 installation, pointed to https://
code.google.com/p/perfsonar-ps/wiki/Level1and2Install
GR-07-UOI-HEPLAB: no hardware, on hold.
GoeGrid: no reply, 4 reminders
ICM: "We do not have free resources to deploy perfSonar", ticket closed.
MPPMU: procuring hardware
RO-11-NIPNE: site under upgrade on 09/01/2014, recent progress, needs to be added to FR
T2_Estonia: under installation/configuration
TECHNION-HEP: first reply yesterday (3 reminders).
USCMS-FNAL-WC1: installed and configured (since long time), not publishing in OIM
Reported at WLCG Ops Coordination meeting https://
twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_deployment
_TF
Good that we have so few missing and most should eventually be deployed
21 May 2014
HEPiX - Annecy 8
Monitoring Status
MaDDash instance at http://maddash.aglt2.org/maddash-webui
Has shown we still have some issues: Too much “orange” meaning data is either not be
taken (configuration or firewall) or access to results are blocked
21 May 2014
Have OMD monitoring the perfSONAR-PS instanceshttps://maddash.aglt2.org/WLCGperfSONAR/omd/
These services should migrate to OSG over the next month.
This monitoring should be useful for any future work to find/fix problems.
March 6April 16
MaDDash LHCONE Matrices: 28Apr2014
21 May 2014HEPiX - Annecy 9
OWAMP (Latency) BWCTL (Bandwidth)
No packet loss, packet loss>0.01 BW>0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb
Main issue is too much “orange” indicating missing measurements/dataSources are “row”, Destination is “column”Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”
HEPiX - Annecy 10
Task-Force Lessons Learned
Installing a service at every site is one thing, but commissioning a NxN system of links is squared the effort. This is why we have perfSONAR-PS installed but not all links are monitored.
perfSONAR is a “special” service It tests a multi-domain network path, with a service at the source and the destination It requires dedicated hardware and comes in a bundle with the OS. We understand this creates complications to some fabric infrastructure. An RPM
bundle was provided to help those sites, we encouraged sites also to share configuration experience
We had many releases of perfSONAR during the deployment process, each coming with new features or bug-fixes we requested. Some sites did install perfSONAR but they are at old releases with many missing
functionalities. The change of OS version (v3.2 -> v3.3) was a major reason for the inertia of some
sites. We still have issues with firewalls. There are 2 kid of firewalls to be considered:
For the hosts to be able to run the tests among themselves For the hosts to be able to expose information to the monitoring tools. Many sites get the first one right but not the second ones.
21 May 2014
HEPiX - Annecy 11
Important Remaining Issues
Get sites running older versions to upgrade
Verify we consistently get the needed metrics
Involve cloud/VO leads in debugging/fixing issues
Fix Firewalls: still a problem for many sites
Test coverage and parameters Should we have more VO-specific meshes/tests? e.g., WLCG->WLCG-
ATLAS, WLCG-CMS?
What frequency of testing for traceroute, BW?
Better Docs: How-tos, Debugging “orange”
WLCG Operations has convened a new Working Group to
address these issues: Network and Transfer Metrics21 May 2014
Mandate: Network & Transfer Metrics
21 May 2014HEPiX - Annecy 12
Ensure all relevant network and transfer metrics are
identified, collected and published
Ensure sites and experiments can better understand and fix
networking issues
Enable use of network-aware tools to improve transfer
efficiency and optimize experiment workflows (e.g. ANSE
project; see
http://www.internet2.edu/presentations/tip2013/20130116_b
arczyk_anse.pdf
)
Working Group Objectives
21 May 2014HEPiX - Annecy 13
● Identify and continuously make available relevant transfer and network metrics
● Document metrics and their use ● Facilitate their integration in the middleware and/or
experiment tool chain● Coordinate commissioning and maintenance of WLCG
network monitoringo Finalize perfSONAR deploymento Ensure all links continue to be monitored and sites stay correctly
configuredo Verify coverage and optimize test parameters
Working Group Membership
21 May 2014HEPiX - Annecy 14
● Chairs: Shawn McKee, Marian Babik
● Members: Proposing to invite previous members of the
perfSONAR-PS TF responsible for different clouds
o See https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates o Fill-in members for missing clouds
● Inviting members knowledgeable about FAX, AAA, FTS,
PhEDEx, Panda or Rucio
● If anyone is interested in joining the effort contact me!
Use of perfSONAR-PS Metrics
Throughput: Notice problems and debug network, also help
differentiate server problems from path problems
Latency: Notice route changes, asymmetric routes Watch for excessive Packet Loss
On-demand tests and NPAD/NDT diagnostics via web
Optionally: Install additional perfSONAR nodes inside local
network, and/or at periphery Characterize local performance and internal packet loss
Separate WAN performance from internal performance
Daily Dashboard check of own site, and peers21 May 2014HEPiX - Annecy 15
Debugging Network Problems
Using perfSONAR-PS we (the VOs) identify network problems by
observing degradation in regular metrics for a particular “path” Packet-loss appearance in Latency tests
Significant and persistent decrease in bandwidth
Currently requires a “human” to trigger.
Next check for correlation with other metric changes between sites at
either end and other sites (is the problem likely at one of the ends or in
the middle?)
Correlate with paths and traceroute information. Something changed in
the routing? Known issue in the path?
In general NOT as easy to do all this as we would like with the
current perfSONAR-PS toolkit21 May 2014HEPiX - Annecy 16
Improving perfSONAR-PS Deployments
Based upon the issues we have encountered we setup a Wiki to gather best practices and solutions to issues we have identified: http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR
This page is shared with the perfSONAR-PS developers and we expect many of the “fixes” will be incorporated into future releases (most are in v3.3.2 already)
Improving resiliency (set-it-and-forget-it) a high priority. Instances should self-maintain and the infrastructure should be able to alert when services fail (OIM/GOCDB tests)
Disentangling problems with the measurement infrastructure versus problems in the network indicated by those measurements…
21 May 2014HEPiX - Annecy 17
Future Use of Network Metrics
Once we have a source of network metrics being acquired we need to understand how best to incorporate those metrics into our facility operations.
Some possibilities: Characterizing paths with “costs” to better optimize decisions in
workflow and data management (underway in ANSE) Noting when paths change and providing appropriate notification Optimizing data-access or data-distribution based upon a better
understanding of the network between sites Identifying structural bottlenecks in need of remediation Aiding network problem diagnosis and speeding repairs In general, incorporating knowledge of the network into our processes
We will require testing and iteration to better understand when and where the network metrics are useful.
21 May 2014HEPiX - Annecy 18
OSG & Networking Service
OSG is building a centralized service for gathering, viewing and providing network information to users and applications.
Goal: OSG becomes the “source” for networking information for its constituents, aiding in finding/fixing problems and enabling applications and users to better take advantage of their networks
Plan is to migrate MaDDash and OMD to OSG in the next month.
The critical missing component is the datastore to organize and store the network metrics and associated metadata OSG (via MaDDash) is gathering relevant metrics from the complete
set of OSG and WLCG perfSONAR-PS instances
This data must be available via an API, must be visualized and must
be organized to provide the “OSG Networking Service”21 May 2014HEPiX - Annecy 19
Closing Remarks
Over the last few years WLCG sites have converged on perfSONAR-PS as their way to measure and monitor their networks for data-intensive science. Not easy to get global consensus but we have it now
after pushing since 2008 The assumption is that perfSONAR (and the perfSONAR-PS toolkit)
is the de-facto standard way to do this and will be supported long-term Especially critical that R&E networks agree on its use and continue to improve
and develop the reference implementation
Dashboard critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them.
Having perfSONAR-PS fully deployed should give us some interesting options for better management and use of our networks
21 May 2014HEPiX - Annecy 20
Discussion/Questions
21 May 2014HEPiX - Annecy 21
Questions or Comments?