Download - WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,

WLCG perfSONAR-PS Update

Shawn McKee/University of MichiganWLCG Network and Transfers Metrics Co-Chair

Spring 2014 HEPiX

LAPP, Annecy, France

May 21st, 2014

Overview

perfSONAR in WLCG

The WLCG perfSONAR-PS Deployment Task-force

The (new) WLCG Network and Transfer Metrics WG

Future Plans

21 May 2014HEPiX - Annecy 2

Introductory Considerations

All distributed, data-intensive projects critically depend upon the network. Network problems can be hard to diagnose and slow to fix Network problems are multi-domain, complicating the

process Standardizing on specific tools and methods allows groups to

focus resources more effectively and better self-support (as well as benefiting from others work)

Performance issues involving the network are complicated by the number of components involved end-to-end. We need the ability to better isolate performance bottlenecks

WLCG wants to make sure their(our) scientists are able to effectively use the network and quickly resolve network issues when and where they occur


Vision for perfSONAR-PS in WLCG


Goals: Find and isolate “network” problems; alerting in a timely way Characterize network use (base-lining) Provide a source of network metrics for higher level services

First step: get monitoring in place to create a baseline of the current situation between sites (see details later)

Next: continuing measurements to track the network, alerting on problems as they develop

Choice of a standard “tool/framework”: perfSONAR We wanted to benefit from the R&E community consensus

perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.

WLCG deployment plan

WLCG choose to deploy perfSONAR-PS at all sites worldwide A dedicated WLCG Operations Task-Force was started in Fall 2013

Sites are organized in regions Based on geographical locations and experiments computing

models All sites are expected to deploy a bandwidth host and a latency host

Regular testing is setup using a centralized (“mesh”) configuration Bandwidth tests: 30 seconds tests

every 6 hours intra-region, 12 hours for T2-T1 inter-region, 1 week elsewhere

Latency tests; 10 Hz of packets to each WLCG site Traceroute tests between all WLCG sites each hour Ping(ER) tests between all site every 20 minutes


HEPiX - Annecy 6

Summary of perfSONAR Deployment

WLCG Deployment Task-force completed work April 27,

2014

All WLCG sites should have installed/upgraded perfSONAR

following instructions at https://

twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment

Baseline release is 3.3.2.

Task-force deadline for deployment was April 1, 2014 We have 205 hosts running and in the mesh

There are 8 sites not installed

We have 64 sites not at the current version Versions prior to 3.3 unable to use the Mesh-config

21 May 2014

https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment

https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment

HEPiX - Annecy 7

Remaining WLCG Deployments

There are 8 sites not installed and configured yet: BelGrid-UCL: asked for SLC6 installation, pointed to https://

code.google.com/p/perfsonar-ps/wiki/Level1and2Install

GR-07-UOI-HEPLAB: no hardware, on hold.

GoeGrid: no reply, 4 reminders

ICM: "We do not have free resources to deploy perfSonar", ticket closed.

MPPMU: procuring hardware

RO-11-NIPNE: site under upgrade on 09/01/2014, recent progress, needs to be added to FR

T2_Estonia: under installation/configuration

TECHNION-HEP: first reply yesterday (3 reminders).

USCMS-FNAL-WC1: installed and configured (since long time), not publishing in OIM

Reported at WLCG Ops Coordination meeting https://

twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_deployment

_TF

Good that we have so few missing and most should eventually be deployed

21 May 2014

https://code.google.com/p/perfsonar-ps/wiki/Level1and2Install



https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_deployment_TF




HEPiX - Annecy 8

Monitoring Status

MaDDash instance at http://maddash.aglt2.org/maddash-webui

Has shown we still have some issues: Too much “orange” meaning data is either not be

taken (configuration or firewall) or access to results are blocked

21 May 2014

Have OMD monitoring the perfSONAR-PS instanceshttps://maddash.aglt2.org/WLCGperfSONAR/omd/

These services should migrate to OSG over the next month.

This monitoring should be useful for any future work to find/fix problems.

March 6April 16

http://maddash.aglt2.org/maddash-webui

http://maddash.aglt2.org/maddash-webui

https://maddash.aglt2.org/WLCGperfSONAR/omd/



MaDDash LHCONE Matrices: 28Apr2014


OWAMP (Latency) BWCTL (Bandwidth)

No packet loss, packet loss>0.01 BW>0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb

Main issue is too much “orange” indicating missing measurements/dataSources are “row”, Destination is “column”Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

HEPiX - Annecy 10

Task-Force Lessons Learned

Installing a service at every site is one thing, but commissioning a NxN system of links is squared the effort. This is why we have perfSONAR-PS installed but not all links are monitored.

perfSONAR is a “special” service It tests a multi-domain network path, with a service at the source and the destination It requires dedicated hardware and comes in a bundle with the OS. We understand this creates complications to some fabric infrastructure. An RPM

bundle was provided to help those sites, we encouraged sites also to share configuration experience

We had many releases of perfSONAR during the deployment process, each coming with new features or bug-fixes we requested. Some sites did install perfSONAR but they are at old releases with many missing

functionalities. The change of OS version (v3.2 -> v3.3) was a major reason for the inertia of some

sites. We still have issues with firewalls. There are 2 kid of firewalls to be considered:

For the hosts to be able to run the tests among themselves For the hosts to be able to expose information to the monitoring tools. Many sites get the first one right but not the second ones.

21 May 2014

HEPiX - Annecy 11

Important Remaining Issues

Get sites running older versions to upgrade

Verify we consistently get the needed metrics

Involve cloud/VO leads in debugging/fixing issues

Fix Firewalls: still a problem for many sites

Test coverage and parameters Should we have more VO-specific meshes/tests? e.g., WLCG->WLCG-

ATLAS, WLCG-CMS?

What frequency of testing for traceroute, BW?

Better Docs: How-tos, Debugging “orange”

WLCG Operations has convened a new Working Group to

address these issues: Network and Transfer Metrics21 May 2014

Mandate: Network & Transfer Metrics


Ensure all relevant network and transfer metrics are

identified, collected and published

Ensure sites and experiments can better understand and fix

networking issues

Enable use of network-aware tools to improve transfer

efficiency and optimize experiment workflows (e.g. ANSE

project; see

http://www.internet2.edu/presentations/tip2013/20130116_b

arczyk_anse.pdf

)

http://www.internet2.edu/presentations/tip2013/20130116_barczyk_anse.pdf

http://www.internet2.edu/presentations/tip2013/20130116_barczyk_anse.pdf

Working Group Objectives


● Identify and continuously make available relevant transfer and network metrics

● Document metrics and their use ● Facilitate their integration in the middleware and/or

experiment tool chain● Coordinate commissioning and maintenance of WLCG

network monitoringo Finalize perfSONAR deploymento Ensure all links continue to be monitored and sites stay correctly

configuredo Verify coverage and optimize test parameters

Working Group Membership


● Chairs: Shawn McKee, Marian Babik

● Members: Proposing to invite previous members of the

perfSONAR-PS TF responsible for different clouds

o See https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates o Fill-in members for missing clouds

● Inviting members knowledgeable about FAX, AAA, FTS,

PhEDEx, Panda or Rucio

● If anyone is interested in joining the effort contact me!

https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates



Use of perfSONAR-PS Metrics

Throughput: Notice problems and debug network, also help

differentiate server problems from path problems

Latency: Notice route changes, asymmetric routes Watch for excessive Packet Loss

On-demand tests and NPAD/NDT diagnostics via web

Optionally: Install additional perfSONAR nodes inside local

network, and/or at periphery Characterize local performance and internal packet loss

Separate WAN performance from internal performance

Daily Dashboard check of own site, and peers21 May 2014HEPiX - Annecy 15

Debugging Network Problems

Using perfSONAR-PS we (the VOs) identify network problems by

observing degradation in regular metrics for a particular “path” Packet-loss appearance in Latency tests

Significant and persistent decrease in bandwidth

Currently requires a “human” to trigger.

Next check for correlation with other metric changes between sites at

either end and other sites (is the problem likely at one of the ends or in

the middle?)

Correlate with paths and traceroute information. Something changed in

the routing? Known issue in the path?

In general NOT as easy to do all this as we would like with the

current perfSONAR-PS toolkit21 May 2014HEPiX - Annecy 16

Improving perfSONAR-PS Deployments

Based upon the issues we have encountered we setup a Wiki to gather best practices and solutions to issues we have identified: http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR

This page is shared with the perfSONAR-PS developers and we expect many of the “fixes” will be incorporated into future releases (most are in v3.3.2 already)

Improving resiliency (set-it-and-forget-it) a high priority. Instances should self-maintain and the infrastructure should be able to alert when services fail (OIM/GOCDB tests)

Disentangling problems with the measurement infrastructure versus problems in the network indicated by those measurements…


http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR

http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR

Future Use of Network Metrics

Once we have a source of network metrics being acquired we need to understand how best to incorporate those metrics into our facility operations.

Some possibilities: Characterizing paths with “costs” to better optimize decisions in

workflow and data management (underway in ANSE) Noting when paths change and providing appropriate notification Optimizing data-access or data-distribution based upon a better

understanding of the network between sites Identifying structural bottlenecks in need of remediation Aiding network problem diagnosis and speeding repairs In general, incorporating knowledge of the network into our processes

We will require testing and iteration to better understand when and where the network metrics are useful.


OSG & Networking Service

OSG is building a centralized service for gathering, viewing and providing network information to users and applications.

Goal: OSG becomes the “source” for networking information for its constituents, aiding in finding/fixing problems and enabling applications and users to better take advantage of their networks

Plan is to migrate MaDDash and OMD to OSG in the next month.

The critical missing component is the datastore to organize and store the network metrics and associated metadata OSG (via MaDDash) is gathering relevant metrics from the complete

set of OSG and WLCG perfSONAR-PS instances

This data must be available via an API, must be visualized and must

be organized to provide the “OSG Networking Service”21 May 2014HEPiX - Annecy 19

Closing Remarks

Over the last few years WLCG sites have converged on perfSONAR-PS as their way to measure and monitor their networks for data-intensive science. Not easy to get global consensus but we have it now

after pushing since 2008 The assumption is that perfSONAR (and the perfSONAR-PS toolkit)

is the de-facto standard way to do this and will be supported long-term Especially critical that R&E networks agree on its use and continue to improve

and develop the reference implementation

Dashboard critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them.

Having perfSONAR-PS fully deployed should give us some interesting options for better management and use of our networks


Discussion/Questions


Questions or Comments?