Managing and monitoring large scale data transfers - Networkshop44

Post on 14-Jan-2017

1,396 views 0 download

transcript

Managing and monitoring large scale data transfers(WLCG FTS service as an

example)

Brian Davies

Managing and monitoring large scale data transfers

(WLCG FTS service as an example)

Brian DaviesNetworkshop44

22/03/16

Outline• Outline of the data transfer monitoring• What is the File Transfer Service (FTS)• Monitoring at different levels

– Central FTS data transfer monitoring– Virtual Organisation (VO) specific – User Monitoring

• Federated Failover• Use of “generic” monitoring tools

– Site Monitoring in conjunction with VO monitoring

WLCG Has a lot of Data transfers to monitor

• 167 Sites in 43 Countries on six Continents• Storage endpoints containing 250PB (disk) 300PB (tape)

– Organised and chaotic access– Supporting Single/Multiple endpoints for Single/Multiple Virtual Organisations– Vary in size and scope

• 10TB-10s of PB of Total Storage (Disk and Tape)• 1/10 GE NICs, 1/10/100 Gbps, R&E networks and private OPN• 10TB-1PB filesystems/object stores, 1-300 diskservers per site• Multiple filesystems (XFS,HDFS,CEPH,GPFS,Lustre)

• Central Production and User initiated• Last two years WLCG has moved 0.5EB of data

– Over 1billion files.• WN jobs produce a lot of data which also has to be stored/moved

– One VO runs 200k concurrent jobs which last 10mins to 72 Hours.– 0-100s of Input files, 2-3 Output files

• Individual file open times 1-10000s

Transfers to a single site/1day/1VO

Easily fill our networks*

*Not all the time

Data movements vary greatly• File size from ~10B to ~10GB• Latency between hosts from 0.1ms to 350ms (just for the UK )• Different workflows require different data movement

– WAN SE<->SE, SE->WN, WN->SE– LAN WN<->SE, SE<->SE

• Different Tools to monitor different workflows• Different storage middleware

– Native gridFTP, BeSTMan, DPM, dCache, SToRM• Different transfer protocols

– gsiFTP, http/WebDaV, xrootd, NFSv4.1, S3

• EGI Middleware Stack• Can handle many VOs

– 22 (HEP and non-HEP) • Checksum validation of files• Retry of failed transfers• Auto-optimisation of transfer parameters to maximise throughput• Ability to set limits suitable for varied storage setups• Web friendly GUI also available!! Federated Failover

– Mainly use Command line tools or higher level control systems.• Handle many file transfers (~1.5M a day)

– Single to thousands of files per single submission

File Transfer Service (FTS) Moves data!

Web GUI

• Overview of all transfers to see problematic sites is needed– But also be need ability to look at individual transfers

• Web GUIs, reading log files– Even have web GUIs which parse log files

• People using the monitoring Vary: – Site Admins ,regional support, VO users, Middleware developers.

• Management and technical– Different systems work well for different use cases.

• What is of interest?– Do transfers complete or fail?– How Fast do they complete?

• How can I tell if my changes improve/worsen the system.

Monitoring at different Levels

Central FTS Monitoring (dashboards and server GUIs)

Three Main VOs usage varies

Overview to see if single site is having issues

View smaller selections…Able to make sub-selections to diagnose problems not a the world scale

Comparison between inter-SE rates

Sites want to know if they are better than their collaborators/competitors

Ability to delve into greater detail at the server level

Many imbedded links to further monitoring

Down to individual transfers

To the log file

Which VO can then re-interpret

Transfer optimisation within FTS to increase individual transfers

Listing Errors (Helps find most important errors to solve)

Single failure mode failed transfers file list

History of a single file

Dedicated transfers to monitor rates

Users Gather their own information

• But systems change which breaks the monitoring.

AAA et al for federated failover • VOs each have their own system (AAA/FAX)

– But do similar actions– Copies data from remote storage if local copy does not exist to WN

• Allows for storage-less sites to be used.• Helps to reduce failures caused by local storage related issues.• Hierarchical Redirection

– Local->regional->continental->Global (or another convention)

Example of global network

FAX backup transfer mechanism also monitored

• Outline of the scale data transport issue for WLCG• What is the File Transfer Service (FTS)• Monitoring at different levels

– Central FTS data transfer monitoring– VO specific – User Monitoring

• Federated Failover• Use of “generic” monitoring tools

– Site Monitoring in conjunction with VO monitoring

Generic network monitoring tools

• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti

• Organising host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring

– http://atlas.ripe.net • perfSONAR toolkit

• Goals: – Find and isolate “network” problems; alerting in time– Characterize network use (base-lining) – Provide a source of network metrics for higher level services

• Choice of a standard open source tool: perfSONAR– Benefiting from the R&E community consensus

• Tasks achieved:– Finalized core deployment and commissioned perfSONAR network – Monitoring in place to create a baseline of the current situation between sites– Developed test coverage and made it possible to run “on-demand” tests to

quickly isolate problems and identify problematic links

Shawn McKee UoM

Overview of perfSONAR in WLCG/OSG

• End-to-end network issues are difficult to spot and localize – Network problems are multi-domain, complicating the process– Standardizing on specific tools and methods allows groups to focus resources more effectively and

better self-support– Performance issues involving the network are complicated by the number of components involved

end-to-end. • perfSONAR provides a number of standard metrics we can use• Latency measurements provide one-way delays and packet loss metrics

– Packet loss is almost always very bad for performance• Bandwidth tests measure achievable throughput and track TCP retries (using Iperf3)

– Provides a baseline to watch for changes; identify bottlenecks• Traceroute/Tracepath track network topology

– All measurements are only useful when we know the exact path they are taking through the network. – Tracepath additionally measures MTU but is frequently blocked

Shawn McKee UoM

Importance of Measuring Our Networks

Current perfSONAR Deployment

246 Active perfSONAR instances202 Running latest version (3.5+)- 95 sonars in latency mesh

- 8930 links measured at 10Hz

- packet-loss, one-way latency, jitter, ttl, packet-reordering

- 115 sonars in traceroutes mesh - 13110 links - hourly traceroutes, path-mtu

- 102 sonars in bandwidth mesh- 10920 links (iperf3)

Shawn McKee UoM

https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3

Generic network monitoring tools

• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti

• Organasing bi-directional host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring

– http://atlas.ripe.net • perfSONAR toolkit

Overview Dashboards

Dedicated monitoring Tools for the TCP layer

Central Service Monitoring

Analysis of the results garners useful information

Range of connections and rates on single host

Comparison between hosts at a single site

Conclusions• We have a lot of data to move (but successfully do so.)

– In many workflows• FTS is a method for how to do it.• Federated failover

– Automatic retries at multiple levels helps make problem transparent to the user• Lots of monitoring to ensure both a high success rate of transfers and a

high throughput both per file and overall.– Monitoring needs to be done at multiple levels

• Generic monitoring tools also useful. • Thank You

– Brian.Davies@stfc.ac.uk

Contact

Thank you

Brian DaviesGridPPBrian.Davies@stfc.ac.uk