Achieving High Throughput on Fast Networks (Bandwidth Challenges and World Records)

1

Achieving High Throughput on Fast Networks (Bandwidth Challenges

and World Records)

Les Cottrell & Yee-Ting Li

Stanford Linear Accelerator Center

Presented at ICTP “Optimization Technologies for Low Bandwidth Networks” workshop Trieste

October 2006

Bandwidth Challenges and Internet World Records

http://www.cern.ch/

2

Driver: LHC Network Requirements

CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

Tier 1

Tier2 Center

Online System

CERN Center PBs of Disk;

Tape Robot

FNAL CenterIN2P3 Center INFN Center RAL Center

InstituteInstituteInstituteInstitute

Workstations

~150-1500 MBytes/sec

~10 Gbps

1 to 10 Gbps

Tens of Petabytes by 2007-8.An Exabyte ~5-7 Years later.

Physics data cache

~PByte/sec

10 - 40 Gbps

Tier2 CenterTier2 CenterTier2 Center

~1-10 Gbps

Tier 0 +1

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment

3

Product of distance * speed using TCP with routers

IPv4 Multi-stream record with FAST TCP: 6.86 Gbps X 27kkm: Nov 2004

PCI-X 2.0: 9.3 Gbps Caltech-StarLight: Dec 2005

PCI Express: 9.8 Gbps Caltech – Sunnyvale, July 2006

Internet2 Land Speed Records & SC2003-2005 Records

6.6 Gbps16500km

4.2 Gbps16343km5.6 Gbps

10949km

5.4 Gbps7067km2.5 Gbps

10037km0.9 Gbps10978km0.4 Gbps

12272km

0

20

40

60

80

100

120

140

160

Thro

ughp

ut (G

bps)

Internet2 LSR - Single IPv4 TCP stream 7.21 Gbps20675 km

Internet2 LSRs:Blue = HEP

7.2G X 20.7 kkm

Th

rou

hg

pu

t (P

eta

bit

-m/s

ec)

H. Newman

4

Internet2 Land Speed Record ’02-03: Outline

• Breaking the Internet2 Land Speed Record (2002/03)– Not be confused with:

• Rocket-powered sled breaks 1982 world land speed record, San Francisco Chronicle May 1, 2003

• Who did it

• What was done

• How was it done?

• What was special about this anyway?

• Who needs it?

5

Who did it: Collaborators and sponsors• Caltech: Harvey Newman, Steven

Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn

• SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti (SISSA)

• LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart

• CERN: Olivier Martin, Paolo Moroni• ANL: Linda Winkler• DataTAG, StarLight, TeraGrid,

SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies

• Cisco, Level(3), Intel• DoE, European Commission, NSF

http://www.cern.ch/

6

What was done?

• Beat the Gbps limit for a single TCP stream across the Atlantic – transferred a TByte in an hour

When From To Bottle-neck

MTU Streams

TCP Thru-put

Nov ’02 (SC02)

Amsterdam

Sunny-vale

1 Gbps 9000B 1 Standard

923 Mbps

Nov ’02 (SC02)

Balti-more

Sunny-vale

10 Gbps

1500 10 FAST 8.6

Gbps

Feb ‘03 Sunny-vale

Geneva 2.5 Gbps

9000B 1 Standard

2.38 Gbps

• Set a new Internet2 TCP land speed record, 10,619 Tbit-meters/sec– (see http://lsr.internet2.edu/)

• With 10 streams achieved 8.6Gbps across USOne Terabyte transferred in less than one hour

http://lsr.internet2.edu/

7

Typical Components• CPU

– Pentium 4 (Xeon) with 2.4GHz cpu• For GE used Syskonnect NIC• For 10GE used Intel NIC

– Linux 2.4.19 or 20

• Routers– Cisco GSR 12406 with OC192/POS

& 1 and 10GE server interfaces (loaned, list > $1M)

– Cisco 760x– Juniper T640 (Chicago)

• Level(3) OC192/POS fibers (loaned SNV-CHI monthly lease cost ~ $220K)

• All borrowed & Off the Shelf

Note bootees

Earthquakestrap

GSRHeat sink

Disk servers

Compute servers

8

Challenges

• After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s– i.e. loss rate of 1 in ~ 2 Gpkts

(3Tbits), or BER of 1 in 3.6*1012

• PCI bus limitations (66MHz * 64 bit = 4.2Gbits/s at best)

• At 10Gbits/s and 180msec RTT requires 500MByte window

• Slow start problem at 1Gbits/s takes about 5-6 secs for 180msec link,

– i.e. if want 90% of measurement in stable (non slow start), need to measure for 60 secs

– need to ship >700MBytes at 1Gbits/s

Seconds

Mb

its/

s

Sunnyvale-Geneva, 1500Byte MTU, stock TCP

9

What was special? 1/2• End-to-end application-to-application, single and multi-

streams (not just internal backbone aggregate speeds)• TCP has not run out of stream yet, scales from modem

speeds into multi-Gbits/s region– TCP well understood, mature, many good features: reliability etc.

– Friendly on shared networks

• New TCP stacks only need to be deployed at sender– Often just a few data sources, many destinations

– No modifications to backbone routers etc

– No need for jumbo frames

• Used Commercial Off The Shelf (COTS) hardware and software

10

What was Special 2/2

• Raise the bar on expectations for applications and users– Some applications can use Internet backbone speeds– Provide planning information

• The network is looking less like a bottleneck and more like a catalyst/enabler– Reduce need to colocate data and cpu– No longer ship literally truck or plane loads of data

around the world– Worldwide collaborations of people working with large

amounts of data become increasingly possible

11

Who needs it?• HENP – current driver

– Multi-hundreds Mbits/s and Multi TByte files/day transferred across Atlantic today

• SLAC BaBar experiment already has almost a PByte stored

– Tbits/s and ExaBytes (1018) stored in a decade

• Data intensive science:– Astrophysics, Global weather, Bioinformatics, Fusion,

seismology…

• Industries such as aerospace, medicine, security …

• Future:– Media distribution

• Gbits/s=2 full length DVD movies/minute

• 2.36Gbits/s is equivalent to – Transferring a full CD in 2.3 seconds (i.e. 1565 CDs/hour)

– Transferring 200 full length DVD movies in one hour (i.e. 1 DVD in 18 seconds)

• Will sharing movies be like sharing music today?

http://www-iepm.slac.stanford.edu/monitoring/bulk/

http://www.griphyn.org/

http://www.slac.stanford.edu/BFROOT/BABAR.html

12

When will it have an impact

• ESnet traffic doubling/year since 1990• SLAC capacity increasing by 90%/year since

1982– SLAC Internet traffic increased by factor 2.5 in

last year• International throughput increase by factor 10

in 4 years• So traffic increases by factor 10 in 3.5 to 4

years, so in:– 3.5 to 5 years 622 Mbps => 10Gbps– 3-4 years 155 Mbps => 1Gbps– 3.5-5 years 45Mbps => 622Mbps

• 2010-2012:– 100s Gbits for high speed production net end

connections – 10Gbps will be mundane for R&E and business– Home broadband: doubling ~ every year,

100Mbits/s by end of decade (if double each year then 10Mbits/s by 2012)?

– Aggressive Goal: 1Gbps to all Californians by 2010

13

Impact• Caught technical press

attention– On TechTV and ABC Radio

– Reported in places such as CNN, the BBC, Times of India, Wired, Nature

– Reported in English, Spanish, Portuguese, French, Dutch, Japanese

– Guinness Book of Reccords (2004)

14

SC Bandwidth Challenge

• Bandwidth Challenge– Yearly challenge of SuperComputing show – ‘The Bandwidth Challenge highlights the best and brightest in new

techniques for creating and utilizing vast rivers of data that can be carried across advanced networks.‘

– Transfer as much data as possible using real applications over a 2 hour window

15

BWC: History • 2002: “Extreme Bandwidth” Caltech, SLAC, CERN

– 12.4Gbits/s peak, 2nd place, LBNL video stream with UDP won

• 2003: “Bandwidth Lust” Caltech, SLAC, LANL, CERN, Manchester, NIKHEF – 23 Gbits/s peaks (6.6 TBytes in < 1 hour), 1st place

• 2004: “Terabyte data transfers for physics” Caltech, SLAC, FNAL + …– Achieved 101 Gbits/s, 1st

place• 2005: “Global Lambda

for Particle Physics”– Sustained > 100Gbits/s

for many hours, peak > 150Gbits/s 1st place

16

BWC: Overview 2005• Distributed TeraByte Particle Physics Data Sample Analysis

– ‘Demonstrated high speed transfers of particle physics data between host labs and collaborating institutes in the USA and worldwide. Using state of the art WAN infrastructure and Grid Web Services based on the LHC Tiered Architecture, they showed real-time particle event analysis requiring transfers of Terabyte-scale datasets.’

• In detail, during the bandwidth challenge (2 hours):– 131 Gbps measured by SCInet BWC team on 17 of our 21 waves

(15 minute average) - > 150Gbps on all 22– 95.37TB of data transferred.

• (3.8 DVD’s per second)– 90-150Gbps (peak 150.7Gbps)

• On day of challenge– Transferred ~475TB ‘practising’ (waves were shared, still tuning

applications and hardware)– Peak one way utlisation observed on a single link was 9.1Gbps

(Caltech) and 8.4Gbps (SLAC)• Also wrote to StorCloud

– SLAC: wrote 3.2TB in 1649 files during BWC– Caltech: 6GB/sec with 20 nodes

17

Participants Worldwide• Caltech/HEP/CACR/ NetLab: Harvey

Newman, Julian Bunn - Contact, Dan Nae, Sylvain Ravot, Conrad Steenberg, Yang Xia, Michael Thomas

• SLAC/IEPM: Les Cottrell, Gary Buhrmaster, Yee-Ting Li, Connie Logg

• FNAL Matt Crawford, Don Petravick, Vyto Grigaliunas, Dan Yocum

• University of Michigan Shawn McKee, Andy Adamson, Roy Hockett, Bob Ball, Richard French, Dean Hildebrand, Erik Hofer, David Lee, Ali Lotia, Ted Hanss, Scott Gerstenberger

• U Florida Paul Avery, Dimitri Bourilkov,• University of Manchester: Richard

Hughes-Jones ･• CERN, Switzerland David Foster• KAIST, Korea Yusung Kim,• Kyungpook Univserity, Korea, Kihwan

Kwon,• UERJ, Brazil Alberto Santoro,• UNESP, Brazil Sergio Novaes,• USP, Brazil Luis Fernandez Lopez• GLORIAD, USA: Greg Cole, Natasha

Bulashova

Sun & Chelsio

Neterion ESnet

http://www.caltech.edu/

18

19

Networking Overview• We had 22 10Gbits/s waves to the Caltech and

SLAC/FNAL booths. Of these:– 15 waves to the Caltech booth (from Florida (1),

Korea/GLORIAD (1), Brazil (1 * 2.5Gbits/s), Caltech (2), LA (2), UCSD, CERN (2), U Michigan (3), FNAL(2)).

– 7 x 10Gbits/s waves to the SLAC/FNAL booth (2 from SLAC, 1 from the UK, and 4 from FNAL).

• The waves were provided by Abilene, Canarie, Cisco (5), ESnet (3), GLORIAD (1), HOPI (1), Michigan Light Rail (MiLR), National Lambda Rail (NLR), TeraGrid (3) and UltraScienceNet (4).

20

Network Overview

21

Hardware (SLAC only)• At SLAC:

– 14 x 1.8Ghz Sun v20z (Dual Opteron)

– 2 x Sun 3500 Disk trays (2TB of storage)

– 12 x Chelsio T110 10Gb NICs (LR)

– 2 x Neterion/S2io Xframe I (SR)

– Dedicated Cisco 6509 with 4 x 4x10GB blades

• At SC|05:– 14 x 2.6Ghz Sun v20z (Dual Opteron)

– 10 QLogic HBA’s for StorCloud Access

– 50TB Storage at SC|05 provide by 3PAR (Shared with Caltech)

– 12 x Neterion/S2io Xframe I NICs (SR)

– 2 x Chelsio T110 NICs (LR)

– Shared Cisco 6509 with 6 x 4x10GB blades

22

Hardware at SC|05Industrial fans to keep cool around back

23

Software• BBCP ‘Babar File Copy’

– Uses ‘ssh’ for authentication

– Multiple stream capable

– Features ‘rate synchronisation’ to reduce byte retransmissions

– Sustained over 9Gbps on a single session

• XrootD– Library for transparent file access (standard unix file functions)

– Designed primarily for LAN access (transaction based protocol)

– Managed over 35Gbit/sec (in two directions) on 2 x 10Gbps waves

– Transferred 18TBytes in 257,913 files• DCache

– 20Gbps production and test cluster traffic

24

Previous year (SC|04)

BWC Aggregate Bandwidth

25

Cumulative Data Transferred

Bandwidth Challenge period

26

Component Traffic

• Note instability

27

In to booth

Out from booth

ESnet routed

ESnet SDN layer2 via USN

Bandwidth Challenge period

SLAC Cluster ContributionsRouter crashes

28

Problems…• Managerial/PR

– Initial request for loan hardware took place 6 months in advance!

– Lots and lots of paperwork to keep account of all loan equipment (over 100 items loaned from 7 vendors)

– Thank/acknowledge all contributors, press release clearances

• Logistical– Set up and tore down a pseudo production network and servers in a

space of week!

– Testing could not begin until waves were alight

• Most waves lit day before challenge!

– Shipping so much hardware not cheap!

– Setting up monitoring

29

Problems…• Tried to configure hardware and software prior to show • Hardware

– NICS• We had 3 bad Chelsios (bad memory)• Neterions Xframe II’s did not work in UKLight’s Boston machines

– Hard-disks• 3 dead 10K disks (had to ship in spare)

– 1 x 4Port 10Gb blade DOA– MTU mismatch between domains– Router blade died during stress testing day before BWC!– Cables! Cables! Cables!

• Software– Used golden disks for duplication (still takes 30 minutes per disk

to replicate!)– Linux kernels:

• Initially used 2.6.14, found severe performance problems compared to 2.6.12.

– (New) Router firmware caused crashes under heavy load• Unfortunately, only discovered just before BWC• Had to manually restart the affected ports during BWC

30

Problems• Most transfers were from memory to memory (Ramdisk

etc).– Local caching of (small) files in memory

– Reading and writing to disk will be the next bottleneck to overcome

31

SC05 BWC Takeaways & Lessons• Substantive take-aways from this Marathon exercise:

– An optimized Linux kernel (2.6.12 + FAST + NFSv4) for data transport; after 7 full kernel-build cycles in 4 days

– Scaling up SRM/gridftp to near 10 Gbps per wave, using Fermilab’s production clusters

– A newly optimized application-level copy program, bbcp, that matches the performance of iperf under some conditions

– Extensions of SLAC’s Xrootd, an optimized low-latency file access application for clusters, across the wide area

– Understanding of the limits of 10 Gbps-capable computer systems, network switches and interfaces under stress

• A Lot of Work remains, to put it this into production-use(for example Caltech/CERN/FNAL/SLAC/Michigan Collaboration)

32

Conclusion• Previewed the IT Challenges of the next generation Data

Intensive Science Applications (High Energy Physics, astronomy etc)– Petabyte-scale datasets– Tens of national and transoceanic links at 10 Gbps (and up)– 100+ Gbps aggregate data transport sustained for hours; We

reached a Petabyte/day transport rate for real physics data• Learned to gauge difficulty of the global networks and

transport systems required for the LHC mission– Set up, shook down and successfully ran the systems in < 1 week– Understood and optimized the configurations of various

components (Network interfaces, router/switches, OS, TCP kernels, applications) for high performance over the wide area network.

33

What’s next?• Break 10Gbits/s single stream limit, distance capped• Evaluate new stacks with real-world links, and other equipment

– Other NICs– Response to congestion, pathologies– Fairnesss, robustness, stability– Deploy for some major (e.g. HENP/Grid) customer applications

• Disk-to-disk throughput & useful applications– Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk),

understand how to use multi-processors– Disk-to-disk Marks:

536 Mbytes/sec (Windows); 500 Mbytes/sec (Linux)

– Concentrate now on reliable Terabyte-scale file transfers

• System Issues: PCI-X Bus, Network Interfaces, Disk I/O Controllers, Linux Kernel, CPU utilization

• Move from “hero” demonstrations to commonplace

34

Press and PR SC|05• 11/8/05 - Brit Boffins aim to Beat LAN speed record from

vnunet.com• SC|05 Bandwidth Challenge SLAC Interaction Point.• Top Researchers, Projects in High Performance Computing

Honored at SC/05 ... Business Wire (press release) - San Francisco, CA, USA

• 11/18/05 - Official Winner Announcement• 11/18/05 - SC|05 Bandwidth Challenge Slide Presentation• 11/23/05 - Bandwidth Challenge Results from Slashdot• 12/6/05 - Caltech press release• 12/6/05 - Neterion Enables High Energy Physics Team to Beat

World Record Speed at SC05 Conference CCN Matthews News Distribution Experts

• High energy physics team captures network prize at SC|05 from SLAC

• High energy physics team captures network prize at SC|05 EurekaAlert!

• 12/7/05 - High Energy Physics Team Smashes Network Record, from Science Grid this Week.

• Congratulations to our Research Partners for a New Bandwidth Record at SuperComputing 2005, from Neterion.

35

How was it done: Typical testbed 12*2cpu servers

4 disk servers

GSR

6*2cpuservers

Sunnyvale

6*2cpu servers

4 disk servers

OC192/POS(10Gbits/s)

2.5Gbits/s (bottleneck)

T640

Sunnyvale section first deployed for SC2002 (Nov 02)

(EU+US)

Geneva

Chicago

7609

7609

SNV

CHI AMS

GVA> 10,000 km

http://www.slac.stanford.edu/

http://www.cisco.com/en/US/hmpgs/index.html

http://www.slac.stanford.edu/

36

37

SLAC/UK Contribution

ESnet/USN layer 2

UKLightIn to booth

Out frombooth

ESnet routed

38

SLAC-ESnet

FermiLab-HOPI

SLAC-ESnet-USNFNAL-UltraLight

UKLight

Out from booth

SLAC-FermiLab-UK Bandwidth Contributions

In to booth

SLAC-ESnet routed

39

SLAC/Esnet Contribution

Mbp

s

Hosts

Aggregate

40

SLAC/FNAL Booth

Aggregate

Mbp

s

Waves

41

HOPIUSN

FermiLab Contribution

UltraLight

42

Conclusion• Products from this the exercise

– An optimized Linux (2.6.12 + NFSv4 + FAST and other TCP stacks) kernel for data transport; after 7 full kernel-build cycles in 4 days

– A newly optimized application-level copy program, bbcp, that matches the performance of iperf under some conditions.

– Extensions of Xrootd, an optimized low-latency file access application for clusters, across the wide area

– Understanding of the limits of 10 Gbps-capable systems under stress.– How to effectively utilize 10GE and 1GE connected systems to drive 10 gigabit wavelengths in both

directions.– Use of production and test clusters at FNAL reaching more than 20 Gbps of network throughput.

• Significant efforts remain from the perspective of high-energy physics– Management, integration and optimization of network resources– End-to-end capabilities able to utilize these network resources. This includes applications and IO

devices (disk and storage systems)

43

More Information• Internet2 Land Speed Record Publicity

– www-iepm.slac.stanford.edu/lsr/– www-iepm.slac.stanford.edu/lsr2/

• 10GE tests– www-iepm.slac.stanford.edu/monitoring/bulk/10ge/– sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html

http://www-iepm.slac.stanford.edu/lsr/



http://www-iepm.slac.stanford.edu/lsr2/

http://www-iepm.slac.stanford.edu/monitoring/bulk/10ge/

http://sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html

Date post:	02-Jan-2016
Category:	Documents
Upload:	lacota-robbins
View:	19 times
Download:	0 times

Achieving High Throughput on Fast Networks (Bandwidth Challenges and World Records)

Documents