© 2006 Open Grid Forum Interactions Between Networks, Protocols & Applications HPCN-RG Richard...

© 2006 Open Grid Forum

Interactions Between Networks, Protocols & ApplicationsHPCN-RG

Richard Hughes-Jones

OGF20, Manchester, May 2007,

2© 2007 Open Grid Forum

GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester

ESLEA and UKLight at SC|05



ESLEA and UKLight

6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR

Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer / application

to give ~850Mbit/s One stream of data 840 Mbit/s

Stream UDP VLBI data UK to Seattle 620 Mbit/s

No packet loss worked well

sc0502 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

sc0503 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

sc0504 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

sc0501 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

time

Ra

te

Mb

it/s

UKLight SC|05

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

Reverse TCP



SC|05 HEP: Moving data with bbcp

What is the end-host doing with your network protocol? Look at the PCI-X 3Ware 9000 controller RAID0 1 Gbit Ethernet link 2.4 GHz dual Xeon ~660 Mbit/s

Power needed in the end hosts Careful Application design

PCI-X bus with RAID Controller

PCI-X bus with Ethernet NIC

Read from diskfor 44 ms every 100ms

Write to Networkfor 72 ms



SC2004: Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/sworks !!

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)



RAID0 6disks 1 Gbyte Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Network & Disk Interactions Hosts:

Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

Measure memory to RAID0 transfer rates with & without UDP traffic

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp9000 write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k

64k

y=178-1.05x

Disk write1735 Mbit/s

Disk write +1500 MTU UDP

1218 Mbit/sDrop of 30%

Disk write +9000 MTU UDP

1400 Mbit/sDrop of 19%

% CPU kernel mode



Remote Computing Farms in the ATLAS TDAQ Experiment



ATLAS Remote Farms – Network Connectivity



ATLAS Remote Computing: Application Protocol

Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes

Processing of event Return of computation

EF asks SFO for buffer space SFO sends OK EF transfers results of the computation

tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

Send OK

Send event data

Request event

●●●

Request Buffer

Send processed event

Process event

Time

Request-Response time (Histogram)

Event Filter Daemon EFD SFI and SFO



TCP Activity Manc-CERN Req-Resp

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time

Dat

a B

ytes

Ou

t

0

50

100

150

200

250

300

350

400

Dat

a B

ytes

In

DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms

64 byte Request green1 Mbyte Response blue

TCP in slow start 1st event takes 19 rtt or ~ 380 ms

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

Dat

a B

ytes

Ou

t

0

50000

100000

150000

200000

250000

Cu

rCw

nd

DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value

TCP Congestion windowgets re-set on each Request

TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity

Even after 10s, each response takes 13 rtt or ~260 ms

020406080

100120140160180

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

TC

PA

chiv

e M

bit

/s

0

50000

100000

150000

200000

250000

Cw

nd

Transfer achievable throughput120 Mbit/s



TCP Activity Manc-cern Req-RespTCP stack no cwnd reduction Round trip time 20 ms 64 byte Request green

1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms

0

200000

400000

600000

800000

1000000

1200000

0 500 1000 1500 2000 2500 3000time

Da

ta B

yte

s O

ut

0

50

100

150

200

250

300

350

400

Dat

a B

ytes

In

DataBytesOut (Delta DataBytesIn (Delta

0100200300400

500600700800900

0 1000 2000 3000 4000 5000 6000 7000 8000time ms

TC

PA

chiv

e M

bit

/s

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000time ms

nu

m P

acke

ts

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value

TCP Congestion windowgrows nicely

Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)

Transfer achievable throughputgrows to 800 Mbit/s

Data transferred WHEN theapplication requires the data

3 Round Trips 2 Round Trips



Round trip time 150 ms 64 byte Request green

1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s

TCP Activity Alberta-CERN Req-RespTCP stack no Cwnd reduction

TCP Congestion windowin slow start to ~1.8s then congestion avoidance

Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)

Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s

0100000200000300000400000500000600000700000800000900000

1000000

0 1000 2000 3000 4000 5000time

Dat

a B

ytes

Ou

t

0

50

100

150

200

250

300

350

400

Dat

a B

ytes

In

DataBytesOut (Delta DataBytesIn (Delta

0100

200300

400500

600700

800

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

TC

PA

chiv

e M

bit

/s

0

200000

400000

600000

800000

1000000

Cw

nd

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

nu

m P

acke

ts0

200000

400000

600000

800000

1000000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value



Moving Constant Bit-rate Data in Real-Timefor

Very Long Baseline Interferometry

Stephen Kershaw, Ralph Spencer, Matt Strong, Simon Casey, Richard Hughes-Jones,

The University of Manchester



What is VLBI ?

Data wave front send over the network to the Correlator

VLBI signal wave front

Resolution Baseline

Sensitivity

Bandwidth B is as

important as time τ : Can use as many Gigabits as we can get!

B/1



DedicatedDWDM link

OnsalaSweden

Gbit link

Jodrell BankUK

DwingelooNetherlands

MedicinaItaly

Chalmers University

of Technolo

gy, Gothenbu

rg

TorunPoland

Gbit link

MetsähoviFinland

European e-VLBI Test Topology

2* 1 Gbit links



CBR Test Setup



CBR over TCP

0 1 2 3 4 5 6 7 8 9 10

x 104

5

10

15

20

25

30

35

40

45

50

Message number

Tim

e / s

Effect of loss rate on message arrival time

Drop 1 in 5k

Drop 1 in 10k

Drop 1 in 20kDrop 1 in 40k

No loss

Data delayed

0 1 2 3 4 5 6 7 8 9 10

x 104

5

10

15

20

25

30

35

40

45

50

Message number

Tim

e / s

Effect of loss rate on message arrival time

Drop 1 in 5k

Drop 1 in 10k

Drop 1 in 20kDrop 1 in 40k

No loss

Data delayed

Timely arrivalof data

Effect of loss rate on message arrival time.

TCP buffer 1.8 MB (BDP) RTT 27 ms

When there is packet lossTCP decreases the rate.TCP buffer 0.9 MB (BDP)RTT 15.2 ms

Can TCP deliver the data on time?



Message number / Time

Packet lossDelay in stream

Expected arrival time at CBR

Arrival time

throughput

1Slope

Resynchronisation



0 20 40 60 80 100 1200

1

2

x 106

Cur

Cw

nd

0 20 40 60 80 100 1200

500

1000

Mbi

t/s

0 20 40 60 80 100 1200

1

2

Pkt

sRet

rans

0 20 40 60 80 100 1200

500

1000

Dup

Ack

sIn

Time in sec

Message size: 1448 Bytes Data Rate: 525 Mbit/s Route:

Manchester - JIVE RTT 15.2 ms

TCP buffer 160 MB Drop 1 in 1.12 million packets

Throughput increases Peak throughput ~ 734 Mbit/s Min. throughput ~ 252 Mbit/s

CBR over TCP – Large TCP Buffer



2,000,000 3,000,000 4,000,000 5,000,0000

500

1000

1500

2000

2500

3000

Message number

One

way

del

ay /

ms

Message size: 1448 Bytes Data Rate: 525 Mbit/s Route:

Manchester - JIVE RTT 15.2 ms

TCP buffer 160 MB Drop 1 in 1.12 million packets

Peak Delay ~2.5s

CBR over TCP – Message Delay





Standard TCP not optimum for high throughput long distance links

Packet loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert

New stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”

TCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage

TCP does not share bandwidth well with other streams

The End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Interaction between HW, protocol processing, and disk sub-system complex

Application architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application.

Users are now able to perform sustained 1 Gbit/s transfers

Summary & Conclusions



Any Questions?



Network switch limits behaviourEnd2end UDP packets from udpmon

Only 700 Mbit/s throughput

Lots of packet loss

Packet loss distributionshows throughput limited

w05gva-gig6_29May04_UDP

0

100

200

300

400

500

600

700

800

900

1000

0 5 10 15 20 25 30 35 40Spacing between frames us

Recv W

ire r

ate

Mb

its/s

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40Spacing between frames us

% P

acke

t lo

ss

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP wait 12us

0

2000

4000

6000

8000

10000

12000

14000

0 100 200 300 400 500 600Packet No.

1-w

ay d

ela

y u

s

0

2000

4000

6000

8000

10000

12000

14000

500 510 520 530 540 550Packet No.

1-w

ay d

elay

us



LightPath Topologies



Switched LightPaths [1] Lightpaths are a fixed point to point path or circuit

Optical links (with FEC) have a BER 10-16 i.e. a packet loss rate 10-12 or 1 loss in about 160 days

In SJ5 LightPaths known as Bandwidth Channels Host to host Lightpath

One Application No congestion Advanced TCP stacks for large Delay Bandwidth Products

Lab to Lab Lightpaths Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks



Switched LightPaths [2]

Some applications suffer when using TCP may prefer to use UDP DCCP XCP …

E.g. With e-VLBI the data wave-front gets distorted and correlation fails

User Controlled Lightpaths Grid Scheduling of

CPUs & Network Many Application flows No congestion on each path Lightweight framing possible



Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)

Used iperf/TCP and UDT/UDP to generate traffic

Each run was 16 minutes, in 7 regions

Test of TCP Sharing: Methodology (1Gbit/s)

Ping 1/s

Iperf or UDT

ICMP/ping traffic

TCP/UDPbottleneck

iperf

SLACCERN

2 mins 4 mins Les Cottrell & RHJ PFLDnet 2005



Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in

congestion) Net effect: recovers slowly, does not effectively use available bandwidth,

so poor throughput Unequal sharing

TCP Reno single stream

Congestion has a dramatic effect

Recovery is slow

Increase recovery rate

SLAC to CERN

RTT increases when achieves best throughput

Les Cottrell & RHJ PFLDnet 2005

Remaining flows do not take up slack when flow removed



Hamilton TCP One of the best performers

Throughput is high Big effects on RTT when achieves best throughput Flows share equally

Appears to need >1 flow toachieve best throughput

Two flows share equally SLAC-CERN

> 2 flows appears less stable

Date post:	21-Dec-2015
Category:	Documents
View:	216 times
Download:	0 times

© 2006 Open Grid Forum Interactions Between Networks, Protocols & Applications HPCN-RG Richard...

Documents