Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
© 2006 Open Grid Forum
Interactions Between Networks, Protocols & ApplicationsHPCN-RG
Richard Hughes-Jones
OGF20, Manchester, May 2007,
2© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
ESLEA and UKLight at SC|05
3© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
ESLEA and UKLight
6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR
Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer / application
to give ~850Mbit/s One stream of data 840 Mbit/s
Stream UDP VLBI data UK to Seattle 620 Mbit/s
No packet loss worked well
sc0502 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
Mb
it/s
sc0503 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
Mb
it/s
sc0504 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
Mb
it/s
sc0501 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
time
Ra
te
Mb
it/s
UKLight SC|05
0
500
1000
1500
2000
2500
3000
3500
4000
4500
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
Mb
it/s
Reverse TCP
4© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
SC|05 HEP: Moving data with bbcp
What is the end-host doing with your network protocol? Look at the PCI-X 3Ware 9000 controller RAID0 1 Gbit Ethernet link 2.4 GHz dual Xeon ~660 Mbit/s
Power needed in the end hosts Careful Application design
PCI-X bus with RAID Controller
PCI-X bus with Ethernet NIC
Read from diskfor 44 ms every 100ms
Write to Networkfor 72 ms
5© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
SC2004: Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/sworks !!
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
6© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP traffic
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05x
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
% CPU kernel mode
7© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Remote Computing Farms in the ATLAS TDAQ Experiment
8© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
ATLAS Remote Farms – Network Connectivity
9© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
ATLAS Remote Computing: Application Protocol
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes
Processing of event Return of computation
EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter Daemon EFD SFI and SFO
10© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
TCP Activity Manc-CERN Req-Resp
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Dat
a B
ytes
Ou
t
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start 1st event takes 19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Dat
a B
ytes
Ou
t
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
chiv
e M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
11© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
TCP Activity Manc-cern Req-RespTCP stack no cwnd reduction Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500 3000time
Da
ta B
yte
s O
ut
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
chiv
e M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
acke
ts
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
TCP Congestion windowgrows nicely
Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)
Transfer achievable throughputgrows to 800 Mbit/s
Data transferred WHEN theapplication requires the data
3 Round Trips 2 Round Trips
12© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Round trip time 150 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s
TCP Activity Alberta-CERN Req-RespTCP stack no Cwnd reduction
TCP Congestion windowin slow start to ~1.8s then congestion avoidance
Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)
Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Dat
a B
ytes
Ou
t
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
chiv
e M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
nu
m P
acke
ts0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
13© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Moving Constant Bit-rate Data in Real-Timefor
Very Long Baseline Interferometry
Stephen Kershaw, Ralph Spencer, Matt Strong, Simon Casey, Richard Hughes-Jones,
The University of Manchester
14© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
What is VLBI ?
Data wave front send over the network to the Correlator
VLBI signal wave front
Resolution Baseline
Sensitivity
Bandwidth B is as
important as time τ : Can use as many Gigabits as we can get!
B/1
15© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
DedicatedDWDM link
OnsalaSweden
Gbit link
Jodrell BankUK
DwingelooNetherlands
MedicinaItaly
Chalmers University
of Technolo
gy, Gothenbu
rg
TorunPoland
Gbit link
MetsähoviFinland
European e-VLBI Test Topology
2* 1 Gbit links
16© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
CBR Test Setup
17© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
CBR over TCP
0 1 2 3 4 5 6 7 8 9 10
x 104
5
10
15
20
25
30
35
40
45
50
Message number
Tim
e / s
Effect of loss rate on message arrival time
Drop 1 in 5k
Drop 1 in 10k
Drop 1 in 20kDrop 1 in 40k
No loss
Data delayed
0 1 2 3 4 5 6 7 8 9 10
x 104
5
10
15
20
25
30
35
40
45
50
Message number
Tim
e / s
Effect of loss rate on message arrival time
Drop 1 in 5k
Drop 1 in 10k
Drop 1 in 20kDrop 1 in 40k
No loss
Data delayed
Timely arrivalof data
Effect of loss rate on message arrival time.
TCP buffer 1.8 MB (BDP) RTT 27 ms
When there is packet lossTCP decreases the rate.TCP buffer 0.9 MB (BDP)RTT 15.2 ms
Can TCP deliver the data on time?
18© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Message number / Time
Packet lossDelay in stream
Expected arrival time at CBR
Arrival time
throughput
1Slope
Resynchronisation
19© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
0 20 40 60 80 100 1200
1
2
x 106
Cur
Cw
nd
0 20 40 60 80 100 1200
500
1000
Mbi
t/s
0 20 40 60 80 100 1200
1
2
Pkt
sRet
rans
0 20 40 60 80 100 1200
500
1000
Dup
Ack
sIn
Time in sec
Message size: 1448 Bytes Data Rate: 525 Mbit/s Route:
Manchester - JIVE RTT 15.2 ms
TCP buffer 160 MB Drop 1 in 1.12 million packets
Throughput increases Peak throughput ~ 734 Mbit/s Min. throughput ~ 252 Mbit/s
CBR over TCP – Large TCP Buffer
20© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
2,000,000 3,000,000 4,000,000 5,000,0000
500
1000
1500
2000
2500
3000
Message number
One
way
del
ay /
ms
Message size: 1448 Bytes Data Rate: 525 Mbit/s Route:
Manchester - JIVE RTT 15.2 ms
TCP buffer 160 MB Drop 1 in 1.12 million packets
Peak Delay ~2.5s
CBR over TCP – Message Delay
21© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
22© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Standard TCP not optimum for high throughput long distance links
Packet loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert
New stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”
TCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage
TCP does not share bandwidth well with other streams
The End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Interaction between HW, protocol processing, and disk sub-system complex
Application architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application.
Users are now able to perform sustained 1 Gbit/s transfers
Summary & Conclusions
23© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Any Questions?
24© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Network switch limits behaviourEnd2end UDP packets from udpmon
Only 700 Mbit/s throughput
Lots of packet loss
Packet loss distributionshows throughput limited
w05gva-gig6_29May04_UDP
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mb
its/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP wait 12us
0
2000
4000
6000
8000
10000
12000
14000
0 100 200 300 400 500 600Packet No.
1-w
ay d
ela
y u
s
0
2000
4000
6000
8000
10000
12000
14000
500 510 520 530 540 550Packet No.
1-w
ay d
elay
us
25© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
LightPath Topologies
26© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Switched LightPaths [1] Lightpaths are a fixed point to point path or circuit
Optical links (with FEC) have a BER 10-16 i.e. a packet loss rate 10-12 or 1 loss in about 160 days
In SJ5 LightPaths known as Bandwidth Channels Host to host Lightpath
One Application No congestion Advanced TCP stacks for large Delay Bandwidth Products
Lab to Lab Lightpaths Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks
27© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Switched LightPaths [2]
Some applications suffer when using TCP may prefer to use UDP DCCP XCP …
E.g. With e-VLBI the data wave-front gets distorted and correlation fails
User Controlled Lightpaths Grid Scheduling of
CPUs & Network Many Application flows No congestion on each path Lightweight framing possible
28© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions
Test of TCP Sharing: Methodology (1Gbit/s)
Ping 1/s
Iperf or UDT
ICMP/ping traffic
TCP/UDPbottleneck
iperf
SLACCERN
2 mins 4 mins Les Cottrell & RHJ PFLDnet 2005
29© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in
congestion) Net effect: recovers slowly, does not effectively use available bandwidth,
so poor throughput Unequal sharing
TCP Reno single stream
Congestion has a dramatic effect
Recovery is slow
Increase recovery rate
SLAC to CERN
RTT increases when achieves best throughput
Les Cottrell & RHJ PFLDnet 2005
Remaining flows do not take up slack when flow removed
30© 2007 Open Grid Forum
GHPN-RG OGF20 Manchester, May 2007 R. Hughes-Jones Manchester
Hamilton TCP One of the best performers
Throughput is high Big effects on RTT when achieves best throughput Flows share equally
Appears to need >1 flow toachieve best throughput
Two flows share equally SLAC-CERN
> 2 flows appears less stable