Post on 16-Dec-2015
transcript
Performance Evaluation of RDMA over IP:A Case Study with the Ammasso Gigabit Ethernet NIC
H.-W. Jin, S. Narravula, G. Brown,
K. Vaidyanathan, P. Balaji, and D.K. Panda
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
{ jinhy, narravul, browngre, vaidyana, balaji, panda}@cse.ohio-state.edu
Contents
• Introduction
• WAN Emulator for Cluster-of-Clusters
• Performance Evaluation of RDMA over IP
• Conclusions and Future Work
Introduction
• Sockets over TCP/IP• RDMA over LAN
– InfiniBand, Myrinet, Quadrics– HPC middleware (MPI) and file systems (PVFS)
• RDMA over WAN– iWARP, RDDP– Grid and Internet applications
• RDMA-enabled Gigabit Ethernet NIC– Ammasso
Ammasso Gigabit Ethernet NICApplications
Sockets Interface CCIL(Cluster Core Interface Lang.)
Sockets
TCP
IP
Device Driver
Gigabit Ethernet
RDMA
TOE(TCP/IP Offload Engine)
Am
masso G
igab
it Eth
erne
t NIC
Op
era
ting
Sys
tem
Problem Statement
• There have been no comprehensive quantitative evaluations of RDMA over WAN environment
• How to Emulate the WAN Environment?
• What Kind of Performance Metrics?
• Sockets vs. CCIL
Contents
• Introduction
• WAN Emulator for Cluster-of-Clusters
• Performance Evaluation of RDMA over IP
• Conclusions and Future Work
Experimental WAN Setup
GigESwitch
GigESwitch
IP
eth0 eth1
Device Driver
Linux Workstation-basedRouter
IP Network A IP Network BWANEmulation
WAN Emulator for Cluster-of-Clusters
• Characteristics of WAN Environments– High network delay– Packet loss– Etc.
• User-Level or Kernel-Level Emulator?
• Blocking or Queueing based Delay Adding?
Degen: Delay generator
eth0 eth1
Device Driver Device Driver
Routing Decision Degen Netfilter
Timestamp delay queue
reinjection
IP
Degen Kernel Module
Dgen DaemonWAN Emulator for Cluster-of-Clusters
Kernel Patch for CCIL WAN Communication
• Ammasso Setup– Ammasso 1100– Ammasso software version amso1100-1.2-ga2
• Packet Drops for CCIL WAN Communication– Timeout– Retransmission
• Kernel Patch on Router
Contents
• Introduction• WAN Emulator for Cluster-of-Clusters• Performance Evaluation of RDMA over IP
– Basic communication latency– Computation and communication overlap– Communication progress– CPU resource requirements– Unification of communication interface– Bandwidth (throughput)
• Conclusions and Future Work
Basic Communication Latency
0
50
100
150
200
250
300
350
400
450
4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
Message Size (Byte)
Lat
ency
(u
s)
Sockets
CCIL
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 1 2 4 8
Network Delay (ms)L
aten
cy (
us)
Sockets
CCIL
• No impact of zero-copy on the basic communication latency• Basic communication is not an important metric
1KB Message Size
Computation and Communication Overlap
Router SwitchSwitchn0 n1
Computation(t1)
TotalTime(t2)
Overlap Ratio = t1/ t2
Send
Receive
Computation and Communication Overlap
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 61 122 182 242 302 362 422
Computation (ms)
Ove
rlap
Rat
io
Sockets
CCIL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 4 8
Network Delay (ms)O
verl
ap R
atio
Sockets
CCIL
• RDMA can achieve a better computation and communication overlap• Its benefit reduces as the network delay increases
1KB Message Size 242ms Computation1098%
114%
Communication Progress
Router SwitchSwitchn0 n1
Response DelayBy Load
DataFetchingLatency
Request
Response
Communication Progress
1
10
100
1000
10000
100000
0 1 4 16 64
Response Delay by Load (ms)
Lat
ency
(u
s)
Sockets
CCIL
1
10
100
1000
10000
100000
0 1 2 4 8
Network Delay (ms)L
aten
cy (
us)
• RDMA can achieve a better communication progress• Its benefit reduces as the network delay increases
16ms Response Delay1KB Message Size
98% 65%
CPU Resource Requirements
Router SwitchSwitchn0 n1
… 40 Streams
Application
Application Execution Time?
CPU Resource Requirements
0
5
10
15
20
25
30
35
40
45
50
1K 2K 4K 8K 16K
Message Size (Byte)
Exe
cuti
on
Tim
e (S
ec)
Sockets
CCIL
0
5
10
15
20
25
30
35
40
45
50
0 1 2 4 8
Network Delay (ms)E
xecu
tio
n T
ime
(Sec
)
• RDMA-based communication does not affect to the application execution time• RDMA has a strong potential of saving the CPU resource
16KB Message Size
Unification of Communication Interface
switch
switch
Inter-Cluster
Intra-Cluster
0
50
100
150
200
250
4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
Message Size (Byte)L
aten
cy (
us)
Sockets
CCIL
• RDMA over IP can provide a unified communication interface• RDMA can achieve lower latency for intra-cluster communication
38%
Bandwidth
• Where is the bottleneck?• Ethernet devices on the router• TCP window size
16KB Message Size
0
100
200
300
400
500
600
4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
Message Size (Byte)
Ban
dw
idth
(M
bp
s)
Sockets
CCIL
0
50
100
150
200
250
300
350
400
450
500
0 1 2 4 8
Network Delay (ms)B
and
wid
th (
Mb
ps)
Sockets
CCIL
Contents
• Introduction
• WAN Emulator for Cluster-of-Clusters
• Performance Evaluation of RDMA over IP
• Conclusions and Future Work
Conclusions
• The first quantitative study of RDMA over IP on a WAN setup
• WAN Emulator for Custer-of-Clusters– Degen
• RDMA over IP Can – Save CPU resource on the server side even on a high
delay WAN environment– Achieve better
• computation and communication overlap• communication progress• peak bandwidth
– Provide unified interface
Future Work
• Performance Evaluations– Other performance factors
• impact of address exchange• bandwidth
– Application-level performance
• WAN Emulator for Cluster-of-Clusters– Delay model– Other components
• RDMA-aware Middleware for Widely Distributed Systems over WAN
Acknowledgements
Our research is supported by the following organizations:
• Current Funding support by
• Current Equipment donations by