Using TCP/IP Traffic shaping to achieve iSCSIservice predictability
Paper presentation
Jarle Bjørgeengen
University of Oslo / USIT
November 11, 2010
Outline
About resource sharing in storage devicesLab setup / job setupExperiment illustrating the problemOne half of the solution: the throttleLive demo
The throttlePart two of the solution: the controller
How the controller worksConclusion and future work
General problem of sharing resources
QoS bridgeQoS bridge
QoS bridgeQoS bridge
QoS bridgeConsumers
Shared physical resources
SAN
Virtual disks
Centralized storage pool
Free competition causes unpredictable I/O performancefor any given consumer.
Remaining capacity affects performance.Storage is managed by sysadminsSysadmins are unable to make keepable promises aboutperformance.
Lab setup
b
HP SC1010 x 36GB 10k
vg_perc
lv_b2
lv_b3
lv_b4
lv_b5
b2
b4
b5
b3
iSCSI target(iet)
TCP Connections
TCP/IP Layer
Striped logical volumes. 64KB
stripe size across 10 disks
/dev/iscsi_0
iqn.iscsilab:perc_b2
iqn.iscsilab:perc_b3
iqn.iscsilab:perc_b4
iqn.iscsilab:perc_b5
Block Layer Block Layer
bm
b1Argus
Is read response time affected by write activity ?
b
lv_b2
lv_b3
lv_b4
lv_b5
b2
b4
b5
b3
bm Random readrate=256kB/s
Seq writefull speed
Seq writefull speed
Seq writefull speed
The Answer is yes
Long response times adversely affect application serviceavailability.
0 100 200 300 400 500Time (s)
0
20
40
60
80
100
120
Wai
t tim
e (m
s)
No interference 1 thread (1 machines)3 threads (3 machines)12 threads (3 machines)
Throttling method
SYNSYN+ACK
Initiator Target
ACK
Write
Timeline without delay
ACK
WriteACK
WriteACK
WriteACK
Time
SYNSYN+ACK
Initiator Target
ACK
Write
ACK
Write
ACK
Write
ACK
Write
ACK
Timeline with delay
Throttling delay
Relation between packet delay and average rate
0 0.6 1.6 2.6 3.6 4.6 5.6 6.6 7.6 8.6 9.6
Introduced delay (ms)
Tim
e to
rea
d 20
0MB
of d
ata
(s)
010
2030
40
0 0.6 1.6 2.6 3.6 4.6 5.6 6.6 7.6 8.6 9.6
Introduced delay (ms)
Tim
e to
writ
e 20
0MB
of d
ata
(s)
020
4060
80
Write rate 15 MB/s - 2.5 MB/sRead rate 22 MB/s - 5 MB/s
Managing consumers
Need to operate on sets of consumers(throttlable={10.0.0.243,10.0.0.244})Ipset: One rule to match them all� �
ipset -N $throttlable ipmap --network 10.0.0.0/24ipset -A $throttlable 10.0.0.243ipset -A $throttlable 10.0.0.244iptables --match-set $throttlable dst -j MARK --set-mark $mark� �
The mark is a step in the range of available packet delays
Live demonstration
Manual throttling and QoS specificationAn automatic QoS policy and automated throttling
Dynamic throttling decision
Figure: Block diagram of a PID controller. Created bySilverStar(at)en.wikipedia. Licensed under the terms of CreativeCommons Attribution 2.5 Generic.
Modified PID function
Start
Stop
Calculate Up,Ui,Ud
Up = Kp× ekUi = Uik−1 +
Ts×KpTi
× ek
Ud = Kp× Td× ek − ek−1Ts
0 < Ui < Ukmax
Ui < 0
Ui > Ukmax
N
N
Ui=0
Ui=Uik-1
Uk = Up+Ui+Ud
0 < Uk < Ukmax
Y
Y
mark = int(ceil(Uk))
Uk < 0
Uk > Ukmax Uk=Ukmax
Uk=0
N
Y
Y
YY
N
The completely automated approach
ISCSIMAP
set_maintaner.pl
Create
/proc/net/iet/sessions
/proc/net/iet/volumes
IP-sets
Create & maintain members
Read
perf_maintainer.pl
PDATA
ReadSaturation indicators
/proc/diskstats
Read
pid_reg.pl
Read
pid_threads
Read
Spawn($resource)
Throttles
Files Shared memory ProcessesLegend:
lvs Run
Command Dependency
Read output
perf_server.pl
CMEM
Throttle values
gnuplot
Impact
The packet delay throttle is very efficientSolves the throttling need completely for iSCSI (likely otherTCP based storage networks too)
The modified PID controller is consistently keepingresponse time low in spite of rapidly changing loadinterference.The concept is widely applicable
Impact
The packet delay throttle is very efficientSolves the throttling need completely for iSCSI (likely otherTCP based storage networks too)
The modified PID controller is consistently keepingresponse time low in spite of rapidly changing loadinterference.The concept is widely applicable
Impact
The packet delay throttle is very efficientSolves the throttling need completely for iSCSI (likely otherTCP based storage networks too)
The modified PID controller is consistently keepingresponse time low in spite of rapidly changing loadinterference.The concept is widely applicable
Future work
iSCSI disk array Ethernet sw.
QoS bridgeQoS bridge
QoS bridgeQoS bridge
QoS bridgeConsumers
QoS bridge
Resource/consumer mapsVirtual disk latencies
Array specific pluginSNMPGET
Packet delay throttle with other algorithmsPID controller with other throttles
Thanks for the attention !
Overhead
Negligeble overhead introduced by TC filtersDifferences measured 20 timest-test 99% confidence shows 0.4% / 1.7 %• overhead forread/write (worst case)
Is response time improved by throttling ?
0 100 200 300 400 500Time (s)
0
20
40
60
80
Wai
t tim
e (m
s)
010000
2000030000
4000050000
Aggr
egat
ed in
terfe
renc
e (k
B/s)
Small job average wait time (Left)Interference aggregated throughput (Right). Throttling period with 4.6 ms delayThrottling period with 9.6 ms delay
Automatically controlled wait time
0 100 200 300 400 500Time (s)
0
20
40
60
80
100
Aver
age
wait
time
(ms)
No regulation20 ms treshold15 ms threshold10 ms threshold
The throttled rates
0 100 200 300 400 500Time (s)
0
10000
20000
30000
40000
50000
Aggr
egat
e W
rite
(kB/
s)
No regulation20 ms threshold (smoothed)15 ms threshold (smoothed)10 ms threshold (smoothed)
Exposing the throttling value
0 50 100 150 200Time (s)
0
10
20
30
40
50
(ms)
010000
2000030000
40000
(kB/
s)
vg_aic read wait time with automatic regulation, thresh=15msPacket delay introduced to writersAggregated write rate
Effect of the packet delay throttle: Reads
0 50 100 150 200 250 300Time (s)
0
5000
10000
15000
20000
Read
(kB/
s)
b2b3b4b5
Effect of the packet delay throttle: Writes
0 50 100 150 200 250 300Time (s)
0
5000
10000
15000
20000
Writ
e (k
B/s)
b2b3b4b5
The tc delay queues
110:netem
parent 1:10limit 1000
delay 4.1ms
110:1netem
111:netem
parent 1:11limit 1000
delay 4.6ms
111:1netem
112:netem
parent 1:12limit 1000
delay 5.1ms
112:1netem
113:netem
parent 1:13limit 1000
delay 5.6ms
113:1netem
114:netem
parent 1:14limit 1000
delay 6.1ms
114:1netem
115:netem
parent 1:15limit 1000
delay 6.6ms
115:1netem
116:netem
parent 1:16limit 1000
delay 7.1ms
116:1netem
117:netem
parent 1:17limit 1000
delay 7.6ms
117:1netem
118:netem
parent 1:18limit 1000
delay 8.1ms
118:1netem
119:netem
parent 1:19limit 1000
delay 8.6ms
119:1netem
120:netem
parent 1:20limit 1000
delay 9.1ms
120:1netem
121:netem
parent 1:21limit 1000
delay 9.6ms
121:1netem
12:netem
parent 1:2limit 1000
delay 99us
12:1netem
13:netem
parent 1:3limit 1000
delay 598us
13:1netem
14:netem
parent 1:4limit 1000
delay 1.1ms
14:1netem
15:netem
parent 1:5limit 1000
delay 1.6ms
15:1netem
16:netem
parent 1:6limit 1000
delay 2.1ms
16:1netem
17:netem
parent 1:7limit 1000
delay 2.6ms
17:1netem
18:netem
parent 1:8limit 1000
delay 3.1ms
18:1netem
19:netem
parent 1:9limit 1000
delay 3.6ms
19:1netem
1:htbroot
r2q 10default 1
direct_packets_stat 4399042ver 3.17
1:10htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:11htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:12htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:13htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:14htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:15htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:16htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:17htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:18htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:19htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:2htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:20htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:21htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:3htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:4htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:5htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:6htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:7htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:8htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
1:9htb
prio 0quantum 200000
rate 8000Mbitceil 8000Mbit
burst 0b/8mpu 0b
overhead 0bcburst 0b/8
mpu 0boverhead 0b
level 0
(1) (2)(3) (4)
(5) (6)(7)
(8)
(9) (10) (11) (12)
(13) (14) (15)(16)
(17)(18)
(19)(20)
The tc bandwidth queues
1:htbroot
r2q 10default 1
direct_packets_stat 4665509ver 3.17
1:1htb
rate 1000Mbitceil 1000Mbit
burst 130875b/8mpu 0b
overhead 0bcburst 130875b/8
mpu 0boverhead 0b
level 7
1:10htb
prio 0quantum 200000rate 550000Kbitceil 550000Kbitburst 277612b/8
mpu 0boverhead 0b
cburst 277612b/8mpu 0b
overhead 0blevel 0
1:11htb
prio 0quantum 200000rate 500000Kbitceil 500000Kbitburst 252500b/8
mpu 0boverhead 0b
cburst 252500b/8mpu 0b
overhead 0blevel 0
1:12htb
prio 0quantum 200000rate 450000Kbitceil 450000Kbitburst 227418b/8
mpu 0boverhead 0b
cburst 227418b/8mpu 0b
overhead 0blevel 0
1:13htb
prio 0quantum 200000rate 400000Kbitceil 400000Kbitburst 202350b/8
mpu 0boverhead 0b
cburst 202350b/8mpu 0b
overhead 0blevel 0
1:14htb
prio 0quantum 200000rate 350000Kbitceil 350000Kbitburst 177231b/8
mpu 0boverhead 0b
cburst 177231b/8mpu 0b
overhead 0blevel 0
1:15htb
prio 0quantum 200000rate 300000Kbitceil 300000Kbitburst 152137b/8
mpu 0boverhead 0b
cburst 152137b/8mpu 0b
overhead 0blevel 0
1:16htb
prio 0quantum 200000rate 250000Kbitceil 250000Kbitburst 127062b/8
mpu 0boverhead 0b
cburst 127062b/8mpu 0b
overhead 0blevel 0
1:17htb
prio 0quantum 200000rate 200000Kbitceil 200000Kbitburst 101975b/8
mpu 0boverhead 0b
cburst 101975b/8mpu 0b
overhead 0blevel 0
1:18htb
prio 0quantum 200000rate 150000Kbitceil 150000Kbitburst 76875b/8
mpu 0boverhead 0b
cburst 76875b/8mpu 0b
overhead 0blevel 0
1:19htb
prio 0quantum 200000rate 100000Kbitceil 100000Kbitburst 51787b/8
mpu 0boverhead 0b
cburst 51787b/8mpu 0b
overhead 0blevel 0
1:2htb
prio 0quantum 200000rate 950000Kbitceil 950000Kbitburst 478325b/8
mpu 0boverhead 0b
cburst 478325b/8mpu 0b
overhead 0blevel 0
1:20htb
prio 0quantum 200000rate 50000Kbitceil 50000Kbitburst 26693b/8
mpu 0boverhead 0b
cburst 26693b/8mpu 0b
overhead 0blevel 0
1:21htb
prio 0quantum 200000rate 45000Kbitceil 45000Kbitburst 24181b/8
mpu 0boverhead 0b
cburst 24181b/8mpu 0b
overhead 0blevel 0
1:22htb
prio 0quantum 200000rate 35000Kbitceil 35000Kbitburst 19162b/8
mpu 0boverhead 0b
cburst 19162b/8mpu 0b
overhead 0blevel 0
1:23htb
prio 0quantum 200000rate 25000Kbitceil 25000Kbitburst 14146b/8
mpu 0boverhead 0b
cburst 14146b/8mpu 0b
overhead 0blevel 0
1:24htb
prio 0quantum 187500rate 15000Kbitceil 15000Kbitburst 9127b/8
mpu 0boverhead 0b
cburst 9127b/8mpu 0b
overhead 0blevel 0
1:25htb
prio 0quantum 62500rate 5000Kbitceil 5000Kbitburst 4Kb/8
mpu 0boverhead 0bcburst 4Kb/8
mpu 0boverhead 0b
level 0
1:3htb
prio 0quantum 200000rate 900000Kbitceil 900000Kbitburst 453262b/8
mpu 0boverhead 0b
cburst 453262b/8mpu 0b
overhead 0blevel 0
1:4htb
prio 0quantum 200000rate 850000Kbitceil 850000Kbitburst 428187b/8
mpu 0boverhead 0b
cburst 428187b/8mpu 0b
overhead 0blevel 0
1:5htb
prio 0quantum 200000rate 800000Kbitceil 800000Kbitburst 403100b/8
mpu 0boverhead 0b
cburst 403100b/8mpu 0b
overhead 0blevel 0
1:6htb
prio 0quantum 200000rate 750000Kbitceil 750000Kbitburst 378000b/8
mpu 0boverhead 0b
cburst 378000b/8mpu 0b
overhead 0blevel 0
1:7htb
prio 0quantum 200000rate 700000Kbitceil 700000Kbitburst 352887b/8
mpu 0boverhead 0b
cburst 352887b/8mpu 0b
overhead 0blevel 0
1:8htb
prio 0quantum 200000rate 650000Kbitceil 650000Kbitburst 327762b/8
mpu 0boverhead 0b
cburst 327762b/8mpu 0b
overhead 0blevel 0
1:9htb
prio 0quantum 200000rate 600000Kbitceil 600000Kbitburst 302700b/8
mpu 0boverhead 0b
cburst 302700b/8mpu 0b
overhead 0blevel 0
(1)(2)
(3)
(4) (5)
(6) (7)
(8)
(9) (10)
(11)
(12) (13)
(14)(15)
(16)
(17)
(18)
(19)
(20) (21)
(22)
(23)
(24)
Input signal
100 200 300 400 500TimeHsL
20
40
60
80
100
Wait HmsL
Red: Exponential Weighted Moving Average (EWMA)Green: Moving medianL(t) = l(t)α + L(t−1)(1− α)EWMA, also called low pass filter
m-av-best.swfMedia File (application/x-shockwave-flash)
u(t) =
Continous︷ ︸︸ ︷Kpe(t)︸ ︷︷ ︸
Proportional
+KpTi
t∫0
e(τ)dτ
︸ ︷︷ ︸Integral
+ KpTde′(t)︸ ︷︷ ︸Derivative
uk = uk−1︸︷︷︸Previous
+ Kp(1 +TTi
)ek − Kpek−1 +KpTd
T(ek − 2ek−1 + ek−2)︸ ︷︷ ︸
Delta︸ ︷︷ ︸Incremental form
uk = Kpek︸ ︷︷ ︸Proportional
+ ui(k−1) +KpTTi
ek︸ ︷︷ ︸Integral
+KpTd
T(ek − ek−1)︸ ︷︷ ︸
Derivative︸ ︷︷ ︸Absolute form