Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | allyson-chase |
View: | 214 times |
Download: | 0 times |
Copyright © 2005 Department of Computer Science
1
Solving the TCP-incast Problem with Application-Level Scheduling
Maxim Podlesny, University of WaterlooCarey Williamson, University of Calgary
Copyright © 2005 Department of Computer Science
22
Motivation
2
• Emerging IT paradigms– Data centers, grid computing, HPC, multi-core– Cluster-based storage systems, SAN, NAS– Large-scale data management “in the cloud”– Data manipulation via “services-oriented computing”
• Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace
• Performance advantages from parallelism– Partition/aggregation, MapReduce, BigTable, Hadoop– Think RAID at Internet scale! (1000x)
Copyright © 2005 Department of Computer Science
33
Problem Statement
• High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows)• Highly-synchronized flows on bottleneck link• Limited switch buffer size (e.g., 32 KB)
How to provide high goodputfor data centerapplications?
TCP retransmission timeouts
TCP throughput degradation
N
Copyright © 2005 Department of Computer Science
444
Related Work• E. Krevat et al., “On Application-based Approaches to Avoiding TCP
Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007
• A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008
• Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009
• V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009
• M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010• A. Shpiner et al., “A Switch-based Approach to Throughput Collapse
and Starvation in Data Centers”, IWQoS 2010
Copyright © 2005 Department of Computer Science
55
Summary
• Data centers have specific network characteristics
• TCP-incast throughput collapse problem emerges
• Possible solutions:
– Tweak TCP timers and/or parameters for this environment
– Redesign (or replace!) TCP in this environment
– Rewrite applications for this environment (Facebook)
– Increase switch buffer sizes (extra queueing delay!)
– Smart edge coordination for uploads/downloads
Summary of Related Work
Copyright © 2005 Department of Computer Science
6
Data Center System Model
N servers
Logical
data block
(S)
(e.g., 1 MB)
Server
Request
Unit
(SRU)
(e.g., 32 KB)
1
2
3
N
packet size S_DATA
small buffer B
link capacity C
switch client
Copyright © 2005 Department of Computer Science
7
Performance Comparisons
Internet vs. data center network:• Internet propagation delay: 10-100 ms• data center propagation delay: 0.1 ms• packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms
Copyright © 2005 Department of Computer Science
88
Summary
• Determine maximum TCP flow concurrency (n)
that can be supported without any packet loss
• Arrange the servers into k groups of (at most) n
servers each, by staggering the group scheduling
Analysis Overview (1 of 2)
Copyright © 2005 Department of Computer Science
99
Summary
• Determine maximum TCP flow concurrency (n)
that can be supported without any packet loss
– Determine flow size in packets (based on SRU and MSS)
– Determine maximum outstanding packets per flow (Wmax)
– Determine max flow concurrency (based on B and Wmax)
• Arrange the servers into k groups of (at most) n
servers each, by staggering the group scheduling
Analysis Overview (2 of 2)
Copyright © 2005 Department of Computer Science
1010
Summary
• Recall TCP slow start dynamics:
– Initial TCP congestion window (cwnd) is 1 packet
– Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…)
• Consider TCP transfer of an arbitrary SRU (e.g., 21)
• Determine peak power-of-2 cwnd value (WA)
• Determine “residual window” for the last RTT (WB)
• Wmax depends on both WA and WB (e.g., WA+ WB/2 )
Determining Wmax
Copyright © 2005 Department of Computer Science
1111
Scheduling Overview
n
nn
n n n
N
Copyright © 2005 Department of Computer Science
12
Scheduling Details
Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled
Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled
Server i (1 <= i <= N) starts responding at:
Server i (1 <= i <= N) starts responding at:
Copyright © 2005 Department of Computer Science
13
Theoretical Results
Maximum goodput of an application in a data center with lossless scheduling is:
where:• S - size of a logical data block• T - actual completion time of an SRU• - SRU completion time used for scheduling• k – how many groups of servers to use
• dmax - real system scheduling variance
Maximum goodput of an application in a data center with lossless scheduling is:
where:• S - size of a logical data block• T - actual completion time of an SRU• - SRU completion time used for scheduling• k – how many groups of servers to use
• dmax - real system scheduling variance
maxd+T+)(kT
S=g
1~
Copyright © 2005 Department of Computer Science
141414
Solution Analytical Model Results
Copyright © 2005 Department of Computer Science
15
Results for 10 KB Fixed SRU Size (1 of 2)
Copyright © 2005 Department of Computer Science
16
Results for 10 KB Fixed SRU Size (2 of 2)
Copyright © 2005 Department of Computer Science
17
Results for Varied SRU Size (1 MB / N)
Copyright © 2005 Department of Computer Science
18
Effect of TCP Timer Granularity
Copyright © 2005 Department of Computer Science
19
Summary and Conclusion
Application-level scheduling for TCP-incast throughput collapse
Main idea: scheduling responses of servers so that there are no losses
Maximum goodput with lossless scheduling Non-monotonic goodput, highly-sensitive to network configuration parameters
Copyright © 2005 Department of Computer Science
20
Future Work
Implementing and testing our solution in real data centers
Evaluating our solution for different application traffic scenarios