8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 1/177
TitleFeedback-based two stage switch architecture for highspeed router design
Author(s) Hu, Bing;
Citation
Issue Date 2010
URL http://hdl.handle.net/10722/56798
Rights unrestricted
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 2/177
FEEDBACK–BASED TWO-STAGE SWITCH
ARCHITECTURE FOR HIGH SPEED ROUTER
DESIGN
BY
HU BING
PH.D. THESIS
DECEMBER 2009
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 3/177
Abstract of thesis entitled
Feedback–Based Two-Stage Switch Architecture for
High Speed Router Design
submitted by
Hu Bing
for the degree of Doctor of Philosophy
at The University of Hong Kong
in December 2009
Due to the widespread usage of WDM technology in fiber, the transmission
capacity increases sharply, while the processing capacity of current commercial
routers increases slowly. The speed mismatch between fiber and router induces a
pressing need for building next generation high-speed routers. A major bottleneck of
high-speed router design is its switch architecture, which concerns how packets are
moved from one linecard to another. In this thesis, we focus on designing efficient
and scalable switch architecture to enable the next generation high-speed routers.
A load-balanced two-stage switch configures its two switch fabrics according
to a pre-determined and periodic sequence of switch configurations. It is attractive
because no centralized scheduler is required and close to 100% throughput can be
obtained. But it also faces two major challenges: packet mis-sequencing and poor
delay performance. In this thesis, we propose a feedback-based two-stage switch
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 4/177
architecture to simultaneously address these two challenges. Notably, we only
require a single-packet-buffer for each middle-stage port VOQ. This greatly cuts
down the average packet delay. At the same time, in-order packet delivery and high
throughput are ensured by properly selecting and coordinating the two sequences of
switch configurations. As compared with the existing load-balanced switch
architectures and scheduling algorithms, our feedback-based switch imposes a
modest requirement on switch hardware, yet consistently yields the best delay-
throughput performance.
To further enhance the performance of the feedback-based switch, original
extensions and refinements are made. Specifically, a three-stage switch architecture
is proposed for further cutting down the average packet delay. A feedback
suppression scheme is designed for reducing the communication overhead. A
multicast scheduling algorithm is invented for carrying multicast traffic using the
same unicast switch fabric. A batch scheduler is devised for multi-cabinet
implementation of the feedback-based switch. To address the fairness issue in
handling inadmissible traffic patterns, a fair scheduler is designed for allocating the
bandwidth of over-subscribed outputs based on max-min fairness criterion. Last but
not the least, an optical implementation of the feedback-based two-stage switch is
proposed.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 5/177
Feedback–Based Two-Stage Switch Architecture for
High Speed Router Design
by
Hu Bing
( )
B.Eng., M.Phil. U . E .S .T .C
A thesis submitted in partial fulfillment of the requirements for
the Degree of Doctor of Philosophy
at The University of Hong Kong
December 2009
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 6/177
- i -
Declaration
I declare that this thesis represents my own work, except where due
acknowledgement is made, and that it has not been previously included in a thesis,
dissertation or report submitted to this University or to any other institution for a
degree, diploma or other qualification.
Signed _________________________________
Hu Bing
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 7/177
- ii -
Acknowledgments
First, I would like to express my deep gratitude to my research supervisor,
Doctor Kwan L. Yeung, for his guidance and encouragement throughout my graduate
study. Doctor Yeung’s unreserved supports cover every detail of my research work,
from teaching me research methodologies to taking pains to polish papers. His
instructions and infinite patience were essential for completing this thesis. I feel
privileged to have had this opportunity to study under his supervision.
I thank the Electrical and Electronic Engineering Department at the
University of Hong Kong, for creating such a great education and research
environment. I thank all staff members in the department for their kindly help and
warm assistance. I also thank the financial support of the University of Hong Kong to
enable me to complete my Ph. D. study. My thanks also go to my lab-mates and
friends whose encouragement and help are essential.
Along the way, I have been incredibly fortunate in getting the support from
my dear parents, for their endless support, materially and spiritually.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 8/177
- iii -
Table of Contents
Declaration .......................................................................................................... i
Acknowledgments ............................................................................................... ii
Table of Contents ................................................................................................ iii
List of Figures ..................................................................................................... viii
List of Symbols ................................................................................................... xi
List of Abbreviations .......................................................................................... xiv
Chapter 1 Introduction
1.1 Overview of Routers .......................................................................... 1
1.2 Switch Architectures .......................................................................... 7
1.2.1 Output-queued Switches ........................................................ 8
1.2.2 Input-queued Switches ........................................................... 8
1.2.3 CIOQ and Buffered Crossbar Switches ................................. 10
1.2.4 Load-Balanced Two-Stage Switches ..................................... 12
1.3 Contributions ..................................................................................... 13
1.4 Thesis Overview ................................................................................ 16
Chapter 2 Feedback-Based Two-Stage Switch Design
2.1 Introduction ....................................................................................... 182.2 Related Work ..................................................................................... 22
2.2.1 Using Re-sequencing Buffers ................................................ 22
2.2.2 Preventing Packets from Becoming Mis-sequencing ............ 23
2.3 Feedback-Based Two-Stage Switch .................................................. 26
2.3.1 Some Observations and Motivations ..................................... 26
2.3.2 Designing Scalable Feedback Mechanism ............................. 28
2.3.3 Solving Packet Mis-sequencing Problem .............................. 31
2.3.4 Feedback-Based Scheduling Algorithms ............................... 34
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 9/177
- iv -
2.4 Performance Evaluations ................................................................... 36
2.4.1 Performance under Uniform Traffic ...................................... 37
2.4.2 Performance under Uniform Bursty Traffic ........................... 38
2.4.3 Performance under Hotspot Traffic ....................................... 39
2.5 The Stability of Feedback-Based Two-Stage Switch ........................ 41
2.5.1 The Existing Approaches ......................................................... 41
2.5.2 Fluid Model for Feedback-Based Two-Stage Switch ............. 42
2.5.3 100% Throughput Proof .......................................................... 45
2.6 Chapter Summary .............................................................................. 49
Chapter 3 Cutting Down Average Packet Delay
3.1 Introduction ....................................................................................... 50
3.2 Optimal Joint Sequence Design .......................................................... 52
3.2.1 In-order Packet Delivery Only ............................................... 53
3.2.2 Both In-order Packet Delivery and Staggered Symmetry ...... 59
3.2.3 Finding the Number of Different Joint Sequences ................. 61
3.2.4 Discussions ............................................................................. 63
3.3 Three-Stage Switch ............................................................................ 64
3.3.1 Three-Stage Switch Architecture ........................................... 64
3.3.2 Traffic Matrix Estimation ....................................................... 69
3.3.3 Performance Evaluations ....................................................... 70
3.4 Chapter Summary .............................................................................. 73
Chapter 4 Cutting Down Communication Overhead
4.1 Introduction ....................................................................................... 74
4.2 Feedback Suppression Algorithms .................................................... 75
4.2.1 Set-based Feedback (Set-feedback) ....................................... 77
4.2.2 Queue-based Feedback Version 1 (Q-feedback-1) ................ 78
4.2.3 Queue-based Feedback Version 2 (Q-feedback-2) ................ 79
4.3 Performance Evaluations ................................................................... 80
4.3.1 Performance under Uniform Traffic ...................................... 81
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 10/177
- v -
4.3.2 Performance under Uniform Bursty Traffic ........................... 82
4.3.3 Performance under Hotspot Traffic ....................................... 82
4.3.4 Performance under Different Switch Size N .......................... 83
4.4 Chapter Summary .............................................................................. 84
Chapter 5 Supporting Multicast Traffic
5.1 Introduction ....................................................................................... 85
5.2 Related Work ..................................................................................... 87
5.2.1 Multicast Switches Based on Bufferless Switch Fabrics ....... 87
5.2.2 Buffered Crossbar Based Multicast Switches ........................ 895.3 Multicast Scheduling in Feedback-Based Two-Stage Switch ........... 90
5.3.1 Multicast Scheduling .............................................................. 90
5.3.2 Discussions ............................................................................. 92
5.4 Performance Evaluations ................................................................... 93
5.4.1 Performance under Uniform Mixing Traffic ......................... 94
5.4.2 Performance under Uniform Bursty Mixing Traffic .............. 96
5.4.3 Performance under Binomial Mixing Traffic ........................ 97
5.5 Chapter Summary .............................................................................. 99
Chapter 6 Multi-cabinet Implementation
6.1 Introduction ....................................................................................... 100
6.2 Related Work ..................................................................................... 102
6.2.1 Multi-cabinet Implementation of Input-queued Switch ......... 1026.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch . 103
6.3 Multi-cabinet Implementation of Feedback-Based Switch ............... 103
6.3.1 Revamped Feedback Mechanism ........................................... 103
6.3.2 Batch Scheduler Design ......................................................... 106
6.3.3 Some Properties ..................................................................... 107
6.4 Performance Evaluations ................................................................... 109
6.4.1 Performance under Uniform Traffic ...................................... 110
6.4.2 Performance under Uniform Bursty Traffic ........................... 111
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 11/177
- vi -
6.4.3 Performance under Hotspot Traffic ....................................... 112
6.5 Chapter Summary .............................................................................. 112
Chapter 7 Scheduling Inadmissible Traffic Patterns
7.1 Introduction ....................................................................................... 113
7.2 Related Work ..................................................................................... 115
7.2.1 Fair Scheduling under Admissible Traffic .............................. 115
7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only .... 115
7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports 116
7.3 Our Approach .................................................................................... 1177.4 Max-min Fairness Criterion ............................................................... 120
7.5 Performance Evaluations ................................................................... 122
7.5.1 Under Server-client Traffic Model ........................................ 122
7.5.2 Attack-traffic Scenario ........................................................... 124
7.6 Chapter Summary .............................................................................. 125
Chapter 8 An Optical Implementation of Feedback-Based Switch
8.1 Introduction ....................................................................................... 126
8.2 Related Work ..................................................................................... 127
8.3 Load Balanced Optical Switch (LBOS) ............................................ 128
8.3.1 Switch Architecture ................................................................. 128
8.3.2 Switch Operation ..................................................................... 130
8.3.3 Equivalence to Load Balanced Electronic Switches .............. 1338.4 Extensions and Refinements of LBOS .............................................. 134
8.4.1 Cutting down the Average Delay by Reconfiguration ........... 134
8.4.2 Supporting Multicast ............................................................... 136
8.4.3 Implementing Fair Scheduler Optically ................................. 137
8.5 Performance Evaluations ................................................................... 137
8.5.1 Performance under Uniform Traffic ........................................ 138
8.5.2 Performance under Uniform Bursty Traffic ............................ 139
8.5.3 Performance under Hotspot Traffic ....................................... 139
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 12/177
- vii -
8.5.4 Performance for Linecard Placement ..................................... 141
8.6 Chapter Summary .............................................................................. 141
Chapter 9 Conclusion
9.1 Our Contributions ................................................................................ 142
9.2 Future Work ........................................................................................ 145
9.2.1 100% Throughput Proof without Speedup ............................. 145
9.2.2 Building a Large Feedback-Based Two-Stage Switches ....... 146
9.2.3 More Scalable Fairness Algorithm in LBOS ......................... 146
9.2.4 Scalable Iterative Algorithm for Input-queued Switch ............ 146
References ........................................................................................................... 147
Publications ......................................................................................................... 157
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 13/177
- viii -
List of Figures
Fig. 1.1: A generic router ................................................................................ 2
Fig. 1.2: A router works in two different planes ............................................. 2
Fig. 1.3: The first generation routers architecture .......................................... 3
Fig. 1.4: The second generation routers architecture ...................................... 4
Fig. 1.5: The third generation routers architecture ......................................... 5
Fig. 1.6: The fourth generation routers architecture ....................................... 6
Fig. 1.7: An input-queued switch with Virtual Output Queues (VOQs) ........ 9
Fig. 1.8: A buffered crossbar switch ............................................................... 11
Fig. 2.1 A load-balanced two-stage switch architecture ................................ 19
Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 21
Fig. 2.3 Feedback operation in joint sequences with staggered symmetry ... 30
Fig. 2.4 Delay vs input load p, with uniform traffic ........................................ 38
Fig. 2.5 Delay vs input load p, with uniform bursty traffic ........................... 39
Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes 40
Fig. 2.7 Delay vs input load p, with hot-spot traffic ...................................... 40
Fig. 3.1 The feedback-base two-stage switch architecture ............................ 51
Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 52
Fig. 3.3 The relation between staggered symmetry and in-order delivery .... 53
Fig. 3.4 The generic joint configuration at time slot t ................................... 56
Fig. 3.5 Generic joint sequence with anchor output and ordered properties . 57
Fig. 3.6 Joint sequence with staggered symmetry and in-order delivery ...... 60
Fig. 3.7 A three-stage switch architecture ...................................................... 65
Fig. 3.8 An example of using three-stage switch .......................................... 66
Fig. 3.9 Traffic matrix and delay matrix ......................................................... 66
Fig. 3.10 An example of identifying the minimum independent set ................ 67Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b) ..... 69
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 14/177
- ix -
Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch ...... 71
Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch ............ 72
Fig. 4.1 Timing diagram of feedback switch with feedback suppression ...... 76
Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback .. 81
Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback ...... 82
Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback ... 83
Fig. 4.5 Throughput vs switch size N , with partial feedback ......................... 84
Fig. 5.1 Delay vs output load λ , with uniform mixing traffic ........................ 94
Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7 ................. 95
Fig. 5.3 Delay vs output load λ , with bursty mixing traffic ........................... 97
Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7 ................... 98
Fig. 5.5 Delay vs output load λ , with binomial mixing traffic ...................... 99
Fig. 6.1 The timing diagram of switch with large propagation delay .............. 101
Fig. 6.2 Multi-cabinet implementation of the feedback-based switch ........... 103
Fig. 6.3 Feedback operation in multi-cabinet implementation ...................... 104
Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet ......... 110
Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet ............. 111
Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet .......... 112
Fig. 7.1 A 4×4 feedback-based switch with output port 3 oversubscribed by inputs
0, 1, 2 and 3. ...................................................................................... 114
Fig. 7.2 Output 0’s throughput vs its output load λ, under server-client traffic 123
Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic ....... 124
Fig. 8.1 A 4×4 load balanced optical switch .................................................. 129Fig. 8.2 The internal structure of linecard i ................................................... 129
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 15/177
- x -
Fig. 8.3 Time diagram for load balanced optical switch ............................... 131
Fig. 8.4 Timing diagram for pipelined packet sending and receiving ............ 132
Fig. 8.5 A joint sequence in load-balanced switch .......................................... 133
Fig. 8.6 Two possible linecard placement patterns using OXC ....................... 136
Fig. 8.7 Delay vs input load, under uniform traffic in LBOS ......................... 139
Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS ............... 140
Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS ......................... 140
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 16/177
- xi -
List of Symbols
N Switch size
VOQ1(i,k ) the VOQ at input port i with packets destined for output k
VOQ2( j,k ) VOQ at middle-stage port j with packets destined for output k
flow(i,k ) Packets arriving at input i and destined for output k
K Anchor output port for an input port i
p Input load for a input port
s p Burst size in uniform bursty traffic
S j The set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy
d The middle-stage port delay experienced in a feedback switch
{r i, j} N × N matrix {r i, j}, which denotes the request number from flow(i, j)
Z ij(n) The number of packets in VOQ1(i, j) at the beginning of time slot n
Aij(n) The cumulative number of arrivals for VOQ1(i, j) at the beginning
of time slot n
Dij(n) The cumulative number of departures for VOQ1(i, j) at the
beginning of time slot n
Bij(n) The number of packets in VOQ2(i, j) at the beginning of time slot n
X ij(n) The cumulative number of arrivals for VOQ2(i, j) at the beginning
of time slot n
Y ij(n) The cumulative number of departures for VOQ2(i, j) at the
beginning of time slot n
λ ij The mean packet arrival rate to VOQ1(i, j)
ω A sample in random event
Aij(t ,ω) The cumulative number of arrivals to VOQ1(i, j) for a fixed ω at
time t
Z ij(t ,ω) The number of packets in VOQ1(i, j) for a fixed ω at time t
Dij(t ,ω) The cumulative number of departures from VOQ1(i, j) for a fixed ω
at time t
X ij(t ,ω) The cumulative number of arrivals to VOQ2(i, j) for a fixed ω at
time t Bij(t ,ω) The number of packets in VOQ2(i, j) for a fixed ω at time t
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 17/177
- xii -
Y ij(t ,ω) The cumulative number of departures from VOQ2(i, j) for a fixed ω
at time t
C ij(t ) The joint queue occupancy of all packets arrived at input port i plus
all packets destined for output j
{r n} Any sequence {r n} with r n → ∞ as n → ∞
S The times of speedup
f (t ) A non-negative, absolutely continuous function defined on R+∪{0}
q If VOQ2( j,k ) is not empty, the packet in VOQ2( j,k ) will be
transmitted to output port k with fixed delay q
M The number of reduced Latin squares
{d ij} The delay matrix, where d ij is traffic-weighted-average middle-
stage packet delay of all the N flows destined to output port i-1
Qi,j The packet counter Qi,j is associated with each of the VOQ1(i,j)
T The sampling interval
u The number of non-overlapped sets per port
g The number of VOQs per non-overlapped set
Gm The non-overlapped set of VOQs
F Denotes VOQ1(i,F ) is the longest queue at input i at time t
b The number of bits sent in the second stage of Q-feedback-1
z Denotes VOQ2( j,z ) is empty VOQ at middle-stage port j
C The number of sets Gm sent when cutting down the feedback bits
m The number of multicast VOQs at each input port
E y The vector reports the occupancy status from VOQ2( j, yN /m) to
VOQ2( j, yN /m+ N /m-1)
T c The overall average delay experienced by all copies of all multicast
packets
T p The average delay experienced by the last-copy of all multicast
packets
T c(k ) The average delay for multicast packets with fan-out k
T p(k ) The last-copy delay for multicast packets with fan-out k
λ The switch output load
P k The probability of generating a fan-out set with size k in binomial
mixing traffic
h The mean fan-out size in binomial mixing traffic
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 18/177
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 19/177
- xiv -
List of Abbreviations
ACK Acknowledgement
AMFS Adaptive Max-min Fair Scheduling
AWGR Arrayed Waveguide Grating Router
bps bits per second
CIOQ Combined Input Output Queuing
CMS Concurrent Matching Switch
CP Cross Point
CPU Central Processing Unit
CR Contention and Reservation
DRRM Dual Round Robin Matching
EDF Earliest Departure First
EDFA Erbium Doped optical Fiber Amplifier
FDL Fiber Delay Line
FIFO First In First Out
F-MWM Fair Maximum Weight Matching
FOFF Full Ordered Frames First
GPS-SW Generalized Processor Sharing in network Switch
HOL Head Of Line
i.i.d Independent and Identically Distributed
ILP Integer Linear Programming
I-SMCB Input-based Shared Memory Crosspoint Buffer
LBOS Load-Balanced Optical Switch
LQF Longest Queue First
MEMS Micro Electro Mechanical Systems
MSM Maximal Size Matching
MWM Maximum Weight Matching
MURS Multicast and Unicast Round robin Scheduling
O-E-O Optic-Electric-Optic
O-SMCB Output-based Shared Memory Crosspoint Buffer OXC Optical Cross-Connect
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 20/177
- xv -
PF Padded Frame
PIM Parallel Iterative Matching
RR Round Robin
RTT Round Trip Time
SRR Synchronous Round Robin
TCAM Ternary Content Addressable Memory
TDMA Time Division Multiplexing Access
TFQA Tracking Fair Quota Allocation
UFS Uniform Frame Spreading
VOD Video On Demand
VOQ Virtual Output Queue
WDM Wavelength Division Multiplexing
WF2Q+ Fair Weighted Fair Queueing +
w.r.t With Regard To
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 21/177
- 1 -
Chapter 1
Introduction
1.1 Overview of Routers
The Internet is a network of networks. The basic unit of data exchange on the
Internet is an IP packet. Routers play a crucial role in the Internet by connecting
different networks together and forwarding each IP packet to its correct destination.
An N × N generic router is shown in Fig. 1.1. It consists of a routing processor, a
switch fabric and N linecards. The routing processor executes the routing protocols,
maintains the routing information and forwarding tables, and performs network
management functions within the router. A linecard is a subsystem that receives
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 22/177
- 2 -
datagrams on an external ingress or internal egress from the switch fabric. Each
linecard is (logically) divided into input port (for processing ingress traffic) and
output port (for processing egress traffic). A switch fabric allows inputs to be
connected with outputs for packet forwarding.
Fig. 1.1 A generic router
Fig. 1.2 A router works in two different planes
A router operates in two different planes [1,2]: control and forwarding (Fig.
1.2). The control plane constructs a routing table using the routing protocol, where
the router learns which linecard is the most appropriate for forwarding specific
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 23/177
- 3 -
packets to specific destinations. Forwarding, the predominant plane in router, is
responsible for the actual process of switching a packet received from a linecard to
another one. Forwarding involves packet by packet processing and is generally more
time-critical than the operations at control plane.
Fig. 1.3 The first generation routers architecture
From Fig. 1.1, we can see that the switch fabric is at the very heart of a router.
In fact, the evolution of routers is accompanied by the evolution of switch fabrics.
Historically, routers have been realized with packet-switching software executing on
a general-purpose CPU. Those first generation routers appeared before the early
1990s, consisting of a CPU, a centralized memory and several linecards (Fig. 1.3).
Lincards are connected to the CPU and centralized memory via a shared bus [4]
(instead of a dedicated switch fabric). The CPU is responsible for all operations at
control and forwarding planes. When a packet arrives at an input linecard, it will
cross the shared bus to arrive at the centralized memory. When the output linecard is
identified by the CPU, the packet will be read out from the memory and forwarded to
the output linecard via the shared bus again. As each packet needs to traverse the
shared bus twice, the bus bandwidth limits the router performance. Besides, the use
of a single CPU also undermines the router performance. An example of the first
generation routers is Huawei Quidway AR18 series routers [3].
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 24/177
- 4 -
Fig. 1.4 The second generation routers architecture
In a second generation router shown in Fig. 1.4, route cache, satellite
processor and memory are allocated at each linecard. The operations at the
forwarding plane are segregated from the central CPU and carried out by distributed
linecards. If routing information can be found in the local linecard route cache, a
packet will traverse the shared bus once, by going to the destination linecard directly.
Otherwise, the packet will be sent to the centralized memory for processing by the
central CPU, as the case of first generation routers. A major limitation of the second
generation router is the shared bus, which can support at most one packet traversal at
a time. Cisco 7500 series routers [6] belong to the second generation of routers.
To alleviate the bottleneck of using a single shared bus, the third generation
router introduces an interconnection network as the switch fabric (Fig. 1.5). This
enables multiple packets to traverse the switch fabric in parallel and without
contention. This architecture improves the routers’ switching capacity from the
second generation’s 2 Gbps to about 1 Tbps. An implicit requirement for
implementing the architecture in Fig. 1.5 is that all linecards and the switch fabric
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 25/177
- 5 -
must be stowed in the same standard-sized switch cabinet. A typical cabinet [7] is of
a size 2.1m×0.6m×1.0m, and is supplied with power no more than 14 kW.
Accordingly, each cabinet can only house up to 16 linecards. An example of this
generation of routers is the Cisco 12000 series routers [7].
DMA
MA C
R o u t e c a c h e
S a t e l l i t e
pr o c e s s or
M e m or y D
M A
M A C
R o u t e c a c h e
S a t e l l i t e
p r o c e s s o r
M e m o r y
Fig. 1.5 The third generation routers architecture
To accommodate Internet traffic in the range of 10 Tbps, a large number of
linecards and huge power requirements are necessary. A report [73] shows that a
router consumes 0.01 kW power for 1 Gbps and supports 40 Gbps by a linecard.
Handling 10 Tbps data would result in 100 kW power and 125 linecards, which
cannot be supported by the third generation router architecture shown in Fig. 1.5.
The fourth generation routers remove the limitations of space and power by
distributing the linecards to different cabinets, as shown in Fig. 1.6. The burdens of
space and power are parceled out. Optical fibers connect all cabinets to the central
electronic switch fabric. (Note that it is difficult to run copper wires at high-speed
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 26/177
- 6 -
due to insertion loss, near-end crosstalk, electromagnetic emissions, echo and
propagation skew [96-97]). As the centralized switch fabric works in electrical
domain, packets arrived on fiber must be converted to electrical signals for switching,
and vice versa when they depart the switch fabric. This extra O-E-O conversion and
the need for a centralized scheduler (for configuring the switch fabric on per-slot
base) limit the fourth generation router from reaching even higher speed. Cisco CRS-
1 [8] is an example of the fourth generation router. Notably, it can pump the
switching capability to 90 Tbps, with 1152 linecards and each linecard running at 40
Gbps.
Fig. 1.6 The fourth generation routers architecture
Nowadays the commercial dense WDM systems [9] can support up to 160
parallel wavelengths in a single fiber, with up to 80 Gbps transmission rate on each
wavelength. Consequently, the fourth generation routers can only process packets
coming from 4 fibers. Besides, due to the speed mismatch between the linecard
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 27/177
- 7 -
processing rate (e.g. 40 Gbps in Cisco CRS-1) and the fiber, a linecard cannot be
directly connected to a dense WDM fiber. Therefore, there is always an urgent need
for building high-speed routers that can fully exploit the capacity of a fiber.
1.2 Switch Architectures
In a router, the forwarding plane involves packet by packet processing, which
is generally more time-critical than the operations at the control plane [2]. In Fig. 1.2,
the forwarding plane comprises two major functions: table lookup for identifying the
correct output linecard of a packet, and switching for actual delivery of the packet.
IP table lookup algorithms can be classified into trie-based [74-79], range-
based [80-81], and hash-based algorithms [82-88]. These algorithms can be
implemented by software, hardware or both. Software schemes can benefit from low
cost and flexibility. Hardware solutions, e.g. TCAM (Ternary Content Addressable
Memory [89-95]), are more efficient as they can search contents in parallel and
complete the lookup in single clock cycle. Nerveless, as table lookup process can be
distributed to each linecard, its high-speed implementation tends to be less critical
than switching. To this end, the table lookup at 100 Gbps per linecard is reported in
[88] while due to the limitation of O-E-O conversion and centralized scheduler, the
switching rate of 40 Gbps per linecard seems to be the current limit.
In this thesis, we focus on designing efficient and scalable switch architecture
to enable the next generation high-speed routers. Based on the switch architecture,
the routers can be generally classified to output-queued, input-queued, and combined
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 28/177
- 8 -
input output queued (CIOQ).
1.2.1 Output-queued Switches
In an output-queued switch, all packets can be switched to their respective
output linecards as soon as they arrive at the inputs. Accordingly, no input port
buffer is required and the output-queued switch provides the optimal packet delay-
throughput performance. But the switch fabric must be powerful enough to deliver
up to N packets to any output port, and the output buffer must be fast enough to
receive up to N packets in each time slot, where N is switch size (i.e. the number of
linecards). In other words, the switch fabric and output ports must operate at N times
of an individual link rate. This makes high-speed output-queued switches expensive
to build, and difficult to scale.
It should be noted that the complexity of a switch fabric can be measured by
the number of switch configurations it needs to realize. A switch configuration is an
internal switch fabric connection pattern for mapping the set of N input packets to N
outputs. For an output-queued switch (fabric), it needs to realize N N configurations as
up to N packets can go to the same output.
1.2.2 Input-queued Switches
For an input-queued switch, all packets are buffered at input ports and wait
for their turns to be served by the switch fabric. No switch fabric speedup is required
(i.e. the fabric only needs to run at the same speed as each input link), whereas each
input can send at most one packet and each output can receive at most one packet in
every time slot. Accordingly, the number of switch configurations to be realized by
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 29/177
- 9 -
an input-queued switch is N !, which is substantially smaller than the N N required by
an output-queued switch. This makes input-queued switches more suitable for
building high-speed routers with large port count.
0
N -1
Input port 0
VOQ(0,0)
VOQ(0, N -1)
VOQ( N -1,0)
VOQ( N -1, N -1)
Switch fabric
Input port N -1
Output port 0
Output port N -1
Fig. 1.7 An input-queued switch with Virtual Output Queues (VOQs)
On the other hand, input-queued switches suffer from the well-known
problem of head-of-line (HOL) blocking. This limits the maximum throughput of an
input-queued switch to just 58.6% under uniform traffic [10]. To eliminate the HOL
blocking, Virtual Output Queue (VOQ) is proposed [11], where each input port
maintains a separate queue for each output (Fig. 1.7). A centralized scheduler is
needed to maximize the throughput of a VOQ switch. The scheduling problem is
equivalent to the matching problem in a bipartite graph [98]. It is found that for any
admissible traffic patterns, 100% throughput can be achieved by MWM (Maximum
Weight Matching [12]). However, MWM algorithm has a high time complexity of
O( N 3·log N ). MSM (Maximal Size Matching) algorithms with lower computation
overheads, notably, PIM (Parallel Iterative Matching [13]), iSLIP [14,15] and
DRRM (Dual Round Robin Matching [16]), are then proposed. They are iterative
algorithms involving non-negligible amount of communication overheads for state
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 30/177
- 10 -
information exchange, which scales up very quickly with the number of iterations to
be carried out, link speed and switch size.
As an example, the ATLANTA architecture proposed in [105] is based on
input-queued switch architecture. Notably, its switch fabric is implemented by a
three-stage (memory/space/memory) Clos network, where packets are buffered at the
first and third stage while the second stage is constructed using crossbar switch
modules. To avert overflow in fabric-embedded-buffers of the first and third stages,
backpressures are sent out from the first stage to the input ports, as well as from the
third stage to the second stage crossbars. Nevertheless, its performance is limited by
the required packet/slot based switch re-configurations.
1.2.3 CIOQ and Buffered Crossbar Switches
In a CIOQ switch, packets are buffered at both input and output ports [17].
The switch fabric is the same as the input-queued switch fabric, where in each time
slot at most a single packet can leave/join an input/output port. A centralized
scheduler is responsible for selecting the most “critical” packet to deliver in each
time slot. A packet may arrive at an output port out-of-order. Therefore an output
buffer/queue is required. It has been shown that [17] with a speedup of two (i.e. in
each time slot, up to two packets can leave/join an input/output port), CIOQ switch
can provide precise emulation of an output-queued switch. Like an input-queued
switch, the number of switch configurations to be realized by a CIOQ switch is also
N !. But the complexity of the centralized scheduler is by no means less than that of
an input-queued switch.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 31/177
- 11 -
Notably, buffered crossbar switch [18-20] is an elegant approach of
implementing CIOQ switches by adopting a distributed approach for scheduling. In
addition to buffering packets at each input, buffered crossbar switch allows packets
to be buffered at each crosspoint of the switch fabric, as shown in Fig. 1.8. It has
been shown that buffered crossbar can yield performance comparable to output-
queued switches. Although buffered crossbar is touted for its technology feasibility
and simpler scheduler, it requires 2 N schedulers (one for each input/output port), N 2
in-fabric crosspoint buffers, and the switch configuration must be determined on a
slot-by-slot basis. It should be noted that the total N 2 crosspoint buffers are very
difficult to build. A report [100] shows that a memory of 512-bit word occupies
0.0278 mm2 of silicon even under state-of-the-art 0.18 um VLSI technology.
Assumed switch size N =32, holding all crosspoint buffers, with 1000-bit for each,
would result in 55.6 mm2 of silicon, which dominates the cost in terms of area and is
prohibitive [47-48,100].
0
N -1
Input port 0
VOQ(0,0)
VOQ(0, N -1)
VOQ( N -1,0)
VOQ( N -1, N -1)
Switch fabric
Input port N -1
Output port 0 Output port N -1
CP(0,0) ... CP(0, N -1)
CP( N -1,0) ... CP( N -1, N -1)
Fig. 1.8 A buffered crossbar switch
For a buffered crossbar switch, due to releasing the inputs/outputs contention,
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 32/177
- 12 -
the total number of switch configurations to be realized is N N , the same complexity
as output-queued switch fabric. Besides, the communication overheads for collecting
queue size at each crosspoint buffer (for input/output arbitration) can be a potential
performance bottleneck.
1.2.4 Load-Balanced Two-Stage Switches
Load-balanced two-stage switches (or load-balanced switches) have received
a great deal of attention recently [21-32] because they are more scalable and can
provide close to 100% throughput. A load-balanced switch consists of two stages of
switch fabrics, as shown in Fig. 2.1 Each switch fabric is configured according to a
pre-determined and periodic sequence of switch configurations. To this end, each
switch fabric only needs to realize only N switch configurations (instead of N ! for an
input-queued and CIOQ switches, and N N for an output-queued and buffered crossbar
switches). This greatly facilitates high-speed implementation.
Besides, due to the pre-determined nature of the sequence of configurations,
load-balanced switch removes the need for a centralized scheduler – another major
bottleneck in designing high-speed switch. As a load-balanced switch provides
multiple paths for packets belonging to the same flow to arrive at the same output
port, packets may arrive out-of-order due to different middle-stage port delays
experienced en route. Many efforts [22-32] are then made to address this notorious
packet mis-sequencing problem (to be reviewed in Chapter 2). It is not difficult to
see that higher switch throughput is usually at the cost of poorer delay performance.
This is because that throughput is improved by better load balancing, but better load
balancing tends to aggravate the packet mis-sequencing problem.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 33/177
- 13 -
1.3 Contributions
In this dissertation, we dedicate our efforts to designing efficient and scalable
switch architecture for next generation high-speed routers. We have two key design
objectives:
No need for a centralized scheduler, as centralized scheduler is a major
obstacle for a scalable switch architecture; and
Amenable for optics, which can avoid the extra O-E-O conversion in the
fourth generation routers when packets are switched from one linecard to
another.
We follow the approach of load-balanced switch due to its scalability (no
centralized scheduler) and close to 100% throughput performance. But its notorious
packet mis-sequencing problem must be properly addressed. Otherwise, the
complexity of the load-balanced switch as well as its delay performance would suffer.
To this end, an elegant solution called feedback-based two-stage switch (or feedback-
based switch in short) is proposed in this thesis. Before diving into the details, our
major contributions made in this thesis are outlined below.
Feedback-based Two-stage Switch Design: Unlike other load-balanced
switches, at each middle-stage port between the two switch fabrics of our
feedback-based two-stage switch, only a single-packet-buffer for each VOQ
is required. Although packets belonging to the same flow pass through
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 34/177
- 14 -
different middle-stage VOQs, the delays they experience at different middle-
stage ports will be identical. This is made possible by properly selecting and
coordinating the two sequences of switch configurations to form a joint
sequence with both staggered symmetry property and in-order packet delivery
property. Based on the staggered symmetry property, an efficient feedback
mechanism is designed to allow the right middle-stage port N -bit occupancy
vector to be delivered to the right input port at the right time. As compared
with the existing load-balanced switch architectures and scheduling
algorithms, our solution imposes a modest requirement on switch hardware,
but consistently yields best delay-throughput performance.
Cutting down the average packet delay of switch: As different flows
experience different middle-stage delays, we can cut down the average packet
delay by assigning heavy flows to experience less middle-stage delays. For a
given traffic matrix, we can find an optimal joint sequence that can minimize
the average middle-stage delay. But this involves tedious computation. A
three-stage switch architecture is thus proposed by adding another stage of
switch fabric for dynamically mapping heavy flows to experience less
middle-stage port delay.
Cutting down the communication overhead of feedback-based switch: In
a feedback-based switch, each middle-stage port needs to piggyback an N -bit
occupancy vector to its connected output in each time slot. To cut down this
communication overhead, the size of an occupancy vector can be reduced by
only reporting the status of selected middle-stage VOQs. To identify VOQs
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 35/177
- 15 -
of interest, we first partition the N VOQs into u non-overlapped sets, each
being identified by a set number. In each time slot, every input port
piggybacks its set numbers of interest to the connected middle-stage port.
This guides a middle-stage port to only report the status of the VOQs of
interest.
Supporting multicast: By slightly modifying the operation of the original
feedback-based two-stage switch, we show that feedback-based switch
supports multicast traffic efficiently. A notable feature of this multicast
extension is that the switch fabric remains unicast, whereas packet
duplication is distributed to both input and middle-stage ports.
Multi-cabinet implementation: In a single-cabinet implementation, the
propagation delay between linecards and switch fabric is negligible. In a
multi-cabinet implementation, due to the non-negligible propagation delay
between linecards and switch fabric, the requirement that occupancy vectors
must arrive at output/input ports within a single time slot will significantly
lower the feedback-based switch efficiency. To this end, we revamp the
original feedback mechanism to support multi-cabinet implementation, and a
new batch scheduler is also designed.
Fairness support for switching inadmissible traffic: As long as the traffic
is admissible, due to the close to 100% throughput in our feedback switch,
packets can arrive at outputs with bounded delays, so fairness in throughput is
not an issue. Under inadmissible traffic (i.e. some output ports are over-
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 36/177
- 16 -
subscribed), the feedback switch may suffer from the ring-fairness problem,
i.e. “up-stream” input ports can starve some “down-stream” input ports. To
address this ring-fairness problem, an algorithm that can allocate the
bandwidth of over-subscribed outputs based on max-min fairness criterion is
designed.
Optical implementation of feedback-based switch: To ensure packets can
be switched from one linecard to another all-optically, an optical feedback-
based switch called Load-Balanced Optical Switch (LBOS) is proposed.
LBOS leverages an N -wavelength WDM fiber ring to connect N linecards
together. The ring network is engineered such that the amount of time a
packet should be buffered at a middle-stage port exactly matches the
propagation delay that this packet would experience en route.
1.4 Thesis Overview
This thesis consists of nine chapters. In Chapter 2, we first review the
existing work for solving the packet mis-sequencing problem of load-balanced
switches. Then our proposed feedback-based two-stage switch’s framework is
introduced. The delay and throughput performance of feedback-based switch is
compared with other existing algorithms by simulations. With a speedup of two, the
stability of feedback-based switch is also proved.
In Chapter 3, we cut down the average packet delay of a feedback-based
switch by assigning heavy flows to experience less middle-stage ports delays. In
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 37/177
- 17 -
Chapter 4, we focus on designing efficient feedback suppression schemes for cutting
down the communication overhead of sending middle-stage occupancy vectors. In
Chapter 5, we extend the feedback-based switch to support multicast traffic. In
Chapter 6, the feedback-based switch is refined to support multi-cabinet
implementation. In Chapter 7, a fair scheduling algorithm for inadmissible traffic is
proposed. An optical implementation of the feedback-based switch, called LBOS, is
introduced in Chapter 8. Finally, Chapter 9 summarizes our contributions in this
thesis, and highlights some interesting future research directions.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 38/177
- 18 -
Chapter 2
Feedback-Based Two-StageSwitch Design
2.1 Introduction
Due to its more scalable switch fabric, input-queued switch architecture is
more suitable than output-queued switch for high-speed router implementation.
However, input-queued switch requires a centralized scheduler to determine its
switch configuration on a slot-by-slot basis. The requirement for a centralized
scheduler is thus the major bottleneck in further increasing the router’s capacity.
Load-balanced two-stage switches [21-32] remove the bottleneck of
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 39/177
- 19 -
centralized scheduler and can provide close to 100% throughput. A load-balanced
two-stage switch architecture consists of two stages of switch fabrics, as shown in
Fig. 2.1. Each fabric is configured according to a pre-determined and periodic
sequence of switch configurations, with the only requirement that each input
connects to each output exactly once in the sequence. The two fabrics can use
different sequences. There are many ways to generate such a sequence, e.g., a
sequence can be constructed by cyclic shifting the set of input/output connections
used in each time slot, such that at time slot t , input i (for i = 0,1,2,…, N -1) is
connected to output j, where j is given by
j = ( i + t ) mod N . (2.1)
In Fig. 2.2 (a), the sequence of blue\dotted configurations represent the
configurations used by the first stage switch fabric in Fig. 2.1, and it is generated
based on (2.1). Note that each switch port is abstracted as a circle in Fig. 2.2.
Fig. 2.1 A load-balanced two-stage switch architecture.
For the generic load-balanced switch architecture shown in Fig. 2.1, we use
VOQ1(i,k ) to represent the VOQ at input port i with packets destined for output k ,
and VOQ2( j,k ) to denote the VOQ at middle-stage port j with packets destined for
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 40/177
- 20 -
output k . We define flow(i,k ) as packets arriving at input i and destined for output k .
Packets from flow(i,k ) are buffered at VOQ1(i,k ). Packets (from different inputs)
destined for output k are buffered at VOQ2( j,k ) for j = 0, 1, …, N -1. Aiming at
converting the incoming non-uniform traffic to uniform, the first stage switch fabric
spreads packets evenly over all middle-stage ports. Then the second stage switch
fabric delivers the packets from middle-stage ports to their respective outputs. From
the above, we can see that in each time slot, there are two switch configurations, one
at each fabric. We call them a joint configuration. The sequence of N joint
configurations forms a joint sequence. Three possible joint sequences are shown in
Fig. 2.2. It is important to point out that all three joint sequences in Fig. 2.2 meet the
basic requirement of a load-balanced two-stage switch, but they have different
properties, namely, in-order packet delivery and staggered symmetry. These two
properties will be discussed in detail in Section 2.3, which form the basis of our
feedback-based two-stage switch design. In Chapter 3, the problem of optimal joint
sequence design will be investigated.
Due to the two-stage nature, flow(i,k ) packets may arrive at output k via
different middle-stage VOQ2( j,k )’s (for j = 0, 1, …, N -1) and thus may experience
different amounts of middle-stage port delay. This leads to the problem of packet
mis-sequencing. Many efforts [21-32] are then made to address this notorious packet
mis-sequencing problem (reviewed in Section 2.2). It is not difficult to see that
higher switch throughput is usually at the cost of poorer delay performance. This is
because that throughput is improved by better load balancing, but better load
balancing tends to deteriorate the packet mis-sequencing problem.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 41/177
- 21 -
Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch.
In this chapter, we show that the efforts made in load balancing and keeping
packets in-order can complement each other in improving both delay and throughput
performance of the switch. We adopt a simple load-balanced switch architecture
where each middle-stage port between the two stages of switch fabrics only has a
single-packet-buffer for each VOQ. Although packets belonging to the same flow
will pass through different middle-stage VOQs, the delays they experience at
different middle-stage ports will be identical. This is made possible by properly
selecting and coordinating the two sequences of switch configurations (used by the
two stages of switch fabrics) to form a joint sequence with both staggered symmetry
property and in-order packet delivery property. Based on the staggered symmetry
property, an efficient feedback mechanism is designed to allow the right middle-stage
port occupancy vector to be delivered to the right input port at the right time.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 42/177
- 22 -
Accordingly, the performance of load balancing as well as switch throughput is
significantly improved.
The rest of this chapter is organized as follows. In the next section, we review
the existing work for solving the packet mis-sequencing problem of load-balanced
switches. In Section 2.3, our proposed feedback switch framework is introduced. The
delay and throughput performance of our proposed solutions is compared with other
existing algorithms in Section 2.4 by simulations. In Section 2.5, we prove that for
any arbitrary work-conserving input port scheduler, the feedback-based switch can
achieve 100% throughput under a speedup of two. Finally, we conclude this chapter
in Section 2.6.
2.2 Related Work
Two main approaches can be followed to solve the mis-sequencing problem
of load-balanced switches, using re-sequencing buffers at outputs, or preventing
packets from becoming mis-sequenced in the first place.
2.2.1 Using Re-sequencing Buffers
When out-of-order packets arrive at an output port, they are temporarily
stored in the re-sequencing buffer (not shown in Fig. 2.1), waiting to be read out and
written onto the output link in the correct order. To this end, each packet header
should have a sequence number field (or timestamp), which is added to the packet
upon its arrival at an input port. With the original two-stage switch architecture [21],
packets can be mis-sequenced by an arbitrary amount, thus a finite re-sequencing
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 43/177
- 23 -
buffer is not possible. Efforts are made to bound the delay at additional costs, such as
N writes to memory in one time slot [22], and a 3-D re-sequencing buffer [23].
In [24], a three-stage load-balanced switch is presented where each of the
three stage switch fabrics is configured by predetermined and periodic configurations.
The buffers ahead of each stage of switch fabric are separately called first-stage
buffer, second-stage buffer and the third-stage buffer (i.e. re-sequencing buffer). For
every arriving packet, it firstly reserves a position in the third-stage buffer. Upon
successful reservation, the packet is forwarded into the first-stage buffer by a flow
splitter according to its assigned position number of the third-stage buffer. Packets
are transmitted through the first two switches in a FIFO manner and are inserted to
their reserved positions in the third-stage buffer. Although the switch is proved to be
stable, this design requires additional hardware as well as global information
exchange for buffer reservation. The high implementation complexity may defeat the
original purpose of using a load-balanced switch.
2.2.2 Preventing Packets from Becoming Mis-sequenced
Instead of re-ordering packets at each output, we can prevent packets from
becoming mis-sequenced in the first place [25-32]. This not only removes the re-
sequencing buffers, but also the corresponding re-sequencing delay. The majority
work along this direction [26-29] adopts the notion of “frame”. For an N × N switch, a
frame consists of N packets belonging to the same flow. At each input port, incoming
packets join their respective VOQs. If the size of a VOQ is larger than N packets, the
flow is said to have a full frame of packets. With the UFS (Uniform Frame Spreading)
algorithm [26], an input port is allowed to send only from flows/VOQs with at least a
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 44/177
- 24 -
frame of packets. Once the sending/frame starts, N packets from the selected flow
will be sent in the next N slots, where each packet arrives at a distinct middle-stage
port from 0 to N -1. The sending/frame starts when an input port is connected to a
particular middle-stage port, say 0. Each input has a distinct frame starting time
because they connect to middle-stage port 0 at different slots. Based on the above
frame notion, upon joining the VOQ at each middle port, each packet in the frame
will see the same middle-stage VOQ size. If the transmission at the second stage
switch fabric is coordinated such that an output is connected to middle-stage ports in
the same (cyclic) order as an input is connected to middle-stage ports, in-order
packet delivery is guaranteed.
A downside of the UFS algorithm is that when traffic load is light, it takes
time to form a full frame of packets, thus the delay performance suffers. To cut down
the delay, FOFF (Full Ordered Frames First) [27] is proposed. Instead of waiting for
full frames of packets, FOFF allows mis-sequencing due to sending partial frames.
But the amount of mis-sequencing at middle-stage ports is bounded. As a result, the
amount of re-sequencing buffer at each output is also bounded.
PF (Padded Frame) algorithm [28] also improves the delay performance of
UFS, but without the re-sequencing buffer of FOFF. The idea is that when no full
frames are available for sending, a partial frame can be sent as a “faked” full frame
by padding the partial frame with dummy packets. CR (Contention and Reservation)
algorithm [29] can further improve the performance of PF by supporting two modes
of frame transmission: contention and reservation. As long as an input i has a full
frame of packets when i connects to middle port 0, i enters the reservation mode and
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 45/177
- 25 -
the transmission in the next N slots is governed by UFS. Otherwise, input i enters the
contention mode, where the packet sent in each slot is selected using a round robin
scheduler, and must be acknowledged at the end of each time slot. A packet is
removed from the input VOQ only if a positive ACK (ACKnowledgement) is
received.
CR algorithm requires dedicated feedback/acknowledgement from each
middle-stage port in each time slot. The feedback path construction is not discussed
in [29]. To this end, the Mailbox switch [30] also requires a feedback path. It is
smartly constructed by adopting the joint sequence of switch configurations in Fig.
2.2(c), where input i and output i are always connected to the same middle-stage port.
In each time slot, when a packet arrives at a middle-stage port (from say input i), the
middle-stage port calculates its departure time (i.e. when it will be sent to its
destination output) based on its location in the VOQ. Then the departure time is sent
to the connected output port i using the second switch fabric. As input i and output i
are resided on the same switch linecard, output i can relay the departure time of the
packet to input i at negligible cost. A feedback path for reporting middle-stage packet
departure time is thus created. Based on the received packet departure time, the next
packet in the flow will be dispatched and inserted in a middle-stage VOQ if it will
depart no earlier than the previous packet of the same flow. Although the packet
order is maintained by Mailbox switch without relying on the frame notion, the
overall switch throughput is limited.
In [31], a distributed and iterative scheduling algorithm CMS (Concurrent
Matching Switch) is introduced. Despite the fixed uniform identical mesh in both
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 46/177
- 26 -
stages of switch fabrics, its logical configurations are the same as the joint sequence
in Fig. 2.2(c). For every arriving packet, input port sends a request to the current
(logically) connected middle-stage port. Each middle port records the receiving
requests in its own N × N matrix {r i, j}, where r i, j denotes the request number from
flow(i, j). Every N time slots each middle-stage port concurrently and independently
finds a matching based on its own {r i, j}. (Note that CMS can achieve stability using
randomized scheduling with amortized constant time and hardware complexity per
port, independent of N .) In the following N time slots, the packets matched are
transmitted to the middle-stage ports. As soon as they arrive, middle-stage ports
forward them to the connected output ports. Since the packets selected in each slot
traverse the two switches in parallel and without conflicts, there is no out of order
problem. However, the packet delay performance can be quite large, where the best-
case is 3 N time slots when a parallel optical mesh is used. Having said that, the delay
performance of the Chang's original architecture [21] is on the order of O( N ) if it is
implemented using an R/ N optics abstraction.
2.3 Feedback-based Two-stage Switch
2.3.1 Some Observations and Motivations
The delay and throughput performance of a load-balanced switch hinges on
how well the load-balancing and in-order packet delivery are implemented.
Obviously, if the incoming traffic is well-balanced by the first stage switch, the
throughput performance will be improved as the second stage switch can maximize
the number of packets sent in each time slot. Consequently, the packet delay will also
be reduced due to higher throughput.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 47/177
- 27 -
But how to measure the load-balancing performance? Many scheduling
algorithms (e.g. in [23, 25]) try to ensure all middle-stage VOQs have the same
queue size. But as far as the throughput performance is concerned, we only need to
ensure each middle-stage VOQ2( j,k ) (in Fig. 2.1) does not suffer from either buffer
underflow or overflow problem. A buffer underflow occurs if there are packets
waiting in some input ports for a particular output k , but VOQ2( j,k ) is empty at the
time that middle-stage port j is connected to output k , yielding an idle transmission
slot on the second stage switch. On the other hand, buffer overflow is equally
undesirable as the overflowed packet is dropped, and the transmission slot in the first
stage switch is wasted. Indeed, as long as no buffer underflow and overflow at each
VOQ2( j,k ) is ensured, the actual buffer size for each VOQ2( j,k ) has no impact on the
throughput performance of the switch. Therefore, it may not be appropriate to
increase the buffer size of VOQ2( j,k ) for boosting throughput performance.
In a load-balanced switch, the head of line packet in each middle-stage VOQ
will experience an average delay of N /2 slots (due to the deterministic nature of the N
configurations), and each additional packet in the line will experience an additional
delay of N slots. To minimize delay, a small buffer size at each VOQ2( j,k ) is preferred.
In general, mechanisms for ensuring in-order packet delivery tend to penalize
the packet delay performance more than throughput. If re-sequencing buffers are
used for solving the mis-sequencing problem, packets suffer from the additional re-
sequencing delay. Since packet mis-sequencing is due to packets of the same flow
experienced different delays at different middle-stage ports, a smaller buffer size at
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 48/177
- 28 -
each VOQ2( j,k ) is favored because middle-stage packet delay can be reduced and
thus the mis-sequencing problem can be eased. Consequently, a smaller re-
sequencing buffer/delay is also possible. In fact, buffering a packet at an input port
(instead of a middle-stage port) gives more flexibility in sending because an input
can retry in the subsequent slots at different middle-stage ports (which may even
have a shorter queue size).
If the frame notion is used for ensuring in-order packet delivery, the time
required for forming a frame dominates the delay performance especially when the
load is light. Besides, frame-based transmission tends to make the traffic to
downstream switches more bursty, resulting in poor delay jitter performance.
Although PD [28] and CR [29] improve the delay performance of UFS [26], the use
of fake frames/packets undermines the load-balancing performance. In this chapter,
we are interested in designing a scheduling algorithm without using re-sequencing
buffers for in-order packet delivery, and without incurring the frame-based
scheduling overheads.
From our observations above, we can see that a smaller buffer size at each
VOQ2( j,k ) is preferred if we can ensure (a) no underflow and overflow at each
VOQ2( j,k ), and (b) no packet mis-sequencing. The smallest buffer size at each
VOQ2( j,k ) is 1. In the rest of this chapter, we shall focus on using a single-packet-
buffer at each VOQ2( j,k ).
2.3.2 Designing Scalable Feedback Mechanism
Now the issue is how to ensure each single-packet-buffered VOQ2( j,k ) is free
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 49/177
- 29 -
of either buffer overflow or underflow. If an input port knows the occupancy of its
connected VOQ2( j,k ) before sending a packet to it, the buffer overflow problem can
be easily solved. Then, do we have an efficient feedback mechanism for reporting the
occupancy of VOQ2( j,k ) to input ports?
We propose a simple yet novel feedback mechanism based on a joint
sequence with staggered symmetry property. A joint sequence of switch
configurations has the staggered symmetry property if middle-stage port j is
connected to output port k at time slot t , then at next slot (t +1) input port k is
connected to the same middle-stage port j. In essence, for each given sequence in the
first stage switch, the second stage sequence (and thus the joint sequence) can be
obtained directly from the property itself. In Fig. 2.2(a), the first stage sequence is
constructed from (1) by cyclic shifting the set of connections used in each slot. Each
configuration in the second stage is obtained from the staggered symmetry property.
We can see that for every pair of staggered configurations, e.g. the second switch
configuration at t =0 and the first switch configuration at t =1, they are mirror images
of each other.
As each VOQ2( j,k ) only has a single packet buffer, a single bit is sufficient to
denote its occupancy. For the N VOQ2( j,k )’s at middle-stage port j (for k =0, …, N -1),
their joint occupancy can be denoted by an N -bit occupancy vector. Since each pair
of input k and output k reside on the same linecard, the occupancy vector at middle-
stage port j can be piggybacked on the data packet sent to output k , which is then
made available to input k at negligible cost. Due to the staggered symmetry property
of the joint sequence used, input k will be connected to middle port j in the next time
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 50/177
- 30 -
slot. This gives a very efficient feedback path, allowing the occupancy vector from
the right middle-stage port to be delivered to the right input at the right time. In the
next time slot, each input port scheduler will select a packet for sending based on the
received occupancy vector. If the packet is properly selected, both buffer overflow
and underflow at a middle-stage VOQ2( j,k ) can be avoided. (In Section 2.3.4, three
simple input port schedulers are designed.)
S l o t t
S
l o t t + 1
S l o t t
S
l o t t + 1
Fig. 2.3 Feedback operation in joint sequences with staggered symmetry.
The timing diagram in Fig. 2.3 summarizes the feedback operation, while
assuming each switch reconfiguration involves certain overhead. We can see that
switch reconfiguration takes place in parallel with relaying the occupancy vector
from output k to input k and the execution of the scheduling algorithm. The
occupancy vector is created by taking both packet arrival/departure in the current slot
into account. In creating the vector, the occupancy bit of VOQ2( j,k ’) is always set to
0 if middle port j will connect to output k ’ in the next slot. This is because the packet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 51/177
- 31 -
(if any) in VOQ2( j,k ’) is guaranteed to be sent in the next time slot. Besides, when a
buffered packet in VOQ2( j,k ’) is being sent, VOQ2( j,k ’) can receive another packet
simultaneously. Due to parallel packet transmission in the two switch stages, a packet
cannot be delivered from an input to an output in a single time slot, i.e. the minimum
delay a packet experienced at a middle-stage port is one slot.
From Fig. 3, we can also see that the feedback operation requires accurate
timing synchronization within a time slot. We notice that accurate synchronization of
less than 10 ns is reported in [106], and a scheme to achieve 1 ns synchronization is
proposed in [107]. Therefore, synchronization within a time slot of, say 40 ns, would
not be a major issue.
Note that the joint sequence in Fig. 2.2(c) does not have the staggered
symmetry property. If it is used for implementing feedback path (as in [30,32]),
occupancy vector cannot be piggybacked onto data packet. Instead, a dedicated
feedback packet must be sent from each middle-stage port to its connected output in
each time slot. This incurs not only extra propagation delay for sending the feedback
packet, but also extra packetization and synchronization overhead. As a result, the
duration of a time slot in [30,32] would be much longer than that shown in Fig. 2.3.
If the switch performance is studied using the number of time slots, the inefficiencies
of using a “larger” time slot could be easily overlooked.
2.3.3 Solving Packet Mis-sequencing Problem
If the load-balanced switch in Fig. 2.1 is configured by the joint sequence in
Fig. 2.2(a), will we face the packet mis-sequencing problem? We know that packet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 52/177
- 32 -
order will be preserved if every packet of a flow experiences the same amount of
delay when passing through any middle-stage port. This is obviously true if middle-
stage ports are bufferless, thereby every packet experiencing the same 0-slot delay.
Will it be still true for the case of single-packet-buffer-per-VOQ2( j,k )?
Surprisingly, a closer examination at the joint sequence in Fig. 2.2(a) reveals
that packets of the same flow do experience the same middle-stage port delay. Take
flow(0,1) in Fig. 2.2(a) as an example. If a packet is sent (from input 0) to middle-
stage port 0 at t =0, it will be buffered at VOQ2(0,1) for 2 slots until VOQ2(0,1) is
connected to output 1 at t =2. If the next packet of the flow is sent to middle-stage
port 1 at t =1, it will be buffered at VOQ2(1,1) for, again, 2 slots until VOQ2(1,1) is
connected to output 1 at t =3.
In the following, we prove that this is true for each and every flow, and for
any switch size N . Consider the joint sequence in Fig. 2.2(a). The sequence used by
the first stage switch is constructed from (2.1). The sequence used by the second
stage switch is constructed according to the staggered symmetry property, which can
be represented by (2.2). That is at time t (for 0≤t < N ), middle-stage port j is connected
to output k , where k is given by
k = ( j + N – 1 – t ) mod N (2.2)
Statement 1: (Anchor Output). In Fig. 2.2(a), input i is always connected to
output K , where K = [(i+ N –1) mod N ], via one of the middle-stage ports.
Proof: At time t , input i is connected to output k via middle-stage port j.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 53/177
- 33 -
Substitute j from (2.1) into (2.2), we can express k in terms of i.
( ) mod 1 mod ( 1) modk i t N N t N i N N K (2.3)
We can see that K depends only on i. Thus for a given input i, it is always connected
to the same anchor output K . #
Statement 2: (Deterministic Delay at Middle-stage Ports). Let K be the
anchor output of input i. For every packet of flow(i,k ), it experiences the same d slots
delay in one of the middle-stage ports, where d is given by
, if
, if
, if
N K k
d K k K k
K N k K k
(2.4)
Proof: Suppose at slot t , input i is connected to its anchor output K via
middle-stage port j and a packet is sent to join VOQ2( j,k ). From (2.2), middle port j is
connected to each output in descending order of the output port number. Then if K ≠k ,
this packet will experience exactly ( K -k ) module N slots delay in VOQ2( j,k ) due to
the single packer buffer at VOQ2( j,k ). If K =k , this packet can only be sent when
middle port j connects to output port K again, so its middle stage delay is N time slots.
In short, this packet will experience exactly d slots delay calculated by (2.4), and d is
bounded by [1, N ]. #
Statement 3 (In-order Packet Delivery). In-order packet delivery is
guaranteed if the joint sequence of configurations is constructed using (2.1) and (2.2).
Proof: Assume packets A and B of flow(i,k ) join VOQ2( j1,k ) and VOQ2( j2,k )
at time t A and t B (where t B>t A), respectively. Let d A and d B be their respective delays
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 54/177
- 34 -
experienced in VOQ2. Mis-sequencing occurs only if packet B reaches output k
earlier than packet A, i.e. t A+d A>t B+d B. However, this will never happen because
t B>t A and d A=d B from Statement 2. #
It can be easily seen that the delay a packet experienced at a middle-stage
port is bounded between [1, N ] slots, and the average middle-stage packet delay is
merely ( N +1)/2 slots for uniform traffic. From Fig. 2.2, we can see that some joint
sequences have the staggered symmetry property only, some have the in-order packet
delivery property only, and some have both properties. For instance, the joint
sequence in Fig. 2.2(b) has the staggered symmetry property but cannot ensure in-
order packet delivery. Consider packets from flow(0,1). Two different middle-stage
delays will be experienced, 2-slot via middle port 3 and 4-slot via middle port 1. This
causes packet out of order. On the other hand, the joint sequence in Fig. 2.2(c) can
provide in-order packet delivery but lacks the staggered symmetry property. The
systematical study of joint sequences is carried out in the Chapter 3, but as far as this
chapter is concerned, we only focus on the joint sequence in Fig. 2.2(a).
2.3.4 Feedback-Based Scheduling Algorithms
Based on the received occupancy vector, each input port selects a packet for
sending. Such an input port scheduler should be designed to avoid both buffer
overflow and underflow at the connected middle-stage VOQ. Suppose input i is
connected to middle-stage port j at slot t , and its anchor output is K . Based on the N -
bit occupancy vector received from middle-stage j in the previous slot t -1, we find
candidate set S j, i.e. the set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy. Input i
can only choose a HOL packet from a VOQ in S j for sending. This avoids buffer
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 55/177
- 35 -
overflow at VOQ2( j,h).
From Fig. 2.2(a), we can see that middle port j is connected to each output in
descending order of the output port number. Therefore, we know a priori that in the
next slot t +1, port j will be connected to output K -1 (wrapped around by N ). If
VOQ2( j, K -1) is empty and VOQ1(i, K -1) is not, we will face an underflow in
VOQ2( j, K -1) at slot t +1. As such, the scheduling algorithm should always give the
highest priority to schedule the HOL packet of VOQ1(i, K-1) at slot t . With the above
considerations in mind, we present three simple input port schedulers below.
RR (Round-Robin): If VOQ1(i,h’ ) is selected in the previous slot, then the
next non-empty VOQ1(i,h) is selected with VOQ2( j,h)S j. Comment: RR
gives fair access to each VOQ1, and RR is amenable to hardware
implementation [33].
LQF (Longest Queue First): Among all the non-empty VOQ1(i,h)’s with
VOQ2( j,h)S j, the one with the longest queue size is selected. Comment:
LQF is good for non-uniform traffic, but requires O( N ) comparisons. We can
replace it by Quasi-LQF [34], a very efficient sub-optimal LQF algorithm
requiring only a single comparison per time slot.
EDF (Earliest Departure First): Among all the non-empty VOQ1(i,h)’s with
VOQ2( j,h)S j, the one with the earliest departure time at the middle-stage
port is selected. The departure time is calculated from (2.4). Comment: EDF
should not be confused with the classic Earliest Deadline First. Our EDF aims
at minimizing the chance of buffer overflow at each VOQ2, which is achieved
by always giving priority to the VOQ1 with the minimum middle-stage delay
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 56/177
- 36 -
to send first.
Take an example. Assume a 4×4 feedback switch is configured by the joint
sequence of Fig. 2.2(a) and at time slot 0, a packet of VOQ1(0,0) is sent. Further
make a assumption that at time slot 1, there are 1, 2, 0, 3 packets in VOQ 1(0,0),
VOQ1(0,1), VOQ1(0,2), VOQ1(0,3) respectively and the feedback indicates that the
corresponding middle stage buffer for output port 0 is not empty. Therefore, only
VOQ1(0,1) and VOQ1(0,3) are legitimate candidates, i.e. VOQ1(0,1) and VOQ1(0,3)
S j. Then at time slot 1, RR and EDF would select the packet at VOQ 1(0,1) for
sending but LQF would transmit the HOL packet of VOQ1(0,3).
To give a scheduler more time to execute, batch scheduling [35] can be used,
where a single scheduling decision is made over a batch of time slots (instead of per
slot). Packets arrived in the current batch of slots will be considered in the next batch.
Indeed, the multi-cabinet implementation of the feedback-based switch in Chapter 6
belongs to this category
2.4 Performance Evaluations
In this section, the performance of our proposed feedback-based scheduling
algorithms is compared with some representative algorithms by simulations. In the
following, we only present simulation results for switch with size N =32 although
similar conclusions apply to other sizes (unless explicitly spelling out, the default
value of switch size N =32 in all simulation results of this thesis). In our simulations,
we focus on studying the performance of the three proposed feedback-based
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 57/177
- 37 -
scheduling algorithms in Section 2.3, i.e. round robin (RR), longest queue first (LQF)
and earliest departure first (EDF). For comparison, we also implement:
LQF with byte-focal switch architecture (LQF_Byte-Focal) [23], which
outperforms FOFF and in general is the best performing algorithm based on
resequencing buffer.
CR algorithm [29], which is the best performing frame-based scheduling
algorithm.
iSLIP algorithm [15], which serves as a benchmark for single-stage input-
queued switches. Specifically, we implement iSLIP with a single iteration
(iSLIP-1), as multi-iterations involve heavy communication overhead.
Output-queued switch, which serves as a lower bound.
2.4.1 Performance under Uniform Traffic
Uniform traffic is generated as follows. At each time slot for each input, a
packet arrives with probability p and destines to each output with equal probability.
Fig. 2.4 shows the delay-throughput performance under uniform traffic. We can see
that three input port schedulers RR, LQF and EDF yield comparable and less-than-
20-slot delay performance for input load up to p = 0.9. When p > 0.94, LQF gives the
best performance (as it always serves the most needed flow first), then follow by
EDF and RR. The average packet delay at middle-stage ports can be easily derived:
(1+ N )/2 = 16.5 time slots. If we deduct this portion from the overall delay, we can
see that the (input port) delay of our scheduling algorithms matches the output-
queued switch performance very well. Compared with LQF_Byte-Focal, our three
schedulers give significantly smaller delay. When p is reasonably large (>0.6), our
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 58/177
- 38 -
algorithms also beat iSLIP and CR. When p=0.7, the delay of LQF_Byte-Focal is 95
time slots, iSLIP 44, CR 152 and ours only 20.
Fig. 2.4 Delay vs input load p, with uniform traffic.
2.4.2 Performance under Uniform Bursty Traffic
Bursty arrivals are modeled by the ON/OFF traffic model. In the ON state, a
packet arrives in every time slot. In the OFF state, no packet arrivals are generated.
Packets of the same burst have the same output and the output for each burst is
uniformly distributed. Given the average input load of p and average burst size s p, the
state transition probabilities from OFF to ON is p/[ s p(1- p)] and from ON to OFF is
1/ s p. Without loss of generality, we set burst size s p = 30 packets.
Fig. 2.5 shows the delay-throughput performance under uniform-bursty traffic.
In Fig. 2.5, we can see that delay builds up quickly with input load, which is due to
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 59/177
- 39 -
the bursty traffic nature. Nevertheless, our RR, LQF and EDF still outperform
LQF_Byte-Focal and CR algorithms. At p=0.8, the delay of LQF_Byte-Focal is 224
time slots, 232 for CR, 156 for our RR/LQF/EDF, and 114 for output-queued switch.
Fig. 2.6 shows the delay performance of LQF under uniform-bursty traffic with
different burst sizes. We can see that average packet delay increases almost linearly
with burst size.
Fig. 2.5 Delay vs input load p, with uniform bursty traffic.
2.4.3 Performance under Hotspot Traffic
Packets arriving at each input port in each time slot with probability p. Packet
destinations are generated as follows. For input port i, packet goes to output i+ N /2
with probability ½, and goes to any other output with probability 1/[2( N -1)]. Fig. 2.7
shows the delay-throughput performance under hotspot traffic.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 60/177
- 40 -
Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes.
Fig. 2.7 Delay vs input load p, with hot-spot traffic.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 61/177
- 41 -
From Fig. 2.7, again we can see that our three schedulers are consistently
better than others. And among the three, LQF again gives the best/lowest delay
performance. Nevertheless, it is interesting to point out that the performance
difference among the three schedulers is much smaller than that in a single-stage
switch, and this is due to the use of the first stage switch for load balancing. For
simplicity, we shall only concentrate on LQF below.
2.5 The Stability of Feedback-Based Two-Stage Switch
Simulation results in the previous section allow us to study the average
performance under specific traffic patterns. In this section, we prove that under a
speedup of two, feedback-based switch using any arbitrary work-conserving port-
based scheduling algorithms (not just RR, LQF and EDF) is stable under any
admissible traffic patterns.
2.5.1 The Existing Approaches
Generally there are two approaches in proving 100% throughput, either using
the Lypunov method or based on the fluid model. The Lypunov method consists of
three steps [14,17]. First, model the VOQ-length process by a Markov chain. Then
convert the stability problem to a linear programming problem. Finally use
appropriate Lypunov functions. Based on this approach, switches using MWM [12],
MSM [14] and CIOQ [17] are proved to be stable.
In the Lypunov method, the packet arrival process at each input is required to
be Bernouli i.i.d (Independent and Identically Distributed). To remove this limitation,
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 62/177
- 42 -
the fluid model approach can be used. Under the assumption that the packet arrival
process at each input obeys the law of large number, a much broader class of traffic
can be accounted for. The 100% throughput proofs for MWM and CIOQ in [36], and
for buffered crossbar switch in [20,37], are based on the fluid model.
2.5.2 Fluid Model for Feedback-Based Two-Stage Switch
Like [20,36-37], we first establish a fluid model for scheduling packets. Let
the number of packets in VOQ1(i, j) at the beginning of time slot n be Z ij(n). Let the
cumulative number of arrivals and departures for VOQ1(i, j) at the beginning of slot n
be Aij(n) and Dij(n), respectively. We have:
Z ij(n) = Z ij (0) + Aij (n) − Dij (n), n ≥ 0, i, j = 1,...., N (2.5)
Let the number of packets in VOQ2(i, j) at the beginning of slot n be Bij(n).
Because there is only one packet buffer for each VOQ2(i, j), we have Bij(t )=0 if
VOQ2(i, j) is empty and Bij(t )=1 if VOQ2(i, j) is occupied. The cumulative number of
arrivals and departures in VOQ2(i, j) at the beginning of slot n are X ij(n) and Y ij(n),
respectively. The following relationship holds:
Bij(n) = Bij (0) + X ij (n) − Y ij (n), n ≥ 0, i, j = 1,...., N (2.6)
We assume that the packet arrival process obeys the strong law of large
numbers with probability one, i.e.
,,...,1,,)(
lim N jin
n Aij
ij
n
where λ ij is the mean packet arrival rate to VOQ1(i, j). The switch is, by definition,
rate stable if:
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 63/177
- 43 -
( )lim , , 1,..., .
ij
ijn
D ni j N
n
An admissible traffic matrix is defined as the one that satisfies the following
constraints.
,1,1 j
ij
i
ij
(2.7)
If a switch is rate stable for an admissible traffic matrix, then the switch delivers
100% throughput.
The fluid model is determined by a limiting procedure illustrated below. First,
the discrete functions are extended to right continuous functions. For arbitrary time t
∈ [n, n+ 1):
Aij(t ) = Aij(n);
Z ij(t ) = Z ij(n);
Dij(t ) = Dij(n) + (t - n)( Dij(n + 1) - Dij(n) );
X ij(t ) = X ij(n);
Bij(t ) = Bij(n);
Y ij(t ) = Y ij(n) + (t - n)( Y ij(n + 1) - Y ij(n) );
Note that all functions are random elements of D[0, ∞). We shall sometimes
use the notation Aij(· ,ω), Z ij(· ,ω), Dij(· ,ω) X ij(· ,ω), Bij(· ,ω), and Y ij(· ,ω) to explicitly
denote the dependency on the sample path ω. For a fixed ω, at time t , we have [36]:
Aij(t ,ω), the cumulative number of arrivals to VOQ1(i, j)
Z ij(t ,ω), the number of packets in VOQ1(i, j)
Dij(t ,ω), the cumulative number of departures from VOQ1(i, j)
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 64/177
- 44 -
X ij(t ,ω), the cumulative number of arrivals to VOQ2(i, j)
Bij(t ,ω), the number of packets in VOQ2(i, j)
Y ij(t ,ω), the cumulative number of departures from VOQ2(i, j)
For each r > 0, we define
);,(),( 1 rt Ar t A ij
r
ij
);,(),( 1 rt Z r t Z ij
r
ij
);,(),( 1 rt Dr t D ij
r
ij
);,(),( 1 rt X r t X ij
r
ij
);,(),( 1 rt Br t B ij
r
ij
);,(),( 1 rt Y r t Y ij
r
ij
It is shown in [20,37] that for each fixed ω satisfying (2.5), (2.6) and any sequence
{r n} with r n → ∞ as n → ∞, there exists a subsequence }{k nr and the continuous
functions(.)...)(.),( ijij Z A
, where)...),(),,(( t Z t A r
ij
r
ij converges to uniformly on
compacts as k → ∞ for any t ≥ 0
;),( t t A ij
r
ijk n
);(),( t Z t Z ij
r
ijk n
);(),( t Dt D ij
r
ijk n
);(),( t X t X ij
r
ijk n
);(),( t Bt B ij
r
ijk n
);(),( t Y t Y ij
r
ijk n
(2.8)
Definition 1: Any function obtained through the limiting procedure in (2.8) is
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 65/177
- 45 -
said to be a fluid limit of the switch. So the fluid model equations using our proposed
scheduling algorithms are:
0)()0()( t t Dt Z t Z ijijijij
(2.9)
0)()()0()( t t Y t X Bt B ijijijij(2.10)
Definition 2: The fluid model of a switch operating under a scheduling
algorithm is said to be weakly stable if for every fluid model solution ),( Z D
with 0)(,0)0( t Z Z for almost every t ≥ 0.
From [36], the switch is rate stable if the corresponding fluid model is weakly
stable. Our goal here is to prove that for every fluid model solution ),( Z D using our
scheduling algorithms,0)( t Z
for almost every t . To prove0)( t Z
, we will use the
following Fact 1 from [36]:
Fact 1: Let f be a non-negative, absolutely continuous function defined on R+∪{0}
with f (0)=0. Assume that for almost every t such that 0)(,0)( t f t f . Then f (t )=0
for almost every t ≥ 0. (Note that R+
is the set of positive real numbers, and )(t f
denotes the derivative of function f (t ) at time t .)
2.5.3 100% Throughput Proof
In the following, we show that our proposed scheduling algorithms give
100% throughput. The result is quite strong in the sense that it holds for any arbitrary
work-conserving scheduling algorithms with a speedup of two. In other words, each
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 66/177
- 46 -
input i can choose to serve any non-empty VOQ1(i,k ) for which VOQ2( j,k ) is empty.
Theorem 1: (Sufficiency) A work-conserving scheduling algorithm can
achieve 100% throughput with a speedup of two for any admissible traffic pattern
obeyed the strong law of large numbers.
Proof : Let C ij(t ) denote the joint queue occupancy of all packets arrived at
input port i, plus all packets destined for output j. We have
m
mjmj
p
ipij t Bt Z t Z t C )]()([)()((2.11)
)(t Z and )(t B are all non-negative, absolutely continuous functions, so C ij(t ) is non-
negative and absolutely continuous too. We can see that C ij(0)=0 and then we have
m
mjmj
p
ipij t Bt Z t Z t C ])()([)()(
Combined with (2.9) and (2.10), we get
m
mjmjmjmj
p
ipipij t Y t X t Dt t Dt t C ])()()([])([)(
With a work-conserving scheduling algorithm, packets left VOQ1(m, j) will enter
VOQ2(m, j), for m = 1,...., N , so
m
mj
m
mj t X t D )()(, then
m
mj
p p
ip
m
mjipij t Y t Dt C ])()([)(
.
From the admissible traffic condition (2.7), we get
m
mj
p
ipij t Y t Dt C ])()([2)((2.12)
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 67/177
- 47 -
For any non-empty VOQ1(i, j), i.e.0)( t Z ij , then by continuity of )(t Z , such that
0)'( t Z ij for ],[ t t t . Set
)(min],[
t Z a ijt t t
.
For large enough k , we have 2/)( at Z k nr
ij for ],[ t t t . Also, for large
enough k we have .12/ ar k n Thus
1)( t Z ij for )],(,[ t r t r t
k k nn which means
that VOQ1(i, j) holds at least one packet in the long interval )].(,[ t r t r k k nn With a
work-conserving scheduling algorithm, flow(i, j) packets always experience the same
fixed middle-stage port delay of d slots, where d is given by (2.4). During the time
interval)],(,[ t r t r
k k nn when input port i is connected to any middle port g , then
if VOQ2( g , j) is empty, a packet is transmitted from input port i to middle port
g .)(t Dik k is increased by one.
if VOQ2( g , j) is not empty, the packet in VOQ2( g , j) will be transmitted to
output port j with fixed delay q, where q = d mod N .( )
mjmY t will be
increased by one after q slots. (The packet in VOQ2( g , j) will be sent when
middle port g is connected to output j. If this occurs in the current time slot, q
= 0. Otherwise, it takes another q=d slots.)
If the switch is operated with a speedup of S , in a long time interval
],[)],(,[ t t t t r t r k k nn it fulfills:
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 68/177
- 48 -
)()]()([)]()([ t t r S qt r Y qt r Y t r Dt r Dk k k k k n
m
nmjnmjnipn
p
ip
Note that( )
mjmY t is monotonically non-decreasing and is increased at most one in
every time slot. So we have:
qt r Y qt r Y m
nmj
m
nmj k k )()(
m
nmj
m
nmj t r Y qt r Y k k
)()(.
Combined them together, we have
[ ( ) ( )] [ ( ) ( )] ( )k k k k k ip n ip n mj n mj n n
p m
D r t D r t Y r t Y r t q S r t t
Since q is pre-determined and within [0, N -1], its impact is insignificant in the fluid
limit [20]. Dividing the above equation with r n and let k →∞, fluid limits are obtained
as:
[ ( ) ( )] [ ( ) ( )] ( )ip ip mj mj
p m
D t D t Y t Y t S t t
Further dividing the above equation by ( t t ), and letting t t , the
derivative of the fluid limit is
S t Y t D
m
mj
p
ip ])()([
. (2.13)
With a speedup of two (i.e. S =2), combined (2.12) & (2.13), we get
0)( t C ij
Based on Fact 1, C ij(t )=0 for almost every t ≥ 0. Due to (2.11) and C ij(t )=0,
then 0)( t Z ij for almost every t ≥ 0. Theorem 1 is proved. #
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 69/177
- 49 -
It should be noted that existing stability proofs [21-30], adopt a common
approach of showing that the delay performance of a specific algorithm is within a
finite bound of the output-queued switch. Since the buffer size at each middle-stage
port is usually assumed to be infinite, the derived bound w.r.t. (With Regard To)
output-queued switch can be unrealistically large.
2.6 Chapter Summary
In this chapter, a framework for designing feedback-based scheduling
algorithms was proposed for elegantly solving the notorious packet mis-sequencing
problem of a load-balanced switch, while without sacrificing the switch’s delay and
throughput performance. Unlike existing approaches, we showed that the efforts
made in load balancing and keeping packets in-order can complement each other.
Specifically, at each middle-stage port between the two switch fabrics of a load-
balanced switch, only a single-packet-buffer for each VOQ is required. In-order
packet delivery is made possible by properly selecting and coordinating the two
sequences of switch configurations to form a joint sequence with both staggered
symmetry property and in-order packet delivery property. As compared with the
existing load-balanced switch architectures and scheduling algorithms, our solutions
have the modest requirement on switch hardware, but consistently yield the best
delay and throughput performance under various traffic conditions.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 70/177
- 50 -
Chapter 3
Cutting Down Average PacketDelay
3.1 Introduction
For an N × N switch, there are N 2
input-output pairs, and thus it needs to carry
a total of N 2
different packet flows. In a feedback-based switch (Fig. 3.1.), although
the amount of middle-stage port delay experienced by packets of the same flow is the
same, packets of different flows may experience different middle-stage port delay.
The feedback-based switch in Fig. 3.1 is configured with the joint sequence in Fig.
3.2(a). Flow(0,1) packets will experience 2-slot middle port delay, such as arriving
middle port 0 at t =0 and leaving at t =2. On the other hand, flow(0,2) packets will just
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 71/177
- 51 -
experience 1-slot middle port delay, such as arriving middle port 0 at t =0 and leaving
at t =1. Assume flow(0,1) and flow(0,2) are the only flows in the switch and the
packet arrival rate for flow(0,1) is much higher than that of flow(0,2). To minimize
the average packet delay performance, can we swap the two flows such that flow(0,1)
packets experience 1-slot middle port delay instead? In general, if the traffic rate
matrix of a switch is known (e.g. by measurement), can we cut down the average
middle-stage packet delay by assigning heavy flows to experience less middle-stage
delays? This problem is investigated in this chapter along two directions.
Fig. 3.1 The feedback-base two-stage switch architecture.
First, from Chapter 2 we know that there exists a set of joint sequences with
both staggered symmetry and in-order packet delivery properties (the joint sequence
in Fig. 3.2(a) is just a particular instance). Then, for a given traffic matrix, we try to
find an optimal joint sequence that can minimize the average middle-stage delay. But
the searching involves rather tedious computation. Then a more practical solution is
proposed to add another stage of switch fabric for dynamically mapping heavy flows
to experience less middle-stage port delays. We call it a feedback-based three-stage
switch.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 72/177
- 52 -
The rest of this chapter is organized as follows. In the next section, we design
the optimal joint sequence for feedback-based two-stage switch under specific traffic.
In Section 3.3, the three-stage switch architecture is introduced to minimize the
average middle-stage delay. Finally, we conclude this chapter in Section 3.4.
Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch.
3.2 Optimal Joint Sequence Design
A feedback-based two-stage switch has a single packet buffer at each middle-
stage VOQ2( j,k ). It is configured by a pre-determined joint sequence of N joint
configurations. A joint sequence consists of two (component) sequences of N
configurations, one for each switch stage, called first stage sequence and second
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 73/177
- 53 -
stage sequence. From Fig. 3.2 and our discussion in Chapter 2, we can see that some
joint sequences have the staggered symmetry property only, some have in-order
packet delivery property only, some have both properties, and yet more have none of
the properties (not shown). The relationship among them can be described by Fig. 3.3.
For a feedback-based switch to properly function, a joint sequence should have both
staggered symmetry and in-order packet delivery properties. To find the optimal joint
sequence for a given traffic matrix, we have to answer to the following two questions:
1. What is the necessary and sufficient condition for both staggered symmetry
and in-order delivery in a feedback-based two-stage switch?
2. How many such joint sequences exist?
Fig. 3.3 The relation between staggered symmetry and in-order delivery.
A broader sense definition of feedback-based switch shall be adopted in this
section to denote any load-balanced switch with single packet buffer at each middle-
stage VOQ2( j,k ). If staggered symmetry property is also required, we spell it out
explicitly as feedback-based switch with staggered symmetry property.
3.2.1 In-Order Packet Delivery Only
Statement 4: A constant middle-stage delay for all packets belonging to the
same flow (and for all N 2 flows) is a necessary and sufficient condition for packet in-
order delivery in a feedback-based switch.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 74/177
- 54 -
Proving sufficient condition: If the middle-stage delay is constant for all
packets of the same flow, then middle-stage ports will not cause any packet out-of-
order problem. #
Note that this sufficient condition is not limited to the feedback-based two-
stage switch architecture. It can be applied to other load-balanced switch
architectures [21, 31].
Proving necessary condition: In a feedback-based switch, assume flow(i,k )
packets do not experience a constant middle port delay. Nevertheless, based on the
periodicity of joint sequence, the middle-stage port delay is always bounded between
[1, N ] slots and if packets A and B of flow(i,k ) enter middle-stage ports at time slot t
and t + N respectively, they will still experience the same middle port delay. Therefore,
there exist packets C and D belonging to flow(i,k ), such that C enters a middle-stage
port at slot t and experiences a middle-stage delay of d s slots, whereas D enters a
middle-stage port at slot t +1 and experiences a middle-stage delay of d (d < d s) slots.
Because d and d s are all positive integer, then:
d +1 ≤ d s
Packets C and D leave middle-stage ports (thus the switch as there is no
output buffer) at slot t +d s and t +1+d respectively. If d +1=d s, then t +d s=t +1+d , which
means C and D will leave at the same time slot. This contradicts to the property of
joint sequence, so d +1≠d s, i.e.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 75/177
- 55 -
d +1 < d s (3.1)
From (3.1), we get t +1+d < t +d s. In other words, packet C leaves middle port
after D, causing packet out of sequence. As no constant middle-stage delay causes
packet out of sequence, this proves the necessary condition. #
This necessary condition is only valid under the feedback-based two-stage
switch architecture, which means that there is only one packet buffer for every
VOQ2( j,k ) and two switch fabrics are configured by a joint sequence. Note that in a
feedback-based two-stage switch, the middle-stage port delay is always bounded
between [1, N ] slots. If the middle port delay is not upper bounded by N slots, the
necessary condition may fail. Take an example, if every next packet incurs a larger
middle-stage delay than the previous packets from the same flow, in order delivery
could still be sustained.
In Fig. 3.2(c), we can see that each input port always connects to a fixed
output port (via some middle-stage port) in all time slots. We call this anchor output
property. In this case, outputs 0, 1, 2 and 3 are the anchor outputs of inputs 0, 1, 2
and 3, respectively. Further consider input 0 in Fig. 3.2(c), it connects to middle ports
0, 1, 2 and 3 in a cyclic manner in each subsequent time slot. We denote this cycle by
(0, 1, 2, 3). Similarly, we can see that inputs 1, 2 and 3 connect to middle ports
following cycles (3, 0, 1, 2), (2, 3, 0, 1) and (1, 2, 3, 0), respectively. Indeed, (0, 1, 2,
3), (3 0 1 2), (2, 3, 0, 1) and (1, 2, 3, 0) are just different ways to express the same
cycle (0, 1, 2, 3). If all input ports of a switch connect to middle ports following the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 76/177
- 56 -
same cycle, we say the sequence of N configurations is ordered . We can see that both
first and second sequences of configurations in Fig. 3.2(a) and (c) are ordered.
Statement 5: If a joint sequence of configurations has the anchor output
property, and one of its two sequences is ordered, then the other sequence is also
ordered.
Proof : Without loss of generality, let the first stage sequence of configurations
be ordered based on cycle ( j1, j2, j3, j4 ... j N ). At time slot t , let middle ports j1, j2 , j3 ,
j4 ... j N be connected by input ports i1, i2 , i3 , i4 ... i N respectively. Further let k 1, k 2 , k 3 ,
k 4 ... k N be the anchor outputs for i1, i2 , i3 , i4 ... i N . We can get the generic joint
configuration at time slot t as shown in Fig. 3.4:
Fig. 3.4: The generic joint configuration at time slot t.
Similarly, the joint configurations at each subsequent time slot up to t + N -1
can be constructed based on the anchor output and ordered sequence properties as
shown in Fig. 3.5, resulting in a joint sequence of N joint configurations. By
construction, we can see that the second sequence of configurations (identified by
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 77/177
- 57 -
solid lines) is also ordered, and follows the cycle of (k 1, k N , k N -1, k N -2 …k 2). #
Fig. 3.5: Generic joint sequence with anchor output and ordered properties.
If the two component sequences of a joint sequence are both ordered, we say
the joint sequence has the ordered property. Note that the tuple (i x, j x, k x) could take
any value in [0, N -1], so the joint sequence in Fig. 3.5 is a generic expression for all
possible joint sequences with anchor output and ordered properties.
Statement 6: Anchor output and ordered properties are the necessary and
sufficient condition for a constant middle port delay for packets of the same flow in a
feedback-based two-stage switch.
Proving sufficient condition: Let the first stage sequence of configurations be
ordered based on cycle ( j1, j2, j3, j4 ... j N ). Further let k 1, k 2 , k 3 , k 4 ... k N be the anchor
outputs for i1, i2 , i3 , i4 ... i N . From Statement 5, the second sequence of configurations
is also ordered, and follows the cycle of (k 1, k N , k N -1, k N -2 …k 2). This joint sequence is
shown in Fig. 3.5. Consider a packet A of flow(i1,k N ) being transmitted to some
middle port j. Due to anchor output, j connects to (anchor) output port k 1 (of input i1)
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 78/177
- 58 -
at the current time slot. Packet A arrives and waits at middle port j until j connects to
k N again. Since the second stage sequence of configurations is ordered in cycle (k 1, k N ,
k N -1, k N -2 …k 2), j will connect to output port k N after one time slot. That means for any
arbitrary middle port j, the middle-stage port delay for flow(i1,k N ) is always 1-slot.
Repeat the above procedure for all N 2
possible flows, we can see that each flow has a
constant middle-stage delay. The sufficient condition is proved. #
Proving necessary condition: In a feedback-based two-stage switch, the
middle port delay is bounded between [1, N ] slots. Due to the connectivity of a joint
sequence, different flows arrived at an input port must experience distinct amount of
middle-stage port delays. In other words, at each input port, there exists exactly one
flow(i,k ) experiencing a constant middle-stage port delay of d time slots, for d =
1, …, N . Assume flow(i,k ) experiences the constant middle-stage port delay of d = N
time slots. At time slot t , input i connects to some middle-stage port j′ and j′ connects
some output port k ′. If a packet B of flow(i,k ) is transmitted to middle port j′ in this
slot, because of the constant N time slots middle-stage port delay for flow(i,k ), j′ will
connect to output port k after N slots. The joint sequence is periodic with a cycle of N
slots, so k=k ′. For arbitrary time slot t and middle port j′, k=k ′ is always true and this
shows that output k is the anchor output for input i. Repeat the above procedure for
all input ports, we can see that each input port has a distinct anchor output port. This
shows that anchor output is the necessary condition for a constant middle-stage port
delay.
At input i, there exists flow(i,k ′) with constant 1-slot middle-stage port delay.
Let output k be the anchor output for i, as proved above. When input i connects to
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 79/177
- 59 -
some middle-stage port j at any time slot t , due to anchor output property, j connects
to output k . At current time slot, if a packet C of flow(i,k ′) is sent to middle port j,
then one time slot later, j will connect to output k ′ to keep the constant 1-slot middle
port delay. Since both middle port j and time slot t are arbitrarily selected, all middle
ports connect to output ports k and k ′ following the same order of k first and then k ′.
Repeat the above process from 1-slot middle-stage port delay to ( N -1)-slots, we can
show that all middle ports connect to output ports follow the same ordered sequence.
This proves the necessary condition. #
From Statements 4 and 6, we can directly get Statement 7:
Statement 7: Anchor output and ordered properties are the necessary and
sufficient condition for packet in-order delivery in a feedback-based two-stage switch.
3.2.2 Both In-Order Packet Delivery and Staggered Symmetry
Statement 8: If one sequence of configurations is ordered and the other
sequence is constructed by the staggered symmetry property, then the resulting joint
sequence has the anchor output property.
Proof : Staggered symmetry property refers to the fact that for any middle-
stage port j, if it is connected to output k at time slot t , then at next slot (t +1) input k
is connected to the same middle-stage port j. In other words, the second
configuration at time slot t is a (vertical) mirror image of the first configuration at
time slot t +1, and the second configuration at t + N -1 wraps around to become the
mirror image of the first configuration at t .
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 80/177
- 60 -
Fig. 3.6: Joint sequence with staggered symmetry and in-order delivery.
Without loss of generality, let the first stage sequence of configurations be
ordered with cycle ( j1, j2, j3, j4 ... j N ). Due to the staggered symmetry, we can see that
the second stage sequence is also ordered, but interestingly, based on the cycle (i N , i N -
1 ,…, i2, i1), which runs in the opposite direction as that followed by the first stage.
The resulting joint sequence is shown in Fig. 3.6. In each time slot, connection
pattern in the first stage fabric is shifted downwards once (i.e. towards the right hand
side of ( j1, j2, j3, j4 ... j N )), whereas the connection pattern in the second stage is
shifted upwards once. From an input port’s point of view, the net effect is that the
shifting in opposite direction cancel out each other, and the input connects to the
same output (but via a different middle-stage port) as in the previous time slot. This
proves the sufficient condition for Statement 8. #
Statement 9: For a feedback-based two-stage switch, the necessary and
sufficient conditions for both in-order packet delivery and staggered symmetry are:
one sequence of configurations is ordered and the other sequence is constructed by
the staggered symmetry property.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 81/177
- 61 -
Proof : The sufficient condition for in-order packet delivery and staggered
symmetry is a direct consequence from Statements 7 and 8. If in-order packet
delivery is guaranteed, from Statement 7 an ordered sequence of configurations is a
necessary condition. Obviously the staggered symmetry property itself is the
necessary condition for both in-order packet delivery and staggered symmetry
properties. #
3.2.3 Finding the Number of Different Joint Sequences
The Statement 9 answered the question 1 (i.e. necessary and sufficient
condition for both staggered symmetry and in-order delivery). Then for the sake of
finding the optimal joint sequence, in the following we concern on the question 2 (i.e.
how many such joint sequences exist).
All possible joint sequences: To find the number of sequences that satisfy the
requirement of each input visiting each output exactly once in the sequence,
we can make use of the solution for the classic problem of Latin square [38].
A Latin square is an N × N table filled with N different symbols in such a way
that each symbol occurs exactly once in each row and exactly once in each
column. From [38], the total number of Latin squares is given by [ N !( N -1)!] M ,
where M is the number of reduced Latin squares (and M ≥1). Unlike Latin
square, in a load-balanced switch the configuration sequence is periodic with
N , and “sequences” beginning with different starting time slots should be
counted once. Accordingly, the number of configuration sequences in the first
stage fabric is N times smaller than the number of Latin squares, or [( N -
1)!]2 M . For a given first stage sequence, there are [ N !( N -1)!] M ways to select
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 82/177
- 62 -
the second sequence, resulting in a total of N ![( N -1)!]3 M 2 possible joint
sequences. (Note that in this case, “sequences” with different starting time
slots are counted individually because they produce different joint sequences.)
Joint sequences with in-order delivery property only: Based on Statement
7, the number of joint sequences providing in-order packet delivery is the
product of the number of different anchor output patterns and the number of
ordered sequences. Since each input must have a distinct anchor output, there
are N ! ways to select an anchor output pattern. Similarly, the number of
possible configurations in a time slot is N !, and there are ( N -1)! possible
cycles that a configuration (sequence) can follow. This results in N !( N -1)!
possible choices. But among them, we only count “sequences” with different
starting time slots once, so the total number of ordered sequences is [( N -1)!]2.
Then, the total number of joint sequences that can keep packets in-order
delivery is (( N -1)!)2
N !.
Joint sequences with staggered symmetry property only: If the sequence
of configurations used by one switch fabric is known, we can always
construct a unique joint sequence with the staggered symmetry property. So
the number of joint sequences with staggered symmetry property equals to the
number of possible single-stage sequences, or [( N -1)!]2 M, where M is the
number of reduced Latin squares [38].
Number of joint sequences with both two properties: From Statement 9,
once the first stage sequence of configurations is determined, so is the second
stage by the staggered symmetry property. Then the number of joint
sequences with both two properties equals to the number of ordered
sequences, which is given by [( N -1)!]2. It should be noted that if all the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 83/177
- 63 -
isomorphic joint sequences are counted once, then there are only ( N -1)!
unique/non-isomorphic joint sequences, each yields a different delay
experience at middle-stage ports.
3.2.4 Discussions
Until now, both two questions are addressed, which we would like to base on
to identify the optimal joint sequence for a given traffic matrix. Statement 9 provides
an efficient mechanism to design a joint sequence for feedback-based two-stage
switches. We first show how a joint sequence can be constructed based on Statement
9. Assume the first stage sequence of configurations is ordered based on cycle ( j1, j2,
j3, j4 ... j N ). At time slot t , let middle ports j1, j2 , j3 , j4 ... j N be connected by input ports
i1, i2 , i3 , i4 ... i N respectively. Due to the ordered property, the first stage
configurations at each subsequent time slot and up to t + N -1 can be constructed.
When the first stage sequence is obtained, the second stage sequence of
configurations can be constructed directly from the staggered symmetry property.
The resulting joint sequence is shown in Fig. 3.6. Note that the tuple ( i x, j x) could
take any value in [0, N -1], so the joint sequence in Fig. 3.6 is a generic expression for
all possible joint sequences with both in-order packet delivery and staggered
symmetry properties. By substituting all possible values for (i x, j x) into Fig. 3.6, we
can systematically find all joint sequences with both staggered symmetry and in-
order packet delivery properties.
Let us take a closer look at Fig. 3.6. We observe that the middle-stage port
delay for flow(i,i) (i= i1, i2 , i3 , i4 ... i N ) is always N -1 slots. In other words, it is not
possible to map flow(i,i) (i=0,1… N -1) to experience less than ( N -1)-slot middle-
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 84/177
- 64 -
stage delay by any joint sequence satisfying Statement 9 (i.e. with both staggered
property and in-order packet delivery properties). Also from Fig. 3.6, if output port j
is the anchor output for input port i, then the middle-stage port delay for flow( j,i) is
always N -2 slots no matter the values for i and j.
We can see that when using a joint sequence with both in-order delivery and
staggered symmetry, the delays for different flows are complicatedly correlated with
each other. To find the optimal joint sequence that gives the minimum overall switch
delay performance, we have to use the brute force to check against every possible
joint sequence in the pool of [( N -1)!]2, which involves rather tedious computation.
3.3
Three-Stage Switch
In this section, we follow another more practical approach, which adds
another stage of switch fabric for dynamically mapping heavy flows to experience
less middle-stage port delays, called three-stage switch.
3.3.1 Three-Stage Switch Architecture
The three-stage switch architecture is shown in Fig. 3.7. Any joint sequence
with staggered symmetry and in-order packet delivery properties, e.g. the one in Fig.
3.2(a), can be used by the first two switch fabrics. The selected joint sequence will
not be changed according to traffic. Instead, the configuration of the third stage
switch is designed/adjusted to map heavy flows to experience less middle-stage delay.
As the configuration in the third switch fabric is based on traffic, it is updated only if
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 85/177
- 65 -
there is a significant enough change in traffic pattern. Since no buffer is required at
the virtual output ports (in Fig. 3.7), adding the third stage switch fabric does not
increase the packet delay (assuming propagation delay is negligible). In other words,
as soon as packets arrive at virtual outputs, they are re-directed to outputs via the
configuration in the third fabric. The 0-delay at the virtual outputs (due to 0-buffer)
also ensures no packet mis-sequencing, and no interruption to the original middle-
stage VOQ occupancy feedback mechanism.
Fig. 3.7 A three-stage switch architecture.
An example is shown in Fig. 3.8. With the three-stage switch in Fig. 3.8(b),
packets of flow(0,3) are delivered to virtual output 2 (instead of 3). After staying at
middle-stage ports for one slot, a packet arrives at virtual output 2 and is immediately
re-directed to output 3. We can see that the middle-stage delay of flow(0,3) packets is
just one slot, whereas 4 slots are required using the two-stage switch implementation
in Fig. 3.8(a).
Without loss of generality, assume the traffic matrix {λ ij} is obtained. Then a
delay matrix {d ij} can be constructed, where d ij denotes that virtual output port j-1 is
connected to output port i-1, and the value of d ij is the traffic-weighted-average
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 86/177
- 66 -
middle-stage packet delay of all the N flows destined to output port i-1. From
Chapter 2, each of the N flows destined to an output port experiences a distinct
middle-stage delay, ranging from 1 to N slots. For the 4×4 traffic matrix {λ ij} in Fig.
3.9(a), the corresponding delay matrix {d ij} is found and shown in Fig. 3.9(b). As an
example,
d 34 = 4λ 13 + λ 23 + 2λ 33 + 3λ 43 = 0.8 + 0.1 + 0.2 + 1.2 = 2.3 slots.
Fig. 3.8 An example of using three-stage switch.
3.04.01.02.0
1.01.05.02.0
4.01.02.01.0
1.02.02.03.0
9.14.21.26.2
3.23.25.19.1
3.25.21.31.2
3.29.19.19.1
(a) Traffic matrix {λ ij} (b) Delay matrix {d ij}
Fig. 3.9 Traffic matrix and delay matrix.
Definition 3: A set of entries of a matrix are independent if none of them
occupies the same row or column.
A legitimate configuration in the third stage switch fabric must correspond to
an independent set. In Fig. 3.9(b), [d 11, d 22,d 33, d 44] is an independent set with virtual
output i mapping to output i. In this case, the three-stage switch degenerates into the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 87/177
- 67 -
two-stage. The average middle-stage packet delay experienced by all N 2 flows in a
two-stage switch is thus
d 11 + d 22 + d 33 + d 44 = 1.9 + 3.1 + 2.3 + 1.9 = 9.2 slots.
Minimizing the overall average middle-stage packet delay becomes finding
an independent set from the delay matrix such that the sum of all entries in the set is
minimized. Optimal algorithms with polynomial running time exist [39,40]. It has a
time complexity of O( N 3). This is acceptable as the configuration in the third switch
fabric is not changed on slot-basis. For completeness, this algorithm is summarized:
9.14.21.26.2
3.23.25.19.1
3.25.21.31.2
3.29.19.19.1
05.02.07.0
8.08.004.0
2.04.010
4.0000
*05.02.07.0
8.08.0*04.0
2.04.010
4.000*0
*05.02.07.0
8.08.0*04.0
2.04.010
4.000*0
*05.02.07.0
8.08.0*04.0
2.04.01'0
4.0'00*0
*05.02.07.0
8.08.0*04.0
2.04.01*0
4.0*000
*05.02.07.0
8.08.0*04.0
2.04.01*0
4.0*000
*05.02.07.0
8.08.0*04.0
2.04.01*0
4.0*000
Fig. 3.10 An example of identifying the minimum independent set.
For a given matrix, it finds the independent set with the minimum weight [39,40].
1. Each row of the matrix subtracts the smallest element in this row, each
column subtracts the smallest element in this column.
2. Find a zero element, Z . If there is no starred zero in its row nor its column,
mark Z with a star. Repeat for each zero of the matrix. Go to Step 3.
3. Cover every column containing starred 0 with a line. If all columns are
covered, the starred zeros form the desired independent set; Exit. Otherwise,
go to Step 4.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 88/177
- 68 -
4. Choose an uncovered zero and mark it with a prime; If there is no starred zero
Z in this row, go to Step 5. If there is a starred zero Z in this row, cover this
row with a line and uncover the column of Z . Repeat until all zeros are
covered. Go to Step 6.
5. There is a sequence of alternating starred and primed zeros constructed as
follows: let Z 0 denote the uncovered 0'. Let Z 1 denote the 0* in Z 0's column (if
any). Let Z 2 denote the 0' in Z 1's row. Continue in a similar way until the
sequence stops at a 0', Z 2k , which has no 0* in its column. Unstar each starred
zero of the sequence, and star each primed zero of the sequence. Erase all
primes and uncover every line. Return to Step 3.
6. Let h denote the smallest uncovered element of the matrix; it will be positive.
Add h to each covered row; then subtract h from each uncovered column.
Return to Step 4 without altering any asterisks, primes, or covered lines.
For the delay matrix in Fig. 3.9(b), we can find a minimum independent set
[d 13, d 21,d 32, d 44]. Fig. 3.10 shows the detail steps and Fig. 3.11 shows the resulting
third stage configuration. The minimum average middle-stage packet delay is
d 13+ d 21+d 32+ d 44=1.9+2.1+1.5+1.9 = 7.4 slots.
This gives 19.6% reduction in middle-stage delay as compared with the two-stage
switch counterpart.
While changing the third-stage configuration, attention should be paid to the
in-flight packets buffered at middle-stage ports. Their destinations are based on the
old mapping rendered by the old third-stage configuration. As such, we have to
suspend the inputs from sending packets to middle-stage ports for N slots; otherwise,
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 89/177
- 69 -
packets based on different mappings will coexist at middle-stage ports. During this
suspension period, the buffered middle-stage packets can be properly cleared and the
new configuration for the third switch fabric will be enforced immediately afterwards.
We call this N -slot suspension period reconfiguration penalty.
Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b).
3.3.2 Traffic Matrix Estimation
Traffic matrix estimation among all the nodes in a network is generally
difficult. Fortunately, here we only need to find the traffic matrix at a single node, i.e.
between the N switch inputs and the N switch outputs. In this section, a simple traffic
matrix estimation algorithm is presented. In particular, a packet counter Qi,j is
associated with each of the N 2
flows/VOQ1(i,j)s. At the beginning of each sampling
interval of T time slots, Qi,j is initialized to 0 and is increased by one for every
subsequent packet arrival. Let λ ij be the estimated traffic rate/load for flow(i, j). λ ij is
updated every T slots using the following exponentially weighted moving averaging
function:
T
Q ji
ijij
,125.0875.0
where λ′ij
is the previous estimate and the weighting on the current sample is
assumed to be 0.125 (which is deemed suitable by simulations.)
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 90/177
- 70 -
We also introduce another criterion for suppressing unnecessary updates of
the third-stage configuration, so as to minimize the reconfiguration penalty.
Specifically, when a new input load λ ij is obtained, we check if the load change is
significant enough by
]1.1,9.0[ij
ij
(3.2)
If all flows satisfy (3.2), the existing third-stage configuration remains. Otherwise, a
new third-stage configuration is determined based on the updated traffic matrix. The
fluctuation range in (3.2) can also be tuned to balance the reconfiguration penalty
and the possible delay performance gain.
Finally, the three-stage switch architecture above is resilient to errors in
estimating the traffic matrix. This is because the close to 100% throughput is
guaranteed by the joint sequence used in the first two switch fabrics, whereas the
third fabric is purely for cutting down the delay. Therefore, adding the third stage
fabric will have no negative impact on switch throughput, packet order, as well as the
middle-stage VOQ occupancy feedback mechanism.
3.3.3 Performance Evaluations
In Chapter 2, the unbeatable delay-throughput performance of the feedback-
based two-stage switch architecture has been well-demonstrated under various traffic
conditions. In this section, we only focus on the improvement of the three-stage
switch over the original feedback switch.
We first study the performance under hot-spot traffic model. For input port i,
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 91/177
- 71 -
packet goes to hot-spot output (i+ x) with probability ½, and goes to other outputs
with same probability 1/[2( N -1)]. The hot-spot can be changed by varying x. This
traffic model is chosen because the overall traffic pattern remains admissible while
increasing input load p from 0 to 1, or varying x. Without loss of generality, the joint
sequence shown in Fig. 3.2(a) is assumed.
Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch.
In the hot-spot traffic model, heavy flow can be easily and correctly identified
by our proposed traffic estimation algorithm. As such, Fig. 3.12 only shows the
delay-throughput performance (of a switch with size N =32) against input load. The
y-axis is the overall average switch delay, which combines both input delay and
middle-stage delay. With two-stage switch architecture, varying hot-spot x results in
different delay-throughput performances. From Fig. 3.12, we can see that when the
hot-spot is at output (i+30), the lowest/best delay is obtained because the hot-spot
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 92/177
- 72 -
flow is assigned to experience 1-slot middle-stage delay. When the hot-spot is at
output (i+31), the highest/poorest delay is obtained because hot-spot flow is assigned
to experience the largest 32-slot middle-stage delay.
With our three-stage switch architecture, we can always map the hot-spot
flow to experience the lowest 1-slot middle-stage delay by properly configuring the
third switch fabric. That means no matter what the value of x is, the overall delay-
throughput performance rendered by our three-stage architecture is always the same
as the case of hot-spot at output (i+30). This cuts down the delay by as large as 15
time slots, giving 60.7% delay improvement at p=0.6 and 43.4% at p=0.95.
Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch.
Fig. 3.13 shows the delay versus time, or the number of sampling intervals,
where each sampling interval is T =105
slots. The initial traffic pattern/matrix changes
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 93/177
- 73 -
twice during the simulation, at the 40-th and the 70-th sampling intervals,
respectively. Each change is represented by a randomly generated traffic matrix.
(Each matrix entry is uniformly distributed between 0 and 1, and the whole matrix is
regulated to be admissible.) From Fig. 3.13, we can see that our traffic estimation
algorithm is quite effective in adapting to the changes in traffic pattern, and the
overall improvement of three-stage switch, as compared with the original two-stage
switch, is about 8%.
3.4 Chapter Summary
In this chapter, we improved the delay performance of feedback-based two-
stage switch by assigning heavy flows to experience less middle-stage delays. We
have followed two approaches. First, for a given traffic matrix, we can find an
optimal joint sequence that can minimize the average middle-stage delay. In the
second approach, we extended the feedback-based two-stage switch architecture to
three-stage, thereby the third switch fabric dynamically maps heavy flows to
experience less middle-stage port delays.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 94/177
- 74 -
Chapter 4
Cutting Down CommunicationOverhead
4.1 Introduction
The occupancy vector in our feedback-based two-stage switch requires N bits.
When the switch size N is large, the N -bit occupancy vector may become a
bottleneck. For example, with a 1024×1024 switch carrying 128-bye packets, the
(second) switch fabric must operate at a speedup of two for carrying the extra 1024
bits of occupancy vector.
In this chapter, we focus on cutting down the communication overhead. The
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 95/177
- 75 -
size of an occupancy vector can be reduced by only reporting the status of selected
middle-stage VOQs. To identify VOQs of interest, we first partition the N VOQs
into u non-overlapped sets, each identified by a set number. In each time slot, every
input port piggybacks its set numbers of interest to the connected middle-stage port.
This “guides” each middle-stage port to only report the status of selected VOQs.
The rest of this chapter is organized as follows. In the next section, by
exploiting the feedback path in the first-stage, a set of efficient feedback suppression
algorithms are designed. In Section 4.3, we compare all the proposed algorithms by
simulations. Finally, we conclude the chapter in Section 4.4.
4.2 Feedback Suppression Algorithms
Firstly, we partition the N VOQs at each port, either input or middle-stage,
into u non-overlapped sets, denoted by G1, G2, …, Gu. Without loss of generality,
assume g=N/u is an integer. Then each set Gm (m=1,2…u) contains g queues.
Specifically, at input k ,
Gm={VOQ1(k ,(m-1) g +1),VOQ1(k ,(m-1) g +2),…, VOQ1(k , mg )}.
At middle-stage port j,
Gm={VOQ2( j,(m-1) g +1), VOQ2( j,(m-1) g +2),…,VOQ2( j, mg )}.
To cut down the communication overhead, the size of an occupancy vector can be
reduced by only reporting the status of selected Gm.
To maximize switch performance, longer queues should be given more
chances to send packets. With full N -bit occupancy vector, the LQF scheduling
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 96/177
- 76 -
provides the best performance by always selecting the longest queue from all the N
VOQ1(i, j)’s at each input port. If we can select Gm based on where the longest queue
resides, the performance would not drop. We propose to construct another feedback
mechanism for an input to piggybacks its set numbers of interest to the connected
middle-stage port. We can make use of the otherwise wasted bandwidth in the first
stage switch for this purpose, as shown in Fig. 4.1. (Note that the speedup required
for carrying feedback in the second stage switch is also applied to the first stage
switch.) But unlike the feedback mechanism in the second stage (for middle-stage
ports to inform outputs/inputs), the (identity of) longest queue received from an input
i by middle port j at slot t can only be used N slots later, i.e. next time middle port j is
connected to output i. Since packets arrive and depart in every slot, the longest queue
identified N slots ago may not be the current longest queue – this is the price we must
pay. Nevertheless, for highly skewed non-uniform traffic pattern, the history data
usually serves as a good estimation.
Fig. 4.1 Timing diagram of feedback switch with feedback suppression.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 97/177
- 77 -
With the above feedback mechanism in the first stage switch, three packet
scheduling algorithms are designed.
4.2.1 Set-Based Feedback (Set-feedback)
Let VOQ1(i,F ) denote the longest queue at input i at time t . If F ∈Gm, then
the value of m is stored at input i and piggybacked (using log u bits) on the packet
sent to the connected middle-stage port j. Port j stores the value of m and when it is
connected to output i at time t + N -1, j sends a g -bit vector, corresponding to the
occupancy of the g queues in set Gm. Input/output i knows which set the g -bit
occupancy vector refers to, based on the stored value of m at time t. At slot t + N , input
i selects a packet to send from the longest available queue in {VOQ1(i,(m-1) g +1),
VOQ1(i,(m-1) g +2), …, VOQ1(i,mg )}. “Available” means the corresponding
VOQ2( j,k ) is empty and VOQ1(i,k ) is not. In doing so, the likelihood of the selected
packet comes from the longest queue among all the N VOQ1s at input i is increased.
The feedback bits required in the first and second stages are log u and g bits
respectively.
An example: Consider a 4×4 ( N =4) feedback-based switch, at each
input/middle-stage port, VOQs are partitioned into 2 (u=2) non-overlapped sets,
denoted by G1 and G2. Then at input 1, set G1 contains {VOQ1(1,0),VOQ1(1,1)} and
G2 contains {VOQ1(1,2),VOQ1(1,3)}. Assumed VOQ1(1 ,3) is the longest queue at
input 1 at time slot 0. VOQ1(1 ,3)∈G2, so the value of 2 (the identify of G) is stored
at input 1 and piggybacked using 1 (log u) bit on the packet sent to the connected
middle-stage port 0. Middle port 0 stores the value of 2 and when it is connected to
output 1 at time slot 3, middle port 0 sends a 2-bit ( g =2) vector, corresponding to the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 98/177
- 78 -
occupancy of {VOQ2(0,2),VOQ2(0,3)} in set G2. Input/output 1 knows the 2-bit
occupancy vector refers to, based on the stored value of 2 at time slot 0 . At slot 4,
input 1 selects a packet to send from the longest available queue in
{VOQ1(1,2),VOQ1(1,3)}.
4.2.2 Queue-Based Feedback Version 1 (Q-feedback-1)
Let VOQ1(i,F ) denote the longest queue at input i at time t . Unlike Set-
feedback , the value of F is stored at input i and piggybacked (using log N bits) on the
packet sent to middle-stage port j. Port j stores the value of F . When it is connected
to output i at slot t + N -1, j sends a b-bit occupancy vector, containing the occupancy
of b queues from VOQ2( j, F ) to VOQ2( j, F +b-1) (wrapped around by N ). Input/output
i knows which queues the b-bit occupancy vector refers to, based on the value of F
stored at time t. At slot t + N , input i selects a packet to send from the longest available
queue in {VOQ1(i, F ), VOQ1(i, F +1), …, VOQ1(i, F +b-1)}. The feedback bits required
in the first and second stages are log N and b bits respectively. (Note that b= g is not
necessary.)
An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is
the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and
piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage
port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is
connected to output 1 at time slot 3, middle port 0 sends a 3-bit (b=2) vector,
corresponding to the occupancy of {VOQ2(0,3), VOQ2(0,0), VOQ2(0,1)}.
Input/output 1 knows the 3-bit occupancy vector refers to, based on the stored value
of 3 at time slot 0. At slot 4, input 1 selects a packet to send from the longest
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 99/177
- 79 -
available queue in {VOQ1(1,3),VOQ1(1,0), VOQ1(1,1)}.
4.2.3 Queue-Based Feedback Version 2 (Q-feedback-2)
This algorithm is the same as Q-feedback-1 except that the second stage
feedback is generated as follows. When middle-stage port j is connected to output i at
slot t + N -1, we randomly select an empty queue VOQ2( j,z ). Middle-stage port j then
sends a (1+log N )-bit occupancy vector, with the first bit indicates the occupancy of
VOQ2( j,F ), and the following log N bits carry the value z . At slot t + N , input i selects
a packet from the longest available queue in {VOQ1(i, F ), VOQ2( j, z )}. The feedback
bits required in the first and second stages are log N and 1+log N bits respectively.
An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is
the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and
piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage
port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is
connected to output 1 at time slot 3, middle port 0 randomly select an empty queue
(say it VOQ2(0 ,2)). Middle-stage port j then sends a 3 (1+log N ) bits occupancy
vector, with the first bit indicates the occupancy of VOQ2(0 ,3), and the following 2
(log N ) bits carry the value 2(the identify of VOQ2(0 ,2)). At slot 4, input 1 selects a
packet to send from the longest available queue in {VOQ1(1,3),VOQ1(1,2)}.
Note that the three algorithms above can all be extended to carry the feedback
of the top-C longest queues (instead of the longest queue only). In Set-feedback , this
requires C ·log u bits in the first stage (for identifying up to C sets of Gm that contain
the top-C longest queues), and C · g bits in the second stage. Similarly, in Q-feedback-
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 100/177
- 80 -
1, we need C ·log N bits in the first stage and C ·b bits in the second stage. For Q-
feedback-2, we need C ·log N bits and C ·(1+ log N ) bits, respectively.
4.3 Performance Evaluations
In this section, the delay-throughput performance of the proposed scheduling
algorithms is studied by simulations. Without loss of generality, a switch with size
N =32 is assumed unless otherwise specified. Scheduling algorithms with full
feedback (in the second stage switch) requires 32 bits. With our proposed feedback
schemes, we target at using 12 bits only (roughly 1/3). The detailed parameter
settings are as follows:
For Set-feedback , in order to form 12 bits feedback, we partition the 32 VOQs
into u=8 sets with each set has g =4 elements. Feedback of Top-3 longest
queues is used, i.e. C =3. The feedback bits required in the first stage and
second stage are 9 bits and 12 bits respectively.
For Q-feedback-1, we set b to 6 and C to 2. The feedback bits required in the
first stage and second stage become 10 bits and 12 bits respectively, which
are comparable to that of Set-feedback .
For Q-feedback-2, we set C to 2. The feedback bits required in the first stage
and second stage are 10 bits and 12 bits respectively.
For comparison, we also implement a) original feedback-switch with full N -
bit feedback (LQF algorithm); b) iSLIP algorithm [15] (with a single iteration),
which serves as a benchmark for single-stage input-queued switches; and c) output-
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 101/177
- 81 -
queued switch, which serves as a performance lower bound. In simulations, we use
the same traffic models as Chapter 2.4, i.e. the uniform, uniform bursty and hot-spot
traffic.
4.3.1 Performance under Uniform Traffic
Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback.
Fig. 4.2 compares the delay performance of the six schemes under uniform
traffic. We can see that the delay gap between full-feedback and our proposed Set-
feedback , Q-feedback-1 and 2 increases with the input load. At p=0.1, they give
almost same delay performance. At p=0.8, the delay gap grows to about 20 slots. But
when compared with iSLIP , our proposed schemes require 40+ slots less, yielding
55% cut in delay. Among the three proposed schemes, Set-feedback generally
outperforms the other two. With the fixed number of bits for conveying feedback
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 102/177
- 82 -
occupancy, set-feedback can convey Top-3 longest queues instead of Top-2 (in Q-
feedback ), which can identify the longest VOQ with higher accuracy.
4.3.2 Performance under Uniform Bursty Traffic
From Fig. 4.3, the delay performance under bursty traffic, we can see that
Set-feedback gives the best performance (lowest delay), then followed by Q-
feedback-1 and Q-feedback-2. In general, iSLIP has smaller delay for low input load
( p≤0.5). At p=0.6, the delay is 183 slots for iSLIP and 92 slots for Set-feedback ,
yielding a 50% cut in delay. Compared with full-feedback , delay is increased from 70
slots to 92 at p=0.6, which represents the price paid for minimizing feedback bits.
Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback.
4.3.3 Performance under Hotspot Traffic
From Fig. 4.4, the delay performance under hot-spot traffic, again we can see
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 103/177
- 83 -
that Set-feedback , Q-feedback-1 and Q-feedback-2 give comparable performance.
Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback.
4.3.4 Performance under Different Switch Size N
Based on the above simulation results, Set-feedback gives the best
performance among our proposed algorithms. In the following, we focus on the
performance of Set-feedback under different traffic patterns with different switch
sizes N . Note that we still limit the feedback for Set-feedback to 12 bits, regardless of
the switch size. Specifically, when N is 64, we set u=16 (so g =64/16=4) and C =3.
The feedback bits required in the first stage and second stage become both 12 bits.
When N =128, we set u=32 (so g =4) and C =3. The feedback bits required in the first
stage and second stage are 15 and 12 bits respectively.
From Fig. 4.5, we can see that when N =128, 12-bit Set-feedback yields 94.5%
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 104/177
- 84 -
throughput under uniform traffic. In other words, Set-feedback trades just 5.5%
(lower) throughput for 88.3% saving in communication overhead.
Fig. 4.5 Throughput vs. switch size N , with partial feedback.
4.4 Chapter Summary
In this chapter, we focused on cutting down the communication overhead in
feedback-based two-stage switch. The size of an occupancy vector, which is sent by
middle-stage port to output port in every time slot, is reduced by only reporting the
status of selected middle-stage VOQs. To identify VOQs of interest, we first
partitioned the N VOQs into u non-overlapped sets, each being identified by a set
number. In each time slot, every input port piggybacks its set numbers of interest to
the connected middle-stage port. This guides a middle-stage port to only report the
status of the VOQs of interest. Extensive simulation results showed that our proposed
feedback suppression algorithms are very efficient.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 105/177
- 85 -
Chapter 5
Supporting Multicast Traffic
5.1 Introduction
The migration of broadcasting and multicasting services, such as cable TV
and multimedia-on-demand to packet oriented networks will play a dominant role in
the near future. These highly popular applications have the potential of loading up
the Internet. To keep up with the bandwidth demand of such applications, the next
generation of packet switches/routers need to provide efficient multicast switching
and packet replication.
When a multicast packet arrives at a switch, the set of output ports the packet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 106/177
- 86 -
destined for, i.e. the packet’s fan-out set , is retrieved from the local forwarding table
(like IP multicast). The cardinality of the fan-out set, i.e. its fan-out , denotes the
number of copies that the packet should be cloned. Packets arrived at the same input
port and destined for the same fan-out set belong to the same multicast flow. The
total number of possible multicast (and unicast) flows at an input port is 2 N
-1. An
admissible multicast traffic pattern requires no over-subscribed input and output
ports. That means the packet arrival rate at each input port should be less than or
equal to its capacity, or 1 packet/slot. Similarly, the aggregated packet arrival rate at
each output port (after packet duplication) must also be smaller than or equal to 1
packet/slot. A multicast switch aims at providing 100% throughput for any
admissible multicast traffic pattern with minimum possible packet delay.
For the sake of scalability, multicast switches are mainly designed based on
input-queued switch architecture, where a centralized scheduler is responsible for
scheduling. Switch fabrics used can be bufferless [41-45] or buffered [46-49]. For
multicast switches based on bufferless switch fabrics [41-45], in-switch multicast
capability (i.e. in-switch packet duplication and forwarding) is usually assumed,
where an input port can send a (multicast) packet to multiple output ports in a single
time slot. Such multicast fabrics are more expensive than their unicast counterparts.
Besides, the centralized scheduling algorithms are usually derived from their unicast
counterparts. Note that even for (simpler) unicast switches, a major bottleneck is the
implementation of the centralized scheduler.
For multicast switches with buffered switch fabrics, they mainly adopt the
buffered crossbar [18-20] as their switch fabrics. Recall that for the buffered crossbar
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 107/177
- 87 -
switch introduced in Chapter 1, even through the scheduler is simpler, its switch
fabric needs to realize N N switch configurations, the same complexity as output-
queued switch fabric.
In short, two limiting factors for high-speed multicast switch design are the
switch fabric complexity and the need for a sophisticated centralized scheduler. In
this chapter, we show that feedback-based two-stage switch can support multicast
traffic efficiently by slightly modifying its original operations. It elegantly
overcomes these two major obstacles. Specifically, it does not require a centralized
scheduler, and relies on a unicast switch fabric (realizing only N switch
configurations) to carry both unicast and multicast traffic.
The rest of the chapter is organized as follows. In the next section, we review
some related work on multicast switch design. The feedback-based two-stage switch
is modified to support multicast traffic in Section 5.3 and simulation results are
presented in Section 5.4. We conclude the chapter in Section 5.5.
5.2 Related Work
5.2.1 Multicast Switches Based on Bufferless Switch Fabrics
Multicast switches based on bufferless switch fabrics [41-45] usually assume
in-fabric multicast capability (i.e. in-fabric packet duplication and forwarding), and
require a rather sophisticated centralized scheduler. In [41], each switch input port
maintains N +1 virtual queues, N for unicast and one for multicast. Priority is given to
schedule multicast traffic. If there are still idle inputs/outputs after scheduling
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 108/177
- 88 -
multicast packets, unicast packets are considered to increase switch utilization.
Although a multicast packet can be “split” to send in multiple time slots, multicast
traffic suffers from severe head-of-line (HOL) blocking due to the single multicast
queue.
In [42], the number of multicast queues is increased to m to reduce HOL
blocking. When a multicast packet arrives, it selects a multicast queue to join in order
to balance the loading among different multicast queues. But packets assigned to
different queues generally have overlapped fan-out sets. Priority is given to schedule
a unicast packet first or a multicast packet first depending on the service ratio
between the two classes. An iterative algorithm is also adopted to maximize the
throughput in each time slot.
In [43], packet splitting is allowed to further cut down the HOL blocking.
Specifically, each input maintains k unicast/multicast shared queues, one for each
non-overlapped set of outputs. When a multicast packet arrives and if its fan-out set
intersects with the fan-out sets of multiple queues, packet-splitting “breaks” the
original packet into “smaller” ones, each with a modified fan-out set (such that no
intersection with the fan-out set of the queue it joins). An iterative algorithm is then
used to maximize the switch throughput. Simulation results show that high
throughput can only be achieved with a large number of iterations. But a large
number of iterations is not suitable for high-speed implementation.
In [44], the number of unicast/multicast shared pointer queues increases to
k = N , one for each output port (like the classic VOQs for unicast traffic). When a
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 109/177
- 89 -
multicast/unicast packet arrives, it is time-stamped and stored in a shared memory.
Then its memory address (i.e. a pointer) is stored in all pointer queues that overlap
with the packet’s fan-out set. An iterative scheduling algorithm based on the
timestamps of buffered packets is designed for maximizing throughput. The major
problem with this approach, again, is its high communication overheads.
In [45], dynamic queuing policies are studied, where packet splitting upon
arrival is not allowed. The switch needs to identify active flows and then assign them
to different shared multicast queues based on the current switch load.
5.2.2 Buffered Crossbar Based Multicast Switches
Buffered crossbar switch architecture [18-20] is touted for its technology
feasibility and simpler scheduler. However, the buffered crossbar switch is not
scalable due to its 2 N separate schedulers, N 2 in-fabric crosspoint buffers and the
need for N N switch configurations. Buffered crossbar has also been extended to
support multicast traffic [46-49]. MURS [46] gives priority to schedule unicast and
multicast traffic in a round robin fashion. Specifically, if unicast gets priority in time
slot t , unicast traffic will be scheduled first. If there are still idle outputs after
scheduling unicast traffic, multicast traffic is considered. Then in slot t +1, multicast
traffic gets the scheduling priority.
To reduce the hardware cost, I-SMCB (Input-based Shared Memory
Crosspoint Buffer [47]) and O-SMCB (Output-based Shared Memory Crosspoint
Buffer [48]) aim at cutting down the crosspoint buffers from N 2
to N 2/2. The key idea
is to share one crosspoint buffer by two adjacent input ports [47] or two adjacent
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 110/177
- 90 -
output ports [48]. But such a hardware cost reduction is offset by its throughput
degradation. In [49], the theoretical relationship between throughput performance
and crosspoint buffer size is studied under a special multicast traffic pattern. It is
concluded that to avoid throughput degradation, the amount of buffer to be deployed
at every crosspoint must scale logarithmically with the switch size N .
5.3 Multicast Scheduling in Feedback-Based Two-Stage Switch
5.3.1 Multicast Scheduling
We extend the feedback-based two-stage switch (Fig. 2.1) to support
multicast traffic. At each input port, in addition to the N unicast VOQ1(i,k )’s, we add
another m shared queues for multicast. We adopt a simple queuing policy that divides
the outputs into m equal and non-overlapped sets (assuming N /m is an integer),
where set x (1≤ x≤m) contains outputs {( x-1) N /m, ( x-1) N /m+1,…, x·N /m-1}. Packet
splitting is used to “split” multicast packets to join different queues. So when a
multicast packet arrives and if its fan-out set intersects with the fan-out sets of
multiple queues, then the original packet is “split” into “smaller” ones, each with a
modified fan-out set (which will not intersect with the fan-out set of the target queue).
Note that the packet after splitting usually remains as a multicast packet but with a
smaller fan-out set. It is worth to note that when m=1, all multicast packets share the
same multicast queue; and when m= N , packet splitting converts all multicast packets
into unicast.
Without loss of generality, we assume the two stages of switch fabrics are
configured using the joint sequence of Fig. 2.2(a). In each time slot, based on the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 111/177
- 91 -
received occupancy vector of middle-stage port k , input i selects a packet for sending
among its N+m local queues. Priority is given to schedule multicast traffic by
examining the m multicast queues first. Here we only consider giving the higher
priority to multicast traffic, as in general it is more time critical than unicast, but it
should be noted that our multicast scheduler can be revised to schedule unicast and
multicast packets depending on the service ratio (like [42]) or in a round robin
fashion (like [46]). Specifically, the HOL packet whose fan-out set has the largest
overlap with the set of empty queues at middle-port k is selected. (If no overlap, a
unicast packet is selected instead.) A copy of the selected packet is sent to the
middle-port together with an N -bit duplication vector , which identifies the overlap
between the empty VOQ2( j,k )’s and the packet fan-out set. Then, the fan-out set of
the selected multicast packet is updated to exclude those in the duplication vector. If
the updated fan-out set is empty, the selected multicast packet is removed from the
multicast queue. When a packet arrives at the middle-stage port, it will be cloned and
stored at the corresponding empty (unicast) VOQ2( j,k )’s based on the duplication
vector.
If there are no backlogged multicast packets or none of them can be selected
(due to zero-overlap between the empty VOQ2( j,k )’s and any multicast packet’s fan-
out set), we select a unicast packet for sending using the LQF scheduler. In this case,
the duplication vector is set to all 0’s. Note that the packet transmission in the
second-stage switch fabric is the same as in a unicast switch. Following the pre-
determined sequence of configurations, when middle-stage port j connects to output
k , the packet (if any) at VOQ2( j,k ) is sent together with the occupancy vector of
middle-port j.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 112/177
- 92 -
5.3.2 Discussions
In our proposed multicast scheduling algorithm, packet duplication takes
place at both input ports and middle-stage ports. Packet duplication at input ports
“breaks” a multicast packet into smaller ones. Since multicast packets in different
multicast queues have non-overlapped fan-out sets, both HOL blocking and output
contention can be eased. Besides, storing multicast packets at inputs reduces the
input port buffer requirement. Since both two switch fabrics in feedback switch are
unicast, a multicast packet is sent in the first fabric as a unicast packet. The
complicated switch fabric with in-fabric duplication (as [41-45]) is not required.
When a split multicast packet arrives a middle-stage port, the second stage packet
duplication occurs, which converts all multicast packets into unicast for delivering by
the second switch fabric.
When there is only a single multicast queue (m = 1), all packet duplication is
carried out at middle-stage ports. Under light traffic, input port queue size can be
minimized. But for heavy traffic, the switch will experience severe HOL blocking
because a multicast packet will not be removed (from the only queue) until all its
copies are sent. With m > 1, packet splitting ensures that packet duplication occurs
partially at input ports and packets in different queues have non-overlapped
destinations. This reduces the HOL blocking. Let the switch size be N . When m= N ,
all packet duplication is carried out at input ports. In this case, there is no need for
“multicast” queues because they only store unicast packets. In other words, each
input port only needs to maintain N unicast queues. The HOL blocking is completely
eliminated and the stability proof in Chapter 2 can also be applied for multicast
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 113/177
- 93 -
feedback switch with m= N .
Unlike the feedback-based two-stage unicast switch, the load-balancing in the
first stage switch is based on multicast packets. Extensive simulation results show
that the final unicast traffic presented to the second stage switch is generally uniform.
This accredits to the use of the single-packet-buffer per middle-stage VOQ2( j,k ), and
the efficient feedback mechanism for reporting the middle-stage port occupancy. To
further increase the buffer utilization, we can use pointer queues [44] to separately
store a packet and its memory address. So a multicast packet is only required to store
once at an input port, and an entry in VOQ1(i,k ) only contains the memory address of
the packet. Likewise, this can be applied to buffers at middle-stage ports.
The proposed multicast scheduling algorithm inherits the in-order packet
delivery property from its unicast counterpart. This is because we can treat each
distributary of a multicast flow as a unicast flow. In Chapter 2, it has been shown that
packets belonging to the same unicast flow always experience the same middle-stage
port delay. Therefore, when they arrive at the output port, they will be in order. If
packets belonging to every distributary flow orderly arrive at their respective outputs,
the corresponding multicast flow will not experience packet mis-sequencing problem.
5.4 Performance Evaluations
To the best of our knowledge, our proposed multicast scheduling is the only
one that does not rely on a centralized scheduler, and its switch fabric only needs to
realize N switch configurations (instead of N !). To study its performance, we vary the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 114/177
- 94 -
number of multicast queues (m) at each input port. In our simulations, we try to
distinguish between the overall average delay experienced by all copies (T c) of a
multicast packet and the average delay experienced by the last-copy (T p) of all
multicast packets. T p corresponds to the worst-case delay and provides us some
insight on the delay variation among different copies of a multicast packet. For
multicast packets with fan-out k , T c(k ) and T p(k ) denote their average delay and
average last-copy delay respectively. They show the fairness performance in
handling packets with different fan-outs. Although we only present simulation results
for switch with size N =32 below, the same conclusions and observations apply for
other switch sizes.
5.4.1 Performance under Uniform Mixing Traffic
Fig. 5.1 Delay vs output load λ , with uniform mixing traffic
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 115/177
- 95 -
At every time slot for each input, a packet arrives with probability p (i.e.
input load is p). If a packet arrives, it has equal probability of being unicast or
multicast. If the packet is unicast, it destines to each output with equal probability. If
the packet is multicast, its fan-out size k is randomly selected between [2, 32], and
the identity of each output in the fan-out set is also randomly selected from all output
ports. Fig. 5.1 shows the switch delay performance against switch output load λ ,
where
λ = p[0.5+0.5(2+32)/2] = 9 p. (5.1)
To ensure the traffic in our simulations is always admissible, we must have λ ≤ 1 (or
p ≤1/9).
Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7
From the delay-throughput performance in Fig. 5.1, we can see that for output
load λ < 0.85, m=1 and m=2 provide a lower average packet delay than m=32. At λ =
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 116/177
- 96 -
0.7, m=2 cuts down the overall average delay (T c) by 58.8% and the average last-
copy delay T p by 51%. When λ >0.85, m=32 (packet duplication at input ports only)
yields a better/lower delay performance because there is no HOL blocking, while the
HOL in m=1 (packet duplication at middle-stage ports only) is intensified with the
traffic load. This also explains why m=2 (packet duplication at both input and
middle-stage ports) is better than m=1.
Fig. 5.2 shows the delay performance against different fan-outs, while fixing
λ = 0.7. When m=2, we can see that T c(k ), the average delay for packets with fan-out
k , is the lowest, and remains almost constant at 20 slots as fan-out k increases. Even
T p(k ), the average last-copy delay for packets with fan-out k , increases rather slowly
with k . This shows that m=2 is fair in handling packets with different fan-outs. On
the contrary, with m=32, both T c(k ) and T p(k ) increase more rapidly with fan-out size.
5.4.2 Performance under Uniform Bursty Mixing Traffic
We use the same traffic generator except that bursty arrivals are modeled by
the ON/OFF traffic model of Chapter 2.4. In the ON state, a packet arrival is
generated in every time slot, which has equal probability of being unicast or
multicast. Simulation results in Figs. 5.3&5.4 are based on burst size s p=30 packets.
Again, we can express the aggregated load at each output port by (5.1).
From Fig. 5.3, the performance gap between m=2 and m=32 is much wider
than that in Fig. 5.1. This is because bursty traffic causes more unevenly distributed
queue sizes in the input ports when m=32. With m=2, packet duplication mainly
occurs at middle-stage ports. In this case, both input port queue size and input port
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 117/177
- 97 -
delay are reduced. With m=1, packet duplication only occurs at middle-stage ports.
The throughput is suffered from the severe HOL blocking. From Fig. 5.4, we can
again see that m=2 is fair in handling packets with different fan-outs. Although m=32
also gives improved fairness performance, this is at the cost of very high average
delay (T c(k )> 750 slots).
Fig. 5.3 Delay vs output load λ , with bursty mixing traffic
5.4.3 Performance under Binomial Mixing Traffic
Binomial mixing traffic [45] is the same as the Bernoulli uniform mixing
traffic model except in generating the fan-out size of a multicast packet. Let P k be the
probability of generating a fan-out set with size k . The k destinations are uniformly
distributed over all output ports. The value of k is chosen according to a non-uniform
binomial distribution, with mean fan-out h:
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 118/177
- 98 -
k N k k
N k N
h
N
hC p )1()(
Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7
In our simulations, we set mean fan-out h = 17. Then the output load λ is:
λ = p[0.5+0.5×17] = 9 p
The delay performance shown in Fig. 5.5 is comparable with that in Fig. 5.1. This is
because the two traffic models are quite similar. Specifically, they have the same
Bernoulli packet arrival, same average fan-out size of 17, and their fan-out sets are
all uniformly selected from all outputs. We skip the figure of delay vs fan-out
because it has a similar trend as that in Fig. 5.2.
From the simulation results above, we can see that setting m=2 is sensible as
it ensures sufficiently low packet delay and high throughput. Besides, the extra
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 119/177
- 99 -
complexity involved in maintaining two multicast queues is marginal.
Fig. 5.5 Delay vs output load λ , with binomial mixing traffic
5.5 Chapter Summary
In this chapter, feedback-based two-stage switch is extended for scheduling
multicast traffic by slightly modifying its operations. The resulted switch not only
removes the centralized scheduler but also supports multicast traffic by the simple
unicast switch fabric. Simulation results showed that with packet duplication at both
input ports and middle-stage ports, the proposed multicast scheduling algorithm is
effective in cutting down both average delay and delay variation among different
copies of the same multicast packet.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 120/177
- 100 -
Chapter 6
Multi-cabinet Implementation
6.1 Introduction
To accommodate the growth of the Internet traffic, high-speed routers consist
of a large number of linecards (e.g. 1152 linecards in Cisco CRS-1 [8]), resulting in
larger physical space and power requirement. Consequently, a multi-cabinet
implementation of routers is needed [50-51], where the distance between linecards
and (central) switch fabrics can be tens of meters.
In a single-cabinet implementation, the propagation delay between linecards
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 121/177
- 101 -
and switch fabrics is negligible. In a multi-cabinet implementation, due to the non-
negligible propagation delay, the requirement that occupancy vectors must arrive at
input ports within a single time slot will significantly lower the feedback-based
switch efficiency. This is illustrated in Fig. 6.1. Since the occupancy vector needs to
take the in-flight packet (in the first switch fabric) into account, it can only be
generated when the packet (at least partly) arrives. A dedicated feedback packet is
required as piggybacking occupancy vector onto data packet is not possible. Finally,
an input port must wait for the occupancy vector to arrive before another packet can
be scheduled for sending. From Fig. 6.1, we can see that the duration of a slot must
be at least twice the propagation delay between linecards and the switch fabrics. But
in each slot, only a single packet can be sent. Since a switch fabric cannot be
reconfigured while there are in-flight packets, the slot duration is (roughly) the
duration that a switch configuration lasts.
Fig. 6.1 The timing diagram of switch with large propagation delay
In this chapter, we revamp the original feedback mechanism and design a new
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 122/177
- 102 -
batch scheduler to solve this problem. The basic idea is to schedule and send multiple
packets while each switch configuration lasts. The key challenge is at how to keep
the original close-to-100% throughput performance and ensure in-order packet
delivery.
The rest of the chapter is organized as follows. In the next section, we review
some related work on addressing the impact brought by propagation delay. In Section
6.3, the feedback mechanism is revamped and a new batch scheduler is designed. Its
performance is evaluated in Section 6.4 and we conclude the chapter in Section 6.5.
6.2 Related Work
6.2.1 Multi-cabinet Implementation of Input-queued Switch
To improve the performance of input-queued switch under multi-cabinet
implementation, SRR (Synchronous Round Robin) scheduler is proposed [51]. SRR
is a distributed and iterative scheme, in which one input port sends only one request
based on a cyclic, like TDMA (Time Division Multiplexing Access), preferential
scheduling of VOQs. A request is selected by logically numbering the slots with an
incremental counter ranging from 0 to N −1. If the preferred VOQ is empty, then the
longest one is selected. Each output also has a preferential input to grant based on the
same TDMA-like cycle. If the preferred input request does not arrive, one request is
randomly selected to grant. Input port can receive the grant for the current request
one round-trip-time later. While waiting for the grant to arrive, each input continually
sends its preferred request on a slot-by-slot basis. From [51], we can see that when
the traffic is bursty, the switch throughput is rather limited.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 123/177
- 103 -
6.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch
A multi-cabinet implementation of buffered crossbar switch is studied in [19],
where a large packet buffer size at each crosspoint is required to achieve high
throughput. This imposes further challenges to the implementation of buffered
crossbar switch. In [18], virtual crosspoint queues are introduced to alleviate the in-
fabric buffer requirement but the resulting switch gives poor throughput performance
under some traffic conditions.
6.3 Multi-cabinet Implementation of Feedback-Based Switch
6.3.1 Revamped Feedback Mechanism
Fig. 6.2 Multi-cabinet implementation of the feedback-based switch
Fig. 6.2 shows a multi-cabinet implementation of a feedback-based two-stage
switch. To increase the switch efficiency, we can send multiple packets in a slot. The
minimum duration of a slot is the round trip propagation time between linecards and
switch fabrics, or RTT seconds. Let the (maximum) number of packets that can be
sent in each slot be x. The value of x depends on packet size ( B bytes), RTT , and the
line rate ( R bps). Roughly, we have
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 124/177
- 104 -
/ _ 8
RTT R x RTT packet duration
B
For a typical distance of 20 meters between linecards and switch fabrics, the
(minimum) slot duration is RTT =200 ns. To transmit a packet of 200 bytes on a
40Gbps line, 40 ns are required. Reserving some guard times for control, we can
transmit x = 4 packets in a slot, as shown in Fig. 6.3.
Packet
Input port Middle-stage Output portMiddle-stage
Time Transmission
delay
Propagation
delay ( RTT /2)
S l o t t
2nd switch reconfig.
Destination report
Occupancy vector
1st switch reconfig.Last packet sent
in slot t arrives.
Occupancy vector
is generated
Fig. 6.3 Feedback operation in multi-cabinet implementation
But can we still keep the in-order packet delivery and high-throughput
properties of a single-cabinet implementation of the feedback switch? With the
following modifications, the answer is yes. First of all, the buffer size at each middle-
stage port VOQ2( j,k ) is increased to x to accommodate up to x packet arrivals in each
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 125/177
- 105 -
time slot. The occupancy vector is expanded to N ·log x bits, as the size of each VOQ
requires log x bits.
The feedback operation is also revamped. Refer to Fig. 6.3. Assume at time
slot t input port i connects to output k via middle-stage port j. At the beginning of slot
t , (based on the occupancy vector received in the previous slot) input i uses a local
batch scheduler (to be detailed in Section 6.3.2) to select up to x packets for sending.
A special header (destination report) is appended to the first packet sent, which
contains the destinations of the x packets (to be) sent in this slot. As each destination
requires log N bits to denote, the destination report consists of x·log N bits.
While input ports are sending packets to middle-stage ports, middle-stage
ports are sending packets to output ports in parallel. When a middle-stage port ( j) is
connected to an output port (k ), all backlogged packets (at most x) in VOQ2( j,k ) will
be completely cleared. (Backlogged packets refer to packets arrived in previous time
slots, excluding those arriving packets in the current slot.) In fact, due to the
predetermined sequence of configurations used, middle port j knows beforehand
which VOQ2( j,k ) will be cleared at which time slot.
Middle-stage port j generates the occupancy vector upon receiving the
destination report from input i. The destination report contains the destinations of all
the packets to arrive in the following slot duration. Therefore, at the time the
occupancy vector is generated (in the middle of slot t ), it already looks ahead to get
the accurate VOQ status at the time the last packet sent in slot t arrives at middle-
stage port j (see Fig. 6.3). The occupancy vector is then appended to the next packet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 126/177
- 106 -
sent in the second switch fabric for transmission, i.e. packet 3 in Fig. 6.3.
When the occupancy vector arrives at output k and is made available to input
k at the beginning of slot t +1, the input port batch scheduler selects and sends up to x
packets to middle-stage port j. It should be emphasized that the scheduling is based
on what will happen when the selected packets arrive at middle-stage port j (i.e. the
information in the occupancy vector received). Notably, the first packet from input k
will arrive at middle-stage port j right after the last packet from input i. The
bandwidth of switch fabric is fully utilized.
6.3.2 Batch Scheduler Design
Now we focus on the batch scheduler design. Without loss of generality, we
assume a LQF batch scheduler at input port k . Specifically, input k identifies the set
of VOQ2’s at middle-port j that has room for new packets, denote this set by S j. Find
the longest queue VOQ1(k ,h), such that VOQ2( j,h) belongs to S j. Then the HOL
packet at VOQ1(k ,h) is scheduled for sending. Update S j and the size of VOQ1(k ,h).
Then the above process is repeated until x packets are scheduled (or no more packets
available).
Like as the scheduler in single-cabinet implementation, we also include the
following refinements in the batch scheduler:
Forced-zero-queue-size: If middle port j will connect to output k ’ in the next
slot t +1, then in current slot t middle port j reports/feedbacks a zero queue
size for VOQ2( j,k ’). This is because VOQ2( j,k ’) is guaranteed to be exhausted
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 127/177
- 107 -
at the end of slot t +1 (i.e. all its packets will be sent to output k ’.), With
forced-zero-queue-size, the batch scheduler has more flexibility in selecting
packets to send.
Preventing underflow: Assume input port i connects to middle port j in time
slot t , and j will connect to output k ’ in the next slot t +1. If flow(i,k ) has
packets waiting in VOQ1(i,k’ ) but VOQ2( j,k’ ) does not have x packets ready
for sending in slot t +1, an underflow will occur. To avoid the possible loss of
efficiency due to underflow at VOQ2( j,k’ ), at slot t input i should always give
priority to send packets from VOQ1(i,k’ ) to VOQ2( j,k’ ) first.
6.3.3 Some Properties
The new batch scheduler operates on the architecture as Fig. 6.2, which
adopts the same joint sequence of Fig. 2.2(a). In the following, we show that the
multi-cabinet implementation of the feedback-based scheduler can ensure in-order
packet delivery, 100% throughput under a speedup of two and asymmetric
reconfigurations:
In-order packet delivery. Each flow having a constant middle-stage delay is a
sufficient condition for packet in-order delivery in two-stage switch (proven
in Chapter 3). While extending the feedback-based switch to multi-cabinet
implementation, we allow x packets to be sent in each time slot. The constant
middle port delay for packets of the same flow is still guaranteed by the
adopted joint sequence. The delay a packet experiences at a middle-stage port
is again bounded by [1, N ] slots. Without loss of generality, assume m (out of
x) packets arrived at a middle-stage port j in a same time slot belong to the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 128/177
- 108 -
same flow(i,k ). Those m packets will be buffered at VOQ2( j,k ) for the same
amount of time until middle-port j is connected to output k . Then, they will
be delivered to output k (possibly together with ( x – m) packets from other
flows) in the same slot. So the constant middle-stage delay is still guaranteed,
and thus the in-order delivery property is still ensured.
100% throughput under speedup of two. For multi-cabinet implementation
with a batch size of x packets, we can treat each batch as a single aggregate
packet. Then the multi-cabinet switch is equivalent to a single-cabinet switch.
In other words, the propagation delay between linecards and switch fabrics
does not affect/reduce the throughput performance of a multi-cabinet switch.
Asymmetric reconfiguration. In Fig. 6.3, when the last bit of the x-th packet
arrives at the middle-stage port, the first stage switch fabric can start to re-
configure. When the last bit of the x-th packet departs the second switch
fabric, the second switch fabric can start to re-configure. In other words, the
reconfiguration of second fabric can start before the last bit of the x-th packet
arrives at the output port. For optical switch fabrics with non-negligible
amount of re-configuration overheads, such a pipelined packet transmission
and asymmetric reconfiguration can be very efficient.
Cutting down the communication overhead. In original feedback-based switch,
the communication overhead for sending single packet is N bits. From Fig.
6.3, we can see that only a single occupancy vector of N ·log x bits is required
for x packets sent in batch scheduler. The per packet communication overhead
is reduced from N bits to ( N ·log x)/ x bits.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 129/177
- 109 -
6.4 Performance Evaluations
In this section, we study the performance of our multi-cabinet implementation
of the feedback-based switch by simulations. In the following, we only present
simulation results for switch with size N =32 although similar conclusions apply to
other sizes. As the duration of a time slot may be different when different
propagation delay is considered (see Figs. 6.1 and 6.3), the delay performance is
measured by the number of time units, where each time unit is equivalent to the
transmission time of a packet at line rate. In our simulations, we use the same traffic
models as Chapter 2.4, i.e. uniform, uniform bursty and hot-spot. We assume the
propagation delay between linecards and switch fabrics is y, which varies from 1 to 2
time units. For simplicity, we ignore the overheads for switch reconfiguration,
scheduling, etc. Three scheduling algorithms are compared:
LQF without batch scheduling. When propagation delay is y time units, we
denote the algorithm by LQF/ y. The operation of LQF/ y is based on Fig. 6.1,
where only one packet can be sent in each slot. In other words, this is a direct
extension from the single-cabinet case.
LQF with batch scheduling (as shown in Fig. 6.3). When propagation delay is
y, we denote the algorithm by B-LQF/ y and the number of packets that can be
sent in each time slot is 2 y.
SRR algorithm [51]. When the propagation delay is y, we denote SRR as
SRR/ y. We regard SRR as a “generalization” of iSLIP [15] for multi-cabinet
implementation. In other words, SRR serves as a benchmark for single-stage
input-queued switches. Note that we do not compare with LQF-Byte-focal
[23] and CR [29] because they cannot be used for multi-cabinet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 130/177
- 110 -
implementation.
When our B-LQF is used and the propagation delay is y time units, the
number of packets that can be sent in each time slot is 2 y.
6.4.1 Performance under Uniform Traffic
From Fig. 6.4, we can see that due to the inefficiency caused by propagation
delay, LQF/ y can only obtain up to 25% and 50% throughput when y=2 and 1
respectively. With B-LQF/ y, close-to-100% throughput can be obtained. Note that the
average middle-stage port delay is still 16.5 slots. Since the duration of a slot is 2 y
time units, the average middle-stage port delay is 33 time units for y=1 and 66 for
y=2.
Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 131/177
- 111 -
6.4.2 Performance under Uniform Bursty Traffic
In Fig. 6.5, our B-LQF/ y again yields close-to-100% throughput under bursty
traffic. Despite of the fact that the middle-stage packet delay increases with the slot
duration, it is interesting to observe that when input load p>0.94, B-LQF/2 starts to
outperform B-LQF/1, though very slightly. The reason is as follows. In a time slot,
each input port can send up to 2d packets to a middle port with B-LQF. So packets in
B-LQF/2 tend to have a higher chance to enter the middle port than B-LQF/1. The
earlier packets enter the middle port, the less input port delay they experience. So
with B-LQF/2, packets tend to experience less input port delay. Under heavy bursty
loading, the input port delay dominates the overall delay performance. For B-LQF/2,
the drop in input port delay starts to outweigh the increase in middle port delay at
p=0.94.
Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 132/177
- 112 -
6.4.3 Performance under Hotspot Traffic
In Fig. 6.6, we can again see that B-LQF/ y yields close-to-100% throughput,
and is significantly better than their non-batch scheduling counterparts.
Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet
6.5 Chapter Summary
In a multi-cabinet implementation of feedback-based switch, due to the non-
negligible propagation delay between linecards and switch fabric, the requirement
that occupancy vectors must arrive at output/input ports within a single time slot will
significantly lower the switch efficiency. In this chapter, we revamped the original
feedback mechanism and a new batch scheduler was designed to address this
problem. We showed that with multi-cabinet implementation, the refined feedback-
based two-stage switch still guarantees in-order packet delivery, and provides close-
to-100% throughput performance.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 133/177
- 113 -
Chapter 7
Scheduling Inadmissible TrafficPatterns
7.1 Introduction
In the previous chapters, the feedback-based switch was designed while
focusing on handling admissible traffic patterns (i.e. both the input ports and output
ports are not over-subscribed), like [21-32]. For any admissible traffic patterns, as
long as the switch is stable, all packets can arrive at the outputs with bounded delays.
In this case, fairness in throughput is not an issue. But in practice, admissible traffic
patterns cannot be ensured, as an output port can experience oversubscription from
time to time. Therefore, a router should also be designed to efficiently handle
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 134/177
- 114 -
inadmissible traffic patterns.
It is interesting to note that under an inadmissible traffic pattern where some
output ports are over-subscribed, the overall throughput in feedback-based switch is
not affected as the over-subscribed outputs would be always fully occupied (due to
the work-conserving nature of the port scheduler used). However, different input
ports will have an unfair throughput share of the oversubscribed outputs. In other
words, the feedback switch will suffer from the ring-fairness problem, i.e. for packets
going to the same over-subscribed output (e.g. output 3 in Fig. 7.1), the further away
“up-stream” input ports (e.g. input 0 in Fig. 7.1) can throttle the nearby “down-
stream” input ports (e.g. input 3 in Fig. 7.1).
t=0 t=1 t=2 t=3
0
1
2
3
Input OutputMiddle-stage
Fig. 7.1 A 4 x 4 feedback-based switch with output port 3 oversubscribed by
inputs 0, 1, 2 and 3.
To address the ring-fairness issue for over-subscribed outputs, a fair scheduler
is designed for feedback-based switch in this chapter. The basic idea of fair scheduler
is to reserve the middle-stage buffer for the flows whose input VOQs exceeding a
pre-determined threshold Q. Then the bandwidth of an over-subscribed output will be
allocated to those input VOQs (exceeding Q) using a simple round robin (RR)
scheduler. We show that the optimal value of threshold is equal to the switch size
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 135/177
- 115 -
(Q= N ) and the resulting algorithm can meet the max-min fairness criterion.
The rest of the chapter is organized as follows. In the next section, we review
some related work on fair scheduling algorithm design. In Section 7.3, our fair
scheduling algorithm is proposed. In Section 7.4, we show that the proposed
algorithm satisfies the max-min fairness criterion. Its performance is then evaluated
in Section 7.5 by simulations. Finally, we conclude the chapter in Section 7.6.
7.2 Related Work
In the literature, fair schedulers are designed to handle both admissible and
inadmissible traffic patterns. For inadmissible traffic patterns, algorithms can be
further divided into two types, with over-subscribed output ports only, or with both
over-subscribed input and output ports.
7.2.1 Fair Scheduling under Admissible Traffic
In [53], a centralized algorithm called GPS-SW (Generalized Processor
Sharing in network Switch) is proposed. Under the assumption that the traffic is
admissible, GPS-SW uses a matrix-scaling approach to maximize throughput while
distributing the excess available bandwidth in a fair fashion. However, the example
in [54] shows that under admissible traffic achieving both max-min fairness and
100% throughput at the same time is impossible. For the sake of fairness under
admissible traffic, GPS-SW sacrifices its throughput performance.
7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 136/177
- 116 -
The F-MWM (Fair-MWM) algorithm, proposed in [55] for input-queued
switch, is based on the assumption that output ports can be oversubscribed but input
ports cannot. Therefore, F-MWM only considers fairly allocating the bandwidth of
output ports. As long as one (input) VOQ’s length exceeds a pre-set threshold, this
VOQ is removed to the congested list. Each VOQ in the congested list is served
exactly once during the every N time slots. The VOQs that are not in the congested
list are scheduled using LQF.
TFQA (Tracking Fair Quota Allocation [56]) is a variant of F-MWM that
applies to buffered crossbar switch. Unlike F-MWM, it maintains an adaptive
threshold for each input port. The VOQs that exceed the threshold would be included
to primary class. The dual round robin pointers in each input port are responsible for
scheduling the packets, one pointer for primary class and another for all VOQs. The
higher priority is always given to the primary class.
7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports
In [54, 52], input and output port can both be oversubscribed. So all inputs
and outputs bandwidth are required to be fairly allocated. The operations of
algorithm for input-queued switch [52] can be divided into two main phases. In the
first phase, only the output port bandwidth is concerned. At the end of the first phase,
the only possible bottlenecks for the flows are the input ports. In the second phase,
the algorithm allocates bandwidth at the input ports in a max-min fair fashion, thus
resulting in an allocation that is overall max-min fair.
AMFS (Adaptive Max-min Fair Scheduling) [54] is based on the architecture
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 137/177
- 117 -
of buffered crossbar switch. AMFS maintains two systems: a virtual system that
exactly emulates a virtual WF2Q+ (Fair Weighted Fair Queueing + [57]) and a real
system AMFS that is responsible for actually scheduling the flows. The virtual
WF2Q+ calculates per-flow virtual scheduling starting time and finishing time, which
the AMFS attempts to emulate. It has been proven that AMFS can sustain 100%
throughput for admissible traffic and ensure max-min fairness for non-admissible
traffic without any speedup. However, such proof is under the assumption that the
crosspoint buffer size is infinite. Furthermore, this algorithm incurs the overhead of
maintaining the virtual WF2Q+ system.
7.3 Our Approach
Like [55-56], we consider inadmissible traffic patterns with oversubscribed
output ports only. This assumption is reasonable as the input ports can indeed avoid
being over-subscribed [55] by the physical line-rate constraint on each ingress port.
But an output port is responsible for processing the egress traffic from N incoming
flows, so output port bandwidth over-subscription is difficult to avoid.
First of all, an overload vector {wi} (i=0, 1… N -1) and a reservation vector {qi}
(i=0, 1… N -1) are required for conveying reservation requests and grants at each
middle stage port j. All the elements in the two vectors are initialized to -1. If wi = l
and l > -1, it indicates that input port i has more than or equal to Q packets destined
for output l . If qi = m, then the VOQ2( j,i) of the current middle stage port j is
reserved for input port m (for sending a packet to output port i). In each time slot,
based on the values of {wi} and {qi}, the following operations are carried out at each
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 138/177
- 118 -
middle port j in parallel:
Sending reservation/overload request. For any input port m, among its VOQs
≥ Q, select VOQ1(m,l ) based on a round robin (RR) scheduler and the identity
of VOQ1(m,l ) is piggybacked (using log N bits) onto the current packet
transmission to middle port j. Middle port j updates its overload vector to
become wm = l .
Determining the winner . Assumed a middle-stage port j connects to an output
port k , middle port j examines its {wi}. If all wi≠ k (i=0, 1… N -1), make sure
that the reservation vector {qi} has qk = -1, which means no reservation on
VOQ2( j,k ) is required (as none of the input ports have Q or more than Q
packets for output k ). If there are some wi = k (i=0, 1… N -1), then select
(based on a RR scheduler) one of them, say wl = k , and set qk = l . This
indicates that VOQ2( j,k ) (of the middle port j) is reserved for input port l .
Then reset all wi = k (i=0, 1… N -1) to wi= -1 to indicate the corresponding
reservation requests for VOQ2( j,k ) have been processed.
Ensuring a reservation is honored . Before middle-stage port j sending its
occupancy vector to its connected output port k , j examines its reservation
vector {qi} first. If there is any qi = m, where m > 0 and m ≠ k , middle-stage
port j knows that VOQ2( j,i) is not available as it has been successfully
reserved by input port m. Therefore, the feedback bit in the occupancy vector
for VOQ2( j,i) is overwritten to 1. This is to ensure that VOQ2( j,i) can only be
used by input port m.
Input port scheduling . Any VOQ1 sent reservation request at time slot t would
be given the highest priority for scheduling at time slot t + N . Otherwise, send
the HOL packet from the longest VOQ1 with the corresponding empty middle
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 139/177
- 119 -
stage VOQ2 (as in the original feedback-based switch with port scheduler
LQF).
An example: Consider a 4×4 feedback-based switch (with the fair scheduler)
configured by the joint sequence of Fig. 2.2(a). Further make a assumption that at
time slot 0, the lengths of VOQ1(0,0) and VOQ1(0,2) exceed threshold 4, so input 0
would select one (say it VOQ1(0,2)) based on RR for sending a reservation request to
its currently connected middle port 1. Both input port 0 and middle port 1 record the
identify of VOQ1(0,2). When middle port 1 connects to output port 2 at time slot 1,
middle port 1 checks its received reservation requests for output port 2 and then
select one (say it VOQ1(0,2)) to grand based on RR. Then the middle port VOQ2(1,2)
can only be used by input port 0 in the following 4 time slots. Meanwhile, VOQ1(0,2)
would be given the highest priority for scheduling at time slot 4.
An input port generates a reservation request if a VOQ1 exceeds a pre-
determined threshold Q. The delay between an input port generating a reservation
request and knowing the result is N time slots (one round trip time for the joint
sequence). Within these N time slots, each input port can send up to N packets. If Q
is smaller than N packets and assume the reservation is successful, by the time that
the input port knows the result, the corresponding VOQ1 may be empty as the
backlogged packets may have been exhausted while waiting for the result to arrive.
This would create a wasted slot. (On the other hand, if the reservation fails, no slot
will be wasted even though the corresponding VOQ1 may still be empty then.) If Q ≥
N , it is guaranteed that there will be at least one packet in the queue for making use
of the reserved slot. However, having a large Q would adversely affect the packet
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 140/177
- 120 -
delay performance. Therefore, we use Q = N in our proposed fair scheduler to get the
best delay-throughput performance.
7.4 Max-min Fairness Criterion
In the following, we would like to show that our fair scheduler can satisfy the
max-min fairness criterion. Firstly, we borrow the following two definitions from
[52,58]:
Definition 4: The allocation vector {ai} is said to be feasible if and only if:
Each entity receives an allocation greater than or equal to zero; that is, for all i, ai
≥ 0.
The total allocated resource is less or equal to the available resource U; that is,
∑ai ≤ U.
Definition 5: For the demand vector {bi}, the allocation vector {ai} is said to
be max-min fair if:
1. It is feasible.
2. No entity receives an allocation greater than its demand; that is, for all i, ai≤ bi.
3. For all i, the allocation of entity i cannot be increased while satisfying the above
two conditions and without reducing the allocation of some other entity j for
which a j ≤ ai.
As long as an algorithm meets the three conditions above, it satisfies the max-
min fairness criterion. Note that in our fair scheduler, the demand vector {bi} is the
traffic load from input port i to an over-subscribed output port j. Let the capacity of
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 141/177
- 121 -
output port j be U, i.e. the available resource is U. Assume fair scheduler allocates U
to each input port i with allocation ai (i=0,1… N -1). Obviously, ai ≥ 0 and ∑ai ≤ U
(i=0,1… N -1). So ai is feasible (condition 1). By setting the threshold for generating a
reservation request at Q= N , fair scheduler will not waste any reserved slot. So for all
i, ai ≤ bi can be ensured (condition 2). In the following, we focus on condition 3, i.e.
we try to increase some bandwidth allocation ai and see how this would affect other
inputs. Assume the switch has been “warmed up”. Let ci be the number of times that
input i’s VOQ(i, j) exceeds threshold Q during L time slots. We have
ci≤ L for all i (i=0,1… N -1). (7.1)
If input i has a larger ci than ck of input k , according to fair scheduler, input i
will generate more reservation requests and thus get a larger share of output j’s
bandwidth (as output j is over-subscribed). That is
ai ≥ ak , if ci ≥ ck (7.2)
Take a closer look on ci. There are two possible cases:
ci < L: In one or more time slots, the length of VOQ( i, j) is less than threshold Q.
Then traffic load bi is satisfied by bandwidth allocation ai, i.e.
iii L
ab Lc
/lim
Therefore, ai cannot be further increased because ai has conformed to condition 2.
ci = L: The length of VOQ(i, j) is always longer than threshold Q. This indicates
that traffic load bi cannot be satisfied by bandwidth allocation ai because the
output port j is over-subscribed:
∑ai = U (7.3)
From (7.1), we have:
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 142/177
- 122 -
ci ≥ ck for all k (k =0,1… N -1)
Combine it with (7.2), we get:
ai ≥ ak , for all k (k =0,1… N -1) (7.4)
To increase ai, we have to reduce some ak (k =0,1… N -1) due to (7.3). Then we
reduce the allocation to some input port k for ai ≥ ak (7.4), which proves that
condition 3 is ensured. Combining the proof for all the three conditions in Definition
5 above, our fair scheduler satisfies the max-min fairness criterion. Note that here we
only focus on the max-min fairness, but the proportional fairness can also be applied
by making a minor revision for the fair scheduler.
7.5 Performance Evaluations
In this section, we focus on the fairness performance in allocating the
bandwidth of an over-subscribed output port using the original feedback-based
switch (Feedback) and the proposed fair scheduler (Feedback-F). (Note that for
admissible traffic patterns, fair scheduler generates the same performance as original
feedback-based switch and thus not shown in this chapter.)
7.5.1 Server-client Traffic Model
The server-client traffic model in [59] is first adopted for generating
inadmissible traffic. At each time slot for every input, a packet arrives with
probability p. Linecards are partitioned into two types: a server (i.e. linecard 0) and
N -1 clients. The server transmits packets with equal probability to all clients. Each
client transmits 1/3 of its traffic toward the server and 2/3 to the other N -2 clients
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 143/177
- 123 -
with equal probability. The server is a hotspot and when N =32, the amount of traffic
going to the server is given by
λ = p( N -1)/3= 31 p/3.
Fig. 7.2 shows the bandwidth share of three representative flows, (1,0), (9,0)
and (2,0), at the server, versus the total server loading λ. Note that to reach output
port 0, the middle stage port delays for flows (1,0), (9,0) and (2,0) are 32, 8 and 1
time slots, respectively. The server becomes over-subscribed (i.e. the traffic becomes
inadmissible) when λ >1. With original feedback-based switch, flow(2,0) (yellow)
and flow(9,0) (purple) are quickly throttled by flow(1,0) (light blue), due to the ring-
fairness problem. With Feedback-F, the three flows equally share the oversubscribed
server bandwidth (together with the remaining 28 flows not shown), due to its proven
capability of max-min fair allocation.
Fig. 7.2 Output 0’s throughput vs its output load λ, under server-client traffic.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 144/177
- 124 -
7.5.2 Attack-traffic Scenario
We also emulate an attack-traffic scenario, where output 0 is gradually
dominated by traffic coming from input 1. The detailed traffic model is as follows. At
each time slot for each input port, a packet arrives with probability p. For input port 1,
an arrived packet goes to output port 0 with probability 0.5 (we call it an attack-flow),
and the remaining 0.5 probability is equally shared by all other output ports. For any
other input ports, an arrived packet goes to all N -1 output ports with equal probability.
Therefore, at the over-subscribed output 0, when N =32, the output load λ is:
λ=0.5 p + p·( N -2)/( N -1)= p·91/62
Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic
From Fig. 7.3, as output load λ increases, with Feedback the throughput share
for flow(2,0) (yellow) and flow(9,0) (purple) quickly drops to 0, while the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 145/177
- 125 -
throughput for the attack-flow(1,0) (light blue) increases linearly. When Feedback-F
is used, the attack-flow(1,0) is regulated/reduced, due to the max-min fair allocation
nature. Specifically, the attack-flow(1,0) can only make use of the excess bandwidth
(if any) from other flows with smaller traffic demands, i.e. flow( i,0)s (i=2,3…31).
From Fig. 7.3, we can see that the malicious flow can be identified and punished by
the proposed fair scheduler.
7.6 Chapter Summary
For an inadmissible traffic pattern where some outputs are over-subscribed,
feedback-based two-stage switch will suffer from the ring-fairness problem. To this
end, a fair scheduler for feedback-based switch was designed in this chapter. We
adopted a simple idea of reserving a middle-stage port for any input VOQs exceeding
a threshold Q. Then the bandwidth of over-subscribed outputs is allocated to the
input VOQs (exceeding Q) on a RR basis. We proved that the resulting algorithm
satisfied the max-min fairness criterion. Indeed, the simulation results also showed
the max-min fairness nature of the proposed fair scheduler.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 146/177
- 126 -
Chapter 8
An Optical Implementation of Feedback-Based Switch
8.1 Introduction
For routers with an electronic switch fabric (e.g. Fig. 1.6), packets must go
through additional O-E-O conversion while being switched from one linecard to
another. This not only limits the router speed, but also increases the difficulties in
designing a high-speed electronic switch fabric. In this chapter, we propose an
optical implementation of our (electronic) feedback-based switch to enable a packet
to be switched all-optically from one linecard to another. We call the resulting switch
load balanced optical switch (LBOS).
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 147/177
- 127 -
It should be noted that despite all the advantages of optics [60-61],
implementing an all-optical router is still far from being practical because of the
immature technologies in optical processing and buffering. In this chapter, we focus
on designing hybrid electro-optic routers, where packet buffering and table lookup
are carried out in electrical domain, and switching is done optically.
The rest of this chapter is organized as follows. In the next section, we review
the related work on optical switch using in hybrid electro-optic routers. In Section
8.3, design and operation of LBOS are detailed. In Section 8.4, LBOS is enriched
and refined like the electrical feedback-based switch. Simulation results are
presented in Section 8.5 and we conclude the chapter in Section 8.6.
8.2 Related Work
There are various efforts in designing efficient optical switches for high-
speed routers. Notably, in the 100 Tb/s router project [27], optical implementation of
a load-balanced electronic switch [21] is considered. The three-stage Clos network
architecture is adopted where the center stage is implemented using optical MEMS
[62]. But all-optical packet transmission from an input linecard/port to an output
linecard/port is not possible, as packets must be temporarily stored and processed in
electrical domain between different stages of the Clos network. Besides, to tackle the
packet mis-sequencing problem a large re-sequencing buffer of N 2+1 packets at each
output port is required, where N is the switch size.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 148/177
- 128 -
Recently, Fasnet [63], an optical switch fabric comprising N switch linecards
connected by two counter-rotating WDM fiber rings, is proposed. The notion of
counter-rotating WDM fiber rings originally appears in designing metro networks
[64], and is further refined in [59,65-68]. In Fasnet [63], one ring is used for
transmission, while the other is for reception. The N wavelengths in the transmission
ring are switched to the reception ring at a folding point between the two rings. Only
a special input port (called master input) can generate a frame header (called
locomotive). Other input ports can put their packets at the end of a frame as its frame
header passes by. At each input port, the maximum number of packets that can be
attached after one frame header is limited by a fairness quota of Y packets. Y can be
accumulated, but has an upper bound of U ×Y , where the values of Y and U should be
given in advance. Unlike [27], this ring-based switch architecture allows all-optical
packet transmission from one linecard to another. But its delay-throughput
performance is rather limited, which is further aggravated by the fairness algorithm
adopted.
8.3 Load Balanced Optical Switch (LBOS)
8.3.1 Switch Architecture
Our load-balanced optical switch (LBOS) is targeted at all-optical switching
of a packet from one linecard to another. As depicted in Fig. 8.1, LBOS consists of N
linecards connected by an N -wavelength WDM fiber ring. Each linecard i has two
ports, input i and output i. Linecard/output i is configured to receive (only) on its
dedicated wavelength channel λi. To send a packet to linecard j, linecard/input i needs
to transmit the packet onto channel λ j when λ j is idle.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 149/177
- 129 -
Fig. 8.1 A 4x4 load balanced optical switch.
Fig. 8.2 The internal structure of linecard i.
The internal structure of linecard i is similar to that used by Fasnet [63], and
is shown in Fig. 8.2. For simplicity, the electrical buffers for implementing the virtual
output queues (VOQ(i,k )’s) at each input port are not shown. A linecard has three
major modules: a receiver on channel λi, a “tunable” transmitter (implemented using
a fixed laser array) and a wavelength monitor. In Fig. 8.2, the EDFA (Erbium Doped
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 150/177
- 130 -
optical Fiber Amplifier) is used to compensate for the optical signal loss en route. A
filter drops wavelength λi from the fiber and passes all other channels to a splitter.
The dropped λi enters the high bit-rate burst mode receiver for receiving. The splitter
taps out a fraction of light and feeds it to the monitor module. The remaining signals
in the fiber will go through a FDL (Fiber Delay Line) of t d seconds, where t d is the
time required for the monitor to identify an idle channel (detailed in the next
paragraph) and the transmitter to start sending a packet onto a selected idle channel.
For the fraction of light entered the monitor, a demultiplexer converts it into
N -1 separate λ’s, and directs them to the dc-coupled photodiode array. A threshold
comparator is used to detect idle wavelength channels. Among all the idle channels,
the linecard controller identifies its longest VOQ(i, j), and the head of line packet
from VOQ(i, j) is sent using the transmitter module. (We call it the LQF scheduler.)
The transmitter module consists of a fixed laser array, where laser λ j will be used to
send the packet destined to linecard j. (A fixed laser array can be more cost-effective
than a single fast tunable laser.) Finally, the transmitted packet is merged back to the
fiber ring by the optical coupler (in Fig. 8.2) and continues its journal to the next
linecard.
8.3.2 Switch Operation
Let the packet duration, i.e. the amount of time required to send a packet
(onto a wavelength channel), be t pkt seconds. We define the duration of a time slot to
be t d+t p seconds, where t d is the propagation delay of the FDL in Fig. 8.2 and t p is the
propagation delay of the fiber from the coupler in Fig. 8.2 to the drop filter of the
next linecard. Assume the whole system is synchronized, and in each time slot, at
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 151/177
- 131 -
most one packet can be transmitted and/or received by each linecard. For the proper
operation of the switch, we must have t d ≥ t pkt and t p ≥ t pkt. This is illustrated by Fig.
8.3, where linecard i starts to receive a packet at the beginning of slot t and it takes
t pkt seconds to receive the entire packet. Meanwhile, the monitor identifies the idle
channels, and a packet is sent onto the selected idle channel. The optical coupler adds
the packet back to the fiber ring at t d seconds after the beginning of the current slot. It
takes another t p seconds for the first bit of the packet to arrive at linecard i+1. This
marks the end of time slot t and the beginning of slot t +1. It is easy to see that a
packet sent by linecard i will arrive at linecard j after ( j – i) mod N time slots.
Fig. 8.3 Timing diagram for load balanced optical switch (LBOS).
From Fig. 8.3, we can see that in each time slot, the transmitter is idle in the
first t d seconds, whereas the receiver and monitor are idle for the last t p seconds. As
only a single packet is sent/received in each slot, the efficiency of LBOS is t pkt/(t d+t p),
or at most 50% (assuming t p=t d=t pkt). To enhance the efficiency, transmitter, receiver
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 152/177
- 132 -
and monitor can operate in parallel for pipelined packet sending, receiving and
scheduling, as shown in Fig. 8.4. Specifically, in the first half of time slot t , the
transmitter can send a packet scheduled in the second half of slot t –1. In the second
half of slot t , the receiver can receive a packet sent by some linecard in the first half
of an earlier time slot , and meanwhile, the monitor can schedule another packet for
sending in the first half of slot t +1. In other words, two packets can be received and
transmitted in each time slot. (We shall call it pipelined LBOS if we would like to
distinguish with the original LBOS.)
Fig. 8.4 Timing diagram for pipelined packet sending and receiving.
From the operations of the LBOS above, we can see that LBOS effectively
balances the loading in the ring network by spreading (i) packets going to different
destinations over different wavelength channels (i.e. space/wavelength domain load
balancing), and (ii) packets going to the same destination over different time slots (i.e.
time domain load balancing). In next sub-section, we show that our LBOS is an
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 153/177
- 133 -
optical counterpart of the load-balanced electronic switch architecture in [32].
8.3.3 Equivalence to Load-Balanced Electronic Switches
Consider the basic LBOS operating based on the timing diagram in Fig. 8.3.
If we treat the fiber ring as a FDL, then the ring network “buffers” a packet from
linecard i to j for exactly ( j – i) mod N time slots. Since one round trip time (RTT)
along the ring is N time slots, a specific wavelength channel on the ring can
carry/buffer up to N in-flight optical packets. With N wavelengths, the fiber ring can
buffer up to N 2 packets. Therefore, (optical) packets are “buffered” as they propagate
along the fiber ring in different wavelengths, which exactly mimics the buffering
services rendered by the middle-stage VOQ2( j,k )’s in LBES (Fig. 2.1). In a specific
time slot, the channel status (i.e. idle or not) of all the wavelengths passing by, which
is equivalent to the occupancy of VOQ2( j,k )’s in Fig. 2.1, will be conveniently
detected by the wavelength monitor on each linecard – the need for dedicated
feedback packets/vectors is thus removed.
Fig. 8.5 A joint sequence in load-balanced switch.
Assumed the LBES (with single-packet-buffer at each VOQ2( j,k )) is
configured by the sequence of configurations shown in Fig. 8.5. Then we can easily
find a one-to-one mapping between every instance of sequence in Fig. 8.5 and the
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 154/177
- 134 -
corresponding operation on the ring network in Fig. 8.1. Due to the equivalence
relationship, our LBOS inherits all the nice features of the LBES [32-33], such as
being scalable, distributed, and yielding close-to-100% throughput and low average
packet delay.`
8.4 Extensions and Refinements of LBOS
8.4.1 Cutting down the Average Delay by Reconfiguration
In LBOS, the delay experienced by a packet is the summation of the queuing
delay at the input linecard and the propagation delay between the input and output
linecards. Due to the way linecards are connected in a ring, the propagation delay is
predetermined and fixed. For example, in Fig. 8.1 each packet of flow(0,3) requires 3
time slots from linecard 0 to linecard 3. Then for a given traffic matrix { λi,j}, the
average packet propagation delay is:
ji
j i
ji h H ,, (8.1)
where λi,j is arrival rate and hi,j is the propagation delay for flow(i, j), respectively. We
have 0≤ λi,j≤1 and 0≤hi,j≤ N -1 for ]1,0[, N ji . In LBOS, we assume that flow(i,i)
does not enter the ring, and thus hi,i=0.
Assume λ0,3=1 and it is the only flow of the switch in Fig. 8.1. From (8.1), we
have H =3 slots. If we swap the positions of linecards 0 and 2 in Fig. 8.1, H will
become 1. This shows by judiciously connecting linecards to form a ring, the
propagation delay (as well as the average packet delay) can be minimized. It is not
difficult to show that for a given traffic matrix, finding the optimal linecard
placement pattern for minimizing H has the same complexity as the classic traveling
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 155/177
- 135 -
salesman problem [71]. Nevertheless, such a linecard placement problem can be
formulated as an ILP (Integer Linear Programming) problem.
Notations:
xi: the propagation delay experienced by packets of flow(0,i), where 0≤ xi≤ N -1,
for ]1,0[ N i . In fact, xi indicates linecard i’s relative position (to linecard 0) in
the ring.
f i, j: binary variable and j > i for ]1,0[, N ji . If f i, j = 1, it means xi > x j and if f i, j =
0, then xi < x j.
Objective:
minimize
)]1()[(])[( ,,,, jii j
i j i
ji jii j
i j i
ji f N x x Nf x x (8.2)
Subject to the following ring topology constraints:
x0=0 (8.3)
1≤ xi≤ N -1 for ]1,1[ N i (8.4)
xi - x j - N f i, j≥ 1- N j>i for ]1,0[, N ji (8.5)
x j - xi + N f i, j≥ 1 j>i for ]1,0[, N ji (8.6)
Notably, constraints (8.5) and (8.6) above are to ensure xi≠ x j if i ≠ j.
Note that the linecard placement pattern is changed only if there is a
significant enough change in traffic matrix. Even so, it is generally not feasible to
reconnect linecards manually. To this end, we can implement a LBOS using an OXC
(Optical cross-Connect), as shown in Fig. 8.6. Note that all ( N -1)! possible linecard
placement patterns can be realized by an OXC, which supports N ! configurations.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 156/177
- 136 -
Further note that inexpensive OXC (with millisecond or more reconfiguration delay)
can be used if the reconfiguration takes place infrequently.
(a) (b)
Fig. 8.6 Two possible linecard placement patterns using OXC: (a) {0-1-2-3}
and (b) {0-3-1-2}.
8.4.2 Supporting Multicast
The transmitter module in Fig. 8.2 consists of a fixed layer array. The lasers
are turned on by direct current injection when a packet is to be sent. Data bits are
then “written” inside a channel by an external modulator. Laser array facilitates
multicasting, where bits can be written simultaneously by the external modulator on
multiple wavelengths (the corresponding lasers have been turned on for carrying a
multicast packet). In this way, the “replication” of packets is obtained in optical
domain (in which bandwidth efficiency is less critical). In other words, multicasting
can be implemented without increasing the (expensive) bandwidth requirement of the
electronic transmitters, as the electronic cost of sending a packet to multiple
destinations is the same as for sending a packet to a single destination. All multicast
scheduling algorithm in Chapter 5 can be implemented in multicast LBOS.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 157/177
- 137 -
8.4.3 Implementing Fair Scheduler Optically
To implement the fair scheduler in Chapter 7 optically, an optical control
channel λ N is required for conveying reservation requests and grants (shown in Fig.
8.7), which is comparable to the control channel in an OBS network for making data
burst reservations. In other words, an extra transceiver on channel λ N is required at
each linecard for processing the control packets in electrical domain. Refer to Fig.
8.2, the λ N receiver is added in parallel with the λ i receiver in the receiver module,
and the λ N transmitter is added to the laser array at the transmitter module. Due to the
relative low data rate on the control channel, an inexpensive low-speed transceiver
can be used, e.g. using LEDs instead of laser diodes.
Assume pipelined LBOS (in Fig. 8.4) is used and the traffic carried on the
ring network (in Fig. 8.1) is shown in Fig. 8.7. We focus on the control channel λ N
(where N =4 in Fig. 8.7). In each packet duration, λ N carries two vectors, an overload
vector {wi} and a reservation vector {qi}, where i=0, 1… N -1. During a packet
duration, linecard k drops λ N and uploads the updated {wi} and {qi} on λ N again.
Meanwhile, the operations of fair scheduler (in Chapter 7) are carried out in
electrical domain.
8.5 Performance Evaluations
In this section, we study the performance of our proposed LBOS under the
same three types of traffic patterns as Chapter 2, i.e. uniform, uniform bursty and
hot-spot traffic. For comparison, Fasnet [63], which has a similar hardware
complexity as LBOS, is implemented. In simulating Fasnet, we adopt the best
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 158/177
- 138 -
parameters as reported in [63], i.e. a fairness quota Y =100 packets and the maximum
accumulated quota of U ×Y =500 packets. For both LBOS and Fasnet, we assume the
propagation delay between adjacent linecards is 100 ns (t p=100 ns) and each linecard
introduces a (FDL) delay of 100 ns (t d=100 ns). The duration of a time slot is thus
200 ns, or two time units. We assume packets arrive at the beginning of each time
unit. For the non-pipelined LBOS (in Fig. 8.3), only one packet can be sent/received
in every two time units. With pipelined LBOS (in Fig. 8.4), one packet can be sent in
each time unit.
We also implement iSLIP algorithm [14] (with a single iteration), which
serves as a benchmark for input-queued switch, and output-queued switch, which
serves as a lower bound. In simulating them, zero propagation delay between
linecards is assumed (to their favor). It should be noted that both iSLIP and output-
queued switch are generally not practical for optical implementation.
For simplicity, we only present simulation results for switch with size N =32
linecards below, but the similar conclusions and observations can be obtained for
other switch sizes.
8.5.1 Performance under Uniform Traffic
From Fig. 8.7, we can see that without pipelined sending and receiving,
LBOS can only obtain up to 50% throughput. For pipelined LBOS, close-to-100%
throughput can be obtained. Note that the delay performance is the total delay a
packet experienced at input port and en route. For LBOS, the average propagation
delay is 32 time units or 16 time slots (i.e., (1+ N -1)/2 under uniform traffic with
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 159/177
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 160/177
- 140 -
Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS.
Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS.
From Fig. 8.9, again we can see that pipelined LBOS consistently
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 161/177
- 141 -
outperforms Fasnet and delivers close-to-100% throughput.
8.5.4 Performance for Linecard Placement
We randomly generate 20 16×16 admissible traffic matrices. For each matrix,
the average propagation delay is calculated using (8.1) and the average of the 20
matrices is found to be H =16.1 time units. With the optimized linecard placement (by
solving the ILP in (8.2)-(8.6)), we can get an average propagation delay of 14.1 time
units. A saving of 12.3 % in propagation delay is observed.
We then carry out simulations to get the average packet delay (i.e. to take the
input port queuing delay into account) for each scenario. We found that without
linecard placement, the average delay is 25.9, and with linecard placement, the delay
drops to 22.9.
8.6 Chapter Summary
In this chapter, we designed an optical implementation of feedback-based
switch for using in a hybrid electro-optic router, called LBOS. It comprises N
linecards connected by an N -wavelength WDM fiber ring. Each linecard i is
configured to receive on channel λi. To send a packet, it can select and transmit on an
idle channel based on where the packet goes. Packets are switched from one linecard
to another all-optically, and then the extra O-E-O conversion in state-of-the-art
routers is removed. We also showed that LBOS inherits all nice features of a load-
balanced electronic switch.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 162/177
- 142 -
Chapter 9
Conclusion
9.1 Our Contributions
In this dissertation, we dedicated our efforts to design efficient and scalable
switch architecture for next generation high-speed routers. Two major design
objectives are no need for a centralized scheduler and amendable to optical
implementation.
In Chapter 2, we focused on removing the centralized scheduler by following
the approach of load-balanced switch due to its scalability and close to 100%
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 163/177
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 164/177
- 144 -
In a feedback-based switch, each middle-stage port needs to piggyback an N -
bit occupancy vector to its connected output in each time slot. In Chapter 4, we
concentrated on cutting down this communication overhead. The size of an
occupancy vector can be reduced by only reporting the status of selected middle-
stage VOQs. To identify VOQs of interest, we partition the N VOQs into u non-
overlapped sets, each being identified by a set number. In each time slot, every input
port piggybacks its set numbers of interest to the connected middle-stage port. This
guides a middle-stage port to only report the status of the VOQs of interest.
In Chapter 5, by slightly modifying the operation of the original feedback-
based two-stage switch, we showed that feedback-based switch supports multicast
traffic efficiently. A notable feature of this multicast extension is that the switch
fabric remains unicast, whereas packet duplication is distributed to both input and
middle-stage ports.
In a single-cabinet implementation, the propagation delay between linecards
and switch fabric is negligible. In a multi-cabinet implementation, due to the non-
negligible propagation delay between linecards and switch fabric, the requirement
that occupancy vectors must arrive at output/input ports within a single time slot will
significantly lower the feedback-based switch efficiency. To this end, we revamped
the original feedback mechanism in Chapter 6 for multi-cabinet implementation, and
a new batch scheduler was also devised.
As long as the incoming traffic is admissible, due to the close to 100%
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 165/177
- 145 -
throughput performance of our feedback switch, packets can arrive at outputs with
bounded delays, so fairness in throughput is not an issue. Under inadmissible traffic
(i.e. some output ports are over-subscribed), the feedback switch suffers from the
ring-fairness problem, i.e. “up-stream” input ports can starve some “down-stream”
input ports. To address this ring-fairness problem, an algorithm that allocates the
bandwidth of over-subscribed outputs based on max-min fairness criterion was
proposed in Chapter 7.
In Chapter 8, we proposed an optical implementation of the feedback-based
switch, called Load-Balanced Optical Switch (LBOS). LBOS leverages an N -
wavelength WDM fiber ring to connect N linecards together. The ring network was
engineered such that the amount of time a packet should be buffered at a middle-
stage port exactly matches the propagation delay that this packet would experience
en route. We showed that with LBOS, all-optical packet transmission from an input
linecard to an output linecard is ensured.
9.2 Future Work
9.2.1 100% Throughput Proof without Speedup
In Chapter 2, we proved that under a speedup of two, feedback-based switch
using any arbitrary work-conserving port scheduler is stable. Indeed, our simulation
results suggest LQF without speedup is stable over a wide range of traffics. However,
due to the lack of theoretical power, the stability without speedup is yet to be proved
up to now. In the following, we hope to come up with a 100% throughput proof
(without speedup) by appealing to other powerful mathematical models.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 166/177
- 146 -
9.2.2 Building a Large Feedback-Based Two-Stage Switches
In feedback-based switch, average packet delay would grow linearly with
switch size N . Therefore, when N is large, the average delay would suffer. To this end,
it is our hope that a large size feedback switch can be constructed by a number of
small size feedback switch modules. Then the delay performance would grow
linearly with the module size, instead of the whole switch size.
9.2.3 More Scalable Fairness Algorithm in LBOS
In Chapter 8, a dedicated control wavelength channel is required for
implementing the fair scheduler for LBOS. In this approach, extra fixed receiver and
transmitter (on control channel) are required at each linecard, which increases the
hardware complexity of LBOS. Indeed, it is a worthwhile research direction to
implement fair scheduler without increasing the hardware complexity.
9.2.4 Scalable Iterative Algorithm for Input-queued Switch
Besides load-balanced switches, we can also refine other switch architectures
for using in next generation high-speed routers. The input-queued switch with
iterative algorithms, as introduced in Chapter 1, is not scalable due to its up to N
iterations communication overheads to find maximal size matching. A very
interesting topic is can we achieve maximal size matching by only one iteration,
whereas this one iteration would function as “ N iterations”. To accomplish this
objective, the “weight” information (e.g. queue size) should be considered in this
single iteration matching. Nevertheless, such idea is very interesting and merits
deeply deliberating.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 167/177
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 168/177
- 148 -
scheduling for local area networks,” ACM Transactions on Computer Systems,
Vol. 11, pp. 319 – 352, 1993.
[14] N. McKeown, “Scheduling algorithms for input-queued cell switches,” PhD.
Thesis, University of California at Berkeley, 1995.
[15] N. McKeown, “The iSLIP scheduling algorithm for input-queued switches,”
IEEE Transactions On Networking , Vol. 7, No. 2, pp. 188 – 201, April 1999.
[16] Y. Li, S. Panwar and H. J. Chao, “On the performance of a dual round-robin
switch,” INFOCOM 1998, March 1998, San Francisco, USA.
[17] S. T. Chuang, A. Goel, N. McKeown and B. Prabhakar, “Matching outputqueuing with a combined input/output-queued switch,” IEEE Journal on
Selected Areas in Communications, Vol. 17, pp. 1030 – 1039, June 1999.
[18] K. Yoshigoe, “Threshold-based exhaustive round-robin for the CICQ switch
with virtual crosspoint queues,” ICC 2007 , June 2007, Glasgow, Scotland.
[19] R. Luijten, C. Minkenberg and M. Gusat, “Reducing memory size in buffered
crossbars with large internal flow control latency,” GLOBECOM 2003, Dec.2003, San Francisco, USA.
[20] Y. Shen, S. S. Panwar and H. J. Chao, “Providing 100% throughput in a
buffered crossbar switch,” IEEE HPSR 2007 , May 2007, New York, USA.
[21] C. S. Chang, D. S. Lee and Y. S. Jou, “Load balanced Birkhoff-von Neumann
switches, part I: one-stage buffering,” Computer Communications, Vol. 25, pp.
611 – 622, 2002.
[22] C. S. Chang, D. S. Lee and C. M. Lien, “Load balanced Birkhoff-von
Neumann switches, part II: multi-stage buffering,” Computer
Communications, Vol. 25, pp. 623 – 634, 2002.
[23] Y. Shen, S. Jiang, S. S. Panwar and H. J. Chao, “Byte-focal: a practical load-
balanced switch,” IEEE HPSR 2005, May 2005, Hong Kong.
[24] X. L. Wang, Y. Cai, S. Xiao and W. B. Gong, “A three-stage load-balancing
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 169/177
- 149 -
switch,” INFOCOM 2008, April. 2008, Phoenix, AZ, USA.
[25] I. Keslassy and N. McKeown, “Maintaining packet order in two-stage
switches,” INFOCOM 2002, June 2002, New York, USA.
[26] I. Keslassy, “The load-balanced router,” PhD. Thesis, Stanford University,
2004.
[27] I. Keslassy, S. T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N.
McKeown, “Scaling the internet routers using optics,” ACM SIGCOMM’03,
Aug. 2003, Karlsruhe, Germany.
[28] J. J. Jaramillo, F. Milan and R. Srikant, “Padded frames: a novel algorithm for stable scheduling in load-balanced switches,” IEEE/ACM Transactions on
Networking , Vol. 16, No. 5, Oct. 2008
[29] C. L. Yu, C. S. Chang and D. S. Lee, “CR switch: a load-balanced switch
with contention and reservation,” INFOCOM 2007 , May 2007, Anchorage,
Alaska, USA.
[30] C. S. Chang, D. S. Lee and Y. J. Shih, “Mailbox switch: a scalable two-stageswitch architecture for conflict resolution of ordered packets,” INFOCOM
2004, March 2004, Hong Kong.
[31] B. Lin and I. Keslassy, “The concurrent matching switch architecture,”
INFOCOM 2006 , April 2006, Barcelona, Spain.
[32] H. I. Lee, “A two-stage switch with load balancing scheme maintaining
packet sequence,” IEEE Communications Letters
, Vol. 10, pp. 290- 292, Apr.2006.
[33] P. Gupta and N. McKeown, “Design and Implementation of a Fast Crossbar
Scheduler,” IEEE Micro, Vol. 19, Issue 1, pp. 20 - 28, Jan.-Feb. 1999.
[34] Y. S. Lin and C. B. Shung, “Quasi-pushout cell discarding,” IEEE
Communications Letters, Vol. 1, pp. 146-148, Sept. 1997
[35] B. Wu, K. L. Yeung, M. Hamdi and X. Li, “Minimizing internal speedup for
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 170/177
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 171/177
- 151 -
buffered crossbar switches,” IEEE HPSR 2006 , June 2006, Poznan, Poland.
[47] Z. Q. Dong and R. R. Cessa, “Packet switching and replication of multicast
traffic by crosspoint buffered packet switches,” IEEE HPSR 2007 , May 2007,
New York, USA.
[48] Z. Q. Dong and R. R. Cessa, “Input- and output-based shared-memory
crosspoint buffered packet switches for multicast traffic switching and
replication,” ICC 2008, May 2008, Beijing, China.
[49] P. Giaccone and E. Leonardi, “Asymptotic performance limits of switches
with buffered crossbars supporting multicast traffic,” IEEE Transactions on
Information theory, Vol. 54, No. 2, Feb. 2008.
[50] C. Minkenberg, R. Luijte, F. Abel, W. Denzel and M. Gusat, “Current issues
in packet switch design,” Proceedings of ACM SIGCOMM , p.119-124,
January 2003.
[51] A. Scicchitano, A. Bianco, P. Giaccone, E. Leonardi and E. Schiattarella,
“Distributed scheduling in input queued switches” ICC 2007 , June 2007,
Glasgow, Scotland.
[52] M. Hosaagrahara and H. Sethu, “Max-min fair scheduling in input-queued
switches” IEEE Transaction on Parallel and Distributed System, Vol. 19, NO.
4, April 2008.
[53] R. Yim, N. Devroye, V. Tarokh, and H. T. Kung, “Achieving fairness in
generalized processor sharing for network switches,” Proc. 22nd Biennial
Symp. Comm., pp. 185-187, 2004.
[54] X. Zhang, S. R. Mohanty and L. N. Bhuyan, “Adaptive max-min fair
scheduling in buffered crossbar switches without speedup,” INFOCOM 2007 ,
May 2007, Anchorage, Alaska , USA
[55] N. Kumar, R. Pan, and D. Shah, “Fair scheduling in input-queued switches
under inadmissible traffic,” GLOBECOM 2004,Vol. 3, No. 29, pp. 1713-1717,
Dec. 2004, Dallas, Texas, USA.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 172/177
- 152 -
[56] N. Hua, P. Wang, D. P. Jin, L. G. Zeng, B. Liu and G. Feng, “Simple and fair
scheduling algorithm for combined input-crosspoint-queued switch,” ICC
2007 , June 2007, Glasgow, Scotland.
[57] J. R. Bennett and H. Zhang, “Hierarchical packet fair queueing algorithms,”
IEEE/ACM Transactions on Networking , vol. 5, no. 5, pp. 675–689, Oct.
1997.
[58] D. P. Bertsekas and R. Gallager , “Data networks,” Englewood Cliffs, NJ:
Prentice-Hall, 1992.
[59] A. Bianco, D. Cuda, J. Finochietto and F. Neri, “Multi-metaring protocol:
fairness in optical packet ring networks,” ICC 2007 , June, Glasgow, Scotland.
[60] H. Kogan and I. Keslassy, “Optimal-complexity optical router,” INFOCOM
2007 , May 1997, Anchorage, Alaska, USA.
[61] M. Maier and M. Reisslein, “Trends in optical switching techniques: a short
survey,” IEEE Network , pp. 42 – 47, Nov./Dec. 2008.
[62] R. Ryf et al., “1296-port MEMS transparent optical crossconnect with 2.07 petabit/s switch capacity,” Optical Fiber Comm. Conf. and Exhibit (OFC) '01,
Vol. 4, pp. PD28 -P1-3, 2001.
[63] A. Bianco, E. Carta, D. Cuda, J. M. Finochietto and F. Neri,“A distributed
scheduling algorithm for an optical switching fabric,” ICC 2008, May 2008,
Beijing, China.
[64] A. Carena, V. D. Feo, J. Finochietto, R. Gaudino, F. Neri, C. Piglione and P.Poggiolini, “RINGO: an experimental WDM optical packet network for
metro applications,” IEEE Journal on Selected Areas in Communications, Vol.
22, No. 8, pp. 1561-1571, Oct. 2004
[65] A. Bianco, J. M. Finochietto, G. Giarratana, F. Neri and C. Piglione,
“Measurement-based reconfiguration in optical ring metro networks,”
Journal of Lightwave Technology, Vol. 23, No. 10, pp. 3156-3166, Oct. 2005
[66] A. Antonino, A. Bianco, A. Bianciotto, V. D. Feo, J. M. Finochietto, R.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 173/177
- 153 -
Gaudino and F. Neri, “Wonder: a resilient WDM packet for metro
applications,” Optical Switching and Networking , pp. 19-28, 5, 2008
[67] A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and M. Valcarenghi, “Wonder :
a pon over a folded bus,” GLOBECOM 2008, Nov. 2008, New Orleans, LA,
USA.
[68] A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and C. Piglione, “Multi-fasnet
protocol: short-term fairness control in WDM slotted MANs,” ICC 2006 ,
May 2006, Paris, France.
[69] X. Wang and K. L. Yeung. “Load balanced two-stage switches using arrayed
waveguide grating routers,” IEEE HPSR 2007 , June, 2007, New York, USA.
[70] J, C. Palais, “Fiber optic communications,” 5th ed. Upper Saddle River , N.J,
Pearson/Prentice Hall, 2005
[71] A. Desai and S. Milner, “Autonomous reconfiguration in free-space optical
sensor networks,” IEEE Journal on Selected Areas in Communications
(JSAC), Vol. 23, No. 8, pp. 1556-1563, Aug. 2005
[72] T. Akin, “Hardening cisco routers,” O'Reilly Networking , ,Feb. 2002
[73] A. Vukovic, “Network power density challenges,” ASHRAE Journal , Vol. 47,
Issue 4, p55-59, Apr. 2005
[74] M. Degermark, A. Brodnik, S. Carlsson and S. Pink, “Small forwarding
tables for fast routing lookups,” ACM SIGCOMM Computer Communication
Review, Vol. 27, Issue 4, pp.3-14, Oct. 1997
[75] W. Eatherton, G. Varghese and Z. Dittia, “Tree bitmap: hardware/software IP
lookups with Incremental updates” ACM SIGCOMM Computer
Communication Review, Vol. 34, Issue 2, pp.97-122, April 2004
[76] H. Song, J. Turner and J. Lockwood, “Shape shifting tries for faster IP
lookup,” IEEE ICNP2005, pp.358-367, 2005
[77] V. Srinivasan and G. Varghese, “Faster IP lookups using controlled prefix
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 174/177
- 154 -
expansion,” ACM SIGMETRICS Performance Evaluation Review, Vol. 26,
Issue 1, pp.1-10, June 1998
[78] S. Nilsson and G. Karlsson, “IP-address lookup using LC-trie,” IEEE Journal
on Selected Areas in Communications, Vol. 17, pp.1083-1092, June. 2001
[79] L. C. Wnn, K. M. Chen and T. J. Liu, “A longest prefix first search tree for IP
lookup,” IEEE ICC 2005, May. 2005, Seoul, Korea
[80] P. R. Warkhede, S. Suri and G. Varghese, “Multi-way range trees: scalable IP
lookup with fast updates,” Computer Networks, vol. 44, No.3, pp.289-303,
2002
[81] H. Lu and S. Sahni, “A b-tree dynamic router-table design,” IEEE
Transaction Computers, vol. 54, pp.813-823, 2005
[82] H. Lu and S. Sahni, “O(log W ) multidimensional packet classification,”
IEEE/ACM Transactions on Networking , Vol. 15, Issue 2, pp. 462-472, April
2007
[83] P. C. Wang, C. L. Lee, C. T. Chan and H. Y. Chang, “Performanceimprovement of two-dimensional packet classification by filter rephrasing”,
IEEE/ACM Transactions on Networking , Vol. 15, Issue 4, pp.906-917, Aug.
2007
[84] M. Waldvogel, G. Varghese, J. Turner and B. A. Plattner, “Scalable high speed
IP routing lookups,” ACM SIGCOMM 1997 , pp.25-36, Sept. 1997, Cannes,
France
[85] Q. Sun, X. H. Huang, X. J. Zhou and Y. Ma, “A dynamic binary hash scheme
for IPv6 lookup,” GLOBECOM 2008, Nov. 2008, New Orleans, LA, USA.
[86] S. Dharmapurikar, P. Krishnamurthy and D. Taylor, “Longest prefix matching
using bloom filters,” ACM SIGCOMM 2003, pp.201-212, 2003
[87] R. Sangireddy, N. Futamura, S. Aluru and A. K. Somani, “Scalable, memory
efficient, high-speed IP lookup algorithms,” IEEE/ACM Transactions on
Networking , Vol. 13, Issue 4, pp.802 – 812, Aug. 2005.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 175/177
- 155 -
[88] H. Y. Song, F. Hao, M. Kodialam and T. V. Lakshman, “IPv6 lookups using
distributed and load balanced bloom filters for 100Gbps core router line
cards,” INFOCOM 2009, April 2009, Rio de Janeiro, Brazil
[89] H. Y. Song and J. Turner, “Fast filter updates for packet classification using
TCAM,” GLOBECOM 2006 , Nov. 2006, San Francisco, USA
[90] R. Panigrahy and S. Sharma, “Reducing TCAM power consumption and
increasing throughput,” 10th IEEE Symposium on High Performance
Interconnects Hot Interconnects ( HOTI’02), pp.107-112, 2002.
[91] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: power-efficient TCAMs for
forwarding engines,” INFOCOM 2003, April 2003, San Francisco, USA
[92] K. Zheng, C. C. Hu, H. B. Liu and Bin Liu, “An ultra high throughput and
power efficient TCAM-based IP lookup engine,” INFOCOM 2004, May 2004,
Hong Kong
[93] M. J. Akhbarizadeh, M. Nourani, R. Panigrahy and S. Sharma, “High-speed
and low-power network search engine using adaptive block-selection
scheme,” Proceedings of the 13th Symposium on High Performance
Interconnects, pp.73–78, 2005
[94] H. Yu, J. Chen, J. Wang, S. Q. Zheng and M. Nourani, “An improved TCAM-
based IP lookup engine,” IEEE HPSR 2008, May 2008, Shanghai, China
[95] H. Yu, J. Chen, J. P. Wang and S. Q. Zheng, “High-performance TCAM-
based IP lookup engines,” INFOCOM 2008, April 2008, Phoenix, AZ, USA
[96] A. Enteshari and M. Kavehrad, “40-100Gbps transmission over copper,”
DesignCon 2009, Feb. 2009, Santa Clara, CA. USA.
[97] M. Kavehrad, and J. F. Doherty, “10Gbps transmission over standard
category-5 copper cable,” GLOBECOM 2003, Dec. 2003, San Francisco, CA.
USA.
[98] G. Chartrand, “Introductory graph theory,” New York Dover , pp. 116, 1985.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 176/177
- 156 -
[99] D. Gale and L. S. Shapley, “College admissions and the stability of
marriage,” Amer. Math. Monthly, vol. 69, pp.9–15, 1962.
[100] G. Kornaros, “BCB: a buffered crossbar switch fabric utilizing shared
memory,” Proc. Ninth EUROMICRO Conf. Digital System Design (DSD ’06),
pp. 180-188, Aug. 2006.
[101] H. Arimoto, T. Kitatani, T. Tsuchiya, K. Shinoda, T. Ohtoshi, M. Aoki and S.
Tsuji, “N-type doping to an active-short cavity DBR laser to expand its
continuous tuning range,” IEEE Photonics Letters, Vol. 20, No. 16, Aug. 15,
2008.
[102] J. E. Simsarian, M. C. Larson, H. E. Garrett, H. Xu and T. A. Strand, “Less
than 5-ns wavelength switching with an SG-DBR laser,” IEEE Photonics
Letters, Vol. 16, No. 4, Feb. 15, 2006.
[103] F. O. Ilday, J. Buckley, L. Kuznetsova and F. W. Wise, “Generation of 36-
femtosecond pulses from a ytterbium fiber laser,” Conference on Lasers and
Electro-Optics 2004 (CLEO), Vol. 2, pp. 3, May 2004.
[104] A. V. Konyashchenko, L. L. Losev and S. Y. Tenyakov, “Raman frequency
shifter for laser pulses shorter than 100 fs,” OPTICS EXPRESS , Vol. 15, No.
19, pp. 11855-11559, Sep. 2007.
[105] F. M. Chiussi, J. G. Kneuer and V. P. Kumar, “Low-cost scalable switching
solutions for broadband networking: the ATLANTA architecture and chipset,”
IEEE Communications Magazine, pp. 44-53, Dec. 1997.
[106] A. E. Tan, “IEEE 1588 precision time protocol time synchronization
performance,” National Semiconductor Application Note 1728, Oct. 2007.
[107] R. Palaniappan, Y. Wang, T. Clarke and B. Goldiez, “Simulation of an ultra-
wide band enhanced time difference of arrival System,” Parallel and
Distributed Computing and Systems, pp.306-309, Nov. 2007.
8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH
http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 177/177