Download - FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 1/177

TitleFeedback-based two stage switch architecture for highspeed router design

Author(s) Hu, Bing;

Citation

Issue Date 2010

URL http://hdl.handle.net/10722/56798

Rights unrestricted



FEEDBACK–BASED TWO-STAGE SWITCH

ARCHITECTURE FOR HIGH SPEED ROUTER

DESIGN

BY

HU BING

PH.D. THESIS

DECEMBER 2009



Abstract of thesis entitled

Feedback–Based Two-Stage Switch Architecture for

High Speed Router Design

submitted by

Hu Bing

for the degree of Doctor of Philosophy

at The University of Hong Kong

in December 2009

Due to the widespread usage of WDM technology in fiber, the transmission

capacity increases sharply, while the processing capacity of current commercial

routers increases slowly. The speed mismatch between fiber and router induces a

pressing need for building next generation high-speed routers. A major bottleneck of

high-speed router design is its switch architecture, which concerns how packets are

moved from one linecard to another. In this thesis, we focus on designing efficient

and scalable switch architecture to enable the next generation high-speed routers.

A load-balanced two-stage switch configures its two switch fabrics according

to a pre-determined and periodic sequence of switch configurations. It is attractive

because no centralized scheduler is required and close to 100% throughput can be

obtained. But it also faces two major challenges: packet mis-sequencing and poor

delay performance. In this thesis, we propose a feedback-based two-stage switch



architecture to simultaneously address these two challenges. Notably, we only

require a single-packet-buffer for each middle-stage port VOQ. This greatly cuts

down the average packet delay. At the same time, in-order packet delivery and high

throughput are ensured by properly selecting and coordinating the two sequences of

switch configurations. As compared with the existing load-balanced switch

architectures and scheduling algorithms, our feedback-based switch imposes a

modest requirement on switch hardware, yet consistently yields the best delay-

throughput performance.

To further enhance the performance of the feedback-based switch, original

extensions and refinements are made. Specifically, a three-stage switch architecture

is proposed for further cutting down the average packet delay. A feedback

suppression scheme is designed for reducing the communication overhead. A

multicast scheduling algorithm is invented for carrying multicast traffic using the

same unicast switch fabric. A batch scheduler is devised for multi-cabinet

implementation of the feedback-based switch. To address the fairness issue in

handling inadmissible traffic patterns, a fair scheduler is designed for allocating the

bandwidth of over-subscribed outputs based on max-min fairness criterion. Last but

not the least, an optical implementation of the feedback-based two-stage switch is

proposed.



Feedback–Based Two-Stage Switch Architecture for

High Speed Router Design

by

Hu Bing

( )

B.Eng., M.Phil. U . E .S .T .C

A thesis submitted in partial fulfillment of the requirements for

the Degree of Doctor of Philosophy

at The University of Hong Kong

December 2009



- i -

Declaration

I declare that this thesis represents my own work, except where due

acknowledgement is made, and that it has not been previously included in a thesis,

dissertation or report submitted to this University or to any other institution for a

degree, diploma or other qualification.

Signed _________________________________

Hu Bing



- ii -

Acknowledgments

First, I would like to express my deep gratitude to my research supervisor,

Doctor Kwan L. Yeung, for his guidance and encouragement throughout my graduate

study. Doctor Yeung’s unreserved supports cover every detail of my research work,

from teaching me research methodologies to taking pains to polish papers. His

instructions and infinite patience were essential for completing this thesis. I feel

privileged to have had this opportunity to study under his supervision.

I thank the Electrical and Electronic Engineering Department at the

University of Hong Kong, for creating such a great education and research

environment. I thank all staff members in the department for their kindly help and

warm assistance. I also thank the financial support of the University of Hong Kong to

enable me to complete my Ph. D. study. My thanks also go to my lab-mates and

friends whose encouragement and help are essential.

Along the way, I have been incredibly fortunate in getting the support from

my dear parents, for their endless support, materially and spiritually.



- iii -

Table of Contents

Declaration .......................................................................................................... i

Acknowledgments ............................................................................................... ii

Table of Contents ................................................................................................ iii

List of Figures ..................................................................................................... viii

List of Symbols ................................................................................................... xi

List of Abbreviations .......................................................................................... xiv

Chapter 1 Introduction

1.1 Overview of Routers .......................................................................... 1

1.2 Switch Architectures .......................................................................... 7

1.2.1 Output-queued Switches ........................................................ 8

1.2.2 Input-queued Switches ........................................................... 8

1.2.3 CIOQ and Buffered Crossbar Switches ................................. 10

1.2.4 Load-Balanced Two-Stage Switches ..................................... 12

1.3 Contributions ..................................................................................... 13

1.4 Thesis Overview ................................................................................ 16

Chapter 2 Feedback-Based Two-Stage Switch Design

2.1 Introduction ....................................................................................... 182.2 Related Work ..................................................................................... 22

2.2.1 Using Re-sequencing Buffers ................................................ 22

2.2.2 Preventing Packets from Becoming Mis-sequencing ............ 23

2.3 Feedback-Based Two-Stage Switch .................................................. 26

2.3.1 Some Observations and Motivations ..................................... 26

2.3.2 Designing Scalable Feedback Mechanism ............................. 28

2.3.3 Solving Packet Mis-sequencing Problem .............................. 31

2.3.4 Feedback-Based Scheduling Algorithms ............................... 34



- iv -

2.4 Performance Evaluations ................................................................... 36

2.4.1 Performance under Uniform Traffic ...................................... 37

2.4.2 Performance under Uniform Bursty Traffic ........................... 38

2.4.3 Performance under Hotspot Traffic ....................................... 39

2.5 The Stability of Feedback-Based Two-Stage Switch ........................ 41

2.5.1 The Existing Approaches ......................................................... 41

2.5.2 Fluid Model for Feedback-Based Two-Stage Switch ............. 42

2.5.3 100% Throughput Proof .......................................................... 45

2.6 Chapter Summary .............................................................................. 49

Chapter 3 Cutting Down Average Packet Delay

3.1 Introduction ....................................................................................... 50

3.2 Optimal Joint Sequence Design .......................................................... 52

3.2.1 In-order Packet Delivery Only ............................................... 53

3.2.2 Both In-order Packet Delivery and Staggered Symmetry ...... 59

3.2.3 Finding the Number of Different Joint Sequences ................. 61

3.2.4 Discussions ............................................................................. 63

3.3 Three-Stage Switch ............................................................................ 64

3.3.1 Three-Stage Switch Architecture ........................................... 64

3.3.2 Traffic Matrix Estimation ....................................................... 69

3.3.3 Performance Evaluations ....................................................... 70

3.4 Chapter Summary .............................................................................. 73

Chapter 4 Cutting Down Communication Overhead

4.1 Introduction ....................................................................................... 74

4.2 Feedback Suppression Algorithms .................................................... 75

4.2.1 Set-based Feedback (Set-feedback) ....................................... 77

4.2.2 Queue-based Feedback Version 1 (Q-feedback-1) ................ 78

4.2.3 Queue-based Feedback Version 2 (Q-feedback-2) ................ 79





- v -



4.3.4 Performance under Different Switch Size N .......................... 83

4.4 Chapter Summary .............................................................................. 84

Chapter 5 Supporting Multicast Traffic

5.1 Introduction ....................................................................................... 85

5.2 Related Work ..................................................................................... 87

5.2.1 Multicast Switches Based on Bufferless Switch Fabrics ....... 87

5.2.2 Buffered Crossbar Based Multicast Switches ........................ 895.3 Multicast Scheduling in Feedback-Based Two-Stage Switch ........... 90

5.3.1 Multicast Scheduling .............................................................. 90

5.3.2 Discussions ............................................................................. 92


5.4.1 Performance under Uniform Mixing Traffic ......................... 94

5.4.2 Performance under Uniform Bursty Mixing Traffic .............. 96

5.4.3 Performance under Binomial Mixing Traffic ........................ 97

5.5 Chapter Summary .............................................................................. 99

Chapter 6 Multi-cabinet Implementation

6.1 Introduction ....................................................................................... 100

6.2 Related Work ..................................................................................... 102

6.2.1 Multi-cabinet Implementation of Input-queued Switch ......... 1026.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch . 103

6.3 Multi-cabinet Implementation of Feedback-Based Switch ............... 103

6.3.1 Revamped Feedback Mechanism ........................................... 103

6.3.2 Batch Scheduler Design ......................................................... 106

6.3.3 Some Properties ..................................................................... 107






- vi -


6.5 Chapter Summary .............................................................................. 112

Chapter 7 Scheduling Inadmissible Traffic Patterns

7.1 Introduction ....................................................................................... 113

7.2 Related Work ..................................................................................... 115

7.2.1 Fair Scheduling under Admissible Traffic .............................. 115

7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only .... 115

7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports 116

7.3 Our Approach .................................................................................... 1177.4 Max-min Fairness Criterion ............................................................... 120


7.5.1 Under Server-client Traffic Model ........................................ 122

7.5.2 Attack-traffic Scenario ........................................................... 124

7.6 Chapter Summary .............................................................................. 125

Chapter 8 An Optical Implementation of Feedback-Based Switch

8.1 Introduction ....................................................................................... 126

8.2 Related Work ..................................................................................... 127

8.3 Load Balanced Optical Switch (LBOS) ............................................ 128

8.3.1 Switch Architecture ................................................................. 128

8.3.2 Switch Operation ..................................................................... 130

8.3.3 Equivalence to Load Balanced Electronic Switches .............. 1338.4 Extensions and Refinements of LBOS .............................................. 134

8.4.1 Cutting down the Average Delay by Reconfiguration ........... 134

8.4.2 Supporting Multicast ............................................................... 136

8.4.3 Implementing Fair Scheduler Optically ................................. 137


8.5.1 Performance under Uniform Traffic ........................................ 138

8.5.2 Performance under Uniform Bursty Traffic ............................ 139




- vii -

8.5.4 Performance for Linecard Placement ..................................... 141

8.6 Chapter Summary .............................................................................. 141

Chapter 9 Conclusion

9.1 Our Contributions ................................................................................ 142

9.2 Future Work ........................................................................................ 145

9.2.1 100% Throughput Proof without Speedup ............................. 145

9.2.2 Building a Large Feedback-Based Two-Stage Switches ....... 146

9.2.3 More Scalable Fairness Algorithm in LBOS ......................... 146

9.2.4 Scalable Iterative Algorithm for Input-queued Switch ............ 146

References ........................................................................................................... 147

Publications ......................................................................................................... 157



- viii -

List of Figures

Fig. 1.1: A generic router ................................................................................ 2

Fig. 1.2: A router works in two different planes ............................................. 2

Fig. 1.3: The first generation routers architecture .......................................... 3

Fig. 1.4: The second generation routers architecture ...................................... 4

Fig. 1.5: The third generation routers architecture ......................................... 5

Fig. 1.6: The fourth generation routers architecture ....................................... 6

Fig. 1.7: An input-queued switch with Virtual Output Queues (VOQs) ........ 9

Fig. 1.8: A buffered crossbar switch ............................................................... 11

Fig. 2.1 A load-balanced two-stage switch architecture ................................ 19

Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 21

Fig. 2.3 Feedback operation in joint sequences with staggered symmetry ... 30

Fig. 2.4 Delay vs input load p, with uniform traffic ........................................ 38

Fig. 2.5 Delay vs input load p, with uniform bursty traffic ........................... 39

Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes 40

Fig. 2.7 Delay vs input load p, with hot-spot traffic ...................................... 40

Fig. 3.1 The feedback-base two-stage switch architecture ............................ 51

Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 52

Fig. 3.3 The relation between staggered symmetry and in-order delivery .... 53

Fig. 3.4 The generic joint configuration at time slot t ................................... 56

Fig. 3.5 Generic joint sequence with anchor output and ordered properties . 57

Fig. 3.6 Joint sequence with staggered symmetry and in-order delivery ...... 60

Fig. 3.7 A three-stage switch architecture ...................................................... 65

Fig. 3.8 An example of using three-stage switch .......................................... 66

Fig. 3.9 Traffic matrix and delay matrix ......................................................... 66

Fig. 3.10 An example of identifying the minimum independent set ................ 67Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b) ..... 69



- ix -

Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch ...... 71

Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch ............ 72

Fig. 4.1 Timing diagram of feedback switch with feedback suppression ...... 76

Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback .. 81

Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback ...... 82

Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback ... 83

Fig. 4.5 Throughput vs switch size N , with partial feedback ......................... 84

Fig. 5.1 Delay vs output load λ , with uniform mixing traffic ........................ 94

Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7 ................. 95

Fig. 5.3 Delay vs output load λ , with bursty mixing traffic ........................... 97

Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7 ................... 98

Fig. 5.5 Delay vs output load λ , with binomial mixing traffic ...................... 99

Fig. 6.1 The timing diagram of switch with large propagation delay .............. 101

Fig. 6.2 Multi-cabinet implementation of the feedback-based switch ........... 103

Fig. 6.3 Feedback operation in multi-cabinet implementation ...................... 104

Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet ......... 110

Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet ............. 111

Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet .......... 112

Fig. 7.1 A 4×4 feedback-based switch with output port 3 oversubscribed by inputs

0, 1, 2 and 3. ...................................................................................... 114

Fig. 7.2 Output 0’s throughput vs its output load λ, under server-client traffic 123

Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic ....... 124

Fig. 8.1 A 4×4 load balanced optical switch .................................................. 129Fig. 8.2 The internal structure of linecard i ................................................... 129



- x -

Fig. 8.3 Time diagram for load balanced optical switch ............................... 131

Fig. 8.4 Timing diagram for pipelined packet sending and receiving ............ 132

Fig. 8.5 A joint sequence in load-balanced switch .......................................... 133

Fig. 8.6 Two possible linecard placement patterns using OXC ....................... 136

Fig. 8.7 Delay vs input load, under uniform traffic in LBOS ......................... 139

Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS ............... 140

Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS ......................... 140



- xi -

List of Symbols

N Switch size

VOQ1(i,k ) the VOQ at input port i with packets destined for output k

VOQ2( j,k ) VOQ at middle-stage port j with packets destined for output k

flow(i,k ) Packets arriving at input i and destined for output k

K Anchor output port for an input port i

p Input load for a input port

s p Burst size in uniform bursty traffic

S j The set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy

d The middle-stage port delay experienced in a feedback switch

{r i, j} N × N matrix {r i, j}, which denotes the request number from flow(i, j)

Z ij(n) The number of packets in VOQ1(i, j) at the beginning of time slot n

Aij(n) The cumulative number of arrivals for VOQ1(i, j) at the beginning

of time slot n

Dij(n) The cumulative number of departures for VOQ1(i, j) at the

beginning of time slot n

Bij(n) The number of packets in VOQ2(i, j) at the beginning of time slot n

X ij(n) The cumulative number of arrivals for VOQ2(i, j) at the beginning

of time slot n

Y ij(n) The cumulative number of departures for VOQ2(i, j) at the

beginning of time slot n

λ ij The mean packet arrival rate to VOQ1(i, j)

ω A sample in random event

Aij(t ,ω) The cumulative number of arrivals to VOQ1(i, j) for a fixed ω at

time t

Z ij(t ,ω) The number of packets in VOQ1(i, j) for a fixed ω at time t

Dij(t ,ω) The cumulative number of departures from VOQ1(i, j) for a fixed ω

at time t

X ij(t ,ω) The cumulative number of arrivals to VOQ2(i, j) for a fixed ω at

time t Bij(t ,ω) The number of packets in VOQ2(i, j) for a fixed ω at time t



- xii -

Y ij(t ,ω) The cumulative number of departures from VOQ2(i, j) for a fixed ω

at time t

C ij(t ) The joint queue occupancy of all packets arrived at input port i plus

all packets destined for output j

{r n} Any sequence {r n} with r n → ∞ as n → ∞

S The times of speedup

f (t ) A non-negative, absolutely continuous function defined on R+∪{0}

q If VOQ2( j,k ) is not empty, the packet in VOQ2( j,k ) will be

transmitted to output port k with fixed delay q

M The number of reduced Latin squares

{d ij} The delay matrix, where d ij is traffic-weighted-average middle-

stage packet delay of all the N flows destined to output port i-1

Qi,j The packet counter Qi,j is associated with each of the VOQ1(i,j)

T The sampling interval

u The number of non-overlapped sets per port

g The number of VOQs per non-overlapped set

Gm The non-overlapped set of VOQs

F Denotes VOQ1(i,F ) is the longest queue at input i at time t

b The number of bits sent in the second stage of Q-feedback-1

z Denotes VOQ2( j,z ) is empty VOQ at middle-stage port j

C The number of sets Gm sent when cutting down the feedback bits

m The number of multicast VOQs at each input port

E y The vector reports the occupancy status from VOQ2( j, yN /m) to

VOQ2( j, yN /m+ N /m-1)

T c The overall average delay experienced by all copies of all multicast

packets

T p The average delay experienced by the last-copy of all multicast

packets

T c(k ) The average delay for multicast packets with fan-out k

T p(k ) The last-copy delay for multicast packets with fan-out k

λ The switch output load

P k The probability of generating a fan-out set with size k in binomial

mixing traffic

h The mean fan-out size in binomial mixing traffic





- xiv -

List of Abbreviations

ACK Acknowledgement

AMFS Adaptive Max-min Fair Scheduling

AWGR Arrayed Waveguide Grating Router

bps bits per second

CIOQ Combined Input Output Queuing

CMS Concurrent Matching Switch

CP Cross Point

CPU Central Processing Unit

CR Contention and Reservation

DRRM Dual Round Robin Matching

EDF Earliest Departure First

EDFA Erbium Doped optical Fiber Amplifier

FDL Fiber Delay Line

FIFO First In First Out

F-MWM Fair Maximum Weight Matching

FOFF Full Ordered Frames First

GPS-SW Generalized Processor Sharing in network Switch

HOL Head Of Line

i.i.d Independent and Identically Distributed

ILP Integer Linear Programming

I-SMCB Input-based Shared Memory Crosspoint Buffer

LBOS Load-Balanced Optical Switch

LQF Longest Queue First

MEMS Micro Electro Mechanical Systems

MSM Maximal Size Matching

MWM Maximum Weight Matching

MURS Multicast and Unicast Round robin Scheduling

O-E-O Optic-Electric-Optic

O-SMCB Output-based Shared Memory Crosspoint Buffer OXC Optical Cross-Connect



- xv -

PF Padded Frame

PIM Parallel Iterative Matching

RR Round Robin

RTT Round Trip Time

SRR Synchronous Round Robin

TCAM Ternary Content Addressable Memory

TDMA Time Division Multiplexing Access

TFQA Tracking Fair Quota Allocation

UFS Uniform Frame Spreading

VOD Video On Demand

VOQ Virtual Output Queue

WDM Wavelength Division Multiplexing

WF2Q+ Fair Weighted Fair Queueing +

w.r.t With Regard To



- 1 -

Chapter 1

Introduction

1.1 Overview of Routers

The Internet is a network of networks. The basic unit of data exchange on the

Internet is an IP packet. Routers play a crucial role in the Internet by connecting

different networks together and forwarding each IP packet to its correct destination.

An N × N generic router is shown in Fig. 1.1. It consists of a routing processor, a

switch fabric and N linecards. The routing processor executes the routing protocols,

maintains the routing information and forwarding tables, and performs network

management functions within the router. A linecard is a subsystem that receives



- 2 -

datagrams on an external ingress or internal egress from the switch fabric. Each

linecard is (logically) divided into input port (for processing ingress traffic) and

output port (for processing egress traffic). A switch fabric allows inputs to be

connected with outputs for packet forwarding.

Fig. 1.1 A generic router

Fig. 1.2 A router works in two different planes

A router operates in two different planes [1,2]: control and forwarding (Fig.

1.2). The control plane constructs a routing table using the routing protocol, where

the router learns which linecard is the most appropriate for forwarding specific



- 3 -

packets to specific destinations. Forwarding, the predominant plane in router, is

responsible for the actual process of switching a packet received from a linecard to

another one. Forwarding involves packet by packet processing and is generally more

time-critical than the operations at control plane.

Fig. 1.3 The first generation routers architecture

From Fig. 1.1, we can see that the switch fabric is at the very heart of a router.

In fact, the evolution of routers is accompanied by the evolution of switch fabrics.

Historically, routers have been realized with packet-switching software executing on

a general-purpose CPU. Those first generation routers appeared before the early

1990s, consisting of a CPU, a centralized memory and several linecards (Fig. 1.3).

Lincards are connected to the CPU and centralized memory via a shared bus [4]

(instead of a dedicated switch fabric). The CPU is responsible for all operations at

control and forwarding planes. When a packet arrives at an input linecard, it will

cross the shared bus to arrive at the centralized memory. When the output linecard is

identified by the CPU, the packet will be read out from the memory and forwarded to

the output linecard via the shared bus again. As each packet needs to traverse the

shared bus twice, the bus bandwidth limits the router performance. Besides, the use

of a single CPU also undermines the router performance. An example of the first

generation routers is Huawei Quidway AR18 series routers [3].



- 4 -

Fig. 1.4 The second generation routers architecture

In a second generation router shown in Fig. 1.4, route cache, satellite

processor and memory are allocated at each linecard. The operations at the

forwarding plane are segregated from the central CPU and carried out by distributed

linecards. If routing information can be found in the local linecard route cache, a

packet will traverse the shared bus once, by going to the destination linecard directly.

Otherwise, the packet will be sent to the centralized memory for processing by the

central CPU, as the case of first generation routers. A major limitation of the second

generation router is the shared bus, which can support at most one packet traversal at

a time. Cisco 7500 series routers [6] belong to the second generation of routers.

To alleviate the bottleneck of using a single shared bus, the third generation

router introduces an interconnection network as the switch fabric (Fig. 1.5). This

enables multiple packets to traverse the switch fabric in parallel and without

contention. This architecture improves the routers’ switching capacity from the

second generation’s 2 Gbps to about 1 Tbps. An implicit requirement for

implementing the architecture in Fig. 1.5 is that all linecards and the switch fabric



- 5 -

must be stowed in the same standard-sized switch cabinet. A typical cabinet [7] is of

a size 2.1m×0.6m×1.0m, and is supplied with power no more than 14 kW.

Accordingly, each cabinet can only house up to 16 linecards. An example of this

generation of routers is the Cisco 12000 series routers [7].

DMA

MA C

R o u t e c a c h e

S a t e l l i t e

pr o c e s s or

M e m or y D

M A

M A C

R o u t e c a c h e

S a t e l l i t e

p r o c e s s o r

M e m o r y

Fig. 1.5 The third generation routers architecture

To accommodate Internet traffic in the range of 10 Tbps, a large number of

linecards and huge power requirements are necessary. A report [73] shows that a

router consumes 0.01 kW power for 1 Gbps and supports 40 Gbps by a linecard.

Handling 10 Tbps data would result in 100 kW power and 125 linecards, which

cannot be supported by the third generation router architecture shown in Fig. 1.5.

The fourth generation routers remove the limitations of space and power by

distributing the linecards to different cabinets, as shown in Fig. 1.6. The burdens of

space and power are parceled out. Optical fibers connect all cabinets to the central

electronic switch fabric. (Note that it is difficult to run copper wires at high-speed



- 6 -

due to insertion loss, near-end crosstalk, electromagnetic emissions, echo and

propagation skew [96-97]). As the centralized switch fabric works in electrical

domain, packets arrived on fiber must be converted to electrical signals for switching,

and vice versa when they depart the switch fabric. This extra O-E-O conversion and

the need for a centralized scheduler (for configuring the switch fabric on per-slot

base) limit the fourth generation router from reaching even higher speed. Cisco CRS-

1 [8] is an example of the fourth generation router. Notably, it can pump the

switching capability to 90 Tbps, with 1152 linecards and each linecard running at 40

Gbps.

Fig. 1.6 The fourth generation routers architecture

Nowadays the commercial dense WDM systems [9] can support up to 160

parallel wavelengths in a single fiber, with up to 80 Gbps transmission rate on each

wavelength. Consequently, the fourth generation routers can only process packets

coming from 4 fibers. Besides, due to the speed mismatch between the linecard



- 7 -

processing rate (e.g. 40 Gbps in Cisco CRS-1) and the fiber, a linecard cannot be

directly connected to a dense WDM fiber. Therefore, there is always an urgent need

for building high-speed routers that can fully exploit the capacity of a fiber.

1.2 Switch Architectures

In a router, the forwarding plane involves packet by packet processing, which

is generally more time-critical than the operations at the control plane [2]. In Fig. 1.2,

the forwarding plane comprises two major functions: table lookup for identifying the

correct output linecard of a packet, and switching for actual delivery of the packet.

IP table lookup algorithms can be classified into trie-based [74-79], range-

based [80-81], and hash-based algorithms [82-88]. These algorithms can be

implemented by software, hardware or both. Software schemes can benefit from low

cost and flexibility. Hardware solutions, e.g. TCAM (Ternary Content Addressable

Memory [89-95]), are more efficient as they can search contents in parallel and

complete the lookup in single clock cycle. Nerveless, as table lookup process can be

distributed to each linecard, its high-speed implementation tends to be less critical

than switching. To this end, the table lookup at 100 Gbps per linecard is reported in

[88] while due to the limitation of O-E-O conversion and centralized scheduler, the

switching rate of 40 Gbps per linecard seems to be the current limit.

In this thesis, we focus on designing efficient and scalable switch architecture

to enable the next generation high-speed routers. Based on the switch architecture,

the routers can be generally classified to output-queued, input-queued, and combined



- 8 -

input output queued (CIOQ).

1.2.1 Output-queued Switches

In an output-queued switch, all packets can be switched to their respective

output linecards as soon as they arrive at the inputs. Accordingly, no input port

buffer is required and the output-queued switch provides the optimal packet delay-

throughput performance. But the switch fabric must be powerful enough to deliver

up to N packets to any output port, and the output buffer must be fast enough to

receive up to N packets in each time slot, where N is switch size (i.e. the number of

linecards). In other words, the switch fabric and output ports must operate at N times

of an individual link rate. This makes high-speed output-queued switches expensive

to build, and difficult to scale.

It should be noted that the complexity of a switch fabric can be measured by

the number of switch configurations it needs to realize. A switch configuration is an

internal switch fabric connection pattern for mapping the set of N input packets to N

outputs. For an output-queued switch (fabric), it needs to realize N N configurations as

up to N packets can go to the same output.

1.2.2 Input-queued Switches

For an input-queued switch, all packets are buffered at input ports and wait

for their turns to be served by the switch fabric. No switch fabric speedup is required

(i.e. the fabric only needs to run at the same speed as each input link), whereas each

input can send at most one packet and each output can receive at most one packet in

every time slot. Accordingly, the number of switch configurations to be realized by



- 9 -

an input-queued switch is N !, which is substantially smaller than the N N required by

an output-queued switch. This makes input-queued switches more suitable for

building high-speed routers with large port count.

0

N -1

Input port 0

VOQ(0,0)

VOQ(0, N -1)

VOQ( N -1,0)

VOQ( N -1, N -1)

Switch fabric

Input port N -1

Output port 0

Output port N -1

Fig. 1.7 An input-queued switch with Virtual Output Queues (VOQs)

On the other hand, input-queued switches suffer from the well-known

problem of head-of-line (HOL) blocking. This limits the maximum throughput of an

input-queued switch to just 58.6% under uniform traffic [10]. To eliminate the HOL

blocking, Virtual Output Queue (VOQ) is proposed [11], where each input port

maintains a separate queue for each output (Fig. 1.7). A centralized scheduler is

needed to maximize the throughput of a VOQ switch. The scheduling problem is

equivalent to the matching problem in a bipartite graph [98]. It is found that for any

admissible traffic patterns, 100% throughput can be achieved by MWM (Maximum

Weight Matching [12]). However, MWM algorithm has a high time complexity of

O( N 3·log N ). MSM (Maximal Size Matching) algorithms with lower computation

overheads, notably, PIM (Parallel Iterative Matching [13]), iSLIP [14,15] and

DRRM (Dual Round Robin Matching [16]), are then proposed. They are iterative

algorithms involving non-negligible amount of communication overheads for state



- 10 -

information exchange, which scales up very quickly with the number of iterations to

be carried out, link speed and switch size.

As an example, the ATLANTA architecture proposed in [105] is based on

input-queued switch architecture. Notably, its switch fabric is implemented by a

three-stage (memory/space/memory) Clos network, where packets are buffered at the

first and third stage while the second stage is constructed using crossbar switch

modules. To avert overflow in fabric-embedded-buffers of the first and third stages,

backpressures are sent out from the first stage to the input ports, as well as from the

third stage to the second stage crossbars. Nevertheless, its performance is limited by

the required packet/slot based switch re-configurations.

1.2.3 CIOQ and Buffered Crossbar Switches

In a CIOQ switch, packets are buffered at both input and output ports [17].

The switch fabric is the same as the input-queued switch fabric, where in each time

slot at most a single packet can leave/join an input/output port. A centralized

scheduler is responsible for selecting the most “critical” packet to deliver in each

time slot. A packet may arrive at an output port out-of-order. Therefore an output

buffer/queue is required. It has been shown that [17] with a speedup of two (i.e. in

each time slot, up to two packets can leave/join an input/output port), CIOQ switch

can provide precise emulation of an output-queued switch. Like an input-queued

switch, the number of switch configurations to be realized by a CIOQ switch is also

N !. But the complexity of the centralized scheduler is by no means less than that of

an input-queued switch.



- 11 -

Notably, buffered crossbar switch [18-20] is an elegant approach of

implementing CIOQ switches by adopting a distributed approach for scheduling. In

addition to buffering packets at each input, buffered crossbar switch allows packets

to be buffered at each crosspoint of the switch fabric, as shown in Fig. 1.8. It has

been shown that buffered crossbar can yield performance comparable to output-

queued switches. Although buffered crossbar is touted for its technology feasibility

and simpler scheduler, it requires 2 N schedulers (one for each input/output port), N 2

in-fabric crosspoint buffers, and the switch configuration must be determined on a

slot-by-slot basis. It should be noted that the total N 2 crosspoint buffers are very

difficult to build. A report [100] shows that a memory of 512-bit word occupies

0.0278 mm2 of silicon even under state-of-the-art 0.18 um VLSI technology.

Assumed switch size N =32, holding all crosspoint buffers, with 1000-bit for each,

would result in 55.6 mm2 of silicon, which dominates the cost in terms of area and is

prohibitive [47-48,100].

0

N -1

Input port 0

VOQ(0,0)

VOQ(0, N -1)

VOQ( N -1,0)

VOQ( N -1, N -1)

Switch fabric

Input port N -1

Output port 0 Output port N -1

CP(0,0) ... CP(0, N -1)

CP( N -1,0) ... CP( N -1, N -1)

Fig. 1.8 A buffered crossbar switch

For a buffered crossbar switch, due to releasing the inputs/outputs contention,



- 12 -

the total number of switch configurations to be realized is N N , the same complexity

as output-queued switch fabric. Besides, the communication overheads for collecting

queue size at each crosspoint buffer (for input/output arbitration) can be a potential

performance bottleneck.

1.2.4 Load-Balanced Two-Stage Switches

Load-balanced two-stage switches (or load-balanced switches) have received

a great deal of attention recently [21-32] because they are more scalable and can

provide close to 100% throughput. A load-balanced switch consists of two stages of

switch fabrics, as shown in Fig. 2.1 Each switch fabric is configured according to a

pre-determined and periodic sequence of switch configurations. To this end, each

switch fabric only needs to realize only N switch configurations (instead of N ! for an

input-queued and CIOQ switches, and N N for an output-queued and buffered crossbar

switches). This greatly facilitates high-speed implementation.

Besides, due to the pre-determined nature of the sequence of configurations,

load-balanced switch removes the need for a centralized scheduler – another major

bottleneck in designing high-speed switch. As a load-balanced switch provides

multiple paths for packets belonging to the same flow to arrive at the same output

port, packets may arrive out-of-order due to different middle-stage port delays

experienced en route. Many efforts [22-32] are then made to address this notorious

packet mis-sequencing problem (to be reviewed in Chapter 2). It is not difficult to

see that higher switch throughput is usually at the cost of poorer delay performance.

This is because that throughput is improved by better load balancing, but better load

balancing tends to aggravate the packet mis-sequencing problem.



- 13 -

1.3 Contributions

In this dissertation, we dedicate our efforts to designing efficient and scalable

switch architecture for next generation high-speed routers. We have two key design

objectives:

No need for a centralized scheduler, as centralized scheduler is a major

obstacle for a scalable switch architecture; and

Amenable for optics, which can avoid the extra O-E-O conversion in the

fourth generation routers when packets are switched from one linecard to

another.

We follow the approach of load-balanced switch due to its scalability (no

centralized scheduler) and close to 100% throughput performance. But its notorious

packet mis-sequencing problem must be properly addressed. Otherwise, the

complexity of the load-balanced switch as well as its delay performance would suffer.

To this end, an elegant solution called feedback-based two-stage switch (or feedback-

based switch in short) is proposed in this thesis. Before diving into the details, our

major contributions made in this thesis are outlined below.

Feedback-based Two-stage Switch Design: Unlike other load-balanced

switches, at each middle-stage port between the two switch fabrics of our

feedback-based two-stage switch, only a single-packet-buffer for each VOQ

is required. Although packets belonging to the same flow pass through



- 14 -

different middle-stage VOQs, the delays they experience at different middle-

stage ports will be identical. This is made possible by properly selecting and

coordinating the two sequences of switch configurations to form a joint

sequence with both staggered symmetry property and in-order packet delivery

property. Based on the staggered symmetry property, an efficient feedback

mechanism is designed to allow the right middle-stage port N -bit occupancy

vector to be delivered to the right input port at the right time. As compared

with the existing load-balanced switch architectures and scheduling

algorithms, our solution imposes a modest requirement on switch hardware,

but consistently yields best delay-throughput performance.

Cutting down the average packet delay of switch: As different flows

experience different middle-stage delays, we can cut down the average packet

delay by assigning heavy flows to experience less middle-stage delays. For a

given traffic matrix, we can find an optimal joint sequence that can minimize

the average middle-stage delay. But this involves tedious computation. A

three-stage switch architecture is thus proposed by adding another stage of

switch fabric for dynamically mapping heavy flows to experience less

middle-stage port delay.

Cutting down the communication overhead of feedback-based switch: In

a feedback-based switch, each middle-stage port needs to piggyback an N -bit

occupancy vector to its connected output in each time slot. To cut down this

communication overhead, the size of an occupancy vector can be reduced by

only reporting the status of selected middle-stage VOQs. To identify VOQs



- 15 -

of interest, we first partition the N VOQs into u non-overlapped sets, each

being identified by a set number. In each time slot, every input port

piggybacks its set numbers of interest to the connected middle-stage port.

This guides a middle-stage port to only report the status of the VOQs of

interest.

Supporting multicast: By slightly modifying the operation of the original

feedback-based two-stage switch, we show that feedback-based switch

supports multicast traffic efficiently. A notable feature of this multicast

extension is that the switch fabric remains unicast, whereas packet

duplication is distributed to both input and middle-stage ports.

Multi-cabinet implementation: In a single-cabinet implementation, the

propagation delay between linecards and switch fabric is negligible. In a

multi-cabinet implementation, due to the non-negligible propagation delay

between linecards and switch fabric, the requirement that occupancy vectors

must arrive at output/input ports within a single time slot will significantly

lower the feedback-based switch efficiency. To this end, we revamp the

original feedback mechanism to support multi-cabinet implementation, and a

new batch scheduler is also designed.

Fairness support for switching inadmissible traffic: As long as the traffic

is admissible, due to the close to 100% throughput in our feedback switch,

packets can arrive at outputs with bounded delays, so fairness in throughput is

not an issue. Under inadmissible traffic (i.e. some output ports are over-



- 16 -

subscribed), the feedback switch may suffer from the ring-fairness problem,

i.e. “up-stream” input ports can starve some “down-stream” input ports. To

address this ring-fairness problem, an algorithm that can allocate the

bandwidth of over-subscribed outputs based on max-min fairness criterion is

designed.

Optical implementation of feedback-based switch: To ensure packets can

be switched from one linecard to another all-optically, an optical feedback-

based switch called Load-Balanced Optical Switch (LBOS) is proposed.

LBOS leverages an N -wavelength WDM fiber ring to connect N linecards

together. The ring network is engineered such that the amount of time a

packet should be buffered at a middle-stage port exactly matches the

propagation delay that this packet would experience en route.

1.4 Thesis Overview

This thesis consists of nine chapters. In Chapter 2, we first review the

existing work for solving the packet mis-sequencing problem of load-balanced

switches. Then our proposed feedback-based two-stage switch’s framework is

introduced. The delay and throughput performance of feedback-based switch is

compared with other existing algorithms by simulations. With a speedup of two, the

stability of feedback-based switch is also proved.

In Chapter 3, we cut down the average packet delay of a feedback-based

switch by assigning heavy flows to experience less middle-stage ports delays. In



- 17 -

Chapter 4, we focus on designing efficient feedback suppression schemes for cutting

down the communication overhead of sending middle-stage occupancy vectors. In

Chapter 5, we extend the feedback-based switch to support multicast traffic. In

Chapter 6, the feedback-based switch is refined to support multi-cabinet

implementation. In Chapter 7, a fair scheduling algorithm for inadmissible traffic is

proposed. An optical implementation of the feedback-based switch, called LBOS, is

introduced in Chapter 8. Finally, Chapter 9 summarizes our contributions in this

thesis, and highlights some interesting future research directions.



- 18 -

Chapter 2

Feedback-Based Two-StageSwitch Design

2.1 Introduction

Due to its more scalable switch fabric, input-queued switch architecture is

more suitable than output-queued switch for high-speed router implementation.

However, input-queued switch requires a centralized scheduler to determine its

switch configuration on a slot-by-slot basis. The requirement for a centralized

scheduler is thus the major bottleneck in further increasing the router’s capacity.

Load-balanced two-stage switches [21-32] remove the bottleneck of



- 19 -

centralized scheduler and can provide close to 100% throughput. A load-balanced

two-stage switch architecture consists of two stages of switch fabrics, as shown in

Fig. 2.1. Each fabric is configured according to a pre-determined and periodic

sequence of switch configurations, with the only requirement that each input

connects to each output exactly once in the sequence. The two fabrics can use

different sequences. There are many ways to generate such a sequence, e.g., a

sequence can be constructed by cyclic shifting the set of input/output connections

used in each time slot, such that at time slot t , input i (for i = 0,1,2,…, N -1) is

connected to output j, where j is given by

j = ( i + t ) mod N . (2.1)

In Fig. 2.2 (a), the sequence of blue\dotted configurations represent the

configurations used by the first stage switch fabric in Fig. 2.1, and it is generated

based on (2.1). Note that each switch port is abstracted as a circle in Fig. 2.2.

Fig. 2.1 A load-balanced two-stage switch architecture.

For the generic load-balanced switch architecture shown in Fig. 2.1, we use

VOQ1(i,k ) to represent the VOQ at input port i with packets destined for output k ,

and VOQ2( j,k ) to denote the VOQ at middle-stage port j with packets destined for



- 20 -

output k . We define flow(i,k ) as packets arriving at input i and destined for output k .

Packets from flow(i,k ) are buffered at VOQ1(i,k ). Packets (from different inputs)

destined for output k are buffered at VOQ2( j,k ) for j = 0, 1, …, N -1. Aiming at

converting the incoming non-uniform traffic to uniform, the first stage switch fabric

spreads packets evenly over all middle-stage ports. Then the second stage switch

fabric delivers the packets from middle-stage ports to their respective outputs. From

the above, we can see that in each time slot, there are two switch configurations, one

at each fabric. We call them a joint configuration. The sequence of N joint

configurations forms a joint sequence. Three possible joint sequences are shown in

Fig. 2.2. It is important to point out that all three joint sequences in Fig. 2.2 meet the

basic requirement of a load-balanced two-stage switch, but they have different

properties, namely, in-order packet delivery and staggered symmetry. These two

properties will be discussed in detail in Section 2.3, which form the basis of our

feedback-based two-stage switch design. In Chapter 3, the problem of optimal joint

sequence design will be investigated.

Due to the two-stage nature, flow(i,k ) packets may arrive at output k via

different middle-stage VOQ2( j,k )’s (for j = 0, 1, …, N -1) and thus may experience

different amounts of middle-stage port delay. This leads to the problem of packet

mis-sequencing. Many efforts [21-32] are then made to address this notorious packet

mis-sequencing problem (reviewed in Section 2.2). It is not difficult to see that

higher switch throughput is usually at the cost of poorer delay performance. This is

because that throughput is improved by better load balancing, but better load

balancing tends to deteriorate the packet mis-sequencing problem.



- 21 -

Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch.

In this chapter, we show that the efforts made in load balancing and keeping

packets in-order can complement each other in improving both delay and throughput

performance of the switch. We adopt a simple load-balanced switch architecture

where each middle-stage port between the two stages of switch fabrics only has a

single-packet-buffer for each VOQ. Although packets belonging to the same flow

will pass through different middle-stage VOQs, the delays they experience at

different middle-stage ports will be identical. This is made possible by properly

selecting and coordinating the two sequences of switch configurations (used by the

two stages of switch fabrics) to form a joint sequence with both staggered symmetry

property and in-order packet delivery property. Based on the staggered symmetry

property, an efficient feedback mechanism is designed to allow the right middle-stage

port occupancy vector to be delivered to the right input port at the right time.



- 22 -

Accordingly, the performance of load balancing as well as switch throughput is

significantly improved.

The rest of this chapter is organized as follows. In the next section, we review

the existing work for solving the packet mis-sequencing problem of load-balanced

switches. In Section 2.3, our proposed feedback switch framework is introduced. The

delay and throughput performance of our proposed solutions is compared with other

existing algorithms in Section 2.4 by simulations. In Section 2.5, we prove that for

any arbitrary work-conserving input port scheduler, the feedback-based switch can

achieve 100% throughput under a speedup of two. Finally, we conclude this chapter

in Section 2.6.

2.2 Related Work

Two main approaches can be followed to solve the mis-sequencing problem

of load-balanced switches, using re-sequencing buffers at outputs, or preventing

packets from becoming mis-sequenced in the first place.

2.2.1 Using Re-sequencing Buffers

When out-of-order packets arrive at an output port, they are temporarily

stored in the re-sequencing buffer (not shown in Fig. 2.1), waiting to be read out and

written onto the output link in the correct order. To this end, each packet header

should have a sequence number field (or timestamp), which is added to the packet

upon its arrival at an input port. With the original two-stage switch architecture [21],

packets can be mis-sequenced by an arbitrary amount, thus a finite re-sequencing



- 23 -

buffer is not possible. Efforts are made to bound the delay at additional costs, such as

N writes to memory in one time slot [22], and a 3-D re-sequencing buffer [23].

In [24], a three-stage load-balanced switch is presented where each of the

three stage switch fabrics is configured by predetermined and periodic configurations.

The buffers ahead of each stage of switch fabric are separately called first-stage

buffer, second-stage buffer and the third-stage buffer (i.e. re-sequencing buffer). For

every arriving packet, it firstly reserves a position in the third-stage buffer. Upon

successful reservation, the packet is forwarded into the first-stage buffer by a flow

splitter according to its assigned position number of the third-stage buffer. Packets

are transmitted through the first two switches in a FIFO manner and are inserted to

their reserved positions in the third-stage buffer. Although the switch is proved to be

stable, this design requires additional hardware as well as global information

exchange for buffer reservation. The high implementation complexity may defeat the

original purpose of using a load-balanced switch.

2.2.2 Preventing Packets from Becoming Mis-sequenced

Instead of re-ordering packets at each output, we can prevent packets from

becoming mis-sequenced in the first place [25-32]. This not only removes the re-

sequencing buffers, but also the corresponding re-sequencing delay. The majority

work along this direction [26-29] adopts the notion of “frame”. For an N × N switch, a

frame consists of N packets belonging to the same flow. At each input port, incoming

packets join their respective VOQs. If the size of a VOQ is larger than N packets, the

flow is said to have a full frame of packets. With the UFS (Uniform Frame Spreading)

algorithm [26], an input port is allowed to send only from flows/VOQs with at least a



- 24 -

frame of packets. Once the sending/frame starts, N packets from the selected flow

will be sent in the next N slots, where each packet arrives at a distinct middle-stage

port from 0 to N -1. The sending/frame starts when an input port is connected to a

particular middle-stage port, say 0. Each input has a distinct frame starting time

because they connect to middle-stage port 0 at different slots. Based on the above

frame notion, upon joining the VOQ at each middle port, each packet in the frame

will see the same middle-stage VOQ size. If the transmission at the second stage

switch fabric is coordinated such that an output is connected to middle-stage ports in

the same (cyclic) order as an input is connected to middle-stage ports, in-order

packet delivery is guaranteed.

A downside of the UFS algorithm is that when traffic load is light, it takes

time to form a full frame of packets, thus the delay performance suffers. To cut down

the delay, FOFF (Full Ordered Frames First) [27] is proposed. Instead of waiting for

full frames of packets, FOFF allows mis-sequencing due to sending partial frames.

But the amount of mis-sequencing at middle-stage ports is bounded. As a result, the

amount of re-sequencing buffer at each output is also bounded.

PF (Padded Frame) algorithm [28] also improves the delay performance of

UFS, but without the re-sequencing buffer of FOFF. The idea is that when no full

frames are available for sending, a partial frame can be sent as a “faked” full frame

by padding the partial frame with dummy packets. CR (Contention and Reservation)

algorithm [29] can further improve the performance of PF by supporting two modes

of frame transmission: contention and reservation. As long as an input i has a full

frame of packets when i connects to middle port 0, i enters the reservation mode and



- 25 -

the transmission in the next N slots is governed by UFS. Otherwise, input i enters the

contention mode, where the packet sent in each slot is selected using a round robin

scheduler, and must be acknowledged at the end of each time slot. A packet is

removed from the input VOQ only if a positive ACK (ACKnowledgement) is

received.

CR algorithm requires dedicated feedback/acknowledgement from each

middle-stage port in each time slot. The feedback path construction is not discussed

in [29]. To this end, the Mailbox switch [30] also requires a feedback path. It is

smartly constructed by adopting the joint sequence of switch configurations in Fig.

2.2(c), where input i and output i are always connected to the same middle-stage port.

In each time slot, when a packet arrives at a middle-stage port (from say input i), the

middle-stage port calculates its departure time (i.e. when it will be sent to its

destination output) based on its location in the VOQ. Then the departure time is sent

to the connected output port i using the second switch fabric. As input i and output i

are resided on the same switch linecard, output i can relay the departure time of the

packet to input i at negligible cost. A feedback path for reporting middle-stage packet

departure time is thus created. Based on the received packet departure time, the next

packet in the flow will be dispatched and inserted in a middle-stage VOQ if it will

depart no earlier than the previous packet of the same flow. Although the packet

order is maintained by Mailbox switch without relying on the frame notion, the

overall switch throughput is limited.

In [31], a distributed and iterative scheduling algorithm CMS (Concurrent

Matching Switch) is introduced. Despite the fixed uniform identical mesh in both



- 26 -

stages of switch fabrics, its logical configurations are the same as the joint sequence

in Fig. 2.2(c). For every arriving packet, input port sends a request to the current

(logically) connected middle-stage port. Each middle port records the receiving

requests in its own N × N matrix {r i, j}, where r i, j denotes the request number from

flow(i, j). Every N time slots each middle-stage port concurrently and independently

finds a matching based on its own {r i, j}. (Note that CMS can achieve stability using

randomized scheduling with amortized constant time and hardware complexity per

port, independent of N .) In the following N time slots, the packets matched are

transmitted to the middle-stage ports. As soon as they arrive, middle-stage ports

forward them to the connected output ports. Since the packets selected in each slot

traverse the two switches in parallel and without conflicts, there is no out of order

problem. However, the packet delay performance can be quite large, where the best-

case is 3 N time slots when a parallel optical mesh is used. Having said that, the delay

performance of the Chang's original architecture [21] is on the order of O( N ) if it is

implemented using an R/ N optics abstraction.

2.3 Feedback-based Two-stage Switch

2.3.1 Some Observations and Motivations

The delay and throughput performance of a load-balanced switch hinges on

how well the load-balancing and in-order packet delivery are implemented.

Obviously, if the incoming traffic is well-balanced by the first stage switch, the

throughput performance will be improved as the second stage switch can maximize

the number of packets sent in each time slot. Consequently, the packet delay will also

be reduced due to higher throughput.



- 27 -

But how to measure the load-balancing performance? Many scheduling

algorithms (e.g. in [23, 25]) try to ensure all middle-stage VOQs have the same

queue size. But as far as the throughput performance is concerned, we only need to

ensure each middle-stage VOQ2( j,k ) (in Fig. 2.1) does not suffer from either buffer

underflow or overflow problem. A buffer underflow occurs if there are packets

waiting in some input ports for a particular output k , but VOQ2( j,k ) is empty at the

time that middle-stage port j is connected to output k , yielding an idle transmission

slot on the second stage switch. On the other hand, buffer overflow is equally

undesirable as the overflowed packet is dropped, and the transmission slot in the first

stage switch is wasted. Indeed, as long as no buffer underflow and overflow at each

VOQ2( j,k ) is ensured, the actual buffer size for each VOQ2( j,k ) has no impact on the

throughput performance of the switch. Therefore, it may not be appropriate to

increase the buffer size of VOQ2( j,k ) for boosting throughput performance.

In a load-balanced switch, the head of line packet in each middle-stage VOQ

will experience an average delay of N /2 slots (due to the deterministic nature of the N

configurations), and each additional packet in the line will experience an additional

delay of N slots. To minimize delay, a small buffer size at each VOQ2( j,k ) is preferred.

In general, mechanisms for ensuring in-order packet delivery tend to penalize

the packet delay performance more than throughput. If re-sequencing buffers are

used for solving the mis-sequencing problem, packets suffer from the additional re-

sequencing delay. Since packet mis-sequencing is due to packets of the same flow

experienced different delays at different middle-stage ports, a smaller buffer size at



- 28 -

each VOQ2( j,k ) is favored because middle-stage packet delay can be reduced and

thus the mis-sequencing problem can be eased. Consequently, a smaller re-

sequencing buffer/delay is also possible. In fact, buffering a packet at an input port

(instead of a middle-stage port) gives more flexibility in sending because an input

can retry in the subsequent slots at different middle-stage ports (which may even

have a shorter queue size).

If the frame notion is used for ensuring in-order packet delivery, the time

required for forming a frame dominates the delay performance especially when the

load is light. Besides, frame-based transmission tends to make the traffic to

downstream switches more bursty, resulting in poor delay jitter performance.

Although PD [28] and CR [29] improve the delay performance of UFS [26], the use

of fake frames/packets undermines the load-balancing performance. In this chapter,

we are interested in designing a scheduling algorithm without using re-sequencing

buffers for in-order packet delivery, and without incurring the frame-based

scheduling overheads.

From our observations above, we can see that a smaller buffer size at each

VOQ2( j,k ) is preferred if we can ensure (a) no underflow and overflow at each

VOQ2( j,k ), and (b) no packet mis-sequencing. The smallest buffer size at each

VOQ2( j,k ) is 1. In the rest of this chapter, we shall focus on using a single-packet-

buffer at each VOQ2( j,k ).

2.3.2 Designing Scalable Feedback Mechanism

Now the issue is how to ensure each single-packet-buffered VOQ2( j,k ) is free



- 29 -

of either buffer overflow or underflow. If an input port knows the occupancy of its

connected VOQ2( j,k ) before sending a packet to it, the buffer overflow problem can

be easily solved. Then, do we have an efficient feedback mechanism for reporting the

occupancy of VOQ2( j,k ) to input ports?

We propose a simple yet novel feedback mechanism based on a joint

sequence with staggered symmetry property. A joint sequence of switch

configurations has the staggered symmetry property if middle-stage port j is

connected to output port k at time slot t , then at next slot (t +1) input port k is

connected to the same middle-stage port j. In essence, for each given sequence in the

first stage switch, the second stage sequence (and thus the joint sequence) can be

obtained directly from the property itself. In Fig. 2.2(a), the first stage sequence is

constructed from (1) by cyclic shifting the set of connections used in each slot. Each

configuration in the second stage is obtained from the staggered symmetry property.

We can see that for every pair of staggered configurations, e.g. the second switch

configuration at t =0 and the first switch configuration at t =1, they are mirror images

of each other.

As each VOQ2( j,k ) only has a single packet buffer, a single bit is sufficient to

denote its occupancy. For the N VOQ2( j,k )’s at middle-stage port j (for k =0, …, N -1),

their joint occupancy can be denoted by an N -bit occupancy vector. Since each pair

of input k and output k reside on the same linecard, the occupancy vector at middle-

stage port j can be piggybacked on the data packet sent to output k , which is then

made available to input k at negligible cost. Due to the staggered symmetry property

of the joint sequence used, input k will be connected to middle port j in the next time



- 30 -

slot. This gives a very efficient feedback path, allowing the occupancy vector from

the right middle-stage port to be delivered to the right input at the right time. In the

next time slot, each input port scheduler will select a packet for sending based on the

received occupancy vector. If the packet is properly selected, both buffer overflow

and underflow at a middle-stage VOQ2( j,k ) can be avoided. (In Section 2.3.4, three

simple input port schedulers are designed.)

S l o t t

S

l o t t + 1

S l o t t

S

l o t t + 1

Fig. 2.3 Feedback operation in joint sequences with staggered symmetry.

The timing diagram in Fig. 2.3 summarizes the feedback operation, while

assuming each switch reconfiguration involves certain overhead. We can see that

switch reconfiguration takes place in parallel with relaying the occupancy vector

from output k to input k and the execution of the scheduling algorithm. The

occupancy vector is created by taking both packet arrival/departure in the current slot

into account. In creating the vector, the occupancy bit of VOQ2( j,k ’) is always set to

0 if middle port j will connect to output k ’ in the next slot. This is because the packet



- 31 -

(if any) in VOQ2( j,k ’) is guaranteed to be sent in the next time slot. Besides, when a

buffered packet in VOQ2( j,k ’) is being sent, VOQ2( j,k ’) can receive another packet

simultaneously. Due to parallel packet transmission in the two switch stages, a packet

cannot be delivered from an input to an output in a single time slot, i.e. the minimum

delay a packet experienced at a middle-stage port is one slot.

From Fig. 3, we can also see that the feedback operation requires accurate

timing synchronization within a time slot. We notice that accurate synchronization of

less than 10 ns is reported in [106], and a scheme to achieve 1 ns synchronization is

proposed in [107]. Therefore, synchronization within a time slot of, say 40 ns, would

not be a major issue.

Note that the joint sequence in Fig. 2.2(c) does not have the staggered

symmetry property. If it is used for implementing feedback path (as in [30,32]),

occupancy vector cannot be piggybacked onto data packet. Instead, a dedicated

feedback packet must be sent from each middle-stage port to its connected output in

each time slot. This incurs not only extra propagation delay for sending the feedback

packet, but also extra packetization and synchronization overhead. As a result, the

duration of a time slot in [30,32] would be much longer than that shown in Fig. 2.3.

If the switch performance is studied using the number of time slots, the inefficiencies

of using a “larger” time slot could be easily overlooked.

2.3.3 Solving Packet Mis-sequencing Problem

If the load-balanced switch in Fig. 2.1 is configured by the joint sequence in

Fig. 2.2(a), will we face the packet mis-sequencing problem? We know that packet



- 32 -

order will be preserved if every packet of a flow experiences the same amount of

delay when passing through any middle-stage port. This is obviously true if middle-

stage ports are bufferless, thereby every packet experiencing the same 0-slot delay.

Will it be still true for the case of single-packet-buffer-per-VOQ2( j,k )?

Surprisingly, a closer examination at the joint sequence in Fig. 2.2(a) reveals

that packets of the same flow do experience the same middle-stage port delay. Take

flow(0,1) in Fig. 2.2(a) as an example. If a packet is sent (from input 0) to middle-

stage port 0 at t =0, it will be buffered at VOQ2(0,1) for 2 slots until VOQ2(0,1) is

connected to output 1 at t =2. If the next packet of the flow is sent to middle-stage

port 1 at t =1, it will be buffered at VOQ2(1,1) for, again, 2 slots until VOQ2(1,1) is

connected to output 1 at t =3.

In the following, we prove that this is true for each and every flow, and for

any switch size N . Consider the joint sequence in Fig. 2.2(a). The sequence used by

the first stage switch is constructed from (2.1). The sequence used by the second

stage switch is constructed according to the staggered symmetry property, which can

be represented by (2.2). That is at time t (for 0≤t < N ), middle-stage port j is connected

to output k , where k is given by

k = ( j + N – 1 – t ) mod N (2.2)

Statement 1: (Anchor Output). In Fig. 2.2(a), input i is always connected to

output K , where K = [(i+ N –1) mod N ], via one of the middle-stage ports.

Proof: At time t , input i is connected to output k via middle-stage port j.



- 33 -

Substitute j from (2.1) into (2.2), we can express k in terms of i.

( ) mod 1 mod ( 1) modk i t N N t N i N N K (2.3)

We can see that K depends only on i. Thus for a given input i, it is always connected

to the same anchor output K . #

Statement 2: (Deterministic Delay at Middle-stage Ports). Let K be the

anchor output of input i. For every packet of flow(i,k ), it experiences the same d slots

delay in one of the middle-stage ports, where d is given by

, if

, if

, if

N K k

d K k K k

K N k K k

(2.4)

Proof: Suppose at slot t , input i is connected to its anchor output K via

middle-stage port j and a packet is sent to join VOQ2( j,k ). From (2.2), middle port j is

connected to each output in descending order of the output port number. Then if K ≠k ,

this packet will experience exactly ( K -k ) module N slots delay in VOQ2( j,k ) due to

the single packer buffer at VOQ2( j,k ). If K =k , this packet can only be sent when

middle port j connects to output port K again, so its middle stage delay is N time slots.

In short, this packet will experience exactly d slots delay calculated by (2.4), and d is

bounded by [1, N ]. #

Statement 3 (In-order Packet Delivery). In-order packet delivery is

guaranteed if the joint sequence of configurations is constructed using (2.1) and (2.2).

Proof: Assume packets A and B of flow(i,k ) join VOQ2( j1,k ) and VOQ2( j2,k )

at time t A and t B (where t B>t A), respectively. Let d A and d B be their respective delays



- 34 -

experienced in VOQ2. Mis-sequencing occurs only if packet B reaches output k

earlier than packet A, i.e. t A+d A>t B+d B. However, this will never happen because

t B>t A and d A=d B from Statement 2. #

It can be easily seen that the delay a packet experienced at a middle-stage

port is bounded between [1, N ] slots, and the average middle-stage packet delay is

merely ( N +1)/2 slots for uniform traffic. From Fig. 2.2, we can see that some joint

sequences have the staggered symmetry property only, some have the in-order packet

delivery property only, and some have both properties. For instance, the joint

sequence in Fig. 2.2(b) has the staggered symmetry property but cannot ensure in-

order packet delivery. Consider packets from flow(0,1). Two different middle-stage

delays will be experienced, 2-slot via middle port 3 and 4-slot via middle port 1. This

causes packet out of order. On the other hand, the joint sequence in Fig. 2.2(c) can

provide in-order packet delivery but lacks the staggered symmetry property. The

systematical study of joint sequences is carried out in the Chapter 3, but as far as this

chapter is concerned, we only focus on the joint sequence in Fig. 2.2(a).

2.3.4 Feedback-Based Scheduling Algorithms

Based on the received occupancy vector, each input port selects a packet for

sending. Such an input port scheduler should be designed to avoid both buffer

overflow and underflow at the connected middle-stage VOQ. Suppose input i is

connected to middle-stage port j at slot t , and its anchor output is K . Based on the N -

bit occupancy vector received from middle-stage j in the previous slot t -1, we find

candidate set S j, i.e. the set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy. Input i

can only choose a HOL packet from a VOQ in S j for sending. This avoids buffer



- 35 -

overflow at VOQ2( j,h).

From Fig. 2.2(a), we can see that middle port j is connected to each output in

descending order of the output port number. Therefore, we know a priori that in the

next slot t +1, port j will be connected to output K -1 (wrapped around by N ). If

VOQ2( j, K -1) is empty and VOQ1(i, K -1) is not, we will face an underflow in

VOQ2( j, K -1) at slot t +1. As such, the scheduling algorithm should always give the

highest priority to schedule the HOL packet of VOQ1(i, K-1) at slot t . With the above

considerations in mind, we present three simple input port schedulers below.

RR (Round-Robin): If VOQ1(i,h’ ) is selected in the previous slot, then the

next non-empty VOQ1(i,h) is selected with VOQ2( j,h)S j. Comment: RR

gives fair access to each VOQ1, and RR is amenable to hardware

implementation [33].

LQF (Longest Queue First): Among all the non-empty VOQ1(i,h)’s with

VOQ2( j,h)S j, the one with the longest queue size is selected. Comment:

LQF is good for non-uniform traffic, but requires O( N ) comparisons. We can

replace it by Quasi-LQF [34], a very efficient sub-optimal LQF algorithm

requiring only a single comparison per time slot.

EDF (Earliest Departure First): Among all the non-empty VOQ1(i,h)’s with

VOQ2( j,h)S j, the one with the earliest departure time at the middle-stage

port is selected. The departure time is calculated from (2.4). Comment: EDF

should not be confused with the classic Earliest Deadline First. Our EDF aims

at minimizing the chance of buffer overflow at each VOQ2, which is achieved

by always giving priority to the VOQ1 with the minimum middle-stage delay



- 36 -

to send first.

Take an example. Assume a 4×4 feedback switch is configured by the joint

sequence of Fig. 2.2(a) and at time slot 0, a packet of VOQ1(0,0) is sent. Further

make a assumption that at time slot 1, there are 1, 2, 0, 3 packets in VOQ 1(0,0),

VOQ1(0,1), VOQ1(0,2), VOQ1(0,3) respectively and the feedback indicates that the

corresponding middle stage buffer for output port 0 is not empty. Therefore, only

VOQ1(0,1) and VOQ1(0,3) are legitimate candidates, i.e. VOQ1(0,1) and VOQ1(0,3)

S j. Then at time slot 1, RR and EDF would select the packet at VOQ 1(0,1) for

sending but LQF would transmit the HOL packet of VOQ1(0,3).

To give a scheduler more time to execute, batch scheduling [35] can be used,

where a single scheduling decision is made over a batch of time slots (instead of per

slot). Packets arrived in the current batch of slots will be considered in the next batch.

Indeed, the multi-cabinet implementation of the feedback-based switch in Chapter 6

belongs to this category

2.4 Performance Evaluations

In this section, the performance of our proposed feedback-based scheduling

algorithms is compared with some representative algorithms by simulations. In the

following, we only present simulation results for switch with size N =32 although

similar conclusions apply to other sizes (unless explicitly spelling out, the default

value of switch size N =32 in all simulation results of this thesis). In our simulations,

we focus on studying the performance of the three proposed feedback-based



- 37 -

scheduling algorithms in Section 2.3, i.e. round robin (RR), longest queue first (LQF)

and earliest departure first (EDF). For comparison, we also implement:

LQF with byte-focal switch architecture (LQF_Byte-Focal) [23], which

outperforms FOFF and in general is the best performing algorithm based on

resequencing buffer.

CR algorithm [29], which is the best performing frame-based scheduling

algorithm.

iSLIP algorithm [15], which serves as a benchmark for single-stage input-

queued switches. Specifically, we implement iSLIP with a single iteration

(iSLIP-1), as multi-iterations involve heavy communication overhead.

Output-queued switch, which serves as a lower bound.

2.4.1 Performance under Uniform Traffic

Uniform traffic is generated as follows. At each time slot for each input, a

packet arrives with probability p and destines to each output with equal probability.

Fig. 2.4 shows the delay-throughput performance under uniform traffic. We can see

that three input port schedulers RR, LQF and EDF yield comparable and less-than-

20-slot delay performance for input load up to p = 0.9. When p > 0.94, LQF gives the

best performance (as it always serves the most needed flow first), then follow by

EDF and RR. The average packet delay at middle-stage ports can be easily derived:

(1+ N )/2 = 16.5 time slots. If we deduct this portion from the overall delay, we can

see that the (input port) delay of our scheduling algorithms matches the output-

queued switch performance very well. Compared with LQF_Byte-Focal, our three

schedulers give significantly smaller delay. When p is reasonably large (>0.6), our



- 38 -

algorithms also beat iSLIP and CR. When p=0.7, the delay of LQF_Byte-Focal is 95

time slots, iSLIP 44, CR 152 and ours only 20.

Fig. 2.4 Delay vs input load p, with uniform traffic.

2.4.2 Performance under Uniform Bursty Traffic

Bursty arrivals are modeled by the ON/OFF traffic model. In the ON state, a

packet arrives in every time slot. In the OFF state, no packet arrivals are generated.

Packets of the same burst have the same output and the output for each burst is

uniformly distributed. Given the average input load of p and average burst size s p, the

state transition probabilities from OFF to ON is p/[ s p(1- p)] and from ON to OFF is

1/ s p. Without loss of generality, we set burst size s p = 30 packets.

Fig. 2.5 shows the delay-throughput performance under uniform-bursty traffic.

In Fig. 2.5, we can see that delay builds up quickly with input load, which is due to



- 39 -

the bursty traffic nature. Nevertheless, our RR, LQF and EDF still outperform

LQF_Byte-Focal and CR algorithms. At p=0.8, the delay of LQF_Byte-Focal is 224

time slots, 232 for CR, 156 for our RR/LQF/EDF, and 114 for output-queued switch.

Fig. 2.6 shows the delay performance of LQF under uniform-bursty traffic with

different burst sizes. We can see that average packet delay increases almost linearly

with burst size.

Fig. 2.5 Delay vs input load p, with uniform bursty traffic.

2.4.3 Performance under Hotspot Traffic

Packets arriving at each input port in each time slot with probability p. Packet

destinations are generated as follows. For input port i, packet goes to output i+ N /2

with probability ½, and goes to any other output with probability 1/[2( N -1)]. Fig. 2.7

shows the delay-throughput performance under hotspot traffic.



- 40 -

Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes.

Fig. 2.7 Delay vs input load p, with hot-spot traffic.



- 41 -

From Fig. 2.7, again we can see that our three schedulers are consistently

better than others. And among the three, LQF again gives the best/lowest delay

performance. Nevertheless, it is interesting to point out that the performance

difference among the three schedulers is much smaller than that in a single-stage

switch, and this is due to the use of the first stage switch for load balancing. For

simplicity, we shall only concentrate on LQF below.

2.5 The Stability of Feedback-Based Two-Stage Switch

Simulation results in the previous section allow us to study the average

performance under specific traffic patterns. In this section, we prove that under a

speedup of two, feedback-based switch using any arbitrary work-conserving port-

based scheduling algorithms (not just RR, LQF and EDF) is stable under any

admissible traffic patterns.

2.5.1 The Existing Approaches

Generally there are two approaches in proving 100% throughput, either using

the Lypunov method or based on the fluid model. The Lypunov method consists of

three steps [14,17]. First, model the VOQ-length process by a Markov chain. Then

convert the stability problem to a linear programming problem. Finally use

appropriate Lypunov functions. Based on this approach, switches using MWM [12],

MSM [14] and CIOQ [17] are proved to be stable.

In the Lypunov method, the packet arrival process at each input is required to

be Bernouli i.i.d (Independent and Identically Distributed). To remove this limitation,



- 42 -

the fluid model approach can be used. Under the assumption that the packet arrival

process at each input obeys the law of large number, a much broader class of traffic

can be accounted for. The 100% throughput proofs for MWM and CIOQ in [36], and

for buffered crossbar switch in [20,37], are based on the fluid model.

2.5.2 Fluid Model for Feedback-Based Two-Stage Switch

Like [20,36-37], we first establish a fluid model for scheduling packets. Let

the number of packets in VOQ1(i, j) at the beginning of time slot n be Z ij(n). Let the

cumulative number of arrivals and departures for VOQ1(i, j) at the beginning of slot n

be Aij(n) and Dij(n), respectively. We have:

Z ij(n) = Z ij (0) + Aij (n) − Dij (n), n ≥ 0, i, j = 1,...., N (2.5)

Let the number of packets in VOQ2(i, j) at the beginning of slot n be Bij(n).

Because there is only one packet buffer for each VOQ2(i, j), we have Bij(t )=0 if

VOQ2(i, j) is empty and Bij(t )=1 if VOQ2(i, j) is occupied. The cumulative number of

arrivals and departures in VOQ2(i, j) at the beginning of slot n are X ij(n) and Y ij(n),

respectively. The following relationship holds:

Bij(n) = Bij (0) + X ij (n) − Y ij (n), n ≥ 0, i, j = 1,...., N (2.6)

We assume that the packet arrival process obeys the strong law of large

numbers with probability one, i.e.

,,...,1,,)(

lim N jin

n Aij

ij

n

where λ ij is the mean packet arrival rate to VOQ1(i, j). The switch is, by definition,

rate stable if:



- 43 -

( )lim , , 1,..., .

ij

ijn

D ni j N

n

An admissible traffic matrix is defined as the one that satisfies the following

constraints.

,1,1 j

ij

i

ij

(2.7)

If a switch is rate stable for an admissible traffic matrix, then the switch delivers

100% throughput.

The fluid model is determined by a limiting procedure illustrated below. First,

the discrete functions are extended to right continuous functions. For arbitrary time t

∈ [n, n+ 1):

Aij(t ) = Aij(n);

Z ij(t ) = Z ij(n);

Dij(t ) = Dij(n) + (t - n)( Dij(n + 1) - Dij(n) );

X ij(t ) = X ij(n);

Bij(t ) = Bij(n);

Y ij(t ) = Y ij(n) + (t - n)( Y ij(n + 1) - Y ij(n) );

Note that all functions are random elements of D[0, ∞). We shall sometimes

use the notation Aij(· ,ω), Z ij(· ,ω), Dij(· ,ω) X ij(· ,ω), Bij(· ,ω), and Y ij(· ,ω) to explicitly

denote the dependency on the sample path ω. For a fixed ω, at time t , we have [36]:

Aij(t ,ω), the cumulative number of arrivals to VOQ1(i, j)

Z ij(t ,ω), the number of packets in VOQ1(i, j)

Dij(t ,ω), the cumulative number of departures from VOQ1(i, j)



- 44 -

X ij(t ,ω), the cumulative number of arrivals to VOQ2(i, j)

Bij(t ,ω), the number of packets in VOQ2(i, j)

Y ij(t ,ω), the cumulative number of departures from VOQ2(i, j)

For each r > 0, we define

);,(),( 1 rt Ar t A ij

r

ij

);,(),( 1 rt Z r t Z ij

r

ij

);,(),( 1 rt Dr t D ij

r

ij

);,(),( 1 rt X r t X ij

r

ij

);,(),( 1 rt Br t B ij

r

ij

);,(),( 1 rt Y r t Y ij

r

ij

It is shown in [20,37] that for each fixed ω satisfying (2.5), (2.6) and any sequence

{r n} with r n → ∞ as n → ∞, there exists a subsequence }{k nr and the continuous

functions(.)...)(.),( ijij Z A

, where)...),(),,(( t Z t A r

ij

r

ij converges to uniformly on

compacts as k → ∞ for any t ≥ 0

;),( t t A ij

r

ijk n

);(),( t Z t Z ij

r

ijk n

);(),( t Dt D ij

r

ijk n

);(),( t X t X ij

r

ijk n

);(),( t Bt B ij

r

ijk n

);(),( t Y t Y ij

r

ijk n

(2.8)

Definition 1: Any function obtained through the limiting procedure in (2.8) is



- 45 -

said to be a fluid limit of the switch. So the fluid model equations using our proposed

scheduling algorithms are:

0)()0()( t t Dt Z t Z ijijijij

(2.9)

0)()()0()( t t Y t X Bt B ijijijij(2.10)

Definition 2: The fluid model of a switch operating under a scheduling

algorithm is said to be weakly stable if for every fluid model solution ),( Z D

with 0)(,0)0( t Z Z for almost every t ≥ 0.

From [36], the switch is rate stable if the corresponding fluid model is weakly

stable. Our goal here is to prove that for every fluid model solution ),( Z D using our

scheduling algorithms,0)( t Z

for almost every t . To prove0)( t Z

, we will use the

following Fact 1 from [36]:

Fact 1: Let f be a non-negative, absolutely continuous function defined on R+∪{0}

with f (0)=0. Assume that for almost every t such that 0)(,0)( t f t f . Then f (t )=0

for almost every t ≥ 0. (Note that R+

is the set of positive real numbers, and )(t f

denotes the derivative of function f (t ) at time t .)

2.5.3 100% Throughput Proof

In the following, we show that our proposed scheduling algorithms give

100% throughput. The result is quite strong in the sense that it holds for any arbitrary

work-conserving scheduling algorithms with a speedup of two. In other words, each



- 46 -

input i can choose to serve any non-empty VOQ1(i,k ) for which VOQ2( j,k ) is empty.

Theorem 1: (Sufficiency) A work-conserving scheduling algorithm can

achieve 100% throughput with a speedup of two for any admissible traffic pattern

obeyed the strong law of large numbers.

Proof : Let C ij(t ) denote the joint queue occupancy of all packets arrived at

input port i, plus all packets destined for output j. We have

m

mjmj

p

ipij t Bt Z t Z t C )]()([)()((2.11)

)(t Z and )(t B are all non-negative, absolutely continuous functions, so C ij(t ) is non-

negative and absolutely continuous too. We can see that C ij(0)=0 and then we have

m

mjmj

p

ipij t Bt Z t Z t C ])()([)()(

Combined with (2.9) and (2.10), we get

m

mjmjmjmj

p

ipipij t Y t X t Dt t Dt t C ])()()([])([)(

With a work-conserving scheduling algorithm, packets left VOQ1(m, j) will enter

VOQ2(m, j), for m = 1,...., N , so

m

mj

m

mj t X t D )()(, then

m

mj

p p

ip

m

mjipij t Y t Dt C ])()([)(

.

From the admissible traffic condition (2.7), we get

m

mj

p

ipij t Y t Dt C ])()([2)((2.12)



- 47 -

For any non-empty VOQ1(i, j), i.e.0)( t Z ij , then by continuity of )(t Z , such that

0)'( t Z ij for ],[ t t t . Set

)(min],[

t Z a ijt t t

.

For large enough k , we have 2/)( at Z k nr

ij for ],[ t t t . Also, for large

enough k we have .12/ ar k n Thus

1)( t Z ij for )],(,[ t r t r t

k k nn which means

that VOQ1(i, j) holds at least one packet in the long interval )].(,[ t r t r k k nn With a

work-conserving scheduling algorithm, flow(i, j) packets always experience the same

fixed middle-stage port delay of d slots, where d is given by (2.4). During the time

interval)],(,[ t r t r

k k nn when input port i is connected to any middle port g , then

if VOQ2( g , j) is empty, a packet is transmitted from input port i to middle port

g .)(t Dik k is increased by one.

if VOQ2( g , j) is not empty, the packet in VOQ2( g , j) will be transmitted to

output port j with fixed delay q, where q = d mod N .( )

mjmY t will be

increased by one after q slots. (The packet in VOQ2( g , j) will be sent when

middle port g is connected to output j. If this occurs in the current time slot, q

= 0. Otherwise, it takes another q=d slots.)

If the switch is operated with a speedup of S , in a long time interval

],[)],(,[ t t t t r t r k k nn it fulfills:



- 48 -

)()]()([)]()([ t t r S qt r Y qt r Y t r Dt r Dk k k k k n

m

nmjnmjnipn

p

ip

Note that( )

mjmY t is monotonically non-decreasing and is increased at most one in

every time slot. So we have:

qt r Y qt r Y m

nmj

m

nmj k k )()(

m

nmj

m

nmj t r Y qt r Y k k

)()(.

Combined them together, we have

[ ( ) ( )] [ ( ) ( )] ( )k k k k k ip n ip n mj n mj n n

p m

D r t D r t Y r t Y r t q S r t t

Since q is pre-determined and within [0, N -1], its impact is insignificant in the fluid

limit [20]. Dividing the above equation with r n and let k →∞, fluid limits are obtained

as:

[ ( ) ( )] [ ( ) ( )] ( )ip ip mj mj

p m

D t D t Y t Y t S t t

Further dividing the above equation by ( t t ), and letting t t , the

derivative of the fluid limit is

S t Y t D

m

mj

p

ip ])()([

. (2.13)

With a speedup of two (i.e. S =2), combined (2.12) & (2.13), we get

0)( t C ij

Based on Fact 1, C ij(t )=0 for almost every t ≥ 0. Due to (2.11) and C ij(t )=0,

then 0)( t Z ij for almost every t ≥ 0. Theorem 1 is proved. #



- 49 -

It should be noted that existing stability proofs [21-30], adopt a common

approach of showing that the delay performance of a specific algorithm is within a

finite bound of the output-queued switch. Since the buffer size at each middle-stage

port is usually assumed to be infinite, the derived bound w.r.t. (With Regard To)

output-queued switch can be unrealistically large.

2.6 Chapter Summary

In this chapter, a framework for designing feedback-based scheduling

algorithms was proposed for elegantly solving the notorious packet mis-sequencing

problem of a load-balanced switch, while without sacrificing the switch’s delay and

throughput performance. Unlike existing approaches, we showed that the efforts

made in load balancing and keeping packets in-order can complement each other.

Specifically, at each middle-stage port between the two switch fabrics of a load-

balanced switch, only a single-packet-buffer for each VOQ is required. In-order

packet delivery is made possible by properly selecting and coordinating the two

sequences of switch configurations to form a joint sequence with both staggered

symmetry property and in-order packet delivery property. As compared with the

existing load-balanced switch architectures and scheduling algorithms, our solutions

have the modest requirement on switch hardware, but consistently yield the best

delay and throughput performance under various traffic conditions.



- 50 -

Chapter 3

Cutting Down Average PacketDelay

3.1 Introduction

For an N × N switch, there are N 2

input-output pairs, and thus it needs to carry

a total of N 2

different packet flows. In a feedback-based switch (Fig. 3.1.), although

the amount of middle-stage port delay experienced by packets of the same flow is the

same, packets of different flows may experience different middle-stage port delay.

The feedback-based switch in Fig. 3.1 is configured with the joint sequence in Fig.

3.2(a). Flow(0,1) packets will experience 2-slot middle port delay, such as arriving

middle port 0 at t =0 and leaving at t =2. On the other hand, flow(0,2) packets will just



- 51 -

experience 1-slot middle port delay, such as arriving middle port 0 at t =0 and leaving

at t =1. Assume flow(0,1) and flow(0,2) are the only flows in the switch and the

packet arrival rate for flow(0,1) is much higher than that of flow(0,2). To minimize

the average packet delay performance, can we swap the two flows such that flow(0,1)

packets experience 1-slot middle port delay instead? In general, if the traffic rate

matrix of a switch is known (e.g. by measurement), can we cut down the average

middle-stage packet delay by assigning heavy flows to experience less middle-stage

delays? This problem is investigated in this chapter along two directions.

Fig. 3.1 The feedback-base two-stage switch architecture.

First, from Chapter 2 we know that there exists a set of joint sequences with

both staggered symmetry and in-order packet delivery properties (the joint sequence

in Fig. 3.2(a) is just a particular instance). Then, for a given traffic matrix, we try to

find an optimal joint sequence that can minimize the average middle-stage delay. But

the searching involves rather tedious computation. Then a more practical solution is

proposed to add another stage of switch fabric for dynamically mapping heavy flows

to experience less middle-stage port delays. We call it a feedback-based three-stage

switch.



- 52 -

The rest of this chapter is organized as follows. In the next section, we design

the optimal joint sequence for feedback-based two-stage switch under specific traffic.

In Section 3.3, the three-stage switch architecture is introduced to minimize the

average middle-stage delay. Finally, we conclude this chapter in Section 3.4.

Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch.

3.2 Optimal Joint Sequence Design

A feedback-based two-stage switch has a single packet buffer at each middle-

stage VOQ2( j,k ). It is configured by a pre-determined joint sequence of N joint

configurations. A joint sequence consists of two (component) sequences of N

configurations, one for each switch stage, called first stage sequence and second



- 53 -

stage sequence. From Fig. 3.2 and our discussion in Chapter 2, we can see that some

joint sequences have the staggered symmetry property only, some have in-order

packet delivery property only, some have both properties, and yet more have none of

the properties (not shown). The relationship among them can be described by Fig. 3.3.

For a feedback-based switch to properly function, a joint sequence should have both

staggered symmetry and in-order packet delivery properties. To find the optimal joint

sequence for a given traffic matrix, we have to answer to the following two questions:

1. What is the necessary and sufficient condition for both staggered symmetry

and in-order delivery in a feedback-based two-stage switch?

2. How many such joint sequences exist?

Fig. 3.3 The relation between staggered symmetry and in-order delivery.

A broader sense definition of feedback-based switch shall be adopted in this

section to denote any load-balanced switch with single packet buffer at each middle-

stage VOQ2( j,k ). If staggered symmetry property is also required, we spell it out

explicitly as feedback-based switch with staggered symmetry property.

3.2.1 In-Order Packet Delivery Only

Statement 4: A constant middle-stage delay for all packets belonging to the

same flow (and for all N 2 flows) is a necessary and sufficient condition for packet in-

order delivery in a feedback-based switch.



- 54 -

Proving sufficient condition: If the middle-stage delay is constant for all

packets of the same flow, then middle-stage ports will not cause any packet out-of-

order problem. #

Note that this sufficient condition is not limited to the feedback-based two-

stage switch architecture. It can be applied to other load-balanced switch

architectures [21, 31].

Proving necessary condition: In a feedback-based switch, assume flow(i,k )

packets do not experience a constant middle port delay. Nevertheless, based on the

periodicity of joint sequence, the middle-stage port delay is always bounded between

[1, N ] slots and if packets A and B of flow(i,k ) enter middle-stage ports at time slot t

and t + N respectively, they will still experience the same middle port delay. Therefore,

there exist packets C and D belonging to flow(i,k ), such that C enters a middle-stage

port at slot t and experiences a middle-stage delay of d s slots, whereas D enters a

middle-stage port at slot t +1 and experiences a middle-stage delay of d (d < d s) slots.

Because d and d s are all positive integer, then:

d +1 ≤ d s

Packets C and D leave middle-stage ports (thus the switch as there is no

output buffer) at slot t +d s and t +1+d respectively. If d +1=d s, then t +d s=t +1+d , which

means C and D will leave at the same time slot. This contradicts to the property of

joint sequence, so d +1≠d s, i.e.



- 55 -

d +1 < d s (3.1)

From (3.1), we get t +1+d < t +d s. In other words, packet C leaves middle port

after D, causing packet out of sequence. As no constant middle-stage delay causes

packet out of sequence, this proves the necessary condition. #

This necessary condition is only valid under the feedback-based two-stage

switch architecture, which means that there is only one packet buffer for every

VOQ2( j,k ) and two switch fabrics are configured by a joint sequence. Note that in a

feedback-based two-stage switch, the middle-stage port delay is always bounded

between [1, N ] slots. If the middle port delay is not upper bounded by N slots, the

necessary condition may fail. Take an example, if every next packet incurs a larger

middle-stage delay than the previous packets from the same flow, in order delivery

could still be sustained.

In Fig. 3.2(c), we can see that each input port always connects to a fixed

output port (via some middle-stage port) in all time slots. We call this anchor output

property. In this case, outputs 0, 1, 2 and 3 are the anchor outputs of inputs 0, 1, 2

and 3, respectively. Further consider input 0 in Fig. 3.2(c), it connects to middle ports

0, 1, 2 and 3 in a cyclic manner in each subsequent time slot. We denote this cycle by

(0, 1, 2, 3). Similarly, we can see that inputs 1, 2 and 3 connect to middle ports

following cycles (3, 0, 1, 2), (2, 3, 0, 1) and (1, 2, 3, 0), respectively. Indeed, (0, 1, 2,

3), (3 0 1 2), (2, 3, 0, 1) and (1, 2, 3, 0) are just different ways to express the same

cycle (0, 1, 2, 3). If all input ports of a switch connect to middle ports following the



- 56 -

same cycle, we say the sequence of N configurations is ordered . We can see that both

first and second sequences of configurations in Fig. 3.2(a) and (c) are ordered.

Statement 5: If a joint sequence of configurations has the anchor output

property, and one of its two sequences is ordered, then the other sequence is also

ordered.

Proof : Without loss of generality, let the first stage sequence of configurations

be ordered based on cycle ( j1, j2, j3, j4 ... j N ). At time slot t , let middle ports j1, j2 , j3 ,

j4 ... j N be connected by input ports i1, i2 , i3 , i4 ... i N respectively. Further let k 1, k 2 , k 3 ,

k 4 ... k N be the anchor outputs for i1, i2 , i3 , i4 ... i N . We can get the generic joint

configuration at time slot t as shown in Fig. 3.4:

Fig. 3.4: The generic joint configuration at time slot t.

Similarly, the joint configurations at each subsequent time slot up to t + N -1

can be constructed based on the anchor output and ordered sequence properties as

shown in Fig. 3.5, resulting in a joint sequence of N joint configurations. By

construction, we can see that the second sequence of configurations (identified by



- 57 -

solid lines) is also ordered, and follows the cycle of (k 1, k N , k N -1, k N -2 …k 2). #

Fig. 3.5: Generic joint sequence with anchor output and ordered properties.

If the two component sequences of a joint sequence are both ordered, we say

the joint sequence has the ordered property. Note that the tuple (i x, j x, k x) could take

any value in [0, N -1], so the joint sequence in Fig. 3.5 is a generic expression for all

possible joint sequences with anchor output and ordered properties.

Statement 6: Anchor output and ordered properties are the necessary and

sufficient condition for a constant middle port delay for packets of the same flow in a

feedback-based two-stage switch.

Proving sufficient condition: Let the first stage sequence of configurations be

ordered based on cycle ( j1, j2, j3, j4 ... j N ). Further let k 1, k 2 , k 3 , k 4 ... k N be the anchor

outputs for i1, i2 , i3 , i4 ... i N . From Statement 5, the second sequence of configurations

is also ordered, and follows the cycle of (k 1, k N , k N -1, k N -2 …k 2). This joint sequence is

shown in Fig. 3.5. Consider a packet A of flow(i1,k N ) being transmitted to some

middle port j. Due to anchor output, j connects to (anchor) output port k 1 (of input i1)



- 58 -

at the current time slot. Packet A arrives and waits at middle port j until j connects to

k N again. Since the second stage sequence of configurations is ordered in cycle (k 1, k N ,

k N -1, k N -2 …k 2), j will connect to output port k N after one time slot. That means for any

arbitrary middle port j, the middle-stage port delay for flow(i1,k N ) is always 1-slot.

Repeat the above procedure for all N 2

possible flows, we can see that each flow has a

constant middle-stage delay. The sufficient condition is proved. #

Proving necessary condition: In a feedback-based two-stage switch, the

middle port delay is bounded between [1, N ] slots. Due to the connectivity of a joint

sequence, different flows arrived at an input port must experience distinct amount of

middle-stage port delays. In other words, at each input port, there exists exactly one

flow(i,k ) experiencing a constant middle-stage port delay of d time slots, for d =

1, …, N . Assume flow(i,k ) experiences the constant middle-stage port delay of d = N

time slots. At time slot t , input i connects to some middle-stage port j′ and j′ connects

some output port k ′. If a packet B of flow(i,k ) is transmitted to middle port j′ in this

slot, because of the constant N time slots middle-stage port delay for flow(i,k ), j′ will

connect to output port k after N slots. The joint sequence is periodic with a cycle of N

slots, so k=k ′. For arbitrary time slot t and middle port j′, k=k ′ is always true and this

shows that output k is the anchor output for input i. Repeat the above procedure for

all input ports, we can see that each input port has a distinct anchor output port. This

shows that anchor output is the necessary condition for a constant middle-stage port

delay.

At input i, there exists flow(i,k ′) with constant 1-slot middle-stage port delay.

Let output k be the anchor output for i, as proved above. When input i connects to



- 59 -

some middle-stage port j at any time slot t , due to anchor output property, j connects

to output k . At current time slot, if a packet C of flow(i,k ′) is sent to middle port j,

then one time slot later, j will connect to output k ′ to keep the constant 1-slot middle

port delay. Since both middle port j and time slot t are arbitrarily selected, all middle

ports connect to output ports k and k ′ following the same order of k first and then k ′.

Repeat the above process from 1-slot middle-stage port delay to ( N -1)-slots, we can

show that all middle ports connect to output ports follow the same ordered sequence.

This proves the necessary condition. #

From Statements 4 and 6, we can directly get Statement 7:

Statement 7: Anchor output and ordered properties are the necessary and

sufficient condition for packet in-order delivery in a feedback-based two-stage switch.

3.2.2 Both In-Order Packet Delivery and Staggered Symmetry

Statement 8: If one sequence of configurations is ordered and the other

sequence is constructed by the staggered symmetry property, then the resulting joint

sequence has the anchor output property.

Proof : Staggered symmetry property refers to the fact that for any middle-

stage port j, if it is connected to output k at time slot t , then at next slot (t +1) input k

is connected to the same middle-stage port j. In other words, the second

configuration at time slot t is a (vertical) mirror image of the first configuration at

time slot t +1, and the second configuration at t + N -1 wraps around to become the

mirror image of the first configuration at t .



- 60 -

Fig. 3.6: Joint sequence with staggered symmetry and in-order delivery.

Without loss of generality, let the first stage sequence of configurations be

ordered with cycle ( j1, j2, j3, j4 ... j N ). Due to the staggered symmetry, we can see that

the second stage sequence is also ordered, but interestingly, based on the cycle (i N , i N -

1 ,…, i2, i1), which runs in the opposite direction as that followed by the first stage.

The resulting joint sequence is shown in Fig. 3.6. In each time slot, connection

pattern in the first stage fabric is shifted downwards once (i.e. towards the right hand

side of ( j1, j2, j3, j4 ... j N )), whereas the connection pattern in the second stage is

shifted upwards once. From an input port’s point of view, the net effect is that the

shifting in opposite direction cancel out each other, and the input connects to the

same output (but via a different middle-stage port) as in the previous time slot. This

proves the sufficient condition for Statement 8. #

Statement 9: For a feedback-based two-stage switch, the necessary and

sufficient conditions for both in-order packet delivery and staggered symmetry are:

one sequence of configurations is ordered and the other sequence is constructed by

the staggered symmetry property.



- 61 -

Proof : The sufficient condition for in-order packet delivery and staggered

symmetry is a direct consequence from Statements 7 and 8. If in-order packet

delivery is guaranteed, from Statement 7 an ordered sequence of configurations is a

necessary condition. Obviously the staggered symmetry property itself is the

necessary condition for both in-order packet delivery and staggered symmetry

properties. #

3.2.3 Finding the Number of Different Joint Sequences

The Statement 9 answered the question 1 (i.e. necessary and sufficient

condition for both staggered symmetry and in-order delivery). Then for the sake of

finding the optimal joint sequence, in the following we concern on the question 2 (i.e.

how many such joint sequences exist).

All possible joint sequences: To find the number of sequences that satisfy the

requirement of each input visiting each output exactly once in the sequence,

we can make use of the solution for the classic problem of Latin square [38].

A Latin square is an N × N table filled with N different symbols in such a way

that each symbol occurs exactly once in each row and exactly once in each

column. From [38], the total number of Latin squares is given by [ N !( N -1)!] M ,

where M is the number of reduced Latin squares (and M ≥1). Unlike Latin

square, in a load-balanced switch the configuration sequence is periodic with

N , and “sequences” beginning with different starting time slots should be

counted once. Accordingly, the number of configuration sequences in the first

stage fabric is N times smaller than the number of Latin squares, or [( N -

1)!]2 M . For a given first stage sequence, there are [ N !( N -1)!] M ways to select



- 62 -

the second sequence, resulting in a total of N ![( N -1)!]3 M 2 possible joint

sequences. (Note that in this case, “sequences” with different starting time

slots are counted individually because they produce different joint sequences.)

Joint sequences with in-order delivery property only: Based on Statement

7, the number of joint sequences providing in-order packet delivery is the

product of the number of different anchor output patterns and the number of

ordered sequences. Since each input must have a distinct anchor output, there

are N ! ways to select an anchor output pattern. Similarly, the number of

possible configurations in a time slot is N !, and there are ( N -1)! possible

cycles that a configuration (sequence) can follow. This results in N !( N -1)!

possible choices. But among them, we only count “sequences” with different

starting time slots once, so the total number of ordered sequences is [( N -1)!]2.

Then, the total number of joint sequences that can keep packets in-order

delivery is (( N -1)!)2

N !.

Joint sequences with staggered symmetry property only: If the sequence

of configurations used by one switch fabric is known, we can always

construct a unique joint sequence with the staggered symmetry property. So

the number of joint sequences with staggered symmetry property equals to the

number of possible single-stage sequences, or [( N -1)!]2 M, where M is the

number of reduced Latin squares [38].

Number of joint sequences with both two properties: From Statement 9,

once the first stage sequence of configurations is determined, so is the second

stage by the staggered symmetry property. Then the number of joint

sequences with both two properties equals to the number of ordered

sequences, which is given by [( N -1)!]2. It should be noted that if all the



- 63 -

isomorphic joint sequences are counted once, then there are only ( N -1)!

unique/non-isomorphic joint sequences, each yields a different delay

experience at middle-stage ports.

3.2.4 Discussions

Until now, both two questions are addressed, which we would like to base on

to identify the optimal joint sequence for a given traffic matrix. Statement 9 provides

an efficient mechanism to design a joint sequence for feedback-based two-stage

switches. We first show how a joint sequence can be constructed based on Statement

9. Assume the first stage sequence of configurations is ordered based on cycle ( j1, j2,

j3, j4 ... j N ). At time slot t , let middle ports j1, j2 , j3 , j4 ... j N be connected by input ports

i1, i2 , i3 , i4 ... i N respectively. Due to the ordered property, the first stage

configurations at each subsequent time slot and up to t + N -1 can be constructed.

When the first stage sequence is obtained, the second stage sequence of

configurations can be constructed directly from the staggered symmetry property.

The resulting joint sequence is shown in Fig. 3.6. Note that the tuple ( i x, j x) could

take any value in [0, N -1], so the joint sequence in Fig. 3.6 is a generic expression for

all possible joint sequences with both in-order packet delivery and staggered

symmetry properties. By substituting all possible values for (i x, j x) into Fig. 3.6, we

can systematically find all joint sequences with both staggered symmetry and in-

order packet delivery properties.

Let us take a closer look at Fig. 3.6. We observe that the middle-stage port

delay for flow(i,i) (i= i1, i2 , i3 , i4 ... i N ) is always N -1 slots. In other words, it is not

possible to map flow(i,i) (i=0,1… N -1) to experience less than ( N -1)-slot middle-



- 64 -

stage delay by any joint sequence satisfying Statement 9 (i.e. with both staggered

property and in-order packet delivery properties). Also from Fig. 3.6, if output port j

is the anchor output for input port i, then the middle-stage port delay for flow( j,i) is

always N -2 slots no matter the values for i and j.

We can see that when using a joint sequence with both in-order delivery and

staggered symmetry, the delays for different flows are complicatedly correlated with

each other. To find the optimal joint sequence that gives the minimum overall switch

delay performance, we have to use the brute force to check against every possible

joint sequence in the pool of [( N -1)!]2, which involves rather tedious computation.

3.3

Three-Stage Switch

In this section, we follow another more practical approach, which adds

another stage of switch fabric for dynamically mapping heavy flows to experience

less middle-stage port delays, called three-stage switch.

3.3.1 Three-Stage Switch Architecture

The three-stage switch architecture is shown in Fig. 3.7. Any joint sequence

with staggered symmetry and in-order packet delivery properties, e.g. the one in Fig.

3.2(a), can be used by the first two switch fabrics. The selected joint sequence will

not be changed according to traffic. Instead, the configuration of the third stage

switch is designed/adjusted to map heavy flows to experience less middle-stage delay.

As the configuration in the third switch fabric is based on traffic, it is updated only if



- 65 -

there is a significant enough change in traffic pattern. Since no buffer is required at

the virtual output ports (in Fig. 3.7), adding the third stage switch fabric does not

increase the packet delay (assuming propagation delay is negligible). In other words,

as soon as packets arrive at virtual outputs, they are re-directed to outputs via the

configuration in the third fabric. The 0-delay at the virtual outputs (due to 0-buffer)

also ensures no packet mis-sequencing, and no interruption to the original middle-

stage VOQ occupancy feedback mechanism.

Fig. 3.7 A three-stage switch architecture.

An example is shown in Fig. 3.8. With the three-stage switch in Fig. 3.8(b),

packets of flow(0,3) are delivered to virtual output 2 (instead of 3). After staying at

middle-stage ports for one slot, a packet arrives at virtual output 2 and is immediately

re-directed to output 3. We can see that the middle-stage delay of flow(0,3) packets is

just one slot, whereas 4 slots are required using the two-stage switch implementation

in Fig. 3.8(a).

Without loss of generality, assume the traffic matrix {λ ij} is obtained. Then a

delay matrix {d ij} can be constructed, where d ij denotes that virtual output port j-1 is

connected to output port i-1, and the value of d ij is the traffic-weighted-average



- 66 -

middle-stage packet delay of all the N flows destined to output port i-1. From

Chapter 2, each of the N flows destined to an output port experiences a distinct

middle-stage delay, ranging from 1 to N slots. For the 4×4 traffic matrix {λ ij} in Fig.

3.9(a), the corresponding delay matrix {d ij} is found and shown in Fig. 3.9(b). As an

example,

d 34 = 4λ 13 + λ 23 + 2λ 33 + 3λ 43 = 0.8 + 0.1 + 0.2 + 1.2 = 2.3 slots.

Fig. 3.8 An example of using three-stage switch.

3.04.01.02.0

1.01.05.02.0

4.01.02.01.0

1.02.02.03.0

9.14.21.26.2

3.23.25.19.1

3.25.21.31.2

3.29.19.19.1

(a) Traffic matrix {λ ij} (b) Delay matrix {d ij}

Fig. 3.9 Traffic matrix and delay matrix.

Definition 3: A set of entries of a matrix are independent if none of them

occupies the same row or column.

A legitimate configuration in the third stage switch fabric must correspond to

an independent set. In Fig. 3.9(b), [d 11, d 22，d 33, d 44] is an independent set with virtual

output i mapping to output i. In this case, the three-stage switch degenerates into the



- 67 -

two-stage. The average middle-stage packet delay experienced by all N 2 flows in a

two-stage switch is thus

d 11 + d 22 + d 33 + d 44 = 1.9 + 3.1 + 2.3 + 1.9 = 9.2 slots.

Minimizing the overall average middle-stage packet delay becomes finding

an independent set from the delay matrix such that the sum of all entries in the set is

minimized. Optimal algorithms with polynomial running time exist [39,40]. It has a

time complexity of O( N 3). This is acceptable as the configuration in the third switch

fabric is not changed on slot-basis. For completeness, this algorithm is summarized:

9.14.21.26.2

3.23.25.19.1

3.25.21.31.2

3.29.19.19.1

05.02.07.0

8.08.004.0

2.04.010

4.0000

*05.02.07.0

8.08.0*04.0

2.04.010

4.000*0

*05.02.07.0

8.08.0*04.0

2.04.010

4.000*0

*05.02.07.0

8.08.0*04.0

2.04.01'0

4.0'00*0

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

Fig. 3.10 An example of identifying the minimum independent set.

For a given matrix, it finds the independent set with the minimum weight [39,40].

1. Each row of the matrix subtracts the smallest element in this row, each

column subtracts the smallest element in this column.

2. Find a zero element, Z . If there is no starred zero in its row nor its column,

mark Z with a star. Repeat for each zero of the matrix. Go to Step 3.

3. Cover every column containing starred 0 with a line. If all columns are

covered, the starred zeros form the desired independent set; Exit. Otherwise,

go to Step 4.



- 68 -

4. Choose an uncovered zero and mark it with a prime; If there is no starred zero

Z in this row, go to Step 5. If there is a starred zero Z in this row, cover this

row with a line and uncover the column of Z . Repeat until all zeros are

covered. Go to Step 6.

5. There is a sequence of alternating starred and primed zeros constructed as

follows: let Z 0 denote the uncovered 0'. Let Z 1 denote the 0* in Z 0's column (if

any). Let Z 2 denote the 0' in Z 1's row. Continue in a similar way until the

sequence stops at a 0', Z 2k , which has no 0* in its column. Unstar each starred

zero of the sequence, and star each primed zero of the sequence. Erase all

primes and uncover every line. Return to Step 3.

6. Let h denote the smallest uncovered element of the matrix; it will be positive.

Add h to each covered row; then subtract h from each uncovered column.

Return to Step 4 without altering any asterisks, primes, or covered lines.

For the delay matrix in Fig. 3.9(b), we can find a minimum independent set

[d 13, d 21，d 32, d 44]. Fig. 3.10 shows the detail steps and Fig. 3.11 shows the resulting

third stage configuration. The minimum average middle-stage packet delay is

d 13+ d 21+d 32+ d 44=1.9+2.1+1.5+1.9 = 7.4 slots.

This gives 19.6% reduction in middle-stage delay as compared with the two-stage

switch counterpart.

While changing the third-stage configuration, attention should be paid to the

in-flight packets buffered at middle-stage ports. Their destinations are based on the

old mapping rendered by the old third-stage configuration. As such, we have to

suspend the inputs from sending packets to middle-stage ports for N slots; otherwise,



- 69 -

packets based on different mappings will coexist at middle-stage ports. During this

suspension period, the buffered middle-stage packets can be properly cleared and the

new configuration for the third switch fabric will be enforced immediately afterwards.

We call this N -slot suspension period reconfiguration penalty.

Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b).

3.3.2 Traffic Matrix Estimation

Traffic matrix estimation among all the nodes in a network is generally

difficult. Fortunately, here we only need to find the traffic matrix at a single node, i.e.

between the N switch inputs and the N switch outputs. In this section, a simple traffic

matrix estimation algorithm is presented. In particular, a packet counter Qi,j is

associated with each of the N 2

flows/VOQ1(i,j)s. At the beginning of each sampling

interval of T time slots, Qi,j is initialized to 0 and is increased by one for every

subsequent packet arrival. Let λ ij be the estimated traffic rate/load for flow(i, j). λ ij is

updated every T slots using the following exponentially weighted moving averaging

function:

T

Q ji

ijij

,125.0875.0

where λ′ij

is the previous estimate and the weighting on the current sample is

assumed to be 0.125 (which is deemed suitable by simulations.)



- 70 -

We also introduce another criterion for suppressing unnecessary updates of

the third-stage configuration, so as to minimize the reconfiguration penalty.

Specifically, when a new input load λ ij is obtained, we check if the load change is

significant enough by

]1.1,9.0[ij

ij

(3.2)

If all flows satisfy (3.2), the existing third-stage configuration remains. Otherwise, a

new third-stage configuration is determined based on the updated traffic matrix. The

fluctuation range in (3.2) can also be tuned to balance the reconfiguration penalty

and the possible delay performance gain.

Finally, the three-stage switch architecture above is resilient to errors in

estimating the traffic matrix. This is because the close to 100% throughput is

guaranteed by the joint sequence used in the first two switch fabrics, whereas the

third fabric is purely for cutting down the delay. Therefore, adding the third stage

fabric will have no negative impact on switch throughput, packet order, as well as the

middle-stage VOQ occupancy feedback mechanism.

3.3.3 Performance Evaluations

In Chapter 2, the unbeatable delay-throughput performance of the feedback-

based two-stage switch architecture has been well-demonstrated under various traffic

conditions. In this section, we only focus on the improvement of the three-stage

switch over the original feedback switch.

We first study the performance under hot-spot traffic model. For input port i,



- 71 -

packet goes to hot-spot output (i+ x) with probability ½, and goes to other outputs

with same probability 1/[2( N -1)]. The hot-spot can be changed by varying x. This

traffic model is chosen because the overall traffic pattern remains admissible while

increasing input load p from 0 to 1, or varying x. Without loss of generality, the joint

sequence shown in Fig. 3.2(a) is assumed.

Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch.

In the hot-spot traffic model, heavy flow can be easily and correctly identified

by our proposed traffic estimation algorithm. As such, Fig. 3.12 only shows the

delay-throughput performance (of a switch with size N =32) against input load. The

y-axis is the overall average switch delay, which combines both input delay and

middle-stage delay. With two-stage switch architecture, varying hot-spot x results in

different delay-throughput performances. From Fig. 3.12, we can see that when the

hot-spot is at output (i+30), the lowest/best delay is obtained because the hot-spot



- 72 -

flow is assigned to experience 1-slot middle-stage delay. When the hot-spot is at

output (i+31), the highest/poorest delay is obtained because hot-spot flow is assigned

to experience the largest 32-slot middle-stage delay.

With our three-stage switch architecture, we can always map the hot-spot

flow to experience the lowest 1-slot middle-stage delay by properly configuring the

third switch fabric. That means no matter what the value of x is, the overall delay-

throughput performance rendered by our three-stage architecture is always the same

as the case of hot-spot at output (i+30). This cuts down the delay by as large as 15

time slots, giving 60.7% delay improvement at p=0.6 and 43.4% at p=0.95.

Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch.

Fig. 3.13 shows the delay versus time, or the number of sampling intervals,

where each sampling interval is T =105

slots. The initial traffic pattern/matrix changes



- 73 -

twice during the simulation, at the 40-th and the 70-th sampling intervals,

respectively. Each change is represented by a randomly generated traffic matrix.

(Each matrix entry is uniformly distributed between 0 and 1, and the whole matrix is

regulated to be admissible.) From Fig. 3.13, we can see that our traffic estimation

algorithm is quite effective in adapting to the changes in traffic pattern, and the

overall improvement of three-stage switch, as compared with the original two-stage

switch, is about 8%.

3.4 Chapter Summary

In this chapter, we improved the delay performance of feedback-based two-

stage switch by assigning heavy flows to experience less middle-stage delays. We

have followed two approaches. First, for a given traffic matrix, we can find an

optimal joint sequence that can minimize the average middle-stage delay. In the

second approach, we extended the feedback-based two-stage switch architecture to

three-stage, thereby the third switch fabric dynamically maps heavy flows to

experience less middle-stage port delays.



- 74 -

Chapter 4

Cutting Down CommunicationOverhead

4.1 Introduction

The occupancy vector in our feedback-based two-stage switch requires N bits.

When the switch size N is large, the N -bit occupancy vector may become a

bottleneck. For example, with a 1024×1024 switch carrying 128-bye packets, the

(second) switch fabric must operate at a speedup of two for carrying the extra 1024

bits of occupancy vector.

In this chapter, we focus on cutting down the communication overhead. The



- 75 -

size of an occupancy vector can be reduced by only reporting the status of selected

middle-stage VOQs. To identify VOQs of interest, we first partition the N VOQs

into u non-overlapped sets, each identified by a set number. In each time slot, every

input port piggybacks its set numbers of interest to the connected middle-stage port.

This “guides” each middle-stage port to only report the status of selected VOQs.

The rest of this chapter is organized as follows. In the next section, by

exploiting the feedback path in the first-stage, a set of efficient feedback suppression

algorithms are designed. In Section 4.3, we compare all the proposed algorithms by

simulations. Finally, we conclude the chapter in Section 4.4.

4.2 Feedback Suppression Algorithms

Firstly, we partition the N VOQs at each port, either input or middle-stage,

into u non-overlapped sets, denoted by G1, G2, …, Gu. Without loss of generality,

assume g=N/u is an integer. Then each set Gm (m=1,2…u) contains g queues.

Specifically, at input k ,

Gm={VOQ1(k ,(m-1) g +1),VOQ1(k ,(m-1) g +2),…, VOQ1(k , mg )}.

At middle-stage port j,

Gm={VOQ2( j,(m-1) g +1), VOQ2( j,(m-1) g +2),…,VOQ2( j, mg )}.

To cut down the communication overhead, the size of an occupancy vector can be

reduced by only reporting the status of selected Gm.

To maximize switch performance, longer queues should be given more

chances to send packets. With full N -bit occupancy vector, the LQF scheduling



- 76 -

provides the best performance by always selecting the longest queue from all the N

VOQ1(i, j)’s at each input port. If we can select Gm based on where the longest queue

resides, the performance would not drop. We propose to construct another feedback

mechanism for an input to piggybacks its set numbers of interest to the connected

middle-stage port. We can make use of the otherwise wasted bandwidth in the first

stage switch for this purpose, as shown in Fig. 4.1. (Note that the speedup required

for carrying feedback in the second stage switch is also applied to the first stage

switch.) But unlike the feedback mechanism in the second stage (for middle-stage

ports to inform outputs/inputs), the (identity of) longest queue received from an input

i by middle port j at slot t can only be used N slots later, i.e. next time middle port j is

connected to output i. Since packets arrive and depart in every slot, the longest queue

identified N slots ago may not be the current longest queue – this is the price we must

pay. Nevertheless, for highly skewed non-uniform traffic pattern, the history data

usually serves as a good estimation.

Fig. 4.1 Timing diagram of feedback switch with feedback suppression.



- 77 -

With the above feedback mechanism in the first stage switch, three packet

scheduling algorithms are designed.

4.2.1 Set-Based Feedback (Set-feedback)

Let VOQ1(i,F ) denote the longest queue at input i at time t . If F ∈Gm, then

the value of m is stored at input i and piggybacked (using log u bits) on the packet

sent to the connected middle-stage port j. Port j stores the value of m and when it is

connected to output i at time t + N -1, j sends a g -bit vector, corresponding to the

occupancy of the g queues in set Gm. Input/output i knows which set the g -bit

occupancy vector refers to, based on the stored value of m at time t. At slot t + N , input

i selects a packet to send from the longest available queue in {VOQ1(i,(m-1) g +1),

VOQ1(i,(m-1) g +2), …, VOQ1(i,mg )}. “Available” means the corresponding

VOQ2( j,k ) is empty and VOQ1(i,k ) is not. In doing so, the likelihood of the selected

packet comes from the longest queue among all the N VOQ1s at input i is increased.

The feedback bits required in the first and second stages are log u and g bits

respectively.

An example: Consider a 4×4 ( N =4) feedback-based switch, at each

input/middle-stage port, VOQs are partitioned into 2 (u=2) non-overlapped sets,

denoted by G1 and G2. Then at input 1, set G1 contains {VOQ1(1,0),VOQ1(1,1)} and

G2 contains {VOQ1(1,2),VOQ1(1,3)}. Assumed VOQ1(1 ,3) is the longest queue at

input 1 at time slot 0. VOQ1(1 ,3)∈G2, so the value of 2 (the identify of G) is stored

at input 1 and piggybacked using 1 (log u) bit on the packet sent to the connected

middle-stage port 0. Middle port 0 stores the value of 2 and when it is connected to

output 1 at time slot 3, middle port 0 sends a 2-bit ( g =2) vector, corresponding to the



- 78 -

occupancy of {VOQ2(0,2),VOQ2(0,3)} in set G2. Input/output 1 knows the 2-bit

occupancy vector refers to, based on the stored value of 2 at time slot 0 . At slot 4,

input 1 selects a packet to send from the longest available queue in

{VOQ1(1,2),VOQ1(1,3)}.

4.2.2 Queue-Based Feedback Version 1 (Q-feedback-1)

Let VOQ1(i,F ) denote the longest queue at input i at time t . Unlike Set-

feedback , the value of F is stored at input i and piggybacked (using log N bits) on the

packet sent to middle-stage port j. Port j stores the value of F . When it is connected

to output i at slot t + N -1, j sends a b-bit occupancy vector, containing the occupancy

of b queues from VOQ2( j, F ) to VOQ2( j, F +b-1) (wrapped around by N ). Input/output

i knows which queues the b-bit occupancy vector refers to, based on the value of F

stored at time t. At slot t + N , input i selects a packet to send from the longest available

queue in {VOQ1(i, F ), VOQ1(i, F +1), …, VOQ1(i, F +b-1)}. The feedback bits required

in the first and second stages are log N and b bits respectively. (Note that b= g is not

necessary.)

An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is

the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and

piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage

port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is

connected to output 1 at time slot 3, middle port 0 sends a 3-bit (b=2) vector,

corresponding to the occupancy of {VOQ2(0,3), VOQ2(0,0), VOQ2(0,1)}.

Input/output 1 knows the 3-bit occupancy vector refers to, based on the stored value

of 3 at time slot 0. At slot 4, input 1 selects a packet to send from the longest



- 79 -

available queue in {VOQ1(1,3),VOQ1(1,0), VOQ1(1,1)}.

4.2.3 Queue-Based Feedback Version 2 (Q-feedback-2)

This algorithm is the same as Q-feedback-1 except that the second stage

feedback is generated as follows. When middle-stage port j is connected to output i at

slot t + N -1, we randomly select an empty queue VOQ2( j,z ). Middle-stage port j then

sends a (1+log N )-bit occupancy vector, with the first bit indicates the occupancy of

VOQ2( j,F ), and the following log N bits carry the value z . At slot t + N , input i selects

a packet from the longest available queue in {VOQ1(i, F ), VOQ2( j, z )}. The feedback

bits required in the first and second stages are log N and 1+log N bits respectively.

An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is

the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and

piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage

port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is

connected to output 1 at time slot 3, middle port 0 randomly select an empty queue

(say it VOQ2(0 ,2)). Middle-stage port j then sends a 3 (1+log N ) bits occupancy

vector, with the first bit indicates the occupancy of VOQ2(0 ,3), and the following 2

(log N ) bits carry the value 2(the identify of VOQ2(0 ,2)). At slot 4, input 1 selects a

packet to send from the longest available queue in {VOQ1(1,3),VOQ1(1,2)}.

Note that the three algorithms above can all be extended to carry the feedback

of the top-C longest queues (instead of the longest queue only). In Set-feedback , this

requires C ·log u bits in the first stage (for identifying up to C sets of Gm that contain

the top-C longest queues), and C · g bits in the second stage. Similarly, in Q-feedback-



- 80 -

1, we need C ·log N bits in the first stage and C ·b bits in the second stage. For Q-

feedback-2, we need C ·log N bits and C ·(1+ log N ) bits, respectively.


In this section, the delay-throughput performance of the proposed scheduling

algorithms is studied by simulations. Without loss of generality, a switch with size

N =32 is assumed unless otherwise specified. Scheduling algorithms with full

feedback (in the second stage switch) requires 32 bits. With our proposed feedback

schemes, we target at using 12 bits only (roughly 1/3). The detailed parameter

settings are as follows:

For Set-feedback , in order to form 12 bits feedback, we partition the 32 VOQs

into u=8 sets with each set has g =4 elements. Feedback of Top-3 longest

queues is used, i.e. C =3. The feedback bits required in the first stage and

second stage are 9 bits and 12 bits respectively.

For Q-feedback-1, we set b to 6 and C to 2. The feedback bits required in the

first stage and second stage become 10 bits and 12 bits respectively, which

are comparable to that of Set-feedback .

For Q-feedback-2, we set C to 2. The feedback bits required in the first stage

and second stage are 10 bits and 12 bits respectively.

For comparison, we also implement a) original feedback-switch with full N -

bit feedback (LQF algorithm); b) iSLIP algorithm [15] (with a single iteration),

which serves as a benchmark for single-stage input-queued switches; and c) output-



- 81 -

queued switch, which serves as a performance lower bound. In simulations, we use

the same traffic models as Chapter 2.4, i.e. the uniform, uniform bursty and hot-spot

traffic.


Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback.

Fig. 4.2 compares the delay performance of the six schemes under uniform

traffic. We can see that the delay gap between full-feedback and our proposed Set-

feedback , Q-feedback-1 and 2 increases with the input load. At p=0.1, they give

almost same delay performance. At p=0.8, the delay gap grows to about 20 slots. But

when compared with iSLIP , our proposed schemes require 40+ slots less, yielding

55% cut in delay. Among the three proposed schemes, Set-feedback generally

outperforms the other two. With the fixed number of bits for conveying feedback



- 82 -

occupancy, set-feedback can convey Top-3 longest queues instead of Top-2 (in Q-

feedback ), which can identify the longest VOQ with higher accuracy.


From Fig. 4.3, the delay performance under bursty traffic, we can see that

Set-feedback gives the best performance (lowest delay), then followed by Q-

feedback-1 and Q-feedback-2. In general, iSLIP has smaller delay for low input load

( p≤0.5). At p=0.6, the delay is 183 slots for iSLIP and 92 slots for Set-feedback ,

yielding a 50% cut in delay. Compared with full-feedback , delay is increased from 70

slots to 92 at p=0.6, which represents the price paid for minimizing feedback bits.

Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback.


From Fig. 4.4, the delay performance under hot-spot traffic, again we can see



- 83 -

that Set-feedback , Q-feedback-1 and Q-feedback-2 give comparable performance.

Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback.

4.3.4 Performance under Different Switch Size N

Based on the above simulation results, Set-feedback gives the best

performance among our proposed algorithms. In the following, we focus on the

performance of Set-feedback under different traffic patterns with different switch

sizes N . Note that we still limit the feedback for Set-feedback to 12 bits, regardless of

the switch size. Specifically, when N is 64, we set u=16 (so g =64/16=4) and C =3.

The feedback bits required in the first stage and second stage become both 12 bits.

When N =128, we set u=32 (so g =4) and C =3. The feedback bits required in the first

stage and second stage are 15 and 12 bits respectively.

From Fig. 4.5, we can see that when N =128, 12-bit Set-feedback yields 94.5%



- 84 -

throughput under uniform traffic. In other words, Set-feedback trades just 5.5%

(lower) throughput for 88.3% saving in communication overhead.

Fig. 4.5 Throughput vs. switch size N , with partial feedback.

4.4 Chapter Summary

In this chapter, we focused on cutting down the communication overhead in

feedback-based two-stage switch. The size of an occupancy vector, which is sent by

middle-stage port to output port in every time slot, is reduced by only reporting the

status of selected middle-stage VOQs. To identify VOQs of interest, we first

partitioned the N VOQs into u non-overlapped sets, each being identified by a set

number. In each time slot, every input port piggybacks its set numbers of interest to

the connected middle-stage port. This guides a middle-stage port to only report the

status of the VOQs of interest. Extensive simulation results showed that our proposed

feedback suppression algorithms are very efficient.



- 85 -

Chapter 5

Supporting Multicast Traffic

5.1 Introduction

The migration of broadcasting and multicasting services, such as cable TV

and multimedia-on-demand to packet oriented networks will play a dominant role in

the near future. These highly popular applications have the potential of loading up

the Internet. To keep up with the bandwidth demand of such applications, the next

generation of packet switches/routers need to provide efficient multicast switching

and packet replication.

When a multicast packet arrives at a switch, the set of output ports the packet



- 86 -

destined for, i.e. the packet’s fan-out set , is retrieved from the local forwarding table

(like IP multicast). The cardinality of the fan-out set, i.e. its fan-out , denotes the

number of copies that the packet should be cloned. Packets arrived at the same input

port and destined for the same fan-out set belong to the same multicast flow. The

total number of possible multicast (and unicast) flows at an input port is 2 N

-1. An

admissible multicast traffic pattern requires no over-subscribed input and output

ports. That means the packet arrival rate at each input port should be less than or

equal to its capacity, or 1 packet/slot. Similarly, the aggregated packet arrival rate at

each output port (after packet duplication) must also be smaller than or equal to 1

packet/slot. A multicast switch aims at providing 100% throughput for any

admissible multicast traffic pattern with minimum possible packet delay.

For the sake of scalability, multicast switches are mainly designed based on

input-queued switch architecture, where a centralized scheduler is responsible for

scheduling. Switch fabrics used can be bufferless [41-45] or buffered [46-49]. For

multicast switches based on bufferless switch fabrics [41-45], in-switch multicast

capability (i.e. in-switch packet duplication and forwarding) is usually assumed,

where an input port can send a (multicast) packet to multiple output ports in a single

time slot. Such multicast fabrics are more expensive than their unicast counterparts.

Besides, the centralized scheduling algorithms are usually derived from their unicast

counterparts. Note that even for (simpler) unicast switches, a major bottleneck is the

implementation of the centralized scheduler.

For multicast switches with buffered switch fabrics, they mainly adopt the

buffered crossbar [18-20] as their switch fabrics. Recall that for the buffered crossbar



- 87 -

switch introduced in Chapter 1, even through the scheduler is simpler, its switch

fabric needs to realize N N switch configurations, the same complexity as output-

queued switch fabric.

In short, two limiting factors for high-speed multicast switch design are the

switch fabric complexity and the need for a sophisticated centralized scheduler. In

this chapter, we show that feedback-based two-stage switch can support multicast

traffic efficiently by slightly modifying its original operations. It elegantly

overcomes these two major obstacles. Specifically, it does not require a centralized

scheduler, and relies on a unicast switch fabric (realizing only N switch

configurations) to carry both unicast and multicast traffic.

The rest of the chapter is organized as follows. In the next section, we review

some related work on multicast switch design. The feedback-based two-stage switch

is modified to support multicast traffic in Section 5.3 and simulation results are

presented in Section 5.4. We conclude the chapter in Section 5.5.

5.2 Related Work

5.2.1 Multicast Switches Based on Bufferless Switch Fabrics

Multicast switches based on bufferless switch fabrics [41-45] usually assume

in-fabric multicast capability (i.e. in-fabric packet duplication and forwarding), and

require a rather sophisticated centralized scheduler. In [41], each switch input port

maintains N +1 virtual queues, N for unicast and one for multicast. Priority is given to

schedule multicast traffic. If there are still idle inputs/outputs after scheduling



- 88 -

multicast packets, unicast packets are considered to increase switch utilization.

Although a multicast packet can be “split” to send in multiple time slots, multicast

traffic suffers from severe head-of-line (HOL) blocking due to the single multicast

queue.

In [42], the number of multicast queues is increased to m to reduce HOL

blocking. When a multicast packet arrives, it selects a multicast queue to join in order

to balance the loading among different multicast queues. But packets assigned to

different queues generally have overlapped fan-out sets. Priority is given to schedule

a unicast packet first or a multicast packet first depending on the service ratio

between the two classes. An iterative algorithm is also adopted to maximize the

throughput in each time slot.

In [43], packet splitting is allowed to further cut down the HOL blocking.

Specifically, each input maintains k unicast/multicast shared queues, one for each

non-overlapped set of outputs. When a multicast packet arrives and if its fan-out set

intersects with the fan-out sets of multiple queues, packet-splitting “breaks” the

original packet into “smaller” ones, each with a modified fan-out set (such that no

intersection with the fan-out set of the queue it joins). An iterative algorithm is then

used to maximize the switch throughput. Simulation results show that high

throughput can only be achieved with a large number of iterations. But a large

number of iterations is not suitable for high-speed implementation.

In [44], the number of unicast/multicast shared pointer queues increases to

k = N , one for each output port (like the classic VOQs for unicast traffic). When a



- 89 -

multicast/unicast packet arrives, it is time-stamped and stored in a shared memory.

Then its memory address (i.e. a pointer) is stored in all pointer queues that overlap

with the packet’s fan-out set. An iterative scheduling algorithm based on the

timestamps of buffered packets is designed for maximizing throughput. The major

problem with this approach, again, is its high communication overheads.

In [45], dynamic queuing policies are studied, where packet splitting upon

arrival is not allowed. The switch needs to identify active flows and then assign them

to different shared multicast queues based on the current switch load.

5.2.2 Buffered Crossbar Based Multicast Switches

Buffered crossbar switch architecture [18-20] is touted for its technology

feasibility and simpler scheduler. However, the buffered crossbar switch is not

scalable due to its 2 N separate schedulers, N 2 in-fabric crosspoint buffers and the

need for N N switch configurations. Buffered crossbar has also been extended to

support multicast traffic [46-49]. MURS [46] gives priority to schedule unicast and

multicast traffic in a round robin fashion. Specifically, if unicast gets priority in time

slot t , unicast traffic will be scheduled first. If there are still idle outputs after

scheduling unicast traffic, multicast traffic is considered. Then in slot t +1, multicast

traffic gets the scheduling priority.

To reduce the hardware cost, I-SMCB (Input-based Shared Memory

Crosspoint Buffer [47]) and O-SMCB (Output-based Shared Memory Crosspoint

Buffer [48]) aim at cutting down the crosspoint buffers from N 2

to N 2/2. The key idea

is to share one crosspoint buffer by two adjacent input ports [47] or two adjacent



- 90 -

output ports [48]. But such a hardware cost reduction is offset by its throughput

degradation. In [49], the theoretical relationship between throughput performance

and crosspoint buffer size is studied under a special multicast traffic pattern. It is

concluded that to avoid throughput degradation, the amount of buffer to be deployed

at every crosspoint must scale logarithmically with the switch size N .

5.3 Multicast Scheduling in Feedback-Based Two-Stage Switch

5.3.1 Multicast Scheduling

We extend the feedback-based two-stage switch (Fig. 2.1) to support

multicast traffic. At each input port, in addition to the N unicast VOQ1(i,k )’s, we add

another m shared queues for multicast. We adopt a simple queuing policy that divides

the outputs into m equal and non-overlapped sets (assuming N /m is an integer),

where set x (1≤ x≤m) contains outputs {( x-1) N /m, ( x-1) N /m+1,…, x·N /m-1}. Packet

splitting is used to “split” multicast packets to join different queues. So when a

multicast packet arrives and if its fan-out set intersects with the fan-out sets of

multiple queues, then the original packet is “split” into “smaller” ones, each with a

modified fan-out set (which will not intersect with the fan-out set of the target queue).

Note that the packet after splitting usually remains as a multicast packet but with a

smaller fan-out set. It is worth to note that when m=1, all multicast packets share the

same multicast queue; and when m= N , packet splitting converts all multicast packets

into unicast.

Without loss of generality, we assume the two stages of switch fabrics are

configured using the joint sequence of Fig. 2.2(a). In each time slot, based on the



- 91 -

received occupancy vector of middle-stage port k , input i selects a packet for sending

among its N+m local queues. Priority is given to schedule multicast traffic by

examining the m multicast queues first. Here we only consider giving the higher

priority to multicast traffic, as in general it is more time critical than unicast, but it

should be noted that our multicast scheduler can be revised to schedule unicast and

multicast packets depending on the service ratio (like [42]) or in a round robin

fashion (like [46]). Specifically, the HOL packet whose fan-out set has the largest

overlap with the set of empty queues at middle-port k is selected. (If no overlap, a

unicast packet is selected instead.) A copy of the selected packet is sent to the

middle-port together with an N -bit duplication vector , which identifies the overlap

between the empty VOQ2( j,k )’s and the packet fan-out set. Then, the fan-out set of

the selected multicast packet is updated to exclude those in the duplication vector. If

the updated fan-out set is empty, the selected multicast packet is removed from the

multicast queue. When a packet arrives at the middle-stage port, it will be cloned and

stored at the corresponding empty (unicast) VOQ2( j,k )’s based on the duplication

vector.

If there are no backlogged multicast packets or none of them can be selected

(due to zero-overlap between the empty VOQ2( j,k )’s and any multicast packet’s fan-

out set), we select a unicast packet for sending using the LQF scheduler. In this case,

the duplication vector is set to all 0’s. Note that the packet transmission in the

second-stage switch fabric is the same as in a unicast switch. Following the pre-

determined sequence of configurations, when middle-stage port j connects to output

k , the packet (if any) at VOQ2( j,k ) is sent together with the occupancy vector of

middle-port j.



- 92 -

5.3.2 Discussions

In our proposed multicast scheduling algorithm, packet duplication takes

place at both input ports and middle-stage ports. Packet duplication at input ports

“breaks” a multicast packet into smaller ones. Since multicast packets in different

multicast queues have non-overlapped fan-out sets, both HOL blocking and output

contention can be eased. Besides, storing multicast packets at inputs reduces the

input port buffer requirement. Since both two switch fabrics in feedback switch are

unicast, a multicast packet is sent in the first fabric as a unicast packet. The

complicated switch fabric with in-fabric duplication (as [41-45]) is not required.

When a split multicast packet arrives a middle-stage port, the second stage packet

duplication occurs, which converts all multicast packets into unicast for delivering by

the second switch fabric.

When there is only a single multicast queue (m = 1), all packet duplication is

carried out at middle-stage ports. Under light traffic, input port queue size can be

minimized. But for heavy traffic, the switch will experience severe HOL blocking

because a multicast packet will not be removed (from the only queue) until all its

copies are sent. With m > 1, packet splitting ensures that packet duplication occurs

partially at input ports and packets in different queues have non-overlapped

destinations. This reduces the HOL blocking. Let the switch size be N . When m= N ,

all packet duplication is carried out at input ports. In this case, there is no need for

“multicast” queues because they only store unicast packets. In other words, each

input port only needs to maintain N unicast queues. The HOL blocking is completely

eliminated and the stability proof in Chapter 2 can also be applied for multicast



- 93 -

feedback switch with m= N .

Unlike the feedback-based two-stage unicast switch, the load-balancing in the

first stage switch is based on multicast packets. Extensive simulation results show

that the final unicast traffic presented to the second stage switch is generally uniform.

This accredits to the use of the single-packet-buffer per middle-stage VOQ2( j,k ), and

the efficient feedback mechanism for reporting the middle-stage port occupancy. To

further increase the buffer utilization, we can use pointer queues [44] to separately

store a packet and its memory address. So a multicast packet is only required to store

once at an input port, and an entry in VOQ1(i,k ) only contains the memory address of

the packet. Likewise, this can be applied to buffers at middle-stage ports.

The proposed multicast scheduling algorithm inherits the in-order packet

delivery property from its unicast counterpart. This is because we can treat each

distributary of a multicast flow as a unicast flow. In Chapter 2, it has been shown that

packets belonging to the same unicast flow always experience the same middle-stage

port delay. Therefore, when they arrive at the output port, they will be in order. If

packets belonging to every distributary flow orderly arrive at their respective outputs,

the corresponding multicast flow will not experience packet mis-sequencing problem.


To the best of our knowledge, our proposed multicast scheduling is the only

one that does not rely on a centralized scheduler, and its switch fabric only needs to

realize N switch configurations (instead of N !). To study its performance, we vary the



- 94 -

number of multicast queues (m) at each input port. In our simulations, we try to

distinguish between the overall average delay experienced by all copies (T c) of a

multicast packet and the average delay experienced by the last-copy (T p) of all

multicast packets. T p corresponds to the worst-case delay and provides us some

insight on the delay variation among different copies of a multicast packet. For

multicast packets with fan-out k , T c(k ) and T p(k ) denote their average delay and

average last-copy delay respectively. They show the fairness performance in

handling packets with different fan-outs. Although we only present simulation results

for switch with size N =32 below, the same conclusions and observations apply for

other switch sizes.

5.4.1 Performance under Uniform Mixing Traffic

Fig. 5.1 Delay vs output load λ , with uniform mixing traffic



- 95 -

At every time slot for each input, a packet arrives with probability p (i.e.

input load is p). If a packet arrives, it has equal probability of being unicast or

multicast. If the packet is unicast, it destines to each output with equal probability. If

the packet is multicast, its fan-out size k is randomly selected between [2, 32], and

the identity of each output in the fan-out set is also randomly selected from all output

ports. Fig. 5.1 shows the switch delay performance against switch output load λ ,

where

λ = p[0.5+0.5(2+32)/2] = 9 p. (5.1)

To ensure the traffic in our simulations is always admissible, we must have λ ≤ 1 (or

p ≤1/9).

Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7

From the delay-throughput performance in Fig. 5.1, we can see that for output

load λ < 0.85, m=1 and m=2 provide a lower average packet delay than m=32. At λ =



- 96 -

0.7, m=2 cuts down the overall average delay (T c) by 58.8% and the average last-

copy delay T p by 51%. When λ >0.85, m=32 (packet duplication at input ports only)

yields a better/lower delay performance because there is no HOL blocking, while the

HOL in m=1 (packet duplication at middle-stage ports only) is intensified with the

traffic load. This also explains why m=2 (packet duplication at both input and

middle-stage ports) is better than m=1.

Fig. 5.2 shows the delay performance against different fan-outs, while fixing

λ = 0.7. When m=2, we can see that T c(k ), the average delay for packets with fan-out

k , is the lowest, and remains almost constant at 20 slots as fan-out k increases. Even

T p(k ), the average last-copy delay for packets with fan-out k , increases rather slowly

with k . This shows that m=2 is fair in handling packets with different fan-outs. On

the contrary, with m=32, both T c(k ) and T p(k ) increase more rapidly with fan-out size.

5.4.2 Performance under Uniform Bursty Mixing Traffic

We use the same traffic generator except that bursty arrivals are modeled by

the ON/OFF traffic model of Chapter 2.4. In the ON state, a packet arrival is

generated in every time slot, which has equal probability of being unicast or

multicast. Simulation results in Figs. 5.3&5.4 are based on burst size s p=30 packets.

Again, we can express the aggregated load at each output port by (5.1).

From Fig. 5.3, the performance gap between m=2 and m=32 is much wider

than that in Fig. 5.1. This is because bursty traffic causes more unevenly distributed

queue sizes in the input ports when m=32. With m=2, packet duplication mainly

occurs at middle-stage ports. In this case, both input port queue size and input port



- 97 -

delay are reduced. With m=1, packet duplication only occurs at middle-stage ports.

The throughput is suffered from the severe HOL blocking. From Fig. 5.4, we can

again see that m=2 is fair in handling packets with different fan-outs. Although m=32

also gives improved fairness performance, this is at the cost of very high average

delay (T c(k )> 750 slots).

Fig. 5.3 Delay vs output load λ , with bursty mixing traffic

5.4.3 Performance under Binomial Mixing Traffic

Binomial mixing traffic [45] is the same as the Bernoulli uniform mixing

traffic model except in generating the fan-out size of a multicast packet. Let P k be the

probability of generating a fan-out set with size k . The k destinations are uniformly

distributed over all output ports. The value of k is chosen according to a non-uniform

binomial distribution, with mean fan-out h:



- 98 -

k N k k

N k N

h

N

hC p )1()(

Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7

In our simulations, we set mean fan-out h = 17. Then the output load λ is:

λ = p[0.5+0.5×17] = 9 p

The delay performance shown in Fig. 5.5 is comparable with that in Fig. 5.1. This is

because the two traffic models are quite similar. Specifically, they have the same

Bernoulli packet arrival, same average fan-out size of 17, and their fan-out sets are

all uniformly selected from all outputs. We skip the figure of delay vs fan-out

because it has a similar trend as that in Fig. 5.2.

From the simulation results above, we can see that setting m=2 is sensible as

it ensures sufficiently low packet delay and high throughput. Besides, the extra



- 99 -

complexity involved in maintaining two multicast queues is marginal.

Fig. 5.5 Delay vs output load λ , with binomial mixing traffic

5.5 Chapter Summary

In this chapter, feedback-based two-stage switch is extended for scheduling

multicast traffic by slightly modifying its operations. The resulted switch not only

removes the centralized scheduler but also supports multicast traffic by the simple

unicast switch fabric. Simulation results showed that with packet duplication at both

input ports and middle-stage ports, the proposed multicast scheduling algorithm is

effective in cutting down both average delay and delay variation among different

copies of the same multicast packet.



- 100 -

Chapter 6

Multi-cabinet Implementation

6.1 Introduction

To accommodate the growth of the Internet traffic, high-speed routers consist

of a large number of linecards (e.g. 1152 linecards in Cisco CRS-1 [8]), resulting in

larger physical space and power requirement. Consequently, a multi-cabinet

implementation of routers is needed [50-51], where the distance between linecards

and (central) switch fabrics can be tens of meters.

In a single-cabinet implementation, the propagation delay between linecards



- 101 -

and switch fabrics is negligible. In a multi-cabinet implementation, due to the non-

negligible propagation delay, the requirement that occupancy vectors must arrive at

input ports within a single time slot will significantly lower the feedback-based

switch efficiency. This is illustrated in Fig. 6.1. Since the occupancy vector needs to

take the in-flight packet (in the first switch fabric) into account, it can only be

generated when the packet (at least partly) arrives. A dedicated feedback packet is

required as piggybacking occupancy vector onto data packet is not possible. Finally,

an input port must wait for the occupancy vector to arrive before another packet can

be scheduled for sending. From Fig. 6.1, we can see that the duration of a slot must

be at least twice the propagation delay between linecards and the switch fabrics. But

in each slot, only a single packet can be sent. Since a switch fabric cannot be

reconfigured while there are in-flight packets, the slot duration is (roughly) the

duration that a switch configuration lasts.

Fig. 6.1 The timing diagram of switch with large propagation delay

In this chapter, we revamp the original feedback mechanism and design a new



- 102 -

batch scheduler to solve this problem. The basic idea is to schedule and send multiple

packets while each switch configuration lasts. The key challenge is at how to keep

the original close-to-100% throughput performance and ensure in-order packet

delivery.


some related work on addressing the impact brought by propagation delay. In Section

6.3, the feedback mechanism is revamped and a new batch scheduler is designed. Its

performance is evaluated in Section 6.4 and we conclude the chapter in Section 6.5.

6.2 Related Work

6.2.1 Multi-cabinet Implementation of Input-queued Switch

To improve the performance of input-queued switch under multi-cabinet

implementation, SRR (Synchronous Round Robin) scheduler is proposed [51]. SRR

is a distributed and iterative scheme, in which one input port sends only one request

based on a cyclic, like TDMA (Time Division Multiplexing Access), preferential

scheduling of VOQs. A request is selected by logically numbering the slots with an

incremental counter ranging from 0 to N −1. If the preferred VOQ is empty, then the

longest one is selected. Each output also has a preferential input to grant based on the

same TDMA-like cycle. If the preferred input request does not arrive, one request is

randomly selected to grant. Input port can receive the grant for the current request

one round-trip-time later. While waiting for the grant to arrive, each input continually

sends its preferred request on a slot-by-slot basis. From [51], we can see that when

the traffic is bursty, the switch throughput is rather limited.



- 103 -

6.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch

A multi-cabinet implementation of buffered crossbar switch is studied in [19],

where a large packet buffer size at each crosspoint is required to achieve high

throughput. This imposes further challenges to the implementation of buffered

crossbar switch. In [18], virtual crosspoint queues are introduced to alleviate the in-

fabric buffer requirement but the resulting switch gives poor throughput performance

under some traffic conditions.

6.3 Multi-cabinet Implementation of Feedback-Based Switch

6.3.1 Revamped Feedback Mechanism

Fig. 6.2 Multi-cabinet implementation of the feedback-based switch

Fig. 6.2 shows a multi-cabinet implementation of a feedback-based two-stage

switch. To increase the switch efficiency, we can send multiple packets in a slot. The

minimum duration of a slot is the round trip propagation time between linecards and

switch fabrics, or RTT seconds. Let the (maximum) number of packets that can be

sent in each slot be x. The value of x depends on packet size ( B bytes), RTT , and the

line rate ( R bps). Roughly, we have



- 104 -

/ _ 8

RTT R x RTT packet duration

B

For a typical distance of 20 meters between linecards and switch fabrics, the

(minimum) slot duration is RTT =200 ns. To transmit a packet of 200 bytes on a

40Gbps line, 40 ns are required. Reserving some guard times for control, we can

transmit x = 4 packets in a slot, as shown in Fig. 6.3.

Packet

Input port Middle-stage Output portMiddle-stage

Time Transmission

delay

Propagation

delay ( RTT /2)

S l o t t

2nd switch reconfig.

Destination report

Occupancy vector

1st switch reconfig.Last packet sent

in slot t arrives.

Occupancy vector

is generated

Fig. 6.3 Feedback operation in multi-cabinet implementation

But can we still keep the in-order packet delivery and high-throughput

properties of a single-cabinet implementation of the feedback switch? With the

following modifications, the answer is yes. First of all, the buffer size at each middle-

stage port VOQ2( j,k ) is increased to x to accommodate up to x packet arrivals in each



- 105 -

time slot. The occupancy vector is expanded to N ·log x bits, as the size of each VOQ

requires log x bits.

The feedback operation is also revamped. Refer to Fig. 6.3. Assume at time

slot t input port i connects to output k via middle-stage port j. At the beginning of slot

t , (based on the occupancy vector received in the previous slot) input i uses a local

batch scheduler (to be detailed in Section 6.3.2) to select up to x packets for sending.

A special header (destination report) is appended to the first packet sent, which

contains the destinations of the x packets (to be) sent in this slot. As each destination

requires log N bits to denote, the destination report consists of x·log N bits.

While input ports are sending packets to middle-stage ports, middle-stage

ports are sending packets to output ports in parallel. When a middle-stage port ( j) is

connected to an output port (k ), all backlogged packets (at most x) in VOQ2( j,k ) will

be completely cleared. (Backlogged packets refer to packets arrived in previous time

slots, excluding those arriving packets in the current slot.) In fact, due to the

predetermined sequence of configurations used, middle port j knows beforehand

which VOQ2( j,k ) will be cleared at which time slot.

Middle-stage port j generates the occupancy vector upon receiving the

destination report from input i. The destination report contains the destinations of all

the packets to arrive in the following slot duration. Therefore, at the time the

occupancy vector is generated (in the middle of slot t ), it already looks ahead to get

the accurate VOQ status at the time the last packet sent in slot t arrives at middle-

stage port j (see Fig. 6.3). The occupancy vector is then appended to the next packet



- 106 -

sent in the second switch fabric for transmission, i.e. packet 3 in Fig. 6.3.

When the occupancy vector arrives at output k and is made available to input

k at the beginning of slot t +1, the input port batch scheduler selects and sends up to x

packets to middle-stage port j. It should be emphasized that the scheduling is based

on what will happen when the selected packets arrive at middle-stage port j (i.e. the

information in the occupancy vector received). Notably, the first packet from input k

will arrive at middle-stage port j right after the last packet from input i. The

bandwidth of switch fabric is fully utilized.

6.3.2 Batch Scheduler Design

Now we focus on the batch scheduler design. Without loss of generality, we

assume a LQF batch scheduler at input port k . Specifically, input k identifies the set

of VOQ2’s at middle-port j that has room for new packets, denote this set by S j. Find

the longest queue VOQ1(k ,h), such that VOQ2( j,h) belongs to S j. Then the HOL

packet at VOQ1(k ,h) is scheduled for sending. Update S j and the size of VOQ1(k ,h).

Then the above process is repeated until x packets are scheduled (or no more packets

available).

Like as the scheduler in single-cabinet implementation, we also include the

following refinements in the batch scheduler:

Forced-zero-queue-size: If middle port j will connect to output k ’ in the next

slot t +1, then in current slot t middle port j reports/feedbacks a zero queue

size for VOQ2( j,k ’). This is because VOQ2( j,k ’) is guaranteed to be exhausted



- 107 -

at the end of slot t +1 (i.e. all its packets will be sent to output k ’.), With

forced-zero-queue-size, the batch scheduler has more flexibility in selecting

packets to send.

Preventing underflow: Assume input port i connects to middle port j in time

slot t , and j will connect to output k ’ in the next slot t +1. If flow(i,k ) has

packets waiting in VOQ1(i,k’ ) but VOQ2( j,k’ ) does not have x packets ready

for sending in slot t +1, an underflow will occur. To avoid the possible loss of

efficiency due to underflow at VOQ2( j,k’ ), at slot t input i should always give

priority to send packets from VOQ1(i,k’ ) to VOQ2( j,k’ ) first.

6.3.3 Some Properties

The new batch scheduler operates on the architecture as Fig. 6.2, which

adopts the same joint sequence of Fig. 2.2(a). In the following, we show that the

multi-cabinet implementation of the feedback-based scheduler can ensure in-order

packet delivery, 100% throughput under a speedup of two and asymmetric

reconfigurations:

In-order packet delivery. Each flow having a constant middle-stage delay is a

sufficient condition for packet in-order delivery in two-stage switch (proven

in Chapter 3). While extending the feedback-based switch to multi-cabinet

implementation, we allow x packets to be sent in each time slot. The constant

middle port delay for packets of the same flow is still guaranteed by the

adopted joint sequence. The delay a packet experiences at a middle-stage port

is again bounded by [1, N ] slots. Without loss of generality, assume m (out of

x) packets arrived at a middle-stage port j in a same time slot belong to the



- 108 -

same flow(i,k ). Those m packets will be buffered at VOQ2( j,k ) for the same

amount of time until middle-port j is connected to output k . Then, they will

be delivered to output k (possibly together with ( x – m) packets from other

flows) in the same slot. So the constant middle-stage delay is still guaranteed,

and thus the in-order delivery property is still ensured.

100% throughput under speedup of two. For multi-cabinet implementation

with a batch size of x packets, we can treat each batch as a single aggregate

packet. Then the multi-cabinet switch is equivalent to a single-cabinet switch.

In other words, the propagation delay between linecards and switch fabrics

does not affect/reduce the throughput performance of a multi-cabinet switch.

Asymmetric reconfiguration. In Fig. 6.3, when the last bit of the x-th packet

arrives at the middle-stage port, the first stage switch fabric can start to re-

configure. When the last bit of the x-th packet departs the second switch

fabric, the second switch fabric can start to re-configure. In other words, the

reconfiguration of second fabric can start before the last bit of the x-th packet

arrives at the output port. For optical switch fabrics with non-negligible

amount of re-configuration overheads, such a pipelined packet transmission

and asymmetric reconfiguration can be very efficient.

Cutting down the communication overhead. In original feedback-based switch,

the communication overhead for sending single packet is N bits. From Fig.

6.3, we can see that only a single occupancy vector of N ·log x bits is required

for x packets sent in batch scheduler. The per packet communication overhead

is reduced from N bits to ( N ·log x)/ x bits.



- 109 -


In this section, we study the performance of our multi-cabinet implementation

of the feedback-based switch by simulations. In the following, we only present

simulation results for switch with size N =32 although similar conclusions apply to

other sizes. As the duration of a time slot may be different when different

propagation delay is considered (see Figs. 6.1 and 6.3), the delay performance is

measured by the number of time units, where each time unit is equivalent to the

transmission time of a packet at line rate. In our simulations, we use the same traffic

models as Chapter 2.4, i.e. uniform, uniform bursty and hot-spot. We assume the

propagation delay between linecards and switch fabrics is y, which varies from 1 to 2

time units. For simplicity, we ignore the overheads for switch reconfiguration,

scheduling, etc. Three scheduling algorithms are compared:

LQF without batch scheduling. When propagation delay is y time units, we

denote the algorithm by LQF/ y. The operation of LQF/ y is based on Fig. 6.1,

where only one packet can be sent in each slot. In other words, this is a direct

extension from the single-cabinet case.

LQF with batch scheduling (as shown in Fig. 6.3). When propagation delay is

y, we denote the algorithm by B-LQF/ y and the number of packets that can be

sent in each time slot is 2 y.

SRR algorithm [51]. When the propagation delay is y, we denote SRR as

SRR/ y. We regard SRR as a “generalization” of iSLIP [15] for multi-cabinet

implementation. In other words, SRR serves as a benchmark for single-stage

input-queued switches. Note that we do not compare with LQF-Byte-focal

[23] and CR [29] because they cannot be used for multi-cabinet



- 110 -

implementation.

When our B-LQF is used and the propagation delay is y time units, the

number of packets that can be sent in each time slot is 2 y.


From Fig. 6.4, we can see that due to the inefficiency caused by propagation

delay, LQF/ y can only obtain up to 25% and 50% throughput when y=2 and 1

respectively. With B-LQF/ y, close-to-100% throughput can be obtained. Note that the

average middle-stage port delay is still 16.5 slots. Since the duration of a slot is 2 y

time units, the average middle-stage port delay is 33 time units for y=1 and 66 for

y=2.

Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet



- 111 -


In Fig. 6.5, our B-LQF/ y again yields close-to-100% throughput under bursty

traffic. Despite of the fact that the middle-stage packet delay increases with the slot

duration, it is interesting to observe that when input load p>0.94, B-LQF/2 starts to

outperform B-LQF/1, though very slightly. The reason is as follows. In a time slot,

each input port can send up to 2d packets to a middle port with B-LQF. So packets in

B-LQF/2 tend to have a higher chance to enter the middle port than B-LQF/1. The

earlier packets enter the middle port, the less input port delay they experience. So

with B-LQF/2, packets tend to experience less input port delay. Under heavy bursty

loading, the input port delay dominates the overall delay performance. For B-LQF/2,

the drop in input port delay starts to outweigh the increase in middle port delay at

p=0.94.

Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet



- 112 -


In Fig. 6.6, we can again see that B-LQF/ y yields close-to-100% throughput,

and is significantly better than their non-batch scheduling counterparts.

Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet

6.5 Chapter Summary

In a multi-cabinet implementation of feedback-based switch, due to the non-

negligible propagation delay between linecards and switch fabric, the requirement

that occupancy vectors must arrive at output/input ports within a single time slot will

significantly lower the switch efficiency. In this chapter, we revamped the original

feedback mechanism and a new batch scheduler was designed to address this

problem. We showed that with multi-cabinet implementation, the refined feedback-

based two-stage switch still guarantees in-order packet delivery, and provides close-

to-100% throughput performance.



- 113 -

Chapter 7

Scheduling Inadmissible TrafficPatterns

7.1 Introduction

In the previous chapters, the feedback-based switch was designed while

focusing on handling admissible traffic patterns (i.e. both the input ports and output

ports are not over-subscribed), like [21-32]. For any admissible traffic patterns, as

long as the switch is stable, all packets can arrive at the outputs with bounded delays.

In this case, fairness in throughput is not an issue. But in practice, admissible traffic

patterns cannot be ensured, as an output port can experience oversubscription from

time to time. Therefore, a router should also be designed to efficiently handle



- 114 -

inadmissible traffic patterns.

It is interesting to note that under an inadmissible traffic pattern where some

output ports are over-subscribed, the overall throughput in feedback-based switch is

not affected as the over-subscribed outputs would be always fully occupied (due to

the work-conserving nature of the port scheduler used). However, different input

ports will have an unfair throughput share of the oversubscribed outputs. In other

words, the feedback switch will suffer from the ring-fairness problem, i.e. for packets

going to the same over-subscribed output (e.g. output 3 in Fig. 7.1), the further away

“up-stream” input ports (e.g. input 0 in Fig. 7.1) can throttle the nearby “down-

stream” input ports (e.g. input 3 in Fig. 7.1).

t=0 t=1 t=2 t=3

0

1

2

3

Input OutputMiddle-stage

Fig. 7.1 A 4 x 4 feedback-based switch with output port 3 oversubscribed by

inputs 0, 1, 2 and 3.

To address the ring-fairness issue for over-subscribed outputs, a fair scheduler

is designed for feedback-based switch in this chapter. The basic idea of fair scheduler

is to reserve the middle-stage buffer for the flows whose input VOQs exceeding a

pre-determined threshold Q. Then the bandwidth of an over-subscribed output will be

allocated to those input VOQs (exceeding Q) using a simple round robin (RR)

scheduler. We show that the optimal value of threshold is equal to the switch size



- 115 -

(Q= N ) and the resulting algorithm can meet the max-min fairness criterion.


some related work on fair scheduling algorithm design. In Section 7.3, our fair

scheduling algorithm is proposed. In Section 7.4, we show that the proposed

algorithm satisfies the max-min fairness criterion. Its performance is then evaluated

in Section 7.5 by simulations. Finally, we conclude the chapter in Section 7.6.

7.2 Related Work

In the literature, fair schedulers are designed to handle both admissible and

inadmissible traffic patterns. For inadmissible traffic patterns, algorithms can be

further divided into two types, with over-subscribed output ports only, or with both

over-subscribed input and output ports.

7.2.1 Fair Scheduling under Admissible Traffic

In [53], a centralized algorithm called GPS-SW (Generalized Processor

Sharing in network Switch) is proposed. Under the assumption that the traffic is

admissible, GPS-SW uses a matrix-scaling approach to maximize throughput while

distributing the excess available bandwidth in a fair fashion. However, the example

in [54] shows that under admissible traffic achieving both max-min fairness and

100% throughput at the same time is impossible. For the sake of fairness under

admissible traffic, GPS-SW sacrifices its throughput performance.

7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only



- 116 -

The F-MWM (Fair-MWM) algorithm, proposed in [55] for input-queued

switch, is based on the assumption that output ports can be oversubscribed but input

ports cannot. Therefore, F-MWM only considers fairly allocating the bandwidth of

output ports. As long as one (input) VOQ’s length exceeds a pre-set threshold, this

VOQ is removed to the congested list. Each VOQ in the congested list is served

exactly once during the every N time slots. The VOQs that are not in the congested

list are scheduled using LQF.

TFQA (Tracking Fair Quota Allocation [56]) is a variant of F-MWM that

applies to buffered crossbar switch. Unlike F-MWM, it maintains an adaptive

threshold for each input port. The VOQs that exceed the threshold would be included

to primary class. The dual round robin pointers in each input port are responsible for

scheduling the packets, one pointer for primary class and another for all VOQs. The

higher priority is always given to the primary class.

7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports

In [54, 52], input and output port can both be oversubscribed. So all inputs

and outputs bandwidth are required to be fairly allocated. The operations of

algorithm for input-queued switch [52] can be divided into two main phases. In the

first phase, only the output port bandwidth is concerned. At the end of the first phase,

the only possible bottlenecks for the flows are the input ports. In the second phase,

the algorithm allocates bandwidth at the input ports in a max-min fair fashion, thus

resulting in an allocation that is overall max-min fair.

AMFS (Adaptive Max-min Fair Scheduling) [54] is based on the architecture



- 117 -

of buffered crossbar switch. AMFS maintains two systems: a virtual system that

exactly emulates a virtual WF2Q+ (Fair Weighted Fair Queueing + [57]) and a real

system AMFS that is responsible for actually scheduling the flows. The virtual

WF2Q+ calculates per-flow virtual scheduling starting time and finishing time, which

the AMFS attempts to emulate. It has been proven that AMFS can sustain 100%

throughput for admissible traffic and ensure max-min fairness for non-admissible

traffic without any speedup. However, such proof is under the assumption that the

crosspoint buffer size is infinite. Furthermore, this algorithm incurs the overhead of

maintaining the virtual WF2Q+ system.

7.3 Our Approach

Like [55-56], we consider inadmissible traffic patterns with oversubscribed

output ports only. This assumption is reasonable as the input ports can indeed avoid

being over-subscribed [55] by the physical line-rate constraint on each ingress port.

But an output port is responsible for processing the egress traffic from N incoming

flows, so output port bandwidth over-subscription is difficult to avoid.

First of all, an overload vector {wi} (i=0, 1… N -1) and a reservation vector {qi}

(i=0, 1… N -1) are required for conveying reservation requests and grants at each

middle stage port j. All the elements in the two vectors are initialized to -1. If wi = l

and l > -1, it indicates that input port i has more than or equal to Q packets destined

for output l . If qi = m, then the VOQ2( j,i) of the current middle stage port j is

reserved for input port m (for sending a packet to output port i). In each time slot,

based on the values of {wi} and {qi}, the following operations are carried out at each



- 118 -

middle port j in parallel:

Sending reservation/overload request. For any input port m, among its VOQs

≥ Q, select VOQ1(m,l ) based on a round robin (RR) scheduler and the identity

of VOQ1(m,l ) is piggybacked (using log N bits) onto the current packet

transmission to middle port j. Middle port j updates its overload vector to

become wm = l .

Determining the winner . Assumed a middle-stage port j connects to an output

port k , middle port j examines its {wi}. If all wi≠ k (i=0, 1… N -1), make sure

that the reservation vector {qi} has qk = -1, which means no reservation on

VOQ2( j,k ) is required (as none of the input ports have Q or more than Q

packets for output k ). If there are some wi = k (i=0, 1… N -1), then select

(based on a RR scheduler) one of them, say wl = k , and set qk = l . This

indicates that VOQ2( j,k ) (of the middle port j) is reserved for input port l .

Then reset all wi = k (i=0, 1… N -1) to wi= -1 to indicate the corresponding

reservation requests for VOQ2( j,k ) have been processed.

Ensuring a reservation is honored . Before middle-stage port j sending its

occupancy vector to its connected output port k , j examines its reservation

vector {qi} first. If there is any qi = m, where m > 0 and m ≠ k , middle-stage

port j knows that VOQ2( j,i) is not available as it has been successfully

reserved by input port m. Therefore, the feedback bit in the occupancy vector

for VOQ2( j,i) is overwritten to 1. This is to ensure that VOQ2( j,i) can only be

used by input port m.

Input port scheduling . Any VOQ1 sent reservation request at time slot t would

be given the highest priority for scheduling at time slot t + N . Otherwise, send

the HOL packet from the longest VOQ1 with the corresponding empty middle



- 119 -

stage VOQ2 (as in the original feedback-based switch with port scheduler

LQF).

An example: Consider a 4×4 feedback-based switch (with the fair scheduler)

configured by the joint sequence of Fig. 2.2(a). Further make a assumption that at

time slot 0, the lengths of VOQ1(0,0) and VOQ1(0,2) exceed threshold 4, so input 0

would select one (say it VOQ1(0,2)) based on RR for sending a reservation request to

its currently connected middle port 1. Both input port 0 and middle port 1 record the

identify of VOQ1(0,2). When middle port 1 connects to output port 2 at time slot 1,

middle port 1 checks its received reservation requests for output port 2 and then

select one (say it VOQ1(0,2)) to grand based on RR. Then the middle port VOQ2(1,2)

can only be used by input port 0 in the following 4 time slots. Meanwhile, VOQ1(0,2)

would be given the highest priority for scheduling at time slot 4.

An input port generates a reservation request if a VOQ1 exceeds a pre-

determined threshold Q. The delay between an input port generating a reservation

request and knowing the result is N time slots (one round trip time for the joint

sequence). Within these N time slots, each input port can send up to N packets. If Q

is smaller than N packets and assume the reservation is successful, by the time that

the input port knows the result, the corresponding VOQ1 may be empty as the

backlogged packets may have been exhausted while waiting for the result to arrive.

This would create a wasted slot. (On the other hand, if the reservation fails, no slot

will be wasted even though the corresponding VOQ1 may still be empty then.) If Q ≥

N , it is guaranteed that there will be at least one packet in the queue for making use

of the reserved slot. However, having a large Q would adversely affect the packet



- 120 -

delay performance. Therefore, we use Q = N in our proposed fair scheduler to get the

best delay-throughput performance.

7.4 Max-min Fairness Criterion

In the following, we would like to show that our fair scheduler can satisfy the

max-min fairness criterion. Firstly, we borrow the following two definitions from

[52,58]:

Definition 4: The allocation vector {ai} is said to be feasible if and only if:

Each entity receives an allocation greater than or equal to zero; that is, for all i, ai

≥ 0.

The total allocated resource is less or equal to the available resource U; that is,

∑ai ≤ U.

Definition 5: For the demand vector {bi}, the allocation vector {ai} is said to

be max-min fair if:

1. It is feasible.

2. No entity receives an allocation greater than its demand; that is, for all i, ai≤ bi.

3. For all i, the allocation of entity i cannot be increased while satisfying the above

two conditions and without reducing the allocation of some other entity j for

which a j ≤ ai.

As long as an algorithm meets the three conditions above, it satisfies the max-

min fairness criterion. Note that in our fair scheduler, the demand vector {bi} is the

traffic load from input port i to an over-subscribed output port j. Let the capacity of



- 121 -

output port j be U, i.e. the available resource is U. Assume fair scheduler allocates U

to each input port i with allocation ai (i=0,1… N -1). Obviously, ai ≥ 0 and ∑ai ≤ U

(i=0,1… N -1). So ai is feasible (condition 1). By setting the threshold for generating a

reservation request at Q= N , fair scheduler will not waste any reserved slot. So for all

i, ai ≤ bi can be ensured (condition 2). In the following, we focus on condition 3, i.e.

we try to increase some bandwidth allocation ai and see how this would affect other

inputs. Assume the switch has been “warmed up”. Let ci be the number of times that

input i’s VOQ(i, j) exceeds threshold Q during L time slots. We have

ci≤ L for all i (i=0,1… N -1). (7.1)

If input i has a larger ci than ck of input k , according to fair scheduler, input i

will generate more reservation requests and thus get a larger share of output j’s

bandwidth (as output j is over-subscribed). That is

ai ≥ ak , if ci ≥ ck (7.2)

Take a closer look on ci. There are two possible cases:

ci < L: In one or more time slots, the length of VOQ( i, j) is less than threshold Q.

Then traffic load bi is satisfied by bandwidth allocation ai, i.e.

iii L

ab Lc

/lim

Therefore, ai cannot be further increased because ai has conformed to condition 2.

ci = L: The length of VOQ(i, j) is always longer than threshold Q. This indicates

that traffic load bi cannot be satisfied by bandwidth allocation ai because the

output port j is over-subscribed:

∑ai = U (7.3)

From (7.1), we have:



- 122 -

ci ≥ ck for all k (k =0,1… N -1)

Combine it with (7.2), we get:

ai ≥ ak , for all k (k =0,1… N -1) (7.4)

To increase ai, we have to reduce some ak (k =0,1… N -1) due to (7.3). Then we

reduce the allocation to some input port k for ai ≥ ak (7.4), which proves that

condition 3 is ensured. Combining the proof for all the three conditions in Definition

5 above, our fair scheduler satisfies the max-min fairness criterion. Note that here we

only focus on the max-min fairness, but the proportional fairness can also be applied

by making a minor revision for the fair scheduler.


In this section, we focus on the fairness performance in allocating the

bandwidth of an over-subscribed output port using the original feedback-based

switch (Feedback) and the proposed fair scheduler (Feedback-F). (Note that for

admissible traffic patterns, fair scheduler generates the same performance as original

feedback-based switch and thus not shown in this chapter.)

7.5.1 Server-client Traffic Model

The server-client traffic model in [59] is first adopted for generating

inadmissible traffic. At each time slot for every input, a packet arrives with

probability p. Linecards are partitioned into two types: a server (i.e. linecard 0) and

N -1 clients. The server transmits packets with equal probability to all clients. Each

client transmits 1/3 of its traffic toward the server and 2/3 to the other N -2 clients



- 123 -

with equal probability. The server is a hotspot and when N =32, the amount of traffic

going to the server is given by

λ = p( N -1)/3= 31 p/3.

Fig. 7.2 shows the bandwidth share of three representative flows, (1,0), (9,0)

and (2,0), at the server, versus the total server loading λ. Note that to reach output

port 0, the middle stage port delays for flows (1,0), (9,0) and (2,0) are 32, 8 and 1

time slots, respectively. The server becomes over-subscribed (i.e. the traffic becomes

inadmissible) when λ >1. With original feedback-based switch, flow(2,0) (yellow)

and flow(9,0) (purple) are quickly throttled by flow(1,0) (light blue), due to the ring-

fairness problem. With Feedback-F, the three flows equally share the oversubscribed

server bandwidth (together with the remaining 28 flows not shown), due to its proven

capability of max-min fair allocation.

Fig. 7.2 Output 0’s throughput vs its output load λ, under server-client traffic.



- 124 -

7.5.2 Attack-traffic Scenario

We also emulate an attack-traffic scenario, where output 0 is gradually

dominated by traffic coming from input 1. The detailed traffic model is as follows. At

each time slot for each input port, a packet arrives with probability p. For input port 1,

an arrived packet goes to output port 0 with probability 0.5 (we call it an attack-flow),

and the remaining 0.5 probability is equally shared by all other output ports. For any

other input ports, an arrived packet goes to all N -1 output ports with equal probability.

Therefore, at the over-subscribed output 0, when N =32, the output load λ is:

λ=0.5 p + p·( N -2)/( N -1)= p·91/62

Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic

From Fig. 7.3, as output load λ increases, with Feedback the throughput share

for flow(2,0) (yellow) and flow(9,0) (purple) quickly drops to 0, while the



- 125 -

throughput for the attack-flow(1,0) (light blue) increases linearly. When Feedback-F

is used, the attack-flow(1,0) is regulated/reduced, due to the max-min fair allocation

nature. Specifically, the attack-flow(1,0) can only make use of the excess bandwidth

(if any) from other flows with smaller traffic demands, i.e. flow( i,0)s (i=2,3…31).

From Fig. 7.3, we can see that the malicious flow can be identified and punished by

the proposed fair scheduler.

7.6 Chapter Summary

For an inadmissible traffic pattern where some outputs are over-subscribed,

feedback-based two-stage switch will suffer from the ring-fairness problem. To this

end, a fair scheduler for feedback-based switch was designed in this chapter. We

adopted a simple idea of reserving a middle-stage port for any input VOQs exceeding

a threshold Q. Then the bandwidth of over-subscribed outputs is allocated to the

input VOQs (exceeding Q) on a RR basis. We proved that the resulting algorithm

satisfied the max-min fairness criterion. Indeed, the simulation results also showed

the max-min fairness nature of the proposed fair scheduler.



- 126 -

Chapter 8

An Optical Implementation of Feedback-Based Switch

8.1 Introduction

For routers with an electronic switch fabric (e.g. Fig. 1.6), packets must go

through additional O-E-O conversion while being switched from one linecard to

another. This not only limits the router speed, but also increases the difficulties in

designing a high-speed electronic switch fabric. In this chapter, we propose an

optical implementation of our (electronic) feedback-based switch to enable a packet

to be switched all-optically from one linecard to another. We call the resulting switch

load balanced optical switch (LBOS).



- 127 -

It should be noted that despite all the advantages of optics [60-61],

implementing an all-optical router is still far from being practical because of the

immature technologies in optical processing and buffering. In this chapter, we focus

on designing hybrid electro-optic routers, where packet buffering and table lookup

are carried out in electrical domain, and switching is done optically.

The rest of this chapter is organized as follows. In the next section, we review

the related work on optical switch using in hybrid electro-optic routers. In Section

8.3, design and operation of LBOS are detailed. In Section 8.4, LBOS is enriched

and refined like the electrical feedback-based switch. Simulation results are

presented in Section 8.5 and we conclude the chapter in Section 8.6.

8.2 Related Work

There are various efforts in designing efficient optical switches for high-

speed routers. Notably, in the 100 Tb/s router project [27], optical implementation of

a load-balanced electronic switch [21] is considered. The three-stage Clos network

architecture is adopted where the center stage is implemented using optical MEMS

[62]. But all-optical packet transmission from an input linecard/port to an output

linecard/port is not possible, as packets must be temporarily stored and processed in

electrical domain between different stages of the Clos network. Besides, to tackle the

packet mis-sequencing problem a large re-sequencing buffer of N 2+1 packets at each

output port is required, where N is the switch size.



- 128 -

Recently, Fasnet [63], an optical switch fabric comprising N switch linecards

connected by two counter-rotating WDM fiber rings, is proposed. The notion of

counter-rotating WDM fiber rings originally appears in designing metro networks

[64], and is further refined in [59,65-68]. In Fasnet [63], one ring is used for

transmission, while the other is for reception. The N wavelengths in the transmission

ring are switched to the reception ring at a folding point between the two rings. Only

a special input port (called master input) can generate a frame header (called

locomotive). Other input ports can put their packets at the end of a frame as its frame

header passes by. At each input port, the maximum number of packets that can be

attached after one frame header is limited by a fairness quota of Y packets. Y can be

accumulated, but has an upper bound of U ×Y , where the values of Y and U should be

given in advance. Unlike [27], this ring-based switch architecture allows all-optical

packet transmission from one linecard to another. But its delay-throughput

performance is rather limited, which is further aggravated by the fairness algorithm

adopted.

8.3 Load Balanced Optical Switch (LBOS)

8.3.1 Switch Architecture

Our load-balanced optical switch (LBOS) is targeted at all-optical switching

of a packet from one linecard to another. As depicted in Fig. 8.1, LBOS consists of N

linecards connected by an N -wavelength WDM fiber ring. Each linecard i has two

ports, input i and output i. Linecard/output i is configured to receive (only) on its

dedicated wavelength channel λi. To send a packet to linecard j, linecard/input i needs

to transmit the packet onto channel λ j when λ j is idle.



- 129 -

Fig. 8.1 A 4x4 load balanced optical switch.

Fig. 8.2 The internal structure of linecard i.

The internal structure of linecard i is similar to that used by Fasnet [63], and

is shown in Fig. 8.2. For simplicity, the electrical buffers for implementing the virtual

output queues (VOQ(i,k )’s) at each input port are not shown. A linecard has three

major modules: a receiver on channel λi, a “tunable” transmitter (implemented using

a fixed laser array) and a wavelength monitor. In Fig. 8.2, the EDFA (Erbium Doped



- 130 -

optical Fiber Amplifier) is used to compensate for the optical signal loss en route. A

filter drops wavelength λi from the fiber and passes all other channels to a splitter.

The dropped λi enters the high bit-rate burst mode receiver for receiving. The splitter

taps out a fraction of light and feeds it to the monitor module. The remaining signals

in the fiber will go through a FDL (Fiber Delay Line) of t d seconds, where t d is the

time required for the monitor to identify an idle channel (detailed in the next

paragraph) and the transmitter to start sending a packet onto a selected idle channel.

For the fraction of light entered the monitor, a demultiplexer converts it into

N -1 separate λ’s, and directs them to the dc-coupled photodiode array. A threshold

comparator is used to detect idle wavelength channels. Among all the idle channels,

the linecard controller identifies its longest VOQ(i, j), and the head of line packet

from VOQ(i, j) is sent using the transmitter module. (We call it the LQF scheduler.)

The transmitter module consists of a fixed laser array, where laser λ j will be used to

send the packet destined to linecard j. (A fixed laser array can be more cost-effective

than a single fast tunable laser.) Finally, the transmitted packet is merged back to the

fiber ring by the optical coupler (in Fig. 8.2) and continues its journal to the next

linecard.

8.3.2 Switch Operation

Let the packet duration, i.e. the amount of time required to send a packet

(onto a wavelength channel), be t pkt seconds. We define the duration of a time slot to

be t d+t p seconds, where t d is the propagation delay of the FDL in Fig. 8.2 and t p is the

propagation delay of the fiber from the coupler in Fig. 8.2 to the drop filter of the

next linecard. Assume the whole system is synchronized, and in each time slot, at



- 131 -

most one packet can be transmitted and/or received by each linecard. For the proper

operation of the switch, we must have t d ≥ t pkt and t p ≥ t pkt. This is illustrated by Fig.

8.3, where linecard i starts to receive a packet at the beginning of slot t and it takes

t pkt seconds to receive the entire packet. Meanwhile, the monitor identifies the idle

channels, and a packet is sent onto the selected idle channel. The optical coupler adds

the packet back to the fiber ring at t d seconds after the beginning of the current slot. It

takes another t p seconds for the first bit of the packet to arrive at linecard i+1. This

marks the end of time slot t and the beginning of slot t +1. It is easy to see that a

packet sent by linecard i will arrive at linecard j after ( j – i) mod N time slots.

Fig. 8.3 Timing diagram for load balanced optical switch (LBOS).

From Fig. 8.3, we can see that in each time slot, the transmitter is idle in the

first t d seconds, whereas the receiver and monitor are idle for the last t p seconds. As

only a single packet is sent/received in each slot, the efficiency of LBOS is t pkt/(t d+t p),

or at most 50% (assuming t p=t d=t pkt). To enhance the efficiency, transmitter, receiver



- 132 -

and monitor can operate in parallel for pipelined packet sending, receiving and

scheduling, as shown in Fig. 8.4. Specifically, in the first half of time slot t , the

transmitter can send a packet scheduled in the second half of slot t –1. In the second

half of slot t , the receiver can receive a packet sent by some linecard in the first half

of an earlier time slot , and meanwhile, the monitor can schedule another packet for

sending in the first half of slot t +1. In other words, two packets can be received and

transmitted in each time slot. (We shall call it pipelined LBOS if we would like to

distinguish with the original LBOS.)

Fig. 8.4 Timing diagram for pipelined packet sending and receiving.

From the operations of the LBOS above, we can see that LBOS effectively

balances the loading in the ring network by spreading (i) packets going to different

destinations over different wavelength channels (i.e. space/wavelength domain load

balancing), and (ii) packets going to the same destination over different time slots (i.e.

time domain load balancing). In next sub-section, we show that our LBOS is an



- 133 -

optical counterpart of the load-balanced electronic switch architecture in [32].

8.3.3 Equivalence to Load-Balanced Electronic Switches

Consider the basic LBOS operating based on the timing diagram in Fig. 8.3.

If we treat the fiber ring as a FDL, then the ring network “buffers” a packet from

linecard i to j for exactly ( j – i) mod N time slots. Since one round trip time (RTT)

along the ring is N time slots, a specific wavelength channel on the ring can

carry/buffer up to N in-flight optical packets. With N wavelengths, the fiber ring can

buffer up to N 2 packets. Therefore, (optical) packets are “buffered” as they propagate

along the fiber ring in different wavelengths, which exactly mimics the buffering

services rendered by the middle-stage VOQ2( j,k )’s in LBES (Fig. 2.1). In a specific

time slot, the channel status (i.e. idle or not) of all the wavelengths passing by, which

is equivalent to the occupancy of VOQ2( j,k )’s in Fig. 2.1, will be conveniently

detected by the wavelength monitor on each linecard – the need for dedicated

feedback packets/vectors is thus removed.

Fig. 8.5 A joint sequence in load-balanced switch.

Assumed the LBES (with single-packet-buffer at each VOQ2( j,k )) is

configured by the sequence of configurations shown in Fig. 8.5. Then we can easily

find a one-to-one mapping between every instance of sequence in Fig. 8.5 and the



- 134 -

corresponding operation on the ring network in Fig. 8.1. Due to the equivalence

relationship, our LBOS inherits all the nice features of the LBES [32-33], such as

being scalable, distributed, and yielding close-to-100% throughput and low average

packet delay.`

8.4 Extensions and Refinements of LBOS

8.4.1 Cutting down the Average Delay by Reconfiguration

In LBOS, the delay experienced by a packet is the summation of the queuing

delay at the input linecard and the propagation delay between the input and output

linecards. Due to the way linecards are connected in a ring, the propagation delay is

predetermined and fixed. For example, in Fig. 8.1 each packet of flow(0,3) requires 3

time slots from linecard 0 to linecard 3. Then for a given traffic matrix { λi,j}, the

average packet propagation delay is:

ji

j i

ji h H ,, (8.1)

where λi,j is arrival rate and hi,j is the propagation delay for flow(i, j), respectively. We

have 0≤ λi,j≤1 and 0≤hi,j≤ N -1 for ]1,0[, N ji . In LBOS, we assume that flow(i,i)

does not enter the ring, and thus hi,i=0.

Assume λ0,3=1 and it is the only flow of the switch in Fig. 8.1. From (8.1), we

have H =3 slots. If we swap the positions of linecards 0 and 2 in Fig. 8.1, H will

become 1. This shows by judiciously connecting linecards to form a ring, the

propagation delay (as well as the average packet delay) can be minimized. It is not

difficult to show that for a given traffic matrix, finding the optimal linecard

placement pattern for minimizing H has the same complexity as the classic traveling



- 135 -

salesman problem [71]. Nevertheless, such a linecard placement problem can be

formulated as an ILP (Integer Linear Programming) problem.

Notations:

xi: the propagation delay experienced by packets of flow(0,i), where 0≤ xi≤ N -1,

for ]1,0[ N i . In fact, xi indicates linecard i’s relative position (to linecard 0) in

the ring.

f i, j: binary variable and j > i for ]1,0[, N ji . If f i, j = 1, it means xi > x j and if f i, j =

0, then xi < x j.

Objective:

minimize

)]1()[(])[( ,,,, jii j

i j i

ji jii j

i j i

ji f N x x Nf x x (8.2)

Subject to the following ring topology constraints:

x0=0 (8.3)

1≤ xi≤ N -1 for ]1,1[ N i (8.4)

xi - x j - N f i, j≥ 1- N j>i for ]1,0[, N ji (8.5)

x j - xi + N f i, j≥ 1 j>i for ]1,0[, N ji (8.6)

Notably, constraints (8.5) and (8.6) above are to ensure xi≠ x j if i ≠ j.

Note that the linecard placement pattern is changed only if there is a

significant enough change in traffic matrix. Even so, it is generally not feasible to

reconnect linecards manually. To this end, we can implement a LBOS using an OXC

(Optical cross-Connect), as shown in Fig. 8.6. Note that all ( N -1)! possible linecard

placement patterns can be realized by an OXC, which supports N ! configurations.



- 136 -

Further note that inexpensive OXC (with millisecond or more reconfiguration delay)

can be used if the reconfiguration takes place infrequently.

(a) (b)

Fig. 8.6 Two possible linecard placement patterns using OXC: (a) {0-1-2-3}

and (b) {0-3-1-2}.

8.4.2 Supporting Multicast

The transmitter module in Fig. 8.2 consists of a fixed layer array. The lasers

are turned on by direct current injection when a packet is to be sent. Data bits are

then “written” inside a channel by an external modulator. Laser array facilitates

multicasting, where bits can be written simultaneously by the external modulator on

multiple wavelengths (the corresponding lasers have been turned on for carrying a

multicast packet). In this way, the “replication” of packets is obtained in optical

domain (in which bandwidth efficiency is less critical). In other words, multicasting

can be implemented without increasing the (expensive) bandwidth requirement of the

electronic transmitters, as the electronic cost of sending a packet to multiple

destinations is the same as for sending a packet to a single destination. All multicast

scheduling algorithm in Chapter 5 can be implemented in multicast LBOS.



- 137 -

8.4.3 Implementing Fair Scheduler Optically

To implement the fair scheduler in Chapter 7 optically, an optical control

channel λ N is required for conveying reservation requests and grants (shown in Fig.

8.7), which is comparable to the control channel in an OBS network for making data

burst reservations. In other words, an extra transceiver on channel λ N is required at

each linecard for processing the control packets in electrical domain. Refer to Fig.

8.2, the λ N receiver is added in parallel with the λ i receiver in the receiver module,

and the λ N transmitter is added to the laser array at the transmitter module. Due to the

relative low data rate on the control channel, an inexpensive low-speed transceiver

can be used, e.g. using LEDs instead of laser diodes.

Assume pipelined LBOS (in Fig. 8.4) is used and the traffic carried on the

ring network (in Fig. 8.1) is shown in Fig. 8.7. We focus on the control channel λ N

(where N =4 in Fig. 8.7). In each packet duration, λ N carries two vectors, an overload

vector {wi} and a reservation vector {qi}, where i=0, 1… N -1. During a packet

duration, linecard k drops λ N and uploads the updated {wi} and {qi} on λ N again.

Meanwhile, the operations of fair scheduler (in Chapter 7) are carried out in

electrical domain.


In this section, we study the performance of our proposed LBOS under the

same three types of traffic patterns as Chapter 2, i.e. uniform, uniform bursty and

hot-spot traffic. For comparison, Fasnet [63], which has a similar hardware

complexity as LBOS, is implemented. In simulating Fasnet, we adopt the best



- 138 -

parameters as reported in [63], i.e. a fairness quota Y =100 packets and the maximum

accumulated quota of U ×Y =500 packets. For both LBOS and Fasnet, we assume the

propagation delay between adjacent linecards is 100 ns (t p=100 ns) and each linecard

introduces a (FDL) delay of 100 ns (t d=100 ns). The duration of a time slot is thus

200 ns, or two time units. We assume packets arrive at the beginning of each time

unit. For the non-pipelined LBOS (in Fig. 8.3), only one packet can be sent/received

in every two time units. With pipelined LBOS (in Fig. 8.4), one packet can be sent in

each time unit.

We also implement iSLIP algorithm [14] (with a single iteration), which

serves as a benchmark for input-queued switch, and output-queued switch, which

serves as a lower bound. In simulating them, zero propagation delay between

linecards is assumed (to their favor). It should be noted that both iSLIP and output-

queued switch are generally not practical for optical implementation.

For simplicity, we only present simulation results for switch with size N =32

linecards below, but the similar conclusions and observations can be obtained for

other switch sizes.


From Fig. 8.7, we can see that without pipelined sending and receiving,

LBOS can only obtain up to 50% throughput. For pipelined LBOS, close-to-100%

throughput can be obtained. Note that the delay performance is the total delay a

packet experienced at input port and en route. For LBOS, the average propagation

delay is 32 time units or 16 time slots (i.e., (1+ N -1)/2 under uniform traffic with





- 140 -

Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS.

Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS.

From Fig. 8.9, again we can see that pipelined LBOS consistently



- 141 -

outperforms Fasnet and delivers close-to-100% throughput.

8.5.4 Performance for Linecard Placement

We randomly generate 20 16×16 admissible traffic matrices. For each matrix,

the average propagation delay is calculated using (8.1) and the average of the 20

matrices is found to be H =16.1 time units. With the optimized linecard placement (by

solving the ILP in (8.2)-(8.6)), we can get an average propagation delay of 14.1 time

units. A saving of 12.3 % in propagation delay is observed.

We then carry out simulations to get the average packet delay (i.e. to take the

input port queuing delay into account) for each scenario. We found that without

linecard placement, the average delay is 25.9, and with linecard placement, the delay

drops to 22.9.

8.6 Chapter Summary

In this chapter, we designed an optical implementation of feedback-based

switch for using in a hybrid electro-optic router, called LBOS. It comprises N

linecards connected by an N -wavelength WDM fiber ring. Each linecard i is

configured to receive on channel λi. To send a packet, it can select and transmit on an

idle channel based on where the packet goes. Packets are switched from one linecard

to another all-optically, and then the extra O-E-O conversion in state-of-the-art

routers is removed. We also showed that LBOS inherits all nice features of a load-

balanced electronic switch.



- 142 -

Chapter 9

Conclusion

9.1 Our Contributions

In this dissertation, we dedicated our efforts to design efficient and scalable

switch architecture for next generation high-speed routers. Two major design

objectives are no need for a centralized scheduler and amendable to optical

implementation.

In Chapter 2, we focused on removing the centralized scheduler by following

the approach of load-balanced switch due to its scalability and close to 100%





- 144 -

In a feedback-based switch, each middle-stage port needs to piggyback an N -

bit occupancy vector to its connected output in each time slot. In Chapter 4, we

concentrated on cutting down this communication overhead. The size of an

occupancy vector can be reduced by only reporting the status of selected middle-

stage VOQs. To identify VOQs of interest, we partition the N VOQs into u non-

overlapped sets, each being identified by a set number. In each time slot, every input

port piggybacks its set numbers of interest to the connected middle-stage port. This

guides a middle-stage port to only report the status of the VOQs of interest.

In Chapter 5, by slightly modifying the operation of the original feedback-

based two-stage switch, we showed that feedback-based switch supports multicast

traffic efficiently. A notable feature of this multicast extension is that the switch

fabric remains unicast, whereas packet duplication is distributed to both input and

middle-stage ports.

In a single-cabinet implementation, the propagation delay between linecards

and switch fabric is negligible. In a multi-cabinet implementation, due to the non-

negligible propagation delay between linecards and switch fabric, the requirement

that occupancy vectors must arrive at output/input ports within a single time slot will

significantly lower the feedback-based switch efficiency. To this end, we revamped

the original feedback mechanism in Chapter 6 for multi-cabinet implementation, and

a new batch scheduler was also devised.

As long as the incoming traffic is admissible, due to the close to 100%



- 145 -

throughput performance of our feedback switch, packets can arrive at outputs with

bounded delays, so fairness in throughput is not an issue. Under inadmissible traffic

(i.e. some output ports are over-subscribed), the feedback switch suffers from the

ring-fairness problem, i.e. “up-stream” input ports can starve some “down-stream”

input ports. To address this ring-fairness problem, an algorithm that allocates the

bandwidth of over-subscribed outputs based on max-min fairness criterion was

proposed in Chapter 7.

In Chapter 8, we proposed an optical implementation of the feedback-based

switch, called Load-Balanced Optical Switch (LBOS). LBOS leverages an N -

wavelength WDM fiber ring to connect N linecards together. The ring network was

engineered such that the amount of time a packet should be buffered at a middle-

stage port exactly matches the propagation delay that this packet would experience

en route. We showed that with LBOS, all-optical packet transmission from an input

linecard to an output linecard is ensured.

9.2 Future Work

9.2.1 100% Throughput Proof without Speedup

In Chapter 2, we proved that under a speedup of two, feedback-based switch

using any arbitrary work-conserving port scheduler is stable. Indeed, our simulation

results suggest LQF without speedup is stable over a wide range of traffics. However,

due to the lack of theoretical power, the stability without speedup is yet to be proved

up to now. In the following, we hope to come up with a 100% throughput proof

(without speedup) by appealing to other powerful mathematical models.



- 146 -

9.2.2 Building a Large Feedback-Based Two-Stage Switches

In feedback-based switch, average packet delay would grow linearly with

switch size N . Therefore, when N is large, the average delay would suffer. To this end,

it is our hope that a large size feedback switch can be constructed by a number of

small size feedback switch modules. Then the delay performance would grow

linearly with the module size, instead of the whole switch size.

9.2.3 More Scalable Fairness Algorithm in LBOS

In Chapter 8, a dedicated control wavelength channel is required for

implementing the fair scheduler for LBOS. In this approach, extra fixed receiver and

transmitter (on control channel) are required at each linecard, which increases the

hardware complexity of LBOS. Indeed, it is a worthwhile research direction to

implement fair scheduler without increasing the hardware complexity.

9.2.4 Scalable Iterative Algorithm for Input-queued Switch

Besides load-balanced switches, we can also refine other switch architectures

for using in next generation high-speed routers. The input-queued switch with

iterative algorithms, as introduced in Chapter 1, is not scalable due to its up to N

iterations communication overheads to find maximal size matching. A very

interesting topic is can we achieve maximal size matching by only one iteration,

whereas this one iteration would function as “ N iterations”. To accomplish this

objective, the “weight” information (e.g. queue size) should be considered in this

single iteration matching. Nevertheless, such idea is very interesting and merits

deeply deliberating.





- 148 -

scheduling for local area networks,” ACM Transactions on Computer Systems,

Vol. 11, pp. 319 – 352, 1993.

[14] N. McKeown, “Scheduling algorithms for input-queued cell switches,” PhD.

Thesis, University of California at Berkeley, 1995.

[15] N. McKeown, “The iSLIP scheduling algorithm for input-queued switches,”

IEEE Transactions On Networking , Vol. 7, No. 2, pp. 188 – 201, April 1999.

[16] Y. Li, S. Panwar and H. J. Chao, “On the performance of a dual round-robin

switch,” INFOCOM 1998, March 1998, San Francisco, USA.

[17] S. T. Chuang, A. Goel, N. McKeown and B. Prabhakar, “Matching outputqueuing with a combined input/output-queued switch,” IEEE Journal on

Selected Areas in Communications, Vol. 17, pp. 1030 – 1039, June 1999.

[18] K. Yoshigoe, “Threshold-based exhaustive round-robin for the CICQ switch

with virtual crosspoint queues,” ICC 2007 , June 2007, Glasgow, Scotland.

[19] R. Luijten, C. Minkenberg and M. Gusat, “Reducing memory size in buffered

crossbars with large internal flow control latency,” GLOBECOM 2003, Dec.2003, San Francisco, USA.

[20] Y. Shen, S. S. Panwar and H. J. Chao, “Providing 100% throughput in a

buffered crossbar switch,” IEEE HPSR 2007 , May 2007, New York, USA.

[21] C. S. Chang, D. S. Lee and Y. S. Jou, “Load balanced Birkhoff-von Neumann

switches, part I: one-stage buffering,” Computer Communications, Vol. 25, pp.

611 – 622, 2002.

[22] C. S. Chang, D. S. Lee and C. M. Lien, “Load balanced Birkhoff-von

Neumann switches, part II: multi-stage buffering,” Computer

Communications, Vol. 25, pp. 623 – 634, 2002.

[23] Y. Shen, S. Jiang, S. S. Panwar and H. J. Chao, “Byte-focal: a practical load-

balanced switch,” IEEE HPSR 2005, May 2005, Hong Kong.

[24] X. L. Wang, Y. Cai, S. Xiao and W. B. Gong, “A three-stage load-balancing



- 149 -

switch,” INFOCOM 2008, April. 2008, Phoenix, AZ, USA.

[25] I. Keslassy and N. McKeown, “Maintaining packet order in two-stage

switches,” INFOCOM 2002, June 2002, New York, USA.

[26] I. Keslassy, “The load-balanced router,” PhD. Thesis, Stanford University,

2004.

[27] I. Keslassy, S. T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N.

McKeown, “Scaling the internet routers using optics,” ACM SIGCOMM’03,

Aug. 2003, Karlsruhe, Germany.

[28] J. J. Jaramillo, F. Milan and R. Srikant, “Padded frames: a novel algorithm for stable scheduling in load-balanced switches,” IEEE/ACM Transactions on

Networking , Vol. 16, No. 5, Oct. 2008

[29] C. L. Yu, C. S. Chang and D. S. Lee, “CR switch: a load-balanced switch

with contention and reservation,” INFOCOM 2007 , May 2007, Anchorage,

Alaska, USA.

[30] C. S. Chang, D. S. Lee and Y. J. Shih, “Mailbox switch: a scalable two-stageswitch architecture for conflict resolution of ordered packets,” INFOCOM

2004, March 2004, Hong Kong.

[31] B. Lin and I. Keslassy, “The concurrent matching switch architecture,”

INFOCOM 2006 , April 2006, Barcelona, Spain.

[32] H. I. Lee, “A two-stage switch with load balancing scheme maintaining

packet sequence,” IEEE Communications Letters

, Vol. 10, pp. 290- 292, Apr.2006.

[33] P. Gupta and N. McKeown, “Design and Implementation of a Fast Crossbar

Scheduler,” IEEE Micro, Vol. 19, Issue 1, pp. 20 - 28, Jan.-Feb. 1999.

[34] Y. S. Lin and C. B. Shung, “Quasi-pushout cell discarding,” IEEE

Communications Letters, Vol. 1, pp. 146-148, Sept. 1997

[35] B. Wu, K. L. Yeung, M. Hamdi and X. Li, “Minimizing internal speedup for





- 151 -

buffered crossbar switches,” IEEE HPSR 2006 , June 2006, Poznan, Poland.

[47] Z. Q. Dong and R. R. Cessa, “Packet switching and replication of multicast

traffic by crosspoint buffered packet switches,” IEEE HPSR 2007 , May 2007,

New York, USA.

[48] Z. Q. Dong and R. R. Cessa, “Input- and output-based shared-memory

crosspoint buffered packet switches for multicast traffic switching and

replication,” ICC 2008, May 2008, Beijing, China.

[49] P. Giaccone and E. Leonardi, “Asymptotic performance limits of switches

with buffered crossbars supporting multicast traffic,” IEEE Transactions on

Information theory, Vol. 54, No. 2, Feb. 2008.

[50] C. Minkenberg, R. Luijte, F. Abel, W. Denzel and M. Gusat, “Current issues

in packet switch design,” Proceedings of ACM SIGCOMM , p.119-124,

January 2003.

[51] A. Scicchitano, A. Bianco, P. Giaccone, E. Leonardi and E. Schiattarella,

“Distributed scheduling in input queued switches” ICC 2007 , June 2007,

Glasgow, Scotland.

[52] M. Hosaagrahara and H. Sethu, “Max-min fair scheduling in input-queued

switches” IEEE Transaction on Parallel and Distributed System, Vol. 19, NO.

4, April 2008.

[53] R. Yim, N. Devroye, V. Tarokh, and H. T. Kung, “Achieving fairness in

generalized processor sharing for network switches,” Proc. 22nd Biennial

Symp. Comm., pp. 185-187, 2004.

[54] X. Zhang, S. R. Mohanty and L. N. Bhuyan, “Adaptive max-min fair

scheduling in buffered crossbar switches without speedup,” INFOCOM 2007 ,

May 2007, Anchorage, Alaska , USA

[55] N. Kumar, R. Pan, and D. Shah, “Fair scheduling in input-queued switches

under inadmissible traffic,” GLOBECOM 2004,Vol. 3, No. 29, pp. 1713-1717,

Dec. 2004, Dallas, Texas, USA.



- 152 -

[56] N. Hua, P. Wang, D. P. Jin, L. G. Zeng, B. Liu and G. Feng, “Simple and fair

scheduling algorithm for combined input-crosspoint-queued switch,” ICC

2007 , June 2007, Glasgow, Scotland.

[57] J. R. Bennett and H. Zhang, “Hierarchical packet fair queueing algorithms,”

IEEE/ACM Transactions on Networking , vol. 5, no. 5, pp. 675–689, Oct.

1997.

[58] D. P. Bertsekas and R. Gallager , “Data networks,” Englewood Cliffs, NJ:

Prentice-Hall, 1992.

[59] A. Bianco, D. Cuda, J. Finochietto and F. Neri, “Multi-metaring protocol:

fairness in optical packet ring networks,” ICC 2007 , June, Glasgow, Scotland.

[60] H. Kogan and I. Keslassy, “Optimal-complexity optical router,” INFOCOM

2007 , May 1997, Anchorage, Alaska, USA.

[61] M. Maier and M. Reisslein, “Trends in optical switching techniques: a short

survey,” IEEE Network , pp. 42 – 47, Nov./Dec. 2008.

[62] R. Ryf et al., “1296-port MEMS transparent optical crossconnect with 2.07 petabit/s switch capacity,” Optical Fiber Comm. Conf. and Exhibit (OFC) '01,

Vol. 4, pp. PD28 -P1-3, 2001.

[63] A. Bianco, E. Carta, D. Cuda, J. M. Finochietto and F. Neri,“A distributed

scheduling algorithm for an optical switching fabric,” ICC 2008, May 2008,

Beijing, China.

[64] A. Carena, V. D. Feo, J. Finochietto, R. Gaudino, F. Neri, C. Piglione and P.Poggiolini, “RINGO: an experimental WDM optical packet network for

metro applications,” IEEE Journal on Selected Areas in Communications, Vol.

22, No. 8, pp. 1561-1571, Oct. 2004

[65] A. Bianco, J. M. Finochietto, G. Giarratana, F. Neri and C. Piglione,

“Measurement-based reconfiguration in optical ring metro networks,”

Journal of Lightwave Technology, Vol. 23, No. 10, pp. 3156-3166, Oct. 2005

[66] A. Antonino, A. Bianco, A. Bianciotto, V. D. Feo, J. M. Finochietto, R.



- 153 -

Gaudino and F. Neri, “Wonder: a resilient WDM packet for metro

applications,” Optical Switching and Networking , pp. 19-28, 5, 2008

[67] A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and M. Valcarenghi, “Wonder :

a pon over a folded bus,” GLOBECOM 2008, Nov. 2008, New Orleans, LA,

USA.

[68] A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and C. Piglione, “Multi-fasnet

protocol: short-term fairness control in WDM slotted MANs,” ICC 2006 ,

May 2006, Paris, France.

[69] X. Wang and K. L. Yeung. “Load balanced two-stage switches using arrayed

waveguide grating routers,” IEEE HPSR 2007 , June, 2007, New York, USA.

[70] J, C. Palais, “Fiber optic communications,” 5th ed. Upper Saddle River , N.J,

Pearson/Prentice Hall, 2005

[71] A. Desai and S. Milner, “Autonomous reconfiguration in free-space optical

sensor networks,” IEEE Journal on Selected Areas in Communications

(JSAC), Vol. 23, No. 8, pp. 1556-1563, Aug. 2005

[72] T. Akin, “Hardening cisco routers,” O'Reilly Networking , ,Feb. 2002

[73] A. Vukovic, “Network power density challenges,” ASHRAE Journal , Vol. 47,

Issue 4, p55-59, Apr. 2005

[74] M. Degermark, A. Brodnik, S. Carlsson and S. Pink, “Small forwarding

tables for fast routing lookups,” ACM SIGCOMM Computer Communication

Review, Vol. 27, Issue 4, pp.3-14, Oct. 1997

[75] W. Eatherton, G. Varghese and Z. Dittia, “Tree bitmap: hardware/software IP

lookups with Incremental updates” ACM SIGCOMM Computer

Communication Review, Vol. 34, Issue 2, pp.97-122, April 2004

[76] H. Song, J. Turner and J. Lockwood, “Shape shifting tries for faster IP

lookup,” IEEE ICNP2005, pp.358-367, 2005

[77] V. Srinivasan and G. Varghese, “Faster IP lookups using controlled prefix



- 154 -

expansion,” ACM SIGMETRICS Performance Evaluation Review, Vol. 26,

Issue 1, pp.1-10, June 1998

[78] S. Nilsson and G. Karlsson, “IP-address lookup using LC-trie,” IEEE Journal

on Selected Areas in Communications, Vol. 17, pp.1083-1092, June. 2001

[79] L. C. Wnn, K. M. Chen and T. J. Liu, “A longest prefix first search tree for IP

lookup,” IEEE ICC 2005, May. 2005, Seoul, Korea

[80] P. R. Warkhede, S. Suri and G. Varghese, “Multi-way range trees: scalable IP

lookup with fast updates,” Computer Networks, vol. 44, No.3, pp.289-303,

2002

[81] H. Lu and S. Sahni, “A b-tree dynamic router-table design,” IEEE

Transaction Computers, vol. 54, pp.813-823, 2005

[82] H. Lu and S. Sahni, “O(log W ) multidimensional packet classification,”

IEEE/ACM Transactions on Networking , Vol. 15, Issue 2, pp. 462-472, April

2007

[83] P. C. Wang, C. L. Lee, C. T. Chan and H. Y. Chang, “Performanceimprovement of two-dimensional packet classification by filter rephrasing”,

IEEE/ACM Transactions on Networking , Vol. 15, Issue 4, pp.906-917, Aug.

2007

[84] M. Waldvogel, G. Varghese, J. Turner and B. A. Plattner, “Scalable high speed

IP routing lookups,” ACM SIGCOMM 1997 , pp.25-36, Sept. 1997, Cannes,

France

[85] Q. Sun, X. H. Huang, X. J. Zhou and Y. Ma, “A dynamic binary hash scheme

for IPv6 lookup,” GLOBECOM 2008, Nov. 2008, New Orleans, LA, USA.

[86] S. Dharmapurikar, P. Krishnamurthy and D. Taylor, “Longest prefix matching

using bloom filters,” ACM SIGCOMM 2003, pp.201-212, 2003

[87] R. Sangireddy, N. Futamura, S. Aluru and A. K. Somani, “Scalable, memory

efficient, high-speed IP lookup algorithms,” IEEE/ACM Transactions on

Networking , Vol. 13, Issue 4, pp.802 – 812, Aug. 2005.



- 155 -

[88] H. Y. Song, F. Hao, M. Kodialam and T. V. Lakshman, “IPv6 lookups using

distributed and load balanced bloom filters for 100Gbps core router line

cards,” INFOCOM 2009, April 2009, Rio de Janeiro, Brazil

[89] H. Y. Song and J. Turner, “Fast filter updates for packet classification using

TCAM,” GLOBECOM 2006 , Nov. 2006, San Francisco, USA

[90] R. Panigrahy and S. Sharma, “Reducing TCAM power consumption and

increasing throughput,” 10th IEEE Symposium on High Performance

Interconnects Hot Interconnects ( HOTI’02), pp.107-112, 2002.

[91] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: power-efficient TCAMs for

forwarding engines,” INFOCOM 2003, April 2003, San Francisco, USA

[92] K. Zheng, C. C. Hu, H. B. Liu and Bin Liu, “An ultra high throughput and

power efficient TCAM-based IP lookup engine,” INFOCOM 2004, May 2004,

Hong Kong

[93] M. J. Akhbarizadeh, M. Nourani, R. Panigrahy and S. Sharma, “High-speed

and low-power network search engine using adaptive block-selection

scheme,” Proceedings of the 13th Symposium on High Performance

Interconnects, pp.73–78, 2005

[94] H. Yu, J. Chen, J. Wang, S. Q. Zheng and M. Nourani, “An improved TCAM-

based IP lookup engine,” IEEE HPSR 2008, May 2008, Shanghai, China

[95] H. Yu, J. Chen, J. P. Wang and S. Q. Zheng, “High-performance TCAM-

based IP lookup engines,” INFOCOM 2008, April 2008, Phoenix, AZ, USA

[96] A. Enteshari and M. Kavehrad, “40-100Gbps transmission over copper,”

DesignCon 2009, Feb. 2009, Santa Clara, CA. USA.

[97] M. Kavehrad, and J. F. Doherty, “10Gbps transmission over standard

category-5 copper cable,” GLOBECOM 2003, Dec. 2003, San Francisco, CA.

USA.

[98] G. Chartrand, “Introductory graph theory,” New York Dover , pp. 116, 1985.



- 156 -

[99] D. Gale and L. S. Shapley, “College admissions and the stability of

marriage,” Amer. Math. Monthly, vol. 69, pp.9–15, 1962.

[100] G. Kornaros, “BCB: a buffered crossbar switch fabric utilizing shared

memory,” Proc. Ninth EUROMICRO Conf. Digital System Design (DSD ’06),

pp. 180-188, Aug. 2006.

[101] H. Arimoto, T. Kitatani, T. Tsuchiya, K. Shinoda, T. Ohtoshi, M. Aoki and S.

Tsuji, “N-type doping to an active-short cavity DBR laser to expand its

continuous tuning range,” IEEE Photonics Letters, Vol. 20, No. 16, Aug. 15,

2008.

[102] J. E. Simsarian, M. C. Larson, H. E. Garrett, H. Xu and T. A. Strand, “Less

than 5-ns wavelength switching with an SG-DBR laser,” IEEE Photonics

Letters, Vol. 16, No. 4, Feb. 15, 2006.

[103] F. O. Ilday, J. Buckley, L. Kuznetsova and F. W. Wise, “Generation of 36-

femtosecond pulses from a ytterbium fiber laser,” Conference on Lasers and

Electro-Optics 2004 (CLEO), Vol. 2, pp. 3, May 2004.

[104] A. V. Konyashchenko, L. L. Losev and S. Y. Tenyakov, “Raman frequency

shifter for laser pulses shorter than 100 fs,” OPTICS EXPRESS , Vol. 15, No.

19, pp. 11855-11559, Sep. 2007.

[105] F. M. Chiussi, J. G. Kneuer and V. P. Kumar, “Low-cost scalable switching

solutions for broadband networking: the ATLANTA architecture and chipset,”

IEEE Communications Magazine, pp. 44-53, Dec. 1997.

[106] A. E. Tan, “IEEE 1588 precision time protocol time synchronization

performance,” National Semiconductor Application Note 1728, Oct. 2007.

[107] R. Palaniappan, Y. Wang, T. Clarke and B. Goldiez, “Simulation of an ultra-

wide band enhanced time difference of arrival System,” Parallel and

Distributed Computing and Systems, pp.306-309, Nov. 2007.