PARALLEL ALGORITHMS FOR HIGH PERFORMANCE SWITCHING
IN COMMUNICATION NETWORKS
APPROVED BY SUPERVISORY COMMITTEE:
Dr. S. Q. Zheng, Chair
Dr. I. Hal Sudborough
Dr. Jason P. Jue
Dr. R. N. Uma
Dr. Yuke Wang
Dr. Ashwin Gumaste
Copyright 2004
Enyue Lu
All Rights Reserved
To my grandparents,
My parents,
and
My husband.
PARALLEL ALGORITHMS FOR HIGH PERFORMANCE SWITCHING
IN COMMUNICATION NETWORKS
by
ENYUE LU, B.S., M.S., M.S.
DISSERTATION
Presented to the Faculty of
The University of Texas at Dallas
in Partial Ful�llment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT DALLAS
August 2004
ACKNOWLEDGEMENTS
I would like to express my greatest gratitude towards my research advisor Profes-
sor S.Q. Zheng. Without his support, guidance, and constant encouragement, this
dissertation would not have been possible.
I am grateful to Professor I. Hal Sudborough, Professor Jason P. Jue, Professor
R. N. Uma, Professor Yuke Wang, and Dr. Ashwin Gumaste for serving on my
supervisory committee. I'm also grateful to Professor Kemin Zhang, Professor Weifan
Wang, Professor Yuehua Bu, and Professor Tianxing Yao for their guidance and
encouragement on the beginning of my research work in combinatorics and graph
theory. I would like to thank Professor Edwin Sha, Professor Kang Zhang, Professor
I-Ling Yen for their suggestions and kind helps. I thank Charles Jackson, Guanyun
Zou and other colleges for their support when I worked in Nortel Networks as a co-
op in spring 2001. I also thank my labmates, Dr. Mei Yang, Yi Zhang, Bing Yang,
Chuanjun Li, Priya Shetty, Rohit Raut for their helpful comments, suggestions, and
friendship.
I am especially grateful to my grandparents, parents, and husband for their
love, support, and encouragement over the years. I dedicate this work to them.
v
PARALLEL ALGORITHMS FOR HIGH PERFORMANCE SWITCHING
IN COMMUNICATION NETWORKS
Publication No.
Enyue Lu, Ph.D.
The University of Texas at Dallas, 2004
Supervising Professor: Dr. S. Q. Zheng
The explosive growth of Internet is driving increased demand for faster transmission
rate and faster switching technologies. On one hand, switching algorithms, including
routing for establishing connections between inputs and outputs and scheduling for
solving packet contentions, play a fundamental role on the performance of switching
networks. On the other hand, low cost, high speed, and large capacity switching
architectures are very attractive for high-performance switches and routers.
The main contributions of this dissertation can be categorized into three aspects:
Routing: Designing fast parallel routing algorithms for electronic or optical multi-
stage interconnection networks using time, space, and wavelength approaches.
� Time Dilation: By modeling the permutation decomposition problem as the
problem of edge colorings of bipartite graphs, we simplify the existing proof
for the decomposability of a permutation and reduce the decomposition time
to logarithmic. Using equitable coloring techniques, we further improve the
routing time complexity for optical Benes networks.
vi
� Space Dilation: We study the connection capacity of a class of multistage non-
blocking switching networks constructed from Banyan networks by horizontal
concatenation of extra stages and/or vertical stacking of multiple copies, and
develop sublinear-time routing algorithms by modeling the routing problems
for these networks as weak and strong edge colorings of bipartite graphs.
� Wavelength Dilation: We model the wavelength routing problem as the ver-
tex coloring problem, show the maximum number of wavelengths needed, and
develop polylogarithmic-time routing algorithms for WRSS Banyan networks
and WRSR Benes networks.
Scheduling: Developing eÆcient parallel stable matching and acyclic stable match-
ing algorithms for switch scheduling.
� Stable Matching: We propose a new approach, parallel iterative improvement,
to solving the stable matching problem using randomization and greedy selec-
tion. Simulation shows that our algorithm has good average performance and
converges in small number of iterations with high probability.
� Acyclic Stable Matching: We model the acyclic stable matching problem as
the dominating set problem for a rooted dependency graph. The scheduling
algorithms based on our acyclic stable matching have low time complexity and
are feasible for high-speed implementation.
Architecture: Rearrangeable nonblocking Benes group connectors and Clos group
connectors for serving as the major switching matrix in the design of ingress edge
routers of a burst-switched DWDM have been proposed. Based on our routing
algorithms, the hardware of Benes group connectors can be reduced further.
vii
TABLE OF CONTENTS
Acknowledgements v
Abstract vi
List of Tables xi
List of Figures xii
CHAPTER 1 INTRODUCTION 1
1.1 Overview of Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Switch Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Output Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Internal Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Crosstalk Problem in Photonic Switching . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Previous Related Work on Switching Algorithms . . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.2 Switch Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.3 Switch Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3.4 Crosstalk-Free Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4 Motivations and Contributions of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 38
1.5 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
CHAPTER 2 A PARALLEL ITERATIVE IMPROVEMENT STABLEMATCHING ALGORITHM 43
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 De�nitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3 Parallel Iterative Improvement Matching Algorithm . . . . . . . . . . . . . . . . . . . 47
2.3.1 Constructing an Initial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3.2 Construct a New Matching from an Existing Matching. . . . . . . . . . 48
2.3.3 PII Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Implementations of PII Algorithm on Parallel Computing MachineModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
CHAPTER 3 DESIGN AND IMPLEMENTATION OF AN ACYCLICSTABLE MATCHING SCHEDULER 58
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 A Parallel Stable Matching Algorithm for Rooted Dependency Graph. . 59
3.2.1 Dominating Set for Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 The Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.3 Comparison with GS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Implementing the Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CHAPTER 4 PARALLEL ROUTING ALGORITHMS FOR NONBLOCKINGELECTRONIC AND PHOTONIC SWITCHING NETWORKS 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Nonblocking Networks Based on Banyan-type Networks . . . . . . . . . . . . . . . 71
4.3 Graph Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 I/O Mapping Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2 Graph Coloring and Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Routing in Rearrangeable Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . 77
4.4.1 Rearrangeable Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.2 Algorithm for Balanced 2-Coloring of G(N;K; g) . . . . . . . . . . . . . . . 78
4.4.3 Algorithm for g-Edge Coloring of G(N;K; g) . . . . . . . . . . . . . . . . . . . 80
4.4.4 Parallel Routing in a Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.5 Overall Routing Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Routing in Strictly Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.1 Strict Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.2 Algorithm for Strong (2g � 1)-Edge Coloring of G(N;K; g) . . . . . 86
4.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 Self-Routing Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.1 Connection Capacity of BL(N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.2 Constructing T (N;�) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
CHAPTER 5 PARALLEL CROSSTALK-FREE ROUTING FOR OPTICALBENES NETWORKS 98
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Parallel Permutation Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Decomposability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 Decomposing a Permutation into Two Semi-Permutations . . . . . . . 102
ix
5.2.3 Parallel Decomposition Algorithm for Partial Permutations . . . . . 104
5.3 Routing a Semi-Permutation in an Optical Benes Network . . . . . . . . . . . . 107
5.3.1 A Routing Algorithm Based on Parallel Decomposition . . . . . . . . . 107
5.3.2 The Improved Routing of Partial Semi-Permutation by Equi-table Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Comparisons of Three Dilation Approaches for Optical Benes Networks 110
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
CHAPTER 6 PARALLEL ROUTING ANDWAVELENGTHASSIGNMENTSFOR OPTICAL INTERCONNECTION NETWORKS 114
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2 Parallel Routing and Wavelength Assignment in WRSS Banyan Networks117
6.3 Parallel Routing and Wavelength Assignment in WRSR Benes Networks118
6.3.1 Upper Bound for the Number of Wavelengths . . . . . . . . . . . . . . . . . . 119
6.3.2 Routing and Wavelength Assignment Algorithm . . . . . . . . . . . . . . . . 120
6.4 Implementation on Realistic Multiprocessor Systems . . . . . . . . . . . . . . . . . . 128
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
CHAPTER 7 PARALLEL ROUTING ALGORITHMS FOR GROUP CON-NECTORS 130
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3 Parallel Routing for Benes Group Connectors . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Structure of GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.2 Graph Model of GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.3 Algorithm for GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3.4 Analysis and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.4 Parallel Routing for Clos Group Connectors. . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.5 Generalizations and Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
CHAPTER 8 CONCLUDING REMARKS 147
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Bibliography 152
Vita
x
LIST OF TABLES
1.1 Nonblockingess of 3-stage Clos networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Hardware costs of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Routing algorithms for 3-stage Clos networks and Benes networks . . . . . 36
2.1 Time complexity for implementations of PII algorithm on three par-allel computing machine models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1 Comparison of algorithms for �nding a stable matching . . . . . . . . . . . . . . . 65
3.2 Timing and area results of the scheduler design. . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Comparison of self-routing strictly nonblocking photonic switchingnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xi
LIST OF FIGURES
1.1 Switching in the telecommunication networks . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Developments in switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Design of a telephone exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Third-generation packet switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Processor system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Output contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 An OQ switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 HOL blocking in an IQ switch with FIFO bu�ers . . . . . . . . . . . . . . . . . . . . . 9
1.9 A VOQ switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.10 Internal blocking in a Baseline network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 Crossbar switch: (a) architecture; (b) states of crosspoint . . . . . . . . . . . . . 12
1.12 An SE and its two states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.13 Self-routing of Baseline network BL(16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.14 3-stage Clos network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.15 Benes network B(8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.16 Electro-optic SE: (a) two states; (b) crosstalk . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.17 The relationship of baseline network, butter y network and hyper-cube: (a) BL(8); (b) BF (8); (c) H(4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.18 Model of switch scheduling: (a) a VOQ switch; (b) bipartite graph;(c) maximum size matching; (d) maximal size matching . . . . . . . . . . . . . . 26
1.19 Scheduling based on stable matching in a VOQ switch . . . . . . . . . . . . . . . . 28
1.20 Matrix decomposition: M =P2
i=1Mi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.21 Graph representation: (a) edge Coloring; (b) matching . . . . . . . . . . . . . . . 33
1.22 Main work of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1 Parallel random matching generation: (a) initial lists; (b) lists ob-tained after randomization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Finding a new matching from an existing matching. . . . . . . . . . . . . . . . . . . 50
2.3 Parallel computing models: (a) a 16-processor hypercube; (b) a 4� 4mesh of trees; (c) a 4� 4 array with multiple broadcasting buses . . . . . . 54
2.4 Performance Comparisons: (a) average number of iterations for algo-rithms to �nd a stable matching; (b) frequencies for algorithms to �nda stable matching within n iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xii
3.1 Finding stable matching in a rooted dependency graph: (a) a rooted
dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stablematching is found by GS algorithm in 5 iterations. . . . . . . . . . . . . . . . . . . . 61
3.2 Finding stable matching in an acyclic dependency graph: (a) an acyclic
dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stablematching is found by GS algorithm in 6 iterations . . . . . . . . . . . . . . . . . . . . 64
3.3 A 4 � 4 scheduler design: (a) scheduler block diagram; (b) circuitstructure; (c) node block diagram.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 A network B(16; 2; 3; �) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Finding a balanced 2-coloring: (a) an I/O mapping; (b) a balanced 2-coloring of an I/O mapping graph G(32; 25; 8); (c) a set of components;(d) pointer initialization for pointer jumping . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Number of connection paths: (a) 1 path inB(16; 0; 1; �); (b) 2 paths inB(16; 1; 1; �); (c) 4 paths in B(16; 2; 1; �); (d) 8 paths in B(16; 3; 1; �) 76
4.4 Edge coloring: (a) a (weak) edge coloring; (b) a strong edge coloring . . 77
4.5 Construction of networks: (a) T (8; 0) based on BL(32); (b) T (8; 1)based on BL(64) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1 A 2-edge coloring of bipartite graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 A decomposition example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Decomposition of a partial permutation based on 2-edge coloring: (a)5 di�erent types of paths; (b) directed paths formed by pointer ini-tialization and a 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 An equitable 2-edge coloring of graph: (a) 3 odd paths and primaryedges; (b) directed paths formed by pointer initialization and an eq-uitable 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 A space dilated Benes network DB(8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 A 2 � 2 multi-wavelength SE: (a) two states; (b) signal transmission . . 116
6.2 A crosstalk-free routing and wavelength assignment for B(4): (a) aWRSR B(4) contains only basic SEs; (b) a WRSR B(4) containsnon-basic SEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Routing a permutation in WRSR B(8): (a) �nding a wavelength as-signment; (b) crosstalk-free routing in B(8). . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1 Block diagrams of a group connector: (a) G(8; 4); (b) G0(8; 4) . . . . . . . . . 130
7.2 Block diagram of an ingress edge router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3 A Benes group connector GB(16; 4) with k = 2 . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Hardware redundancy of P1 and control bit selection of P2 in G(16; 4) 139
7.5 The settings of SEs in the �rst stage of GB(16; 4) according to theequitable 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Construction of a Clos group connector: (a) a 3-stage Clos groupconnector GC(m;n; r); (b) a 2-stage Clos group connector GC(m;m; r)143
xiii
CHAPTER 1
INTRODUCTION
1.1 Overview of Switching
The ITU-T, telecommunication standardization sector of ITU (International Telecom-
munications Union), de�nes switching as:
\The establishing, on demand, of an individual connection from a desired
inlet to a desired outlet within a set of inlets and outlets for as long as is required
for the transfer of information".
Today, the information not only denotes the speech we hear in our telephone
receiver, but also incorporates all types of information from several telecommunica-
tion services, as shown in Figure 1.1.
Voice
Data
Video
Voice
Data
Video
Figure 1.1. Switching in the telecommunication networks
One hundred and twenty years ago, switching meant an operator intercon-
necting two subscribers with each other. Today we view the concept of switching
di�erently. Present-day switching equipment must be capable of handling more ser-
1
2
vices than before, including high-quality audio, video of di�erent quality standards,
LAN-to-LAN communication, the transfer of large data �les, and new interactive
services based on the cable-TV network. But there is more to it than the switch-
ing of information related to the service user. Information used by the network -
signaling information, for example - must also be switched.
Partly as a consequence of this, the number of switching techniques in the
public network has increased in recent years. From the beginning we had only circuit
switching, which is very suitable for telephone services. Since then, subscribers have
demanded better utilization of transmission capacity and larger bandwidth, and
other techniques have emerged. As a result of the requirements imposed by data
communication, circuit switching was supplemented in the 1970s with the packet
switching technique.
Today we also have frame relay and two types of cell switching: asynchronous
transfer mode (ATM) and distributed queue dual bus (DQDB). The origin of frame
relay and the techniques for cell switching can be traced to packet switching.
Business networks use still other techniques, such as distributed packet switch-
ing by means of buses and rings (for example, Ethernet and token ring) and the �ber
distributed data interface (FDDI) standard.
The service explosion and the tendency to transmit very large amounts of
information through the network have brought the requirement for performance into
focus in recent years. Good performance means that delays through the switching
equipment are minimized, that the ow of information is not distorted in any way,
and that the switched bandwidth can match service requirements.
It is the switching equipment that primarily limits the bandwidth of a con-
nection. Today, we can make use of very high bit rates, up to tens of billions of bits
per second (tens of Gbit/s) in optical transmission systems. However, in switching
equipment, we must change over to electrical signals and considerably lower bit rates.
3
The next step is to use optical switching with electronic switch control. And
in time, we will most assuredly have fully optical switching systems. Indeed, in view
of the intensive research and development that are being carried out in this area, it
should not be long before the �rst optical space switches are commercially available.
Figure 1.2 describes technical developments in the �eld of switching (public
switching only). A detail survey is given in [65].
Figure 1.2. Developments in switching
1.2 Background
Switching is the process by which a network element, named switch, forwards data
arriving at one of its inputs to one of its outputs. There are three kinds of such
switches: (1) telephone switches, which support the telephone network; (2) datagram
switches (also called routers), which tie the Internet together, and (3) ATM switches,
optimized to deal with small, �xed-size packets called cells.
4
1.2.1 Switch Architecture
Generally speaking, a switch contains the following parts: inputs, outputs, a switch-
ing fabric, and a switch controller. The connections will be established between
inputs and outputs through the switching fabric. The switch controller is responsi-
ble for con�guring the switching fabric to establish connections.
We can categorize switches into two categories, circuit switches and packet
switches. Circuit switches switch voice sample while packet switches switch packets
that contain both data and descriptive meta-data [39].
Circuit Switches
In a telephone switch, as shown in Figure 1.3, the switching fabric carries voice and
the switch controller handles to set up and tear down circuits. A switch transfers
information from an input to an output. This can be complicated because a large
central oÆce switch may have more than 150,000 inputs and outputs.
Figure 1.3. Design of a telephone exchange
The two basic ways to connect inputs to outputs are time division switching
and space division switching. In the time division switching, a switch having only
one input and one output, each incoming voice sample is stored in N time slots
in sequence, and the switch controller determines in which order the time slots are
to be read from the sequence. Ordinarily, in the space division switching, a switch
5
consisting N inputs and N outputs, each voice sample takes a connection path from
di�erent input through the switch to di�erent output, depending on its source and
destination. In this dissertation, we focus on the space division switching.
Packet Switches
There are two types of packet switches: virtual circuit ATM switches and datagram
routers. For the ATM switch, it handles �xed-size packets, called cell, while a data-
gram router handles variable-size packets. In this dissertation, we call them both
\switches" and use \cells" to refer both �xed-size and variable-size packets.
The evolution of packet switches has undergone three generations. Details
about each generation and comparisons of the characteristics of the three genera-
tions of switches can be found in [39]. Figure 1.4 shows the block diagram of the
third generation switch architecture, where packets arriving at the inputs are simul-
taneously entering into the switch fabric, through which they are routed in parallel
to outputs.
Switch fabric OutputsInputs
Control processor
Figure 1.4. Third-generation packet switch
6
Processor System
Although processor control in both circuit and packet switches can be implemented
in several ways, two main divisions have been made:
(1) Centralized control, where all work to set up connections is controlled from
a central processor system; and
(2) Distributed control, where the control functions are shared by a number
of processors that are more or less independent of one another.
In centralized control, if there is only one processor used to perform both
routine work and advanced operations, it is called a single-processor system. In this
system, the processor must be dimensioned according to the most diÆcult tasks. At
the same time, however, because the routine tasks are the most time-consuming, the
processor may have diÆculty getting all things done. One solution to this kind of
problem is to let several processors share the work load, which is calledmultiprocessor
system.
In distributed control systems there is no central processor for the overall
functions. Instead, the switching equipment is divided into a number of switching
parts, each of which has its own processor. In this case, the processors may have com-
plete control over all the work in the respective switching parts, or have centralized
control of certain functions to connect di�erent switching parts in a less degree.
Figure 1.5 gives an overall view of di�erent processor systems.
1.2.2 Output Contention
Output contention happens when connections from di�erent inputs are requested to
be established to the same output simultaneously, as shown in Figure 1.6. In each
switching slot, only one connection can be established. Thus, for circuit switching,
only one connection request can be accepted and others will be blocked. For packet
7
Processor system
Centralized Distributed
Single-processor Multi-processorsSeveral switchingparts with some
centralized functions
Several independentswitching parts with
own processors
Figure 1.5. Processor system
switching, only one cell can be transmitted across the switching fabric and each
output can only send one cell, and thus, the other cells must either discarded or
bu�ered [11].
Switch fabric OutputsInputs
1
2
N
1
2
N
Outputcontention
Figure 1.6. Output contention
For packet switching, depending on where the cells are bu�ered, the switches
can be categorized. In this dissertation, we only consider the switches with bu�ers
in inputs or/and outputs.
Output Queueing Switch
In an output queueing (OQ) switch, all cells destined for the same output are allowed
to arrive at the output at the same time. Since only one cell can be transmitted via
8
the output link at a time, the remaining cells are bu�ered at the output, as shown
in Figure 1.7.
The price to pay for such scheme to solve output contention is the need for
operating the switch fabric and the memory at the output port at rate N times
the line speed if there are N inputs. As the line speed or the switch port number
increase, this scheme will have a bottleneck.
Switch fabric OutputsInputs
Figure 1.7. An OQ switch
Input Queueing Switch
Another way to resolve output contention is to place a bu�er in each input port.
Only one cell is allowed to go to the same output port at one time, and the cells that
lose contention will need to wait at the input bu�er. A switch with such architecture
is called an input queueing (IQ) switch. An arbiter is needed to decide which cells
should be chosen and which cells should be rejected. This decision can be based on
cell priority or timestamp, or be random.
For an IQ switch with �rst-in-�rst-out (FIFO) input bu�ers, a blocked cell
at the head of an input queue can prevent other cells behind it destined for idle
outputs from being forwarded. This is called head of line (HOL) blocking problem.
As shown in Figure 1.8, the cell in input N destined for output 1 is blocked while
output 1 is idle. Due to the HOL blocking, the throughput of the input bu�ered
switch is at most 58.6% for random uniform traÆc [38].
9
Switch fabricInputs
1
2
N
21
N
21
Outputs
1
2
N
Cell blocked due toHOL blocking
Figure 1.8. HOL blocking in an IQ switch with FIFO bu�ers
Input Queueing with Virtual Output Queues
The HOL blocking problem of IQ switch can be overcome by providing a single and
separate FIFO queue at each input to hold cells destined for each output. Each input
bu�er of the switch is logically divided into N logical queues. Such an FIFO queue is
called virtual output queue (VOQ) introduced in [89], and such a switch architecture
is called virtual-output-queueing (VOQ) switch, as shown in Figure 1.9. All the N
VOQs in each input bu�er share the same physical memory, and each contains the
cells destined to a unique output port. Hence, the HOL blocking is reduced, and
the throughput is increased. However, VOQ switch requires a fast and intelligent
arbitration mechanism. Since there are N2 VOQs, up to N2 (instead of N) HOL cells
compete for switching in each cell slot. A complex arbitration is needed to decide
which N cells should be switched in each cell slot. This becomes the bottleneck of
the switch.
Combined Input and Output Queueing Switches
Combined Input and Output Queueing (CIOQ) Switches, have bu�ers in both input
and output ports. This kind of switch architecture is intended to combine the ad-
10
Switch fabric OutputsInputs
...
...
...
Figure 1.9. A VOQ switch
vantages of both input bu�ering and output bu�ering. In an IQ switch, the input
bu�er speed is comparable to the input line rate. In an CIOQ switch, there are up
to L (1 < L < N) cells that each output port can accept at each time slot. If there
are more than L cells destined for the same output port, excess cells are stored in
the input bu�ers instead of discarding them.
To achieve a desired throughput, the speedup factor L can be engineered
based on the input traÆc distribution. Since the output bu�er memory only needs
to operated at L times the line rate, a large-scale switch can be built by using input
and output bu�ering. However, this type of switch requires a complicated arbitration
mechanism to determine which of L cells among the N HOL cells may go to the same
output port.
1.2.3 Internal Blocking
A switch fabric is a set of links and switching elements for establishing connections
between inputs and outputs. There are many implementations for switch fabrics. In
this dissertation, we focus on the switch fabric that is implemented by an intercon-
nection network.
While a connection is being established in an interconnection network, it
can face another contention problem, called internal link blocking. It occurs when
11
multiple connections contend for a link at the same time inside the switch fabric.
As shown in Figure 1.10, an internal physical link is shared by two connections,
connection from 0 to 4 and connection 2 to 5.
01
23
45
67
01
23
45
67
Internalblocking
Figure 1.10. Internal blocking in a Baseline network
According to internal blocking properties, switches are classi�ed as blocking
and nonblocking. That is, in a nonblocking switch, a connection path is always
available to connect any idle input to any idle output while in a blocking switch a
connection path may not be found between an idle input and an idle output
Nonblocking networks have been favored in switching systems since they can
set up any one-to-one I/O mapping. There are three types of nonblocking networks:
strictly nonblocking (SNB), wide-sense nonblocking (WSNB) and rearrangeable non-
blocking (RNB) [7], [31]. In both SNB and WSNB networks, a connection can be
established from any idle input to any idle output without disturbing existing con-
nections. In SNB networks any of available paths for a connection can be chosen and
in WSNB networks, however, a rule must be followed to choose one. In an RNB net-
work, a path for the connection from any idle input to any idle output is available if
the rearrangement of existing connections is allowed. In the following, we introduce
several interconnection networks that will be discussed through the dissertation.
12
Crossbar
Basically, an N � N crossbar, as shown in Fig 1.11, consists of an array of N � N
individually operated crosspoints, which control connections. Each crosspoint has
two logical states: cross and bar states, where cross state is the default state.
Cross
Bar
0
1
N-1
0 1 N-1
( a ) ( b )
Figure 1.11. Crossbar switch: (a) architecture; (b) states of crosspoint
The crossbar has three attractive properties: it is strictly nonblocking, simple
in architecture and modular. However, the hardware cost in terms of the number of
the crosspoints grows as O(N2), which is prohibitively high with large N .
A connection between input i and output j is established by setting the (i; j)-
th crosspoint to the bar state while letting other crosspoints along the connection
remain the cross state. The bar state of a crosspoint can be triggered individually by
the destination of each incoming connection. That is, the connection from inputs to
outputs in a crossbar is done by the addresses of its source and destination regardless
of other connections. This property is called self-routing property and a network with
this property is called a self-routing network. Thus, crossbar switch is a self-routing
network. A self-routing network can be either nonblocking such as a crossbar or
blocking such as a Banyan network introduced in the following.
13
Banyan-type network
The Banyan-type network is a multistage interconnection network (MIN), which
usually comprises a number of switching elements (SEs) grouped into several stages
interconnected by a set of links. Each SE can be implemented by a 2 � 2 crossbar.
Depending on whether the upper/lower input is connected with the upper/lower
output of an SE, it has two logical states, namely, straight and cross (see Figure
1.12).
Cross
StraightUpper input
Lower input
Upper output
Lower output
SE
Figure 1.12. An SE and its two states
A class of Banyan-type networks has received considerable attention. A net-
work belonging to this class satis�es the following basic properties:
i. It has N inputs, N outputs, logN -stages and N=2 SEs in each stage.
ii. There is a unique path between each input and each output.
iii. Let u and v be two SEs in stage i, and let Sj(u) and Sj(v) be two sets of SEs to
which u and v can reach in stage j, 0 < i+1 = j � n. Then Sj(u)\Sj(v) = ;
or Sj(u) = Sj(v) for any u and v.
Because of the above three properties (short connection diameter, unique
connection path, uniformmodularity, etc.), Banyan-type networks are very attractive
for constructing switching networks. Several well-known networks, such as Banyan,
Omega, Shu�e, and Baseline, belong to this class. It has been shown that these
networks are topologically equivalent [2, 96]. In this dissertation, we use Baseline
network as the representative of Banyan-type networks.
14
An N � N Baseline network, denoted by BL(N), is constructed recursively.
A BL(2) is a 2 � 2 SE. A BL(N), N = 2n and n > 1, consists of a switching stage
of N=2 SEs, and a shu�e connection, followed by a stack of two BL(N=2)'s. Thus,
a BL(N) has logN stages labeled by 0; � � � ; logN � 1 from left to right, and each
stage has N=2 SEs labeled by 0; � � � ; N=2 � 1 from top to bottom. The upper and
lower outputs of each SE in stage i are connected with two BL(N=2i+1)'s, named
upper subnetwork and lower subnetwork, respectively. The N links interconnecting
two adjacent stages i and i + 1 are called output links of stage i and input links of
stage i+1. The input (resp. output) links in the �rst (resp. last) stage of BL(N) are
connected with N inputs (resp. outputs) of BL(N). To facilitate our discussions, the
label of each stage, link and SE is represented by a binary number. Let alal�1 � � � a1a0
be the binary representation of a. We use �a to denote the integer that has the binary
representation alal�1 � � � a1(1� a0). An example is shown in Figure 1.13.
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
000
001
010
011
100
101
110
111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
000
001
010
011
100
101
110
111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
INPUTS
OUTPUTS
STAGES
upper subnetwork BL(8)
lower subnetwork BL(8)
0 1 2 3
P0
P1
Figure 1.13. Self-routing of Baseline network BL(16)
Self-routing in BL(N) is decided by the destination, dn�1dn�2 � � � d0, of each
connection. If the (n � i)-th bit, dn�i�1, of the destination equals to 0, the input of
the SE on the connection path in stage i is connected to the SE's upper output, and
15
to the lower output otherwise (i.e., dn�i�1 = 1). For an example, in Figure 1.13, the
connection path P0 from 0010 to 1011 in BL(16) is set up by self-routing. Since the
destination is 1011, the connection path passes the lower, upper, lower, and lower
outputs of SEs 1, 4, 4 and 5 in stages 0, 1, 2, and 3, respectively. More speci�cally,
for a connection with destination dn�1 � � � d0 at stage i, if it arrives at input link
bn�1 � � � b1b0, then it leaves from output link bn�1 � � � b1dn�i�1. Since two adjacent
stages are connected by shu�e connection, (i.e., output link bn�1bn�2 � � � b1b0 in stage
i is connected to input link bn�1 � � � bn�ib0bn�i�1 � � � b1 in stage i+1), the unique path
for each connection can be derived as follows. We show the labels of the input link,
SE and output link at stage i on a connection from source sn�1 � � � s0 to destination
dn�1 � � � d0.Input link: dn�1 � � � dn�isn�1 � � � si+1siSE: dn�1 � � � dn�isn�1 � � � si+1
Output link: dn�1 � � � dn�isn�1 � � � si+1dn�i�1
The Banyan-type switch provides advantages: Firstly, its cost in terms of the
number of SEs is O(N logN), which makes it much more suitable than the crossbar
for the construction of large switches. Secondly, self-routing is an attractive feature
in that no control mechanism is needed for establishing connection. Thirdly, due
to their modular and recursive structure, large-scale switches can be built by using
smaller switches without modifying their structures.
The main drawback of banyan-type switch is that it is a blocking network.
Its performance degrades rapidly as the size of the switch increases.
Three-Stage Clos Network
The structure of three-stage Clos network [13], as shown in Figure 1.14, consists of
three stages of switch modules (SM), which can be implemented by crossbars. In the
�rst stage, a set of N inputs is broken up into r1 subsets of n1 inputs. Each subset of
inputs goes into a unique �rst-stage SM. Each of the �rst-stage SMs has m outputs
connecting to all m middle-stage SMs. Similarly, each of middle-stage SMs has r2
16
outputs connecting to all r2 third-stage SMs. In the third stage, N output lines are
provided by r2 subsets of n2 lines. A 3-stage Clos is denoted by C(n1; r1;m; n2; n2; r2),
and, in the symmetric case where n1 = n2 = n and r1 = r2 = r, by C(n;m; r).
1
2
r1
1
2
m
1
2
r2
n1xm r1xr2 mxn2
Figure 1.14. 3-stage Clos network
Depending on the number of SMs in the middle stage, the 3-stage Clos network
can be blocking, rearrangeable nonblocking, or strictly nonblocking. That is, by
increasing the value of m, the probability of blocking is reduced. The following table
shows the relation between the value of m and the blockingness of C(n;m; r).
m m < n n � m < 2n� 1 m � 2n � 1
C(n;m; r) blocking RNB SNB
Table 1.1. Nonblockingess of 3-stage Clos networks
The 3-stage Clos network provides an advantage in that it reduces the hard-
ware complexity from O(N2) in the case of the crossbar switch to O(N32 ), and the
switch can be designed to be nonblocking. Furthermore, it also provides more relia-
bility since there is more than one possible path through the switch to connect any
input port and any output port.
17
Benes network
The Benes network [6] is a well-known rearrangeable nonblocking MIN. The N �N
Benes network, denoted by B(N), can be constructed as the following three ways:
(1) Base on crossbar
B(N) can be constructed recursively base on an SE (i.e. a 2� 2 crossbar). A
B(2) is an SE. A B(N) consists of a switching stage of N=2 SEs, an N=2� 2 shu�e
connection (i.e. Oi is connected to Ij with j � N=2�i+bi=2c mod N in two adjacent
stages [70]), followed by a stack of two B(N=2)s, a 2 �N=2 shu�e connection, and
another switching stage of N=2 SEs. Thus, a B(N) consists of 2 logN � 1 stages
labeled by 0; 1; � � � ; 2 logN � 2 from left to right (in this dissertation, all logarithms
are in base 2), and each stage consists of N=2 SEs labeled by 0; 1; � � � ; N=2�1 from top
to bottom. A pair of SEs i and �i in the same stage is called a pair of dual SEs. Two
inputs (outputs) for a 2�2 SE are called dual inputs (outputs). Each B(N) contains
2 B(N=2)s, alternately named upper subnetwork and lower subnetwork from top to
bottom, and 4 B(N=4)s, alternately named upper subnetwork, lower subnetwork,
upper subnetwork, and lower subnetwork from top to bottom, and so on. As shown
in Figure 1.15, a B(8) contains 2 B(4)s within dashed boxes, each containing 2 B(2)s
within dotted boxes.
7
0 1 2 3 4
OUTPUTS
STAGES
6
01
23
45
1
2
3
0
1
2
3
INPUTS
01
23
45
0
67
0
1
2
3
0
1
2
3
0
1
2
3
Figure 1.15. Benes network B(8)
(2) Base on Banyan network
18
A Benes network can also be constructed by concatenating a Baseline network
and a reverse Baseline network with the center stages overlapped.
(3) Base on 3-stage Clos network
A Benes network B(N) can also be recursively constructed from replacing
each SM in the middle stage of 3-stage Clos networks by logN � 2 times. More
speci�cally, in order to construct B(N), in the �rst iteration, we replace each of two
SMs in the middle stage of C(2; 2; N=2) with a C(2; 2; N=22); in the second iteration,
we replace each of 22 SMs in the middle stage of 2 C(2; 2; N=22)s with a C(2; 2; N=23),
� � �, in the logN � 2 iteration, we replace each of 2logN�2 SMs in the middle stage of
2logN�3 C(2; 2; 4)s with a C(2; 2; 2).
The Benes network is a rearrangeable nonblocking permutation network and
one of the most eÆcient switching architectures in terms of the number of crosspoints
used. In general, a Benes networkB(N) requires 2N �(2logN�1) crosspoints, N being
a power of two. The main disadvantage of this type of networks is that some fast
and intelligent mechanism is needed to rearrange the connections to avoid internal
blocking for establishing new connections.
The high degree of connection capability in SNB and WSNB networks is at a
high hardware cost while RNB networks is usually constructed with lower hardware
cost. Given the same size of network with N inputs and N outputs, Table 1.2
shows the hardware cost in terms number of crosspoints for di�erent interconnection
networks.
Network Blockingness Cost
Crossbar SNB N2
3-stage Clos Network WSNB, RNB �(N1:5)
Benes Network RNB 2N(2 logN � 1)
Bayan Network Blocking 2N logN
Table 1.2. Hardware costs of networks
19
1.2.4 Crosstalk Problem in Photonic Switching
To build a large IP router with capacity of 1 Tb/s and beyond, either electronic or
optical switching can be used. The deployment of optical �bers as a transmission
medium has prompted searching for the solution to the problem of speed mismatch-
ing between transmission and switching. Optical routers have better scalability than
electronic routers in terms of switching capacity. However, the required optical tech-
nologies are immature for all-optical switching to happen anytime soon. A hybrid
approach in which optical signals are switched, but both switch control and routing
decisions are carried out electronically, becomes more practical. Advances in electro-
optic technologies provide a promising choice to meet the increasing demands for
high channel bandwidth and low communication latency in optical communication.
A hybrid optical MIN (OMIN) can be built from 2�2 electro-optic switching
elements (SEs) such as common lithium-niobate (LiNbO3) SE (e.g. [27, 30, 84]). Such
an SE is a directional coupler with two inputs and two outputs. Depending on the
amount of voltage at the junction of two waveguides, optical signals carried on either
of two inputs can be coupled to either of two outputs. An electronically controlled
optical SE can have switching speed ranging from hundreds of picoseconds to tens
of nanoseconds [78], which is much faster than Micro-Electro-Mechanical System
(MEMS)[78]. However, large OMINs built from integrating these electro-optic SEs
have the problem of crosstalk, which is caused by undesired coupling between signals
carried in two waveguides so that the two signal channels interfere with each other.
Crosstalk can also be generated by intersected interconnection links (crossovers) in
OMINs. It has been shown that crossover crosstalk can be negligible by a careful
physical design of the interconnection patterns [40].
The crosstalk generated by signals with the same wavelength can be quite
easily eliminated by a wavelength �lter, and thus, this type of crosstalk is called
�lterable crosstalk. If the crosstalk generated by signals with di�erent wavelengths,
20
this type of crosstalk is called non-�lterable crosstalk [56]. The crosstalk originated
from an SE is called the �rst-order crosstalk, which may result into the higher-
order crosstalk when it interferes with signals in other SEs. In this dissertation,
we only consider non-�lterable �rst-order crosstalk. An optical switching network
is considered crosstalk-free if the connections passing through the same SE have
di�erent wavelengths at the same time.
Each SE has two logic states, namely, straight and cross (see Figure 1.16 (a)).
Figure 1.16 shows an example of crosstalk in an SE. For the straight state, a small
fraction of the input signal injected at the upper input may be detected at the lower
output (see Figure 1.16 (b)). Crosstalk can also occur when an SE is in the cross
state. Consequently, the input signal will be distorted at output due to loss and
crosstalk accumulated along a connection path.
Voltage
Electrode
Electrode
Input signal Output signal
Crosstalk
( b )
Waveguide
( a )
Straight
Cross
Figure 1.16. Electro-optic SE: (a) two states; (b) crosstalk
Let us look at blocking properties more closely. In an SE, two inputs (resp.
outputs) intending to be connected with the same output (resp. input) causes output
link con ict (resp. input link con ict). The existence of crosstalk in photonic switch-
ing networks adds a new dimension of blocking, called node con ict, which happens
when more than one connection with the same wavelength passing through the same
SE at the same time. Figure 1.13 shows two connection paths P0 from 0010 to 1011
21
and P1 from 0100 to 1010. P0 and P1 have an output link con ict in stage 2 and
an input link con ict in stage 3 because both inputs of SE 4 in stage 2 intend to be
connected with its lower output and both outputs of SE 5 in stage 3 intend to be
connected with its upper input. The two paths have node con icts at SEs 4 and 5
in stages 2 and 3, respectively.
1.3 Previous Related Work on Switching Algorithms
1.3.1 Parallel Computing
Before discussing the existing algorithms for switch scheduling and routing, we �rst
introduce the computing models for these algorithms.
A commonly accepted model for designing and analyzing sequential algo-
rithms consists of a central processing unit with a random-access memory attached
to it. The typical instruction set for this model includes reading from and writing
into the memory, and basic logic and arithmetic operations.
The main purpose of parallel processing is to perform computations faster
than can be done with a single processor by using a number of processors concur-
rently. A parallel computer system is simply a collection of processors, typically
of the same type, interconnected in a certain fashion to allow the coordination of
their activities and the exchange of data. Parallel computer systems can be clas-
si�ed according to a variety of architectural features and modes of operations. In
particular, these criteria include the type and the number of processors, the inter-
connections among processors and the corresponding communication schemes, the
overall control and synchronization, and the input/output schemes. People evaluate
parallel algorithms by several important criteria such as the number of processors,
time performance, space utilization, and communication scheme.
The bounds on the resources (for example, processors, time and space) re-
quired by an algorithm are measured as a function of the input size, which re ects
22
the amount of data to be processed. We are primarily interested in the worst-case
analysis of algorithms. Given an input size N , each resource bound represents the
maximum amount of that resource required by any instance of size N . These bounds
are expressed using the following standard notation:
� f(N) = O(g(N)) if there exist positive constants c and N0 such that f(N) �
cg(N0), for all N � N0.
� f(N) = (g(N)) if there exist positive constants c and N0 such that f(N) �
cg(N0), for all N � N0.
� f(N) = �(g(N)) if f(N) = O(g(N)) and f(N) = (g(N)).
For two functions f(N) and g(N), f(N) = O(g(N)), or �(g(N)), or (g(N))
means that f(N) is asymptotically no greater than, or equal to, or no less than g(N)
[15].
The running time of an algorithm is estimated by the number of basic opera-
tions required by the algorithm as a function of the input size. In this dissertation,
we assume the basic operations include reading from and writing into the memory,
sending to and receiving from processors, and the basic arithmetic and logic op-
erations such as adding, subtracting, comparing, or multiplying two numbers, and
computing the bitwise logic OR or AND of two words. The cost of an operation does
not depend on the word size.
The parallel computing models can be used as general frameworks for de-
scribing and analyzing parallel algorithms. Here we only introduce two models,
shared-memory model and network model, which will be talked in this dissertation.
The shared-memory model consists of a number of processors, each of which
has its own local memory and can execute its own local program, and all of which
communicate by exchanging data through a shared memory unit. Each processor
23
is uniquely identi�ed by an index, called a processor number or processor id, which
is available locally. If all the processors operate synchronously under the control of
a common clock, this synchronous shared-memory model is called parallel random-
access machine (PRAM) model. The key assumptions about the PRAM model are
shared-memory and synchronous mode of operation. There are several variations
of the PRAM model based on the assumptions regarding the handling of the si-
multaneous access of several processors to the same location of the global memory.
The exclusive read exclusive write (EREW) PRAM does not allow any simultaneous
access to a single memory location. The concurrent read exclusive write (CREW)
PRAM allows simultaneous access for a read instruction only. Access to a location
for a read or a write instruction is allowed in the concurrent read concurrent write
(CRCW) PRAM. These three models do not di�er substantially in their computa-
tional power, although the CREW is more powerful than the EREW, and the CRCW
is more powerful than EREW, and the CRCW is most powerful.
In the network model, a network can be viewed as a graph G(V;E), where
each node i 2 V represents a processor, and each edge (i; j) 2 E represents a two-
way communication link between processors i and j. Each processor is assumed to
have its own local memory, and no shared memory is available. The operation of a
network may be either synchronous or asynchronous. Processors can communicate
with each others by sending and receiving data using the two-way communication
links among them.
The network model incorporates the topology of the interconnection between
the processors into model itself. In the following, we introduce several topologies
that will be often used in the dissertation.
A completely connected multiprocessor system of size N consists of a set of
processing elements (PEs) PEi, 0 � i � N�1, connected in such a way that there is a
connection between every pair of PEs. In this dissertation, without speci�cation, we
24
assume that each PE can communicate with at most one PE during a communication
step. With this restriction, any algorithm for such a system is equivalent to an
algorithm under the EREW PRAM abstract model for parallel computing.
A linear array processor system of size N consists N processors P1, P2, � � �,
PN connected in a linear array; that is, processor Pi is connected to Pi�1 and to Pi+1,
whenever they exist. A two-dimensional array is a two-dimensional version of linear
array. A two-dimensional array with size of N2 processors arranged into an N � N
grid such that processor Pi;j is connected to processors Pi�1;j and Pi;j�1, whenever
they exist.
A hypercube multiprocessor system of size N = 2d, denoted by H(2d), con-
sists of N processors, indexed from 0 to p � 1, interconnected into a d-dimensional
Boolean cube that can be de�ned as follows. Let the binary representation of i be
id�1id�2 � � � i0, where 0 � i � N � 1. Then processor Pi is connected to processors
Pi(j) , where i(j) = id�1 � � ��ij � � � i0, and �ij = 1� ij, for 0 � j � d� 1. In other words,
two processors are connected if and only if their indices di�er in only one bit position.
The hypercube has a recursive structure. We can extend a d-dimensional cube to a
(d+1)-dimensional cube by connecting corresponding processors of two d-dimensional
cubes. One cube has the most signi�cant address bit equal to 0; the other cube has
the most signi�cant address bit equal to 1. Thus a H(2d) is constructed from 2
H(2d�1)'s by adding 2d�1 edges, named d-dimension edges, that connects the corre-
sponding 2d�1 nodes in 2 H(2d�1)'s. H(2) is an edge with two nodes. The hypercube
is popular because of its regularity, small diameter, many interesting graph-theoretic
properties, and ability to handle many computations quickly and simply.
A butter y multiprocessor system of size N=2 logN , denoted by BF (N), con-
sists ofN=2 logN processors. The structure of butter y is isomorphic to Banyan-type
network talked in 1.2.3. Butter y networks are also in the family of the hypercube
[47] because H(N=2) can be obtained from BF (N) by merging all SEs in row i of
25
BF (N) as a node i of H(N=2) and merging all links connecting SEs contained in two
di�erent nodes as an edge of H(N=2). Figure 1.17 shows the relationship of baseline
network, butter y network and hypercube. In Figure 1.17 (c), the d-dimension edges
are labeled by d�.
(a) (b) (c)
0
2
1
3
1*
1* *
3
0
1
2
3
0
1
2
3
0
1
2
3
2*
2
2
0
1
2
3
0
1
2
3
0
1
Figure 1.17. The relationship of baseline network, butter y network and hypercube:
(a) BL(8); (b) BF (8); (c) H(4)
1.3.2 Switch Scheduling
For packet switching, only one cell can be transmitted across the switching fabric to
each output at one time. Due to output contention, an arbitration process decid-
ing which cells to be transferred is needed. The arbitration scheme, named switch
scheduling, is essentially a service discipline that arranges the service order among
cells. An algorithm to implement the arbitration scheme is called a scheduling algo-
rithm.
Mathematical Models for Switch Scheduling
The scheduling problem for packet switches can be modeled as a matching problem
on a bipartite graph G = (V;E), where V = V1 [V2, V1 = finputsg, V2 = foutputsg,
and E = fconnections for packets/cells in the head of queues in inputs/outputsg.
Figure 1.18 shows an example of the graph model for a VOQ switch.
In a graph G, a set of independent edges (no two edges in the set are adjacent
26
1
2
3
1
2
3
scheduler
1
3
1
1
2
3
2 2
3
switch fabricinput output
1
2
3
1
2
3
Graph
( a )
( b )
1
2
3
1
2
3
Maximum matching
1
2
3
1
2
3
Maximal matching
( c ) ( d )
Figure 1.18. Model of switch scheduling: (a) a VOQ switch; (b) bipartite graph; (c)
maximum size matching; (d) maximal size matching
to each other) is called a matching of G. A maximum size matching of G is one
with the largest number of edges in it among all matchings of G. A maximal size
matching is one that is not contained in any other matchings. That means, for a
maximal matching, if we add one edge that is not in the matching, then this edge
must be adjacent to some edge in the matching. Figure 1.18 shows an example of
a maximum size matching and a maximal size matching of the bipartite graph for a
VOQ switch.
Due to performance requirements, we can associate a weight to each connec-
tion/edge. For example, the connections with more waiting time have larger weights.
Similar to maximumand maximalmatchings, we de�nemaximum weight matching as
one with the largest weight among all weighted matchings of G and maximal weight
matching as one that is not contained in any other weighted matchings. Clearly,
maximum/maximal size matchings is a special case of maximum/maximal weight
matching with weights of all edges equal to 1.
The existing maximumsize and maximumweight matching algorithms are too
complex both in time complexity and hardware implementation, and therefore, they
are not practical for high speed switches. Researchers have turned their attention
to heuristic algorithms that �nd a maximal size or weight matching quickly. These
27
heuristic algorithms can be classi�ed into three categories: sequential, parallel, and
neural algorithms. A comprehensive survey of these algorithms can be found in [60].
In this dissertation, we will focus on a special matching problem, named stable
matching problem.
Stable Matching
The stable matching problem (or stable marriage problem) was �rst introduced by
Gale and Shapley (GS) in 1962 [20]. Given n men, n women, and 2n ranking lists in
which each person ranks all members of the opposite sex in the order of preference,
a matching is a set of n pairs of man and woman with each man/woman in exactly
one pair. A matching is stable if there does not exist one man and one woman
who are not matched to each other, but each of whom strictly prefers the other to
his/her current partner in the matching; otherwise, the matching is unstable. Gale
and Shapley showed that every instance of the stable matching problem admits at
least one stable matching, which can be computed in O(n2) iterations. The paper of
Gale and Shapley sparked much interest in many aspects and variants of the classical
stable matching problem. For a good survey on this subject, refer to [24].
Recently, the solutions to the stable matching problem have been applied to
switch scheduling for packet switches. Many scheduling algorithms based on stable
matchings have been proposed for both input queued (IQ) switches and combined
input and output queued (CIOQ) switches (e.g. [12, 35, 36, 58, 63, 64, 72, 85]). It has
been shown that scheduling algorithms based on stable matchings can provide QoS
guarantees. In these algorithms, the man set and the woman set consist of all input
ports and all output ports respectively, and the ranking list for each input/output
is de�ned di�erently according to di�erent performance requirements. For example,
McKeown proposed two scheduling algorithms, GS longest queue �rst (GS-LQF) and
GS oldest cell �rst (GS-OCF), with ranking lists based on the occupancy of the input
28
queues and the waiting time of the cells at the head of input queues respectively in
[58]. GS-LQF and GS-OCF algorithms were shown to achieve asymptotically 100%
throughput under both uniform and non-uniform traÆc for IQ switches.
Figure 1.19 shows an example how scheduling based on stable matching works,
where the ranking lists are de�ned as the occupancy of the input queues. There are
three inputs and three outputs. In each input, cells destined to di�erent outputs are
queued in di�erent queues. The man set and woman set consist of three inputs and
three outputs respectively. The ranking list for each input is de�ned by the lengths
of its three queues destined to di�erent outputs, and the ranking list for each output
is de�ned by the lengths of three queues destined to it in di�erent inputs, where the
longest queue has the highest ranking, which is 1, and the shortest queue has the
lowest ranking, which is 3. The scheduling are based on the found stable matching,
which is shown as dotted lines.
1
2
3
1
2
3
scheduler
1
3
1
1
2
3
2 2
3
switchfabric
input output
Ranking lists:
input 1: {1,3,2}
input 2: {1,2,3}
input 3: {3,1,2}
output 3: {2,3,1}
output 1: {2,1,3}
output 2: {3,2,1}
Stable matching:
(1,3), (2,1), (3,2)
Figure 1.19. Scheduling based on stable matching in a VOQ switch
29
Acyclic Stable Matching
Applications of stable matching in switch scheduling have been proposed. However,
the classical GS stable matching algorithm is infeasible for high-speed implementa-
tion due to its high complexity. Instead, a special stable matching, called acyclic
stable matching have been shown useful in implementing scheduling for high-speed
switches/routers.
Let M = fm1;m2; � � � ;mng and W = fw1; w2; � � � ; wng be the sets of n
men and n women respectively. Let mLi = fwri;1; wri;2; � � � ; wri;ng and wLi =
fmri;1;mri;2; � � � ;mri;ng be the ranking lists for man mi and woman wi respectively,
where wri;j (resp. mri;j) is the rank of woman wj (resp. man mj) by man mi (resp.
woman wi). Let A be a ranking matrix of size of n � n, where each entry of A is a
pair ai;j = (wri;j;mrj;i).
Given a ranking matrixA, we de�ne the dependency graph as a directed graph
~G constructed as follows: each entry ai;j of A is represented by a vertex vi;j of ~G;
for any two vertices vi;j and vi;k, if ahi;j < ahi;k, then there is an edge from vi;j to
vi;k; for any two vertices vi;j and vl;j, if avi;j < avl;j, then there is an edge from vi;j
to vl;j. Thus, for any instance of stable matching problem, there is a corresponding
dependency graph. If the dependency graph is acyclic, the solution is called acyclic
stable matching.
The scheduling algorithms based on acyclic stable matching have been pro-
posed for combined input and output queued (CIOQ) switches [12, 63, 72, 85], and
it has been shown that with some speedup, an acyclic stable matching scheduling
algorithm can provide QoS guarantees for both unicast and multicast traÆc with
�xed-length and variable-length packets.
30
1.3.3 Switch Routing
In a switching network, when more than one input requests to be connected with
the same output, output contention occurs. Output contentions can be resolved
by switch scheduling. For a set of connection requests without output contention,
the process of establishing con ict-free connection paths to satisfy these requests is
called switch routing. An algorithm, named routing algorithm, is needed to �nd these
paths. Once a set of con ict-free paths is found, the connections can be properly set
up.
Let I and O be the sets of N inputs, denoted by I0; � � � ; IN�1, and N outputs,
denoted byO0; � � � ; ON�1, of an interconnection network respectively. Let � : I 7�! O
be an I=O mapping that indicates connections from I to O. If there is a connection
from Ii to Oj, then set �(i) = j and ��1(j) = i, and we call Ii (Oj) an active input
(output). If j 6= �(i) for any active Ii, we call j an idle output. We say that an input
(resp. output, link, SE) is active if it is on a connection path, and idle otherwise.
An I/O mapping from I to O is one-to-one if each Ii is mapped to at most one Oj
and �(i) 6= �(j) for any i 6= j. In this dissertation, all I/O mappings are one-to-one
and all connections belong to a one-to-one I/O mapping.
A one-to-one I/O mapping involving K(� N) active inputs is called a partial
permutation, or called a non-maximum I/O mapping. A partial permutation with
K = N active inputs is also called a permutation, or called a maximum I/O mapping.
Clearly, a permutation is the maximum number of connections that can be realized
in a single pass in an interconnection network.
Since crossbar and Banyan-type networks are self-routing networks, the rout-
ing in these networks simply follow their self-routing rules talked in subsection 1.2.3.
In the following, We will discuss the previous routing work for Benes networks and
3-stage Clos networks.
31
Mathematical Models for Switch Routing
There are three general mathematical models for designing a routing algorithm in
Benes network and 3-stage Clos network.
(i)Matrix Decomposition:
We represent a 3-stage Clos network C(n;m; r) as a matrix M where each
row is corresponding to one input SM (i.e. an n � m crossbar in the �rst stage),
each column is corresponding to one output SM (i.e. an m� n crossbar in the third
stage), and the entry (i; j) in M indicates the number of connection requests from
input SM i to output SM j. The problem is to partition M into m permutation
matrices, where each row and each column of the matrix has at most one entry of 1.
All requests in a permutation matrix can be routed through one middle SM (i.e. an
r � r crossbar in the second stage).
For a special case of Benes network B(N), we represent its permutation as
a matrix M with size of N=2 � N=2, where each row is corresponding to one input
SE (i.e. a 2 � 2 crossbar in the �rst stage) , each column is corresponding to one
output SE (i.e. a 2� 2 crossbar in the last stage), and the entry (i; j) in M indicates
the number of connection requests from input SE i to output SE j. The problem is
to partition M into 2 permutation matrices. Thus, all connections in a permutation
matrix can be routed through the same subnetwork B(N=2). Figure 1.20 shows
an example for the decomposition of a matrix into two permutation matrices for a
permutation � of B(8), where
� =
0 1 2 3 4 5 6 7
3 2 5 0 4 6 7 1
!
After the matrix decomposition in �gure 1.20, we can set the SEs in the �rst
and last stages, and route the sub-permutation
�1 =
1 3 4 6
2 0 4 7
!
32
and sub-permutation
�2 =
0 2 5 7
3 5 6 1
!
in the upper subnetwork B(4) and the lower subnetwork B(4), respectively.
2 0 00
0 1 01
0 1 10
0 0 11
0 0 01
0 1 00
0 0 10
0 1 00
0 0 10
0 0 01
1 0 00 1 0 00
M =M =M = 1 2
Figure 1.20. Matrix decomposition: M =P2
i=1Mi
The other two models are related to graph theory. We �rst introduce some
de�nitions and notations. Let G be a graph, V (G) be the set of vertices and E(G)
be the set of edges of G. We use jV (G)j and jE(G)j to denote the total number of
vertices and edges in V (G) and E(G) respectively. A graph G is called bipartite graph
if V (G) can be partitioned into two parts so that no two vertices in the same part are
adjacent to each other. The degree of a vertex is the total number of edges adjacent
to the vertex, and the degree of a graph G, denoted by �(G), is the maximumvertex
degree of G. If each vertex in G has the same degree d, then G is called a d-regular
graph. As de�ned earlier, a set of independent edges in a graph is a matching. If the
edges in a matching cover all vertices of G, this matching is called a perfect matching
of G. If all edges of a graph G can be colored by c di�erent colors so that the incident
edges have di�erent colors, G is called c-edge colorable and this coloring is called a
c-edge coloring of G.
We can represent Clos network C(n;m; r) with a permutation � by a graph
G, where V (G) = finput SMs and output SMs g and E(G) = fconnections between
input SMs and output SMsg. It is clear that G is a bipartite graph with all input
SMs as one part and all output SMs as another part, and �(G) � r since each input
SM has r inputs and each output SM has r outputs. G may have more than one edge
33
between two vertices, and however, there is a one-to-one correspondence between a
connection in � and an edge in graph G. Thus, we can label each edge in the graph
by the input of its corresponding connection.
(ii) Edge Coloring:
The Konig's theorem [8] states that every bipartite graph is �(G)-edge col-
orable. Thus, we can color the edges of G using �(G) colors so that the edges
incident to the same vertex have distinct colors. Edges of the same color can be
routed through di�erent middle SMs of C(n;m; r).
(iii) Matching: We can also set a Clos network by recursively �nding a match-
ing and remove it from G until all edges in G have been considered and letting the
edges in the same matching route through the same SM in the middle stage.
For Benes network, the corresponding graph G is a bipartite graph with degree
at most 2. Hence, G is 2-colorable and has at most 2 perfect matchings [8]. Figure
1.22 shows the graph model of G to represent the permutation � of B(8). In Figure
1.21 (a), we color G with two di�erent colors, one denoted by solid lines and the other
is denoted by dashed lines. In Figure 1.22 (b), two perfect matchings,M1 andM2,
of G are found. Clearly, the edges with the same color (or the edges belong to the
same matching) are corresponding to the connections in the same sub-permutation
�1 or �2.
01
23
45
67
01
23
45
67
1
3
4
6
0
2
4
7
0
2
5
7
1
3
5
6
( a ) ( b)
G M1 M2
Figure 1.21. Graph representation: (a) edge Coloring; (b) matching
34
Since the edges of the same color are independent, these edges with the same
color form a matching. Also all entries in each permutation matrix form a matching
of G by the de�nition of permutation matrix. Therefore, matrix decomposition,
edge coloring and matching are essentially equivalent, but each has own techniques
in implementation [31].
Routing Algorithms
The routing algorithms for Benes networkB(N) can be obtained by running a routing
algorithm of the 3-stage Clos network in logN � 1 times since a Benes network is
recursively constructed from replacing the middle stage of SMs with 3-stage Clos
networks. Therefore, its time complexity is simply the time complexity of the 3-
stage Clos network multiplied by a factor of logN . We �rst introduce the routing
algorithms based on the above three models.
Routing algorithms based on matrix decomposition techniques have been de-
veloped for 3-stage Clos networks C(n;m; r) [9, 34, 78]. It is known that matrix
decomposition do not always work [9]. However, it works for Benes networks. Waks-
man [93] proposed a routing algorithm for n = 2, which was elaborated by Opferman
and Tsao-Wu [66] and named looping algorithm. The time complexity of looping al-
gorithm is O(N logN) for a sequential control and O(N) for parallel control.
Sequential matching algorithms for bipartite graphs are available in graph
theory literature, One, due to Hopcroft and Karp [28], runs in O(jV j2:5). Another,
by Gabow [19], runs in O(jV j0:5(jEj+ jV )) time. A third algorithm, due to Cole and
Hopcroft, �nds matchings in O(jEj log jV j) steps.
Based on these matching algorithms, the sequential routing algorithm for Clos
networks C(n;m; r) can be derived. The application of Hopcroft and Karp matching
algorithm to C(n;m; r) leads to an O(mr2:5)-time routing procedure, the application
of Gabow's algorithm results in an O(mr0:5(N + r))-time routing procedure, and the
35
application of Cole and Hopcroft's algorithm leads to an O(mN log r)-time routing
procedure.
There are two primary methods used in edge-coloring: Konig's method of
alternating paths and Euler partitions. Gabow and Kariv [19] formalized Konig's
proof into a procedure that performs a coloring in O(jV j � jEj) time. This algorithm
can lead to a O(Nr)-time routing procedure. The edge-coloring algorithm based
on Euler partition takes O(jV j0:5jEj log�) time. For C(n;m; r), this leads to a
algorithm O(r0:5N logm), assuming that N is a power of 2.
A detailed survey for the routing algorithms in Clos networks based on match-
ing and edge coloring can be found in [10]. Table 1.3 lists routing algorithms for
3-stage Clos network and Benes networks based on these three approaches with the
time complexities in sequential and parallel implementation. These time complexities
are based on PRAM models.
The primary routing algorithms for setting up the Benes Network, besides the
algorithms derived from the Clos network, includes the parallel algorithm of [62], the
self-routing algorithm [61, 77] for some permutations and the non-recursive algorithm
of [45].
Since Benes network B(N) has O(N logN) SEs, one cannot set up the net-
work in less than O(N logN) time using a single processor. The parallel routing
algorithms is an alternative. Nassimi and Sahni in [62] gave a parallel routing algo-
rithm for Benes network, and analyzed its complexity on a fully completed connected
multi-processor system and various non-fully completed connection multi-processor
system such as mesh-connected computer, perfect shu�e computer, and cube con-
nected computer [62]. The idea of this algorithm is to implement the loop algorithm
in parallel. As shown in Table 1.3, the time complexity for the parallel routing algo-
rithms in [49] and [62] for B(N) is O(log2N), which is the best time complexity of
the known routing algorithms for Benes networks.
36
Algorithm Network Sequential Time Parallel Time
Matrix Decomposition 3-stage Clos network x x
[66] Benes O(N logN) O(N)
Matching 3-stage Clos network O(mr2:5) x
[28] Benes O(N2:5) x
Matching 3-stage Clos network O(mr0:5(N + r)) x
[19] Benes O(N1:5) x
Matching 3-stage Clos network O(mN log r) O(mN)
[14] Benes O(N logN) O(N)
Edge Coloring 3-stage Clos network O(N logm) O(N)
[18] Benes O(N) x
Edge Coloring 3-stage Clos network O(Nr) x
[19] Benes O(N2) x
Edge Coloring 3-stage Clos network (r0:5N logm) O(r0:5N)
[10] Benes O(N2:5) x
Edge Coloring 3-stage Clos network O(N logm) O(log2N)
[49] Benes O(N) O(log2N)
Parallel Looping 3-stage Clos network x x
[62] Benes x O(log2N)
Table 1.3. Routing algorithms for 3-stage Clos networks and Benes networks
37
In order to reduce the time complexity of routing algorithm for Benes network,
another alternative is self-routing algorithm, which means that the setting of every
switch in the interconnection network is controlled by the routing tag bits attached
to the input packets. Although Benes network is not self-routing network, Lenfant
[48] �rst showed that the Benes network can self-route all �ve families of frequently
used permutations, each with a di�erent routing algorithm. Nassimi and Sahni [61]
presented a uni�ed self-routing algorithm for one class of permutations which contains
the �ve families of permutation Lenfant considered. Raghavendra and Boppana gave
a self-routing algorithm which di�ers the self-routing algorithm of Nassimi and Sahni
to route linear permutation in Benes network in [77]. The time complexity of self-
routing algorithm is O(logN) for N �N Benes network since it contains 2 logN � 1
stages.
Although the self-routing algorithms is faster than known parallel algorithms,
they cannot route all permutations. K.Y. Lee [45] presented a new Benes network
routing algorithm which sets half of Benes network by self routing and realizes all
permutations. This algorithm does not view the Benes network as recursive network,
but rather as a concatenation of two Banyan networks, SN1 and SN2, where SN1
corresponds to the �rst (logN � 1) stages, and SN2 corresponds to the remaining
logN stages of Benes network. The basic idea of this non-recursive algorithm is as
follows: setting up SEs in SN1 by a full binary tree using set partitioning functions
and setting up SEs in SN2 by bit control. Thus this routing algorithm sets SEs
one stage at a time, starting from the leftmost stage heading for the rightmost
stage. Although the time complexity of this non-recursive algorithm is the same as
the looping algorithm, it has two advantages compared to the looping algorithm.
First, this algorithm eliminates the information exchange among di�erent stages to
make the pipelining of switch setting feasible for Multiple Instruction Multiple Data
(MIMD) environments, because it sets switches stage by stage, one stage at a time.
38
Second, SN2 is bit controlled in real time, and the bottleneck remains only within
SN1.
1.3.4 Crosstalk-Free Routing
In order to reduce internal blocking e�ect in optical switching networks, three ap-
proaches, space dilation, time dilation and wavelength dilation, have been proposed.
The idea is to assure that the connections in the same SE have di�erent wavelengths.
In this dissertation, di�erent wavelengths refer to the wavelengths with enough wave-
length spacing so that no crosstalk will be generated when such wavelengths passing
through the same SE/link, and the crosstalk is referred to the �rst-order SE crosstalk
[99]. In space and time dilations, node con icts can be eliminated by ensuring at
most one connection passing through an SE in an OMIN. More speci�cally, in space
dilation node con icts can be avoided by increasing the number of SEs in an OMIN
(e.g. [41, 42, 68, 86, 91, 92, 94]), while in time dilation a set of con icting connections
is partitioned into subsets so that the connections in each subset can be established
simultaneously without con icts (e.g. [69, 73, 74, 83, 99]). Clearly, space dilation
trades the hardware cost while time dilation trades time. In wavelength dilation, the
crosstalk between two signals passing through the same SE is suppressed by routing
to ensure two wavelengths to be di�erent (e.g. [81, 82]), or by using wavelength
converters (e.g. [22, 75]).
1.4 Motivations and Contributions of Dissertation
Nonblocking networks are always favored to be used in switching whenever possible.
Crosstalk-free requirement in photonic networks adds a new dimension of constraints
for nonblockingness. Switching algorithms, including routing for establishing connec-
tions between inputs and outputs and scheduling for solving packet contentions, play
a fundamental role on the performance of switching networks. Any algorithm that re-
39
quires more than linear time would be considered too slow for real-time applications.
One remedy is to use multiple processors to establish connections in parallel and
the other is to construct low cost, high speed, large capacity nonblocking switching
architecture.
The contributions of this dissertation mainly include: developing parallel algo-
rithms for routing and scheduling and proposing cost-e�ective high-speed switching
architectures, as shown in Figure 1.22. We tackle the challenging switching problems
by using combinatorics and graph theory approaches, applying parallel processing
and computing techniques, and adopting implementation and experimental evalua-
tions.
Switching
Architecture
Parallel RoutingAlgorithms for
Group Connectors(Chapter 7)
Routing
Parallel RoutingAlgorithms forNonblocking
Electronic andPhotonicSwitchingNetworks
(Chapter 4)
Parallel Crosstalk-Free Routing forOptical Benes
Networks(Chapter 5)
Parallel Routingand WavelengthAssignment for
OpticalInterconnection
Networks(Chapter 6)
Scheduling
A Parallel IterativeImprovement
Stable MatchingAlgorithm
(Chapter 2)
Design andImplementation ofan Acyclic Stable
MatchingScheduler(Chapter 3)
Figure 1.22. Main work of the dissertation
1.5 Outline of Dissertation
This dissertation is organized as follows.
In Chapter 2, we propose a new approach, parallel iterative improvement
(PII), to solving the stable matching problem. This approach treats the stable
matching problem as an optimization problem with all possible matchings form-
ing its solution space. Since a stable matching always exists for any stable matching
40
problem instance, �nding a stable matching is equivalent to �nding a matching with
the minimum number (which is always zero) of unstable pairs. A particular PII
algorithm is presented to show the e�ectiveness of this approach by constructing a
new matching from an existing matching and using techniques such as randomization
and greedy selection to speedup the convergence process. Simulation results show
that the PII algorithm has better average performance compared with the classical
stable matching algorithms and converges in linear iterations with high probability.
We also discuss the implementations on hypercube, mesh of trees, and array with
multiple broadcasting buses.
In Chapter 3, we model the acyclic stable matching problem as the dominating
set problem for a rooted dependency graph, and then propose a parallel algorithm
for �nding the dominating set. One advantage of scheduling algorithms based on our
acyclic stable matching is its low time complexity. For any instance of acyclic stable
matching problem, our acyclic stable matching algorithm can �nd a stable matching
in O(N logN) time while the classical stable matching needs O(N2) time. Another
advantage is its feasibility for high-speed implementation. We design and implement
a scheduler based on our acyclic stable matching algorithm in hardware. Simulation
results show that the number of 2-input NAND gates and the timing of our design
are proportional to N2 and N respectively, making it feasible to be implemented at
high speed with current CMOS technologies.
In Chapter 4, we study a class of multistage nonblocking switching networks
B(N;x; p; a), which is constructed by horizontally concatenating x(= logN � 1)
extra stages to an N � N Banyan-type network and vertically stacking p copies of
the extended Banyan, with crosstalk-free constraint (a = 1) or without crosstalk-free
constraint (a = 0). This class of networks contains Banyan network, Benes network,
and Cantor network as special cases. By modeling the routing problems for this
class of networks as weak and strong edge colorings of bipartite graphs, we develop
41
fast parallel routing algorithms that can route an arbitrary partial permutation with
K(= N) connections in a rearrangeable nonblocking network B(N;x; p; a) in O((x+
log p) logK + logN) time and in a strictly nonblocking network B(N; 0; p�; a) in
O(log p� logK + p� log p�) time.
In Chapter 5, we model the permutation decomposition problem as the prob-
lem of an edge coloring of a bipartite graph, and simplify the existing proof for the
decomposability of a permutation into two crosstalk-free (CF) partial permutations.
By applying parallel processing techniques, we develop a fast parallel decomposi-
tion algorithm to decompose a permutation into two CF partial permutations using
linear number of processors, which improves the time complexity of the existing
permutation decomposition algorithms from linear time to logarithmic time. Using
equitable coloring techniques, we further improve the time complexity for establish-
ing a set of K connections in a time dilated optical Benes network from O(N logN)
to O(log2K + logN).
In Chapter 6, we extend the concept of nonblocking in the space division
switching to the wavelength division switching. We model the wavelength routing
problem as the vertex coloring problem and develop fast parallel routing algorithms
for realizing an arbitrary permutation in wavelength-rearrangeable space-strict-sense
Banyan networks and wavelength-rearrangeable space-rearrangeable Benes networks
in O(log2N) time and O(log3N) time respectively, and discuss implementations of
both algorithms on a hypercube.
In Chapter 7, we consider Benes group connectors and Clos group connectors,
which are based on Benes networks and 3-stage Clos networks. We develop fast
parallel routing algorithms for both group connectors to realize connections from
N inputs to n output groups. Benes group connectors and Clos group connectors
have less hardware cost than the corresponding Benes networks and Clos networks,
respectively. We show that, by the proposed routing algorithms, the hardware of
42
Benes group connectors can be reduced further.
In Chapter 8, we conclude our research work and brie y discuss the future
work.
CHAPTER 2
A PARALLEL ITERATIVE IMPROVEMENT STABLE
MATCHING ALGORITHM
2.1 Introduction
Recently, the application of stable matching in switch scheduling has been proposed
in many literatures. For designing eÆcient switch scheduling, we must improve the
time complexity of stable matching algorithm. One possible solution is to use parallel
processing.
For real-time applications, the algorithm proposed by Gale and Shapley, sim-
ply GS algorithm, with time complexityO(n2 log n) using O(n) processors is not fast
enough. Attempts of �nding parallel stable matching algorithms with low complexity
were made by many researchers (e.g. [1, 23, 26, 29, 59, 87, 90]). Up to date, the best
known parallel algorithm for stable matching problem takes O(pn � log3 n) time [17].
This algorithm runs on a CRCW PRAM (concurrent-read concurrent-write parallel
random access machine) of n4 processors, which makes it infeasible for applications
in packet switching networks.
The parallelizability of the stable matching problem is far from being fully
understood. It is widely believed that this problem is not inNC. The parallel version
of the stable matching algorithm by Gale and Shapley needs O(n log n) iterations in
average [24]. It was suggested that parallel stable matching algorithms cannot be
expected to provide high speedup on the average [37, 76]. Thus, designing eÆcient
parallel algorithms that perform well for most cases is a challenging endeavor.
In this chapter, we propose a new approach, parallel iterative improvement
(PII), to solving the stable matching problem. Since a stable matching always ex-
43
44
ists for any instance of the stable matching problem, �nding a stable matching is
equivalent to �nding a feasible matching with minimum number (which is always
zero) of unstable pairs. The PII algorithm consists of two alternating phases, Initi-
ation Phase and Iteration Phase. An Initiation Phase is a procedure that
randomly generates a matching. An Iteration Phase consists of multiple improve-
ment iterations. We try to speedup the convergence process by exploring parallelism
in identifying a subset of unmatched pairs to replace matched pairs in an existing
matching so that the number of unstable pairs in the newly obtained matching can
be reduced. Due to greedy selection of new matching pairs, PII algorithm may not
converge in one Iteration Phase. However, we observed that the PII algorithm
tends to �nd a stable matching in o(n) iterations in average, and in n iterations
with high probability. We show that an Initiation Phase and an iteration of an
Iteration Phase take O(log n) time on both completely connected multiprocessor
system and array with multiple broadcasting buses, and O(log2 n) time on both hy-
percube and mesh of trees (MOT), all assumed having n2 processor elements (PEs).
Simulations show that the PII algorithm has better average performance compared
with the classical stable matching algorithms and converges in n iterations with high
probability. For real-time applications with hard time constraint, the proposed al-
gorithm can terminate at any time during its execution, and the matching with the
minimum number of unstable matching pairs can be used as an approximation of a
stable matching.
The rest of the chapter is organized as follows. In Section 2.2, we study the
properties of stable matching. In Section 2.3, we present our PII algorithm. We
show the implementations of PII algorithm on parallel computing machine models
in Section 2.4. Section 2.5 compares simulation results. Section 2.6 summarizes the
chapter.
45
2.2 De�nitions and Properties
For a ranking matrix A = fai;jg, we call wri;j (resp. mrj;i) the left value (resp. right
value) of ai;j, and denote it by aLi;j (resp. aRi;j). For convenience, we use (axi;j; ayi;j)
to denote the indices (i; j) of pair ai;j. Clearly, the ordered list of left values of all
pairs in row i of A is man ranking list mLi and the ordered list of right values of all
pairs in column j is woman ranking list wLj . Example 1 shows the ranking matrix
obtained from the given ranking lists.
Example 1 An instance of stable matching problem:
Man ranking lists: Woman ranking lists: Ranking matrix:
mL1 : f4; 2; 3; 1g; wL1 : f1; 4; 2; 3g; 4; 1 2; 1 3; 4 1; 3
mL2 : f3; 1; 2; 4g; wL2 : f1; 2; 3; 4g; 3; 4 1; 2 2; 2 4; 1
mL3 : f2; 4; 1; 3g; wL3 : f4; 2; 3; 1g; 2; 2 4; 3 1; 3 3; 4
mL4 : f1; 4; 3; 2g. wL4 : f3; 1; 4; 2g. 1; 3 4; 4 3; 1 2; 2
2
A pair ai;j in A corresponds to a man-woman pair (mi; wj). A matching,
denoted byM, corresponds to n pairs of Awith no two pairs in the same row/column.
If a pair of A is inM, it is called a matching pair of M and a non-matching pair
otherwise. For any matchingM of ranking matrix A, we de�ne the marked ranking
matrix, AM, as the ranking matrix with all matching pairs marked. Thus for any
matchingM, each row i (resp. column j) ofAM has exactly one matching pair, which
is denoted asM(Ri) (resp. M(Cj)). A pair ai;j is an unstable pair if aLi;j <M(Ri)L
and aRi;j <M(Cj)R. By the de�nition of stable matching, we have:
Property 1 A matchingM is stable if and only if there is no unstable pair in AM.
With respect toAM, we de�ne a setNM1 of type-1 new matching pairs (simply
nm1-pairs) as follows. If there is no unstable pair in AM, NM1 = ;. Otherwise, for
46
each row with at least one unstable pair, select the one with the minimum left value
among all unstable pairs in this row as an nm1-generating pair; for each column with
at least one nm1-generating pair, select the one with the minimum right value as an
nm1-pair.
Based on NM1, we de�ne a set NM2 of type-2 new matching pairs (simply
nm2-pairs) by a procedure that �rst identi�es nm2-generating pairs and then iden-
ti�es nm2-pairs using an nm2-generating graph.
For any nm1-pair ai;j in AM, pair al;k with l =M(Cj)x and k =M(Ri)
y is
called the nm2-generating pair corresponding to ai;j. We say that nm1-pair ai;j and
its corresponding nm2-generating pair al;k are associated with matching pairs ai;k and
al;j. We de�ne an nm2-generating graph GM as follows: V (GM) = f nm2-generating
pairsg, and E(GM) = fe = (u; v)j two nm2-generating pairs u and v are associated
with a common matching pairg. Since each nm2-generating pair is associated with
two matching pairs, we have:
Property 2 Given any AM, the degree of nm2-generating graph GM is at most 2.
By Property 2, each connected component in GM is a cycle or chain, named
nm2-generating cycle or nm2-generating chain (an isolated node is a chain of length
0). If a node in GM has degree 2, it is called an internal node; otherwise, it is called
an end node. Clearly, if an nm2-generating pair ai;j is an internal node in GM, there
are two nm1-pairs, one in row i and the other in column j; if an nm2-generating pair
ai;j is an end node in GM, there is at most one nm1-pair in row i or column j. We
call an end node ai;j a row end (resp. column end) of an nm2-generating chain if
there is no nm1-pair in row i (resp. column j) of AM. An isolated node is both row
end and column end.
By the nm2-generating graph, we can generate the set NM2 of nm2-pairs as
follows. For each nm2-generating chain with row end ai1;j1 and column end ai2;j2 , we
47
generate an nm2-pair ai1;j2. No nm2-pair is generated from any nm2-generating cycle.
Hence, there is a one-to-one correspondence between an nm2-generating chain and
an nm2-pair. Let NM = NM1 [ NM2. We call NM the set of new matching pairs
(simply nm-pairs). Based on the way that NM is generated, we know that NM1
and NM2 are disjoint, and each row/column of AM contains at most one nm-pair.
A matching pair ai;j in AM is called a replaced matching pair (simply rm-
pair), if it is in the same row/column of an nm-pair. We denote the set of rm-pairs
by RM . Based on the way that RM is constructed, we have:
Lemma 1 If there is at least one unstable pair in AM, thenM0 = (M�RM)[NM
is a matching di�erent fromM.
2.3 Parallel Iterative Improvement Matching Algorithm
In this section, we present our main result, a parallel iterative improvement algorithm
(PII algorithm) for a completely connected multiprocessor system, which consists of
a set of PEs connected in such a way that there is a direct connection between every
pair of PEs. We assume that each PE can communicatewith at most one adjacent PE
during every communication step. The PII algorithm uses n2 PEs. To facilitate our
discussion, these n2 PEs are placed as an n�n array. As input, PEi;j, (1 � i; j � n),
contains ai;j of ranking matrix A. When the algorithm terminates, a stable matching
is found by PEi;j indicating whether pair (mi; wj) is in the matching. The key idea
of the PII algorithm is to construct a new matchingM0 from an existing matching
M in hope thatM0 is \closer" to a stable matching thanM.
2.3.1 Constructing an Initial Matching
Randomly generating an initial matching can be reduced to generating a random
permutation, which can be done by a sequential algorithm proposed in [16]. We
present a parallel implementation of this algorithm.
48
Let each PE maintain a pointer. Initially, every PEi;j sets its pointer to point
to PEi+1;j, (1 � i � n � 1), and as a result, there are n disjoint lists. Then, each
PEi;i will randomly choose a j (i � j � n) to swap their pointers, i.e. PEi;i points to
PEi+1;j and PEi;j points to PEi+1;i. Consequently, n new disjointed lists originated
from PE1;j are formed. After performing log n times of pointer jumping [34], each
PE1;j �nds the other end PEn;p(1;j) of its list, where p(1; j) is the column position
of the PE pointed by PE1;j. Hence, a matching f(j; p(1; j))j1 � j � ng is formed.
Figure 2.1 shows an example of generating a matching of size 4, where the matching
obtained from (b) consists pairs of (1; 4), (2; 3), (3; 1), (4; 2). Clearly, this parallel
implementation takes O(log n) time since each list has length of n.
PE PE1,2 PE1,3 PE1,4
PE2,4PE2,3PE2,2PE2,1
PE3,1 PE3,2 PE PE3,4
PE4,1 PE4,2 PE4,3 PE4,4
1,1
( b )
PE1,2 PE1,3 PE1,4
PE2,4PE2,3PE2,2PE2,1
PE3,1 PE3,2 PE3,3 PE3,4
PE4,1 PE4,2 PE4,3 PE4,4
1,1
( a )
PE
3,3
Figure 2.1. Parallel random matching generation: (a) initial lists; (b) lists obtained
after randomization.
2.3.2 Construct a New Matching from an Existing Matching
A basic operation of PII algorithm is to construct a new matching M0 = (M�
RM) [ NM from an existing matchingM if M is unstable. In the following, we
describe six steps to carry out this operation.
Step 1: Recognize unstable pairs. Every PE with a matching pair in M
broadcasts its column position associated with its left value to the other PEs in the
same row; every PE with a matching pair inM broadcasts its row position associated
with its right value to the other PEs in the same column. Let ui;j be a Boolean
49
variable indicating whether pair ai;j is stable. If PEi;j's both values are smaller, set
ui;j := true; otherwise set ui;j := false. The broadcasting in rows/columns takes
O(log n) time.
Step 2: Stability checking. Find if there exists a PEi;j with ui;j := true by
parallel searching in binary tree fashion in rows/columns. Since each row/column
has n PEs, the searching takes O(log n) time. If fi;j := false for any PEi;j, then the
current matchingM is stable, and the algorithm terminates. Otherwise, go to the
next step.
Step 3: Find NM1. For each row with at least one unstable pair, �nd the
unstable pair with the minimum left value, and mark this pair as an nm1-generating
pair. For each column with at least one nm1-generating pair, �nd the nm1-generating
pair with the minimum right value, and mark this pair as an nm1-pair. The �nd-
minimum operation in rows/columns takes O(log n) time.
Step 4: Find nm2-generating pairs. For each PEi;j containing an nm1-pair,
mark the pair in PEl;k as a nm2-generating pair, where l =M(Cj)x and k =M(Ri)
y.
Clearly, this step only takes O(1) time.
Step 5: Find NM2. This step has two major objectives: (1) each nm2-
generating node that is both row end and column end in GM recognizes itself as an
isolated node, and (2) the row end of each nm2-generating chain in GM �nds its col-
umn end. Let each PEi;j containing an nm2-generating pair maintain two pointers,
r-pointer and c-pointer. The r-pointer (resp. c-pointer) of PEi;j points to the PE
containing an nm2-generating pair in columnM(Ri)y (resp. row M(Cj)
x) if there
is an nm1-pair in row i (resp. column j), and otherwise to itself. If both r-pointer
and c-pointer of PEi;j point to itself, then it corresponds to an isolated node in GM;
if the r-pointer (resp. c-pointer) of PEi;j points to itself but another pointer points
to some other PE, then PEi;j contains an nm2-generating pair that is the row (resp.
column) end of an nm2-generating chain; if both r-pointer and c-pointer of PEi;j
50
point to other PEs, its nm2-generating pair corresponds to an internal node of GM.
Figure 2.2 shows an example for �nding a new matchingM0 from an existing match-
ing M, where M = fa0;0; a1;9; a2;10; a3;7; a4;8; a5;1; a6;6; a7;5; a8;4; a9;3; a10;2g, NM1 =
fa1;7; a3;1; a4;6; a5;9; a6;5; a7;4; a8;3; a10;10g, NM2 = fa2;2; a9;8g, RM = fa1;9; a2;10; a3;7;
a4;8; a5;1; a6;6; a7;5; a8;4; a9;3; a10;2g andM0 = (M�RM)[NM = fa0;0g[NM1[NM2.
By a completely connected multiprocessor with N2 PEs, objective (1) can be easily
achieved in O(1) time and objective (2) can be achieved by performing dlog ne times
of pointer jumping [34] since the length of each nm2-generating chain is at most n.
Once objectives (1) and (2) are accomplished, the nm2-pairs can be easily computed
in O(1) time.
2
2
1
c-pointer
r-pointer
nm -pair
nm -generating pair
nm -pair
matching pair
1 2 3 4 5 6 7 8 9 10
Rows
1
2
3
4
5
6
7
8
9
10
Columns
nm -generating path: a a a a 2
nm -generating isolated node: a 2
9,4
nm -generating cycle: a a a 2 1,1
6,8 7,6 8,5
2,2
3,9 5,7
Figure 2.2. Finding a new matching from an existing matching.
Step 6: Construct a New Matching. Each PEi;j containing an nm-pair marks
the matching pair in row i as a replaced pair, and marks itself as a matching pair.
This step takes O(1) time.
Given an initial matching, the above procedure can be repeatedly applied,
and a stable matching may be found. Example 2 shows how the PII algorithm �nds
a new matching from a given unstable matching for the instance of Example 1.
51
Example 2 A stable matching is found after an iteration of the Iteration Phase.
initial matching pairs: unstable pairs: nm1-generating pairs: nm1-pairs
4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3
3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1
2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4
1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2
nm2-generating pairs: nm2-pairs: replaced pairs: new matching pairs
4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3
3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1
2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4
1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2
It is not diÆcult to verify that the new matching is a stable matching. 2
2.3.3 PII Algorithm
Now we are ready to present our PII algorithm. Conceptually, the PII algorithm has
two alternating phases: Initiation Phase and Iteration Phase. The Initia-
tion Phase �nds an initial matching arbitrarily. The Iteration Phase contains
at most c �n iterations, where c is a constant for controlling the number of iterations
in the Iteration Phase. Each iteration of Iteration Phase checks whether an
existing matchingM is stable. IfM is stable, the algorithm terminates; otherwise,
a new matchingM0 is constructed. Then,M0 is used asM for the next iteration.
After c � n iterations in each Iteration Phase, the PII algorithm goes back to
Initiation Phase to generate a new initial matching randomly and a new Itera-
tion Phase is e�ected based on this new generated matching. As we analyzed, an
Initiation Phase and an iteration of an Iteration Phase takes O(log n) time
on a completely connected multiprocessor system with n2 PEs.
In an iteration of an Iteration Phase, a new matchingM0 = (M�RM)[
NM1 [ NM2 is constructed from an existing matchingM. It is easy to verify that
the pairs in NM1 were unstable forM, but become stable forM0; the pairs in NM2
are stable forM0, regardless whether they were stable forM; and the pairs inM,
which were stable for M, remain to be stable for M0. Intuitively, the number of
52
unstable pairs forM0 is smaller than the number of unstable pairs forM. For most
cases, it is true. This is the heuristic behind the PII algorithm.
However, new unstable pairs may be generated forM0. Let the initial match-
ing be M0 and the matching generated in the i-th iteration be Mi. Since the set
of nm1-pairs, nm2-pairs and rm-pairs with respect toMi�1 is unique, the matching
Mi is constructed uniquely fromMi�1. Hence, ifMi 2 fMjjj 2 f0; 1; � � � ; i� 1gg,
i.e. the newly generated matching is the same as a previously generated matching,
no stable matching can be found. Example 3 shows a case that will not converge
to a stable matching. It is possible to include a procedure for detecting this cyclic
situation. Such a procedure, however, is too time-consuming. This is why we de-
cided to start a new round after c �n iterations of an Iteration Phase, where c is a
carefully selected constant. The random permutation generating algorithm we used
generates random matchings with uniform distribution according to [3]. Therefore,
by the existence of a stable matching, the PII algorithm can always �nd one for any
instance of stable matching problem.
Example 3 The matching Mi (i � 4) is the same as matchingMj with j = i mod
4.
Initial matchingM0: MatchingM1: MatchingM2: Matching M3
4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3
3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1
2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4
1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2
2
Our simulation results (see Section 2.5) indicate that the PII algorithm has
better performance compared with GS algorithm. However, we are unable to the-
oretically exclude the possibility that the total number of iterations is very large.
In order to enforce a bound for the number of iterations, we propose to run the
PII algorithm and parallel GS algorithm simultaneously in a time-sharing fashion.
53
We denote this modi�ed PII algorithm as PII-GS, which terminates once one of the
algorithms generates a stable matching. Clearly, the PII-GS algorithm converges to
a stable matching with O(n2) iterations in the worst case.
2.4 Implementations of PII Algorithm on Parallel Computing Machine
Models
In this section, we consider implementing the PII algorithm on three well-known par-
allel computing systems � hypercube, mesh of trees (MOT) and array with multiple
broadcasting buses. Without loss of generality, assume n = 2k. If 2k < n < 2k+1,
the PII algorithm can be implemented on a 22k-processor system with a constant
slow-down factor. The n2 PEs in each system are placed as n� n array, and n PEs
in each row/column form a row/column connection (see Figure 2.3). We assume that
our parallel computing systems operate in a synchronous fashion. Basic O(1)-time
parallel operations of a hypercube and a MOT can be found in [47]. For an array
with multiple broadcasting buses, we assume that each bus has a controller. A pro-
cessor can request to communicate with the controller or any other processor on the
bus. At any time, a constant number of processors on a bus may send requests to the
bus controller, and the controller selects one request (if any) to grant the bus access
arbitrarily. The controller of a bus can broadcast a message to all the processors
on the bus. We assume that each processor-to-processor, processor-to-controller and
broadcasting operation takes O(1) time.
It is simple to notice that multiple-broadcasting, �nding minimum, and pointer
jumping are the most time consuming operations in the PII algorithm. The pointer
jumping can be carried out by sorting. Let C be a parallel computing machine with
n2 processors, and let TB(n), TM(n) and TS(n) be the time required for multiple-
broadcasting, �nding minimum and sorting on C, respectively. Then, an Initia-
tion Phase of PII algorithm can be implemented on C in O(TS(n) � log n) time,
and each iteration of an Iteration Phase of PII algorithm can be implemented on
54
(b)(a) (c)
Figure 2.3. Parallel computing models: (a) a 16-processor hypercube; (b) a 4 � 4
mesh of trees; (c) a 4 � 4 array with multiple broadcasting buses
C in O(maxfTB(n); TM(n); TS(n) � log ng) time.
For a hypercube and a MOT, the operations of broadcasting and �nding-
minimum in PII are performed in parallel row-wise or column-wise, resulting TB(n)=
TM(n) = O(log n). For an array with multiple broadcasting buses, TB(n) = O(1).
Finding-minimum operation can be carried out on a bus in O(log n) time using a
binary searching method. For an n2-processor hypercube TS(n) = O(log2 n) while
TS(n) = (n) for a MOT and an array with multiple broadcasting buses since
either of their bisection widths is n. If we use sorting to implement pointer jumping
operations, both an Initiation Phase and an iteration in an Iteration Phase
of PII algorithm require O(log3 n) time on a hypercube and (n log n) time on a
MOT and an array with multiple broadcasting buses. In the following, however, we
show that sorting can be avoided on these parallel computing models using special
features of PII algorithm.
First, we show how to implement an Initiation Phase without pointer
jumping. This can be done by adopting a parallel implementation in [25] of the algo-
rithm of [16]. Let �i, (1 � i � n�1), be the permutation interchanging i and ri that is
chosen randomly from the set fi; � � � ; ng while leaving other elements of f1; 2; � � � ; ng
�xed. Let �n be an identity permutation. Initially, we use row i to represent �i. The
computation � = �1 Æ �2 Æ � � � Æ �n�1 Æ �n is organized in a complete binary tree of
55
height log n. For example, for n = 8, � = ((�1 Æ�2) Æ (�3 Æ�4)) Æ ((�5 Æ�6) Æ (�7 Æ�8)).
Hence, all that remains is to consider the composition of two permutations. Given
a permutation �0, let D(�0) = fij1 � i � n; and �0(i) 6= ig. The algorithm of
[25] associates jD(�0)j processors to �0. In our implementation, we mimic the op-
erations of one processor in [25] using a set of processors and their connections.
More speci�cally, we associate each row/column i to �i at the beginning. Let
�(i;j) = �i Æ �i+1 Æ � � � Æ �j = �(i;(j�i+1)=2) Æ �((j�i+1)=2+1;j), where j � i + 1 is an
integer of power of 2. Note that jD(�(i;j))j � j � i + 1. Thus, we can use row i
through row j and column i through column j to perform the operations assigned
to the processors for computing �(i;j) in the algorithm of [25]. The communication
paths for computing the composition of permutations at the same level of the bi-
nary computation tree are disjoint, because they use disjoint sets of row and column
connections. Since the height of the binary computation tree is O(log n), an Initi-
ation Phase of PII algorithm takes O(log2 n) time on a hypercube and a MOT.
If an array with multiple broadcasting buses is used, an Initiation Phase of PII
algorithm takes O(log n) time.
We now show how to implement an iteration of Iteration Phase without
using pointer jumping. Since each row/column contains at most one nm2-generating
pair, each pointer jumping of Step 5 can be decomposed into disjoint parallel 1-to-
1 row communications followed by disjoint parallel 1-to-1 column communications
without con icts. Thus, every pointer jumping step can be implemented in O(log2 n)
time on a hypercube and a MOT, and in O(log n) time on an array with multiple
broadcasting buses. We also note that simulating an n � n MOT by an n=2 � n=2
MOT (which has 3n2=4 � n < n2 processors) results in a constant slowdown factor.
To summarize, we show the improvement of the time complexity of PII algorithm on
three parallel computing systems in Table 2.1.
56
Machine modelsInitiation Phase An iteration in Iteration Phase
with sorting without sorting with sorting without sorting
Hypercube O(log3 n) O(log2 n) O(log3 n) O(log2 n)
MOT O(n logn) O(log2 n) O(n logn) O(log2 n)
Array with Buses O(n logn) O(logn) O(n logn) O(logn)
Table 2.1. Time complexity for implementations of PII algorithm on three parallel
computing machine models
0
50
100
150
200
250
300
350
0 10 20 30 40 50 60 70 80 90 100
size of stable matching
aver
age
iter
atio
ns
PII
PII-GS
GS
( a )
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
size of stable matching
cum
ula
tive
fre
qu
ency
(%)
PII
PII-GS
GS
( b )
Figure 2.4. Performance Comparisons: (a) average number of iterations for algo-
rithms to �nd a stable matching; (b) frequencies for algorithms to �nd a stable
matching within n iterations
2.5 Simulation Results
We have simulated PII, PII-GS, and parallel GS algorithms for di�erent sizes n 2
f10; 20; � � � ; 100g of stable matching, with 10000 runs each. The ranking lists and
initial matchings are generated by random permutation algorithm [16]. Each It-
eration Phase contains n iterations. The performance comparisons are based on
the average number of parallel iterations for each algorithm to generate a stable
matching and the frequency for each algorithm to converge in n iterations. From the
simulation, we notice that PII and PII-GS algorithms signi�cantly outperform GS
algorithm. Figure 2.4 shows that PII and PII-GS algorithms converge in n iterations
with very high probabilities, while the probability for GS algorithm to converge with
the same number of iterations decreases quickly as the sizes of problem increase.
57
2.6 Summary
In this chapter, we proposed a new approach, parallel iterative improvement, to solv-
ing the stable matching problem. The PII algorithm requires n2 PEs, among which
n PEs are required to perform arithmetic operations (for random number genera-
tion) and the other PEs can be simple comparators. The classical GS algorithm and
most existing stable matching algorithms can only �nd the man-optimal or woman-
optimal stable matching. By [24], the man (resp. woman)-optimal stable matching
is women (resp. men)-pessimal, i.e. every man/woman gets the best partner while
every woman/man gets the worst partner over all stable matchings. However, due to
randomness, the PII algorithm constructs a stable matching that is contained in the
set of stable matchings. Therefore, this algorithm will generate the stable matching
with more fairness.
In some applications, such as real-time packet/cell scheduling for a switch,
stable matching is desirable, but may not be found quickly within tight time con-
straint. Thus, �nding a \near-stable" matching by relaxing solution quality to satisfy
time constraint is more important for such applications. Most of existing parallel
stable matching algorithms cannot guarantee a matching with a small number of
unstable pairs within a given time interval. Interrupting the computation of such
an algorithm does not result in any matching. However, the PII algorithm can be
stopped at any time. By maintaining the matching with the minimum number of
unstable pairs found so far, a matching that is close to a stable matching can be
computed quickly.
CHAPTER 3
DESIGN AND IMPLEMENTATION OF AN ACYCLIC
STABLE MATCHING SCHEDULER
3.1 Introduction
The scheduling algorithms based on general stable matchings are too complex for
high-speed implementation. It turns out that for stable matching instances with
acyclic dependency graphs, �nding stable matchings takes less time. Researchers
have proposed several scheduling algorithms for CIOQ switches based on acyclic
stable matchings. In [72], Prabhakar and McKeown proposed the most urgent cell
�rst algorithm (MUCFA) for a CIOQ switch with a speedup of 4 to emulate an
output queued (OQ) switch performance. Chuang and Stoica improved the result to
a speedup of 2 by the critical cell �rst (CCF) algorithm [12] and the joined preferred
matching (JPM) algorithm [85] independently. In [63], Nong et al. proved that
with some speedup, an acyclic stable matching scheduling algorithm can provide
QoS guarantees for both unicast and multicast traÆc with �xed-length and variable-
length packets.
The advantage of acyclic stable matching scheduling algorithms is its fea-
sibility for high-speed implementation. However, there is no hardware design and
implementation of acyclic stable matching scheduling algorithms in the literature. In
this chapter, we propose a parallel algorithm for the acyclic stable matching problem,
and present its hardware implementation. We �rst model the acyclic stable matching
problem as the dominating set problem for rooted dependency graphs. We show that
the root set and the dominating set of a rooted dependency graph are identical. We
then propose a parallel algorithm, FIND ROOTS, to �nd the root set of a rooted
58
59
dependency graph in O(n log n) time with n2 simple processing elements (PEs). We
further present hardware design and implementation of the proposed algorithm. Sim-
ulation results show that the number of 2-input NAND gates and the timing of our
design are proportional to n2 and n log n respectively. The proposed design can be
used to implement schedulers based on acyclic stable matching algorithms, such as
those in [72, 12, 85, 63].
The rest of the chapter is organized as follows. In Section 3.2, we propose
our parallel algorithm FIND ROOTS. In Section 3.3, we focus on the design and
implementation of FIND ROOTS in hardware. Section 3.4 summarizes the chapter.
3.2 A Parallel Stable Matching Algorithm for Rooted Dependency Graph
In this chapter, for a ranking matrix A = fai;jg, we call wri;j (resp. mrj;i) the
horizontal value (resp. vertical value) of ai;j, and denote it by ahi;j (resp. avi;j). By
de�nitions, we know, given an n� n ranking matrix A, a set of man-woman pairs is
a matchingM if any two pairs (mi1; wj1) and (mi2; wj2) inM are corresponding to
two entries ai1;j1 and ai2;j2 in di�erent rows/columns of A;M is a stable matching if
there does not exist a pair (mi; wj) =2 M such that ahi;j < ahi;k and avi;j < avl;j, where
(mi; wk); (ml; wj) 2 M.
3.2.1 Dominating Set for Dependency Graph
A dominating set of dependency graph ~G is a set of vertices, denoted by Vd, such
that the following two conditions are satis�ed: (1) for any two vertices in Vd, they
are corresponding to two entries in di�erent rows and columns of the ranking matrix;
(2) for any vertex v 2 V (~G)� Vd, there is a directed edge from a vertex in Vd to v.
Since each vertex vi;j in ~G is corresponding to a pair of man and woman
(mi; wj), by the de�nitions of stable matching and dominating set, we have the
following fact.
60
Fact 1 Let ~G be a dependency graph. Vd is the vertex subset corresponding to a
stable matching if and only if Vd is a dominating set of ~G.
By Fact 1, the problem of �nding a stable matching is reduced to the problem
of �nding a dominating set. In general, the dominating set for a dependency graph
may not be unique, and �nding one is time consuming. However, we �nd that the
problem of �nding dominating sets for a special class of dependency graphs, named
rooted dependency graphs, is much easier. A rooted dependency graph is de�ned
recursively as follows: an empty graph is a rooted dependency graph; a non-empty
dependency graph ~G is a rooted dependency graph if (1) it contains one or more
roots, each being a vertex without any incoming edge; (2) the reduced subgraph,
which is obtained from ~G by removing all vertices in the same rows and columns in
which the roots are located and all outgoing edges from these removed vertices, is
also a rooted dependency graph. The root set of a rooted dependency graph ~G is a
set that consists of all roots of ~G and its reduced subgraphs recursively generated
from ~G. The following fact is obvious.
Fact 2 Let ~G be the dependency graph of a ranking matrix A where each entry
ai;j = (wri;j ;mrj;i). For any vertex vi;j, the number of incoming edges coming from
the vertices in row i is equal to wri;j � 1 and the number of incoming edges coming
from the vertices in column j is equal to mrj;i � 1.
By Fact 2, we know that a vertex with corresponding entry (1; 1) is a root
since it has no incoming edge. By Facts 1 and 2, we have the following theorem.
Theorem 1 For a rooted dependency graph ~G, the root set is the same as the dom-
inating set, which is unique for ~G.
61
colu
mns
1
2
3
4
1 2 3 4
colu
mns
1
GG "G’
rows rows4321
4
3
2
1
2
32rows
colu
mns
1
4
3
4
m
4,22,4
3,3 4,1 1,1 2,3
1,2 2,4 3,2 4,2
1,1 2,3 4,3 3,1
2,4 3,2 1,4 4,4 3,2
( a )
iteration 44
3
2
1
m
m
m
iteration 3iteration 2iteration 1 iteration 5
( b )
4,2
4,4
w
w
w
w
4
3
2
1
m
m
m
m 1
4
3
2
1
m
m
m
m
4
3
2
m
m
m
4
3
2
1
w
w
w
w
m 4
3
2
1
w
w
w
w
4
3
2
1 w
m
m
m
m
4
3
2
1
w
w
w
1
4
3
2
1
w
w
w
w
4
3
2
Figure 3.1. Finding stable matching in a rooted dependency graph: (a) a rooted
dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stable matching is
found by GS algorithm in 5 iterations.
Example 4 Figure 3.1 (a) shows an example for a rooted dependency graph ~G and
its reduced subgraph ~G0 and ~G00, where each root is marked as a dark circle. The
dependency graph~G is corresponding to the following ranking matrix A1:
3,3 4,1 1,1 2,3
1,2 2,4 3,2 4,2
1,1 2,3 4,3 3,1
2,4 3,2 1,4 4,4
The horizontal value and vertical value of each entry in the ranking matrix are shown
in each corresponding vertex. From the �gure, clearly, neither of two vertices v1;3 and
v3;1, which are marked as dark circles in ~G, has incoming edge since each of them
corresponds to an entry (1; 1) in the ranking matrix. Hence, ~G has two roots, v1;3
and v3;1. After removing all vertices in rows 1, 3 and columns 1, 3 and their outgoing
edges in ~G, we get the reduced subgraph ~G0, which has root v4;2 marked as dark circle
in ~G0. After removing all vertices in row 4 and column 2 and their outgoing edges
in ~G0, we get the reduced subgraph ~G00, which contains only one vertex v2;4 that is
also a root of ~G00. By the de�nition, we know ~G is a rooted dependency graph. It is
easy to verify that the root set, fv1;3; v3;1; v4;2; v2;4g, is the dominating set of ~G. By
62
Theorem 1, the dominating set corresponds to the stable matching of ranking matrix
A1, which is f(1; 3); (3; 1); (4; 2); (2; 4)g. 2
A rooted dependency graph may not be acyclic (i.e. the graph may have a di-
rected cycle). In Figure 3.1 (a), ~G contains a cycle (v1;1,v4;1,v4;2,v3;2,v2;2,v2;3,v2;4,v1;4)
(see Figure 3.1 (a), in which edges in the cycle are marked as dark edges). How-
ever, an acyclic graph always has at least one root, and its reduced subgraph is also
acyclic. Thus, we have the following fact.
Fact 3 An acyclic dependency graph is a rooted dependency graph, but a rooted
dependency graph may not be an acyclic dependency graph.
In the following, we propose a parallel algorithm for �nding the root set (i.e.
the stable matching) in a rooted dependency graph.
3.2.2 The Algorithm
Given a rooted dependency graph ~G constructed from an n � n ranking matrix A,
we �rst �nd the roots of ~G. If the reduced subgraph ~G0 of ~G is not empty, we
continue to �nd remaining vertices in the root set of ~G recursively until the total
number of found roots equals to n. The algorithm for �nding the root set of a rooted
dependency graph, FIND ROOTS, is described in the following.
Algorithm FIND ROOTS
begin
G := ~G; /* ~G is the dependency graph */
Vr := ;; /* Vr is the root set */
while there exists a root in G do
Step 1: �nd the set of roots V 0
rof G and let Vr := Vr [ V
0
r;
Step 2: �nd the reduced subgraph G0 of G and let G := G0.
end
Based on Theorem 1 and Fact 1, the set of roots obtained from FIND ROOT
is corresponding to the set of man-woman pairs in the stable matching. We analyze
63
the time complexity of FIND ROOTS using n2 PEs as follows. The n2 PEs are
placed as an n� n array, and the n PEs in each row or column are fully connected.
Each PEi;j is corresponding to a vertex vi;j of ~G and has a pair of horizontal
(h for short) and vertical (v for short) values set as (wri;j ;mrj;i) initially. Since the
total number of roots in root set of ~G is equal to n, FIND ROOTS runs in at most n
iterations. Each iteration of FIND ROOTS consists of two steps. Based on Fact 2,
we know step 1 can be done in O(1) time by each PEi;j checking if its (h; v) = (1; 1).
Conceptually, step 2 contains 2 substeps. In substep 1, each root vertex vi;j found
in step 1 sets its (h; v) = (0; 0) and marks all vertices in row i and column j as the
vertices to be deleted. Since all PEs in the same row or column are fully connected,
this substep takes O(1) time. In substep 2, each undeleted vertex vi;j decreases its
h (resp. v) value by k if its h (resp. v) value is greater than that of k deleted
vertices in row i (resp. column j). Since there are at most n deleted vertices in each
row/column, this substep can be done in O(log n) time. Therefore, based on the
above discussion, we have the following theorem.
Theorem 2 Given any instance of stable matching problem, if its corresponding
dependency graph is a rooted dependency graph (including acyclic dependency graph),
we can �nd the stable matching in O(n log n) time on n2 PEs.
3.2.3 Comparison with GS Algorithm
Gale and Shapley proposed an algorithm for solving the stable matching problem in
[20]. The GS algorithm works in the following way. Each man �rst proposes to his
most favorite woman; each woman will keep the proposal proposed by the man who
has the highest rank in her ranking list among those who have proposed to her, and
reject all the rest proposals. Each rejected man then proposes to his next favorite
woman on his ranking list. The GS algorithm will continue this process until all
women get proposals. When GS algorithm stops, each woman and the man whose
64
G
3
"G
4rows
colu
mns
colu
mns
1
2
3
4
1 2 3 4rows
2
colu
mns
1
2
3
4
1 2 3 4rows
1
2
3
4
1
G’
iteration 2
( b )
iteration 1 iteration 5 iteration 6
2
1,1 3,1 2,3 4,3
1,2 4,2
m
4
3
2
1
m
m
m
4
3
1,1
2
1
m
m
m
m
4
3
2
4,2
4,4 1,2
1
m
m
m
m
( a )
iteration 4iteration 3
3,1
4,4
3
2,2 3,1
2,3 4,3 3,4
2,4 4,4 3,4 1,2
1
m
m
m
m
4
3
2
1
m
m
m
m
4
1
w
w
w
w
4
3
2
1
m
m
m
m
4
3
2
4
3
2
1
w
w
w
w
4
3
2
w
w
w
w
4
3
2
1
w
w
w
w 1 1
w
w
w
w
4
3
2
1
w
w
w
w
4
3
2
Figure 3.2. Finding stable matching in an acyclic dependency graph: (a) an acyclic
dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stable matching is
found by GS algorithm in 6 iterations
proposal the woman keeps become a pair of partners, and all pairs form a stable
matching. GS showed that a stable matching always exists and can be found in O(n2)
iterations. Due to the dependency in GS algorithm, the number of iterations cannot
be easily reduced by parallelism regardless of the number of PEs used. The running
time of parallel GS algorithm is O(n2 log n) time on n PEs since each iteration takes
O(log n) time to �nd the minimum from at most n distinct numbers.
For stable matching problems with rooted dependency graphs, GS algorithm
does not work as fast as FIND ROOTS. Figure 3.1 (b) shows an example to �nd
the stable matching using GS algorithm, where new proposals are marked by light
lines and the kept proposals are marked by dark lines in each iteration. As shown
in Figure 3.1 (b), to �nd the stable matching for ranking matrix A1, GS algorithm
needs 5 iterations while FIND ROOTS only needs 3 iterations. This means that O(n)
iterations are not suÆcient for GS algorithm to �nd the stable matching for rooted
dependency graphs. Furthermore, O(n) iterations are not suÆcient for GS algorithm
to �nd the stable matching for acyclic dependency graphs. Figure 3.2 shows an
example of an acyclic dependency graph. Consider �nding a stable matching of the
65
ranking matrix A2:1,1 3,1 2,3 4,3
1,2 4,2 2,2 3,1
2,3 4,3 1,1 3,4
2,4 4,4 3,4 1,2
According to Algorithm FIND ROOTS, we �rst �nd the roots of ~G, which are v1;1,
v3;3; then �nd the root of ~G0, which is v2;4; �nally �nd the root of ~G00, which is v4;2.
Thus, GS algorithm needs 6 iterations and FIND ROOTS needs 3 iterations.
In the worst case, the parallel GS algorithm �nds the stable matching for
a rooted dependency graph and an acyclic dependency graph in O(n2 log n) time.
However, FIND ROOTS �nds the stable matching for a rooted dependency graph
and an acyclic dependency graph in n iterations, each taking O(log n) time. Thus, the
speedup for worst time complexity of FIND ROOTS to GS algorithm is O(n). Both
FIND ROOTS and GS algorithms take man and woman ranking lists as inputs and
every list contains n numbers. Thus, the needed spaces for both algorithms are the
same. Table 3.1 compares the parallel GS algorithm and the parallel FIND ROOTS
algorithm for �nding the stable matching in any rooted dependency graph or acyclic
dependency graph with respect to time, the number of PEs and memory space.
Algorithm Time PEs Space
GS O(n2 log n) n O(n2)
FIND ROOTS O(n log n) n2 O(n2)
Table 3.1. Comparison of algorithms for �nding a stable matching
3.3 Implementing the Scheduler
One of the objectives of our work is to design a scheduler that is feasible to implement.
In this section, we present the hardware design and implementation of a scheduler
based on the FIND ROOTS algorithm. An n � n scheduler has n2 pairs of inputs
(wr1;1;mr1;1); � � � ; (mrn;n; wrn;n), and n pairs of outputs which are the indices of n
roots, s1; s2; � � � ; sn. The circuit consists of n2 nodes arranged as an n � n array.
66
Each node corresponds to an entry in the ranking matrix A and a vertex of A's
dependency graph. We use 2n buses to interconnect n2 nodes such that node ni;j,
where 1 � i; j � n, is connected to the ith row bus, ri, and the jth column bus, cj.
Each bus is log n-bit wide. The �rst bit line of all n row buses are connected to a
controller, which is used to select one out of possibly multiple bus requests (in the
case of multiple root nodes exist in a graph). Each node ni;j has 2 inputs for reading
its (h; v) pair, and one output to send out its index. Figure 3.3 shows the scheduler
block diagram, circuit structure, and node block diagram of a 4 � 4 scheduler.
Scheduler
( a )
1
2
3
4s3
s1
s2
s3
s4
s2
wri,j
mrj,i
4
s s
. . . to/from ri
to/from cj
( c )( b )
Con
trol
ler
cc c c
r
r
r
1 2 3
r4
wr1,1 mr1,1
1
wr4,4 mr4,4
ni,j
Figure 3.3. A 4� 4 scheduler design: (a) scheduler block diagram; (b) circuit struc-
ture; (c) node block diagram.
The operation of an n� n scheduler has n iterations. Initially, each node ni;j
sets its (h; v) = (wri;j;mrj;i). An iteration operates as follows. For each node ni;j, if
it �nds its (h; v) = (1; 1) (i.e. it is a root node), it will send a `request signal' on its
row bus. If the controller detects that there are more than one buses requesting, it
will con�rm the bus with the minimum row index and send back a `grant signal' to
the bus. Once a root node ni;j gets the `grant signal' from its row bus, it will send
a `mask signal' on row bus ri and column bus cj to \eliminate" all nodes on row i
and column j; meanwhile, it will update its (h; v) = (0; 0) and send out its index.
Once a node on row i (resp. column j) receives a `mask signal', it will broadcast its
v (resp. h) value on its column (resp. row) bus. If a node with its h (resp. v) value
is greater than the h (resp. v) value received from its row bus (resp. column bus), it
will subtract its h (resp. v) value by 1.
67
Size N=2 N=4 N=6 N=8 N=10 N=12
Timing 62.24 137.28 239.52 315.84 399.6 479.52
Area 1166 5410 12263 29283 41002 59342
Table 3.2. Timing and area results of the scheduler design.
The major advantage of this design is its simplicity. We only use 2n log n-bit
buses to broadcast signals to nodes in the same row or the same column, and one
log n-bit priority encoder functioning as a controller for bus arbitration. Although
n2 nodes are used, the logic of each node is simple, which mainly includes 2 log n-bit
registers used to store its h and v values, one log n-bit comparator, and one log n-bit
adder.
We conducted simulations of the scheduler design on Synopsys's design tools.
We wrote the VHDL [32] code, compiled and synthesized it on Synopsys's de-
sign analyzer [88] using its library lsi 10k. The design analyzer was directed to min-
imize the area cost of the design. Table 3.2 depicts the timing results (in terms of
ns) and the area results (in terms of the number of 2-input NAND gates) of the
scheduler design for n = 2; 4; 6; 8; 10; 12. The timing and the number of 2-input
NAND gates are proportional to n and n2 respectively, making the design feasible
to be implemented with current CMOS technologies.
Another advantage of the design is its exibility. Our scheduler design works
well for real applications, including the case that ranks in some ranking lists are not
distinct (e.g. cells with the same priority), the case that the lengths of some ranking
lists are not equal to n (e.g. in some input queue, there is no cell destined for some
output port), and the case that the sizes of man set and woman set are not equal
(e.g. the number of input queues is not equal to the number of output queues).
68
3.4 Summary
In this chapter, we addressed the acyclic stable matching problem and proposed
a parallel algorithm to solve the stable matching problem for rooted dependency
graphs, which contains all acyclic dependency graphs as special cases. We designed
a hardware scheduler based on the proposed algorithm. Simulation results show that
the proposed scheduler design is feasible with current CMOS technologies. To the
best of our knowledge, the scheduler design is the �rst hardware design for acyclic
stable matching. It is very useful in constructing high-speed switches/routers.
CHAPTER 4
PARALLEL ROUTING ALGORITHMS FOR NONBLOCKING ELECTRONIC
AND PHOTONIC SWITCHING NETWORKS
4.1 Introduction
Recently, a class of multistage nonblocking switching networks has been proposed.
In this class each network, denoted by B(N;x; p; �), has relatively low hardware cost
and short connection diameter, O(N1:5 logN) and O(logN) respectively, in terms
of the number of SEs. A B(N;x; p; �), � 2 f0; 1g, is constructed by horizontally
concatenating x(� logN � 1) extra stages to an N � N Banyan-type network and
vertically stacking p copies of the extended Banyan. Networks B(N;x; p; 0) and
B(N;x; p; 1) are similar in structure, but the latter does not allow any two connec-
tion paths to pass through the same SE while the former does. B(N;x; p; 0) and
B(N;x; p; 1) are suitable for electronic and optical implementation, respectively. It
has been shown that B(N;x; p; �) can be SNB, WNB and RNB with certain values
of x and p for given N and � [41, 42, 57, 91, 92].
A trivial lower bound on the time for routing K (0 � K � N) connections
sequentially in B(N;x; p; �) is (K logN). This lower bound is obtained by assum-
ing that for a connection it takes O(1) time to correctly guess which plane to use
without causing con ict and O(logN) time to compute the connection path in that
plane. Clearly, when x 6= 0 and p > 1, correctly assigning connections to planes and
routing connections in each plane are not easy. When the number of connection re-
quests is large, the routing time complexity is greater than O(N). Parallel processing
techniques should be used to meet the stringent real-time timing requirement [31].
To the best of our knowledge, except for some special cases such as Banyan network
69
70
(i.e., B(N; 0; 1; �)) and Benes network (i.e., B(N; logN � 1; 1; �)), no e�ort of inves-
tigating faster routing for the whole class of these networks has been reported in the
literature. For B(N;x; p; �) to be useful, a fast switch control mechanism must be
devised.
The focus of this chapter is studying the control aspect of the class ofB(N;x; p; �)
networks in the context of being used as electrical and optical switching networks. In
particular, our objective is to speed up routing process by using parallel processing
techniques. A completely connected multiprocessor system of N processing elements
(PEs) is used as the parallel computation model. Such a model is by no means to
be practical; it is used as a general abstract model to derive parallel algorithms.
EÆcient algorithms on more realistic models, such as a hypercube whose architec-
tural complexity is the same as that of a single plane of B(N;x; p; �), can be easily
obtained from our algorithms.
There are three basic approaches for designing routing algorithms, namely
matrix decomposition, matching and graph edge-coloring, and they are essentially
equivalent [31]. In this chapter, we assume that N = 2n. By examining the con-
nection capacity of B(N;x; p; �), we �rst model the routing problems for this class
of networks as weak and strong edge-colorings of bipartite graphs, which uni�es and
extends previous models for RNB and SNB networks. Basing on our model, we pro-
pose fast routing algorithms for B(N;x; p; �) using parallel processing techniques.
We show that the presented parallel routing algorithms can route K connections in
O(logK logN) time for an RNB B(N;x; p; �) and in O(d� logN) time for an SNB
B(N; 0; p�; �), where d� is the degree of the I/O mapping graph of the new connec-
tions. Since K = N and d� = O(pN) in the worst case, the proposed algorithms
can always route O(N) connections in an RNB B(N;x; p; �) in O(log2N) time and
in an SNB B(N;x; p�; �) in O(pN logN) time. As a by-product of our analysis,
we also propose a class of simple self-routing nonblocking networks, T (N;�). Com-
71
pared with crossbar, the presented new networks have lower hardware cost, shorter
connection diameter, and less number of required wavelengths.
The remainder of the chapter is organized as follows. In Section 4.2, we discuss
the topology of B(N;x; p; �). In Section 4.3, we model routing in B(N;x; p; �) as two
coloring problems of an I/O mapping graph G(N;K; g). In Section 4.4, we propose a
fast parallel routing algorithm for RNB B(N;x; p; �) based on a weak g-edge coloring
of G(N;K; g). In Section 4.5, we extend this parallel routing algorithm to SNB
B(N;x; p; �) based on a strong (2g � 1)-edge coloring of G(N;K; g). In Section 4.6,
we present a new structure T (N;�) for self-routing nonblocking networks. Finally,
we summarize this chapter in Section 4.7.
4.2 Nonblocking Networks Based on Banyan-type Networks
If Baseline network is used for photonic switching, it is a blocking network since
two connections may pass through the same SE, which causes node con ict. Even if
Baseline network is used for electronic switching, it is still a blocking network since
two connections may try to pass through the same input (resp. output) link, which
causes input (resp. output) link con ict.
Although the Baseline network is a blocking network, a nonblocking network
can be built by extending it in three ways: horizontal concatenation of extra stages
to the back of a Baseline network, vertical stacking of multiple copies of a Baseline
network, and the combination of both horizontal concatenation and vertical stacking
[41, 42, 91, 92]. Such an extended network is constructed by concatenating the mirror
image of the �rst x(< n) stages of BL(N) to the back of a BL(N), then vertically
making p copies of the extended BL(N) (each copy is called a plane), and �nally
connecting the inputs (resp. outputs) in the �rst (resp. last) stage to N 1�p splitters
(resp. p� 1 combiners). Speci�cally, the i-th input (resp. output) of the j-th plane
is connected with the j-th output (resp. input) of the i-th 1� p splitter (resp. p� 1
72
combiner), which is connected with the i-th input (resp. output) of this network.
We denote a network constructed in this way by B(N;x; p; �), where � is crosstalk
factor. That is, � = 0 if the network has no crosstalk-free constraint and � = 1 if
the network has crosstalk-free constraint. Clearly, B(N; 0; 1; �) is a Baseline network
and B(N;n� 1; 1; �) is a Benes network [6]. In B(N;x; 1; �), a subnetwork, denoted
by B(N;x; 1=2l; �) (0 � l � n� 1), is de�ned as a B(N=2l;maxfx� l; 0g; 1; �) from
stage l to stage n+maxfx� l; 0g� 1. Figure 4.1 shows an example of B(16; 2; 3; �),
which contains three planes of B(16; 2; 1; �), and each B(16; 2; 1; �) is constructed
from B(16; 0; 1; �) by adding two extra stages.
INPUTS
OUTPUTS
STAGES
2 extra stages3 planes
01
23
45
67
89
1011
1213
1415
01
23
45
67
89
1011
1213
1415
0 1 2 3 4 5
Figure 4.1. A network B(16; 2; 3; �)
4.3 Graph Model
In this section, we model routing problems in B(N;x; p; �) networks as two edge
coloring problems of bipartite graphs.
4.3.1 I/O Mapping Graphs
ForB(N;x; p; �), a set of N inputs (resp. outputs) is divided intoN=g modulo-g input
group (resp. modulo-g output group). Let g = 2i, 0 � i � n. Then, the k-th modulo-
73
g input group comprises inputs I(k�1)g; I(k�1)g+1; � � � ; Ikg�1, and the k-th modulo-g
output group comprises outputs O(k�1)g; O(k�1)g+1; � � � ; Okg�1, where 1 � k � N=g.
If a connection path does not have any link (resp. node) con ict with other
connection paths, it is called a link con ict-free (resp. node con ict-free) path.
Clearly node con ict-free path is also link con ict-free, but the converse is not true.
If a set of connections can be set up by con ict-free paths in B(N;x; 1; �), these
connections are called feasible connections of B(N;x; 1; �). Our goal is to quickly set
up K link (resp. node) con ict-free paths for K connections of any I/O mapping in
B(N;x; p; 0) (resp. B(N;x; p; 1)). To achieve this goal, we usually decompose a set of
connections into disjoint subsets, and route each subset in one plane of B(N;x; p; �)
so that each subset is feasible for its assigned plane.
Given any I/O mapping with K connections for B(N;x; p; �), we construct a
graph G(N;K; g), named I/O mapping graph, as follows. The vertex set consists of
two parts, V1 and V2. Each part has N=g vertices, i.e., each modulo-g input (resp.
output) group is represented by a vertex in V1 (resp. V2). There is an edge between
vertex bi=gc in V1 and vertex bj=gc in V2 if j = �(i). Thus, G(N;K; g) is a bipartite
graph with N=g vertices in each of V1 and V2 and K edges, where at most g edges
are incident at any vertex. Thus, the degree of G(N;K; g) is at most g. Since there
may be more than one connection from a modulo-g input group to the same modulo-
g output group, G(N;K; g) may have parallel edges between two vertices and it
may be a multigraph. However, there is a one-to-one correspondence between active
inputs/outputs in an I/O mapping and the edges in the I/O mapping graph, and
thus, we can label each edge by its corresponding input. An edge e is called the left
edge (resp. right edge) of edge f if e = �f (resp. �(e) = �(f)). Any edge has at most
one left edge and at most one right edge in G(N;K; g). Two edges e and f are called
neighboring edges if e is the left or right edge of f . We de�ne a linear component (or
simply, a component) of G(N;K; g) as follows: two edges e and f belong to the same
74
component if and only if there is a sequence of edges e = e1; � � � ; ej = f such that ei
and ei+1, 1 � i � j� 1, are neighboring edges. If every edge in a component has two
neighboring edges, the component is called a closed component; otherwise it is called
an open component. By generalizing \neighboring edge" to an equivalent relation,
each edge is in exactly one component, and thus, components are edge disjoint in
G(N;K; g).
Example 5 In Figure 4.2, (a) shows an I/O mapping with 32 inputs, 25 of which
are active; (b) shows the I/O mapping graph G(32; 25; 8) of (a), where V1 (resp. V2)
of G(32; 25; 8) has 4 vertices and each vertex in V1 (resp. V2) includes 8 inputs (resp.
outputs) belonging to the same modulo-8 input (resp. output) group; (c) shows all
components of G(32; 25; 8) in (b). 2
131211109
7
54
210
25-11517-12927
24
14
8
6
3130292827262524232221201918171615
21
6
(i)π21 VV
( a )
i
10
3023281-1-1209
165
22122611
-1-140-1
3 3210
76543210
31302928
25
2322
423
15
31
23
22
15
14
22
31
14
2627
G(32, 25, 8)
( b ) ( d )
24
765
21
24
2322212019181716
15141312111098
3
7
25
2019181716
15141312111098
313029282726
11
19
18
3
26
1918
3
12
21
20
10
( c )
1011
2120
8
(i) 1 closed component
1
9
26
24
1
24
0
8
9
25
0
25
7
7
29
28
13
4
2928
1312
4
(ii) 5 open components
Figure 4.2. Finding a balanced 2-coloring: (a) an I/O mapping; (b) a balanced 2-
coloring of an I/O mapping graph G(32; 25; 8); (c) a set of components; (d) pointer
initialization for pointer jumping
4.3.2 Graph Coloring and Nonblockingness
If we route connections in B(N;x; p; �) one by one using sequential algorithms, the
time complexity for establishing K connections is (K � (logN + x)) since it takes
75
(logN + x) time to set up one connection. For a large number of connections, the
time required is more than O(N), which is not acceptable for real-time applications.
Parallel processing techniques can be used to speed up routing in B(N;x; p; �). We
say that two connections share a modulo-g input (resp. output) group if their sources
(resp. destinations) are in the same modulo-g input (resp. output) group. Let us
study the connection capability of B(N;x; p; �) �rst.
Lemma 2 For any connection set C of B(N; 0; 1; �), if no two connections in C
share any modulo-g input (resp. output) group, then the connection paths for C
satisfy the following conditions:
(i) they are node con ict-free in the �rst (resp. last) log g stages;
(ii) they are input link con ict-free in the �rst log g + 1 (resp. last log g) stages and
output link con ict-free in the �rst log g (resp. last log g + 1) stages.
Lemma 3 For any pair of input and output in B(N;x; 1; �), there are 2x paths
connecting them.
It is easy to verify that Lemmas 2 and 3 are true according to the topology of
BL(N) (refer to [57] for formal proofs and see Figure 4.3 for examples). Using the
above two lemmas, the following claim can be easily derived from the results of [57].
Lemma 4 Given a connection set C of B(N;x; 1; �), if any two connections in C do
not share any modulo-2bn�x+�
2c input group and also do not share any modulo-2b
n�x+�2
c
output group, then C is feasible for B(N;x; 1; �).
By Lemma 4, if we assign the connections in B(N;x; p; �) with sources (resp.
destinations) passing through the same modulo-g input (resp. output) group to
di�erent planes, then we can route connections in B(N;x; p; �) without con ict.
Thus, in order to route con ict-free connections in B(N;x; p; �), we �rst need to
76
( a )
( b )
( c )
( d )
Figure 4.3. Number of connection paths: (a) 1 path in B(16; 0; 1; �); (b) 2 paths in
B(16; 1; 1; �); (c) 4 paths in B(16; 2; 1; �); (d) 8 paths in B(16; 3; 1; �)
determine which plane to be used for each connection. By constructing an I/O
mapping graph G(N;K; g) with g = 2bn�x+�
2c, we can reduce the problem of routing
K connections in B(N;x; p; �) to the following two graph coloring problems:
Weak Edge Coloring Problem (WEC problem): Given an I/O mapping graph
G(N;K; g) with K0(< K) colored edges, color K edges with a set of colors such that
no two edges with the same color are incident at the same vertex of G(N;K; g) with
the changing of the colors of the K0 colored edges allowed. If we can �nd a weak
edge coloring of G(N;K; g) using at most c1 di�erent colors, we call this coloring a
(weak) c1-edge coloring of G(N;K; g). The de�nition of weak edge coloring is the
same as the de�nition of edge coloring in graph theory. Thus we omit \weak" in the
following of chapter.
Strong Edge Coloring Problem (SEC problem): Given an I/O mapping graph
G(N;K; g) with K0(< K) colored edges, color K �K0 uncolored edges with a set of
colors such that no two edges with the same color are incident at the same vertex
77
of G(N;K; g) without changing the colors of the K0 colored edges. If we can �nd
a strong edge coloring of G(N;K; g) using at most c2 di�erent colors, we call this
coloring a strong c2-edge coloring of G(N;K; g).
If we consider the colored (resp. uncolored) edges in G(N;K; g) as the existing
(resp. new) connections in B(N;x; p; �), a solution to the WEC problem is a plane
assignment for routing in an RNB network since we can reroute existing connections,
and a solution to the SEC problem is a plane assignment for routing in an SNB
network since rerouting existing connections is prohibited. Clearly, for the same
G(N;K; g), c1 � c2.
Example 6 In Figure 4.4, there are three edges labeled a, b, c, respectively. Edges a
and b have already been colored using colors 1 and 2, respectively. A WEC solution
is given in (a), and an SEC solution is given in (b). Note that, in (b), an additional
color is needed for edge b because the colors of existing colored edges a and c cannot
be changed. 2
b
c
a
b
c
a
b
c
[ 1 ]
[ 2 ]
[ 2 ]
[ 1 ]
[ 2 ]
[ 1 ]
[ 2 ]
[ 1 ]
[ 2 ]
[ 3 ]
( a ) ( b )
aa
b
c
Figure 4.4. Edge coloring: (a) a (weak) edge coloring; (b) a strong edge coloring
In the following two sections, we show how to speed up the routing for RNB
networks and SNB networks using WEC and SEC of I/O mapping graphs, respec-
tively.
4.4 Routing in Rearrangeable Nonblocking Networks
In this section, we present a fast parallel routing algorithm for RNB B(N;x; p; �)
based on a weak g-edge coloring of G(N;K; g).
78
4.4.1 Rearrangeable Nonblockingness
The following claim is implied by the results of [57].
Lemma 5 If p � 2bn�x+�
2c, then B(N;x; p; �) is rearrangeable nonblocking.
It is important to note that the minimum value of p in Lemma 5 equals to
the value of g in Lemma 4, where p is the number of B(N;x; 1; �) planes required
for B(N;x; p; �) to be rearrangeable nonblocking.
By Lemmas 4 and 5, if we assign the connections (including existing and new
connections) sharing the same modulo-g input or output group to di�erent planes,
the connections are feasible for every assigned plane. Then, the routing can be
completed by setting up con ict-free connection paths within each plane.
Lemma 6 Every bipartite multigraph G has a �(G)-edge coloring, where �(G) is
the degree of G.
By Lemma 6 (see a proof in [8]), if we set g = 2bn�x+�
2c in G(N;K; g), the
plane assignments for a set of connections in RNB B(N;x; p; �) can be solved by
�nding a g-edge coloring of G(N;K; g).
4.4.2 Algorithm for Balanced 2-Coloring of G(N;K; g)
In order to solve WEC problem eÆciently, we present an algorithm for a problem,
named balanced 2-coloring problem: given an I/O mapping graph G(N;K; g), color
its edges with 2 colors so that every vertex is adjacent to at most g=2 edges with one
color and g=2 with the other.
We choose to present our parallel algorithms for a completely connected mul-
tiprocessor system with N PEs. Initially, each PEi, 0 � i � N � 1, reads �(i) from
input i, sets value of ��1 in PE�(i) as i, and then performs the following two steps.
79
Step 1. Divide the I/O mapping graph G(N;K; g) into a set of components.
This step can be done by each edge �nding its left edge �i and right edge ��1(�(i)).
Step 2. Color components with two colors, red and blue, so that neighboring
edges in each component have di�erent colors.
Each component has two speci�c representatives, simply referred as Reps.
(There is an exception: for the component with length of 1, there is only one Rep,
which is itself.) For closed and open components, the Reps are de�ned di�erently.
For a closed component, we de�ne two edges with the minimum labels as two Reps;
for an open component, if an edge e has no left edge or e's left edge has no right
edge, e is de�ned as one Rep. Figure 4.2(c) shows the Reps of all possible types of
components, where the Reps of each component are marked as dark lines and edges
are labeled by their corresponding inputs. Step 2 can be done by coloring edges with
the Reps as references using the pointer jumping technique in [33]. At the beginning,
each edge sets its pointer to point to the right edge of its left edge if it exists and
to itself otherwise. By doing so, two disjoint directed cycles are formed for a closed
component, and two disjoint directed paths are formed for an open component with
more than one edge, each containing a Rep. For an open component, furthermore,
the end pointer of every directed path is pointing to one of the Reps. For example,
Figure 4.2(d) shows that the directed cycles and paths formed from the components
of Figure 4.2(c). Then, by performing dlogK=2e times of parallel pointer jumping,
each edge �nds the Rep belonging to the same directed cycle or path. Finally, each
edge can be colored by comparing the value of the Rep found by itself with that by
its neighbor. That is, if the value of the Rep founded by an edge is no larger than its
neighbor's, color the edge with red; and otherwise color it with blue. Figure 4.2(b)
shows a balanced 2-coloring of an I/O mapping graph of Figure 4.2(c), where solid
lines are colored as red and dashed lines are colored as blue.
The detailed implementation of a balanced 2-coloring algorithm is referred to
80
Algorithm 1, where we use operator \:=" to denote an assignment local to a PE or
to the control unit, and use operator \ " to denote an assignment requiring some
interprocessor communication.
The correctness and time complexity of Algorithm 1 are given in the following
theorem.
Theorem 3 A balanced 2-coloring of any G(N;K; g) can be found in O(logK) time
using a completely connected multiprocessor system of N PEs.
Proof: Given an I/O mapping graph G(N;K; g), Step 1 can be done in O(1) time
using a completely connected multiprocessor system of N PEs. In Step 2, since the
length of each directed cycle or path is at most dK=2e, each edge can �nd a Rep
by dlogK=2e times of pointer jumping. Clearly, all edges in the same directed cycle
or path are colored with the same color since they �nd the same Rep. The pointer
initialization implies that each edge and its neighboring edges are in di�erent directed
cycle or path, and thus, they have di�erent colors. By the de�nition of left/right
edge, there are no more than g=2 pairs of neighboring edges incident at any vertex
of G(N; k; g). Thus, the coloring of all components compose a balanced 2-coloring
of G(N; k; g). Therefore, a balanced 2-coloring of any G(N;K; g) can be found in
O(logK) time. 2
4.4.3 Algorithm for g-Edge Coloring of G(N;K; g)
Based on the balanced 2-coloring algorithm, a WEC solution to any I/O mapping
graph G(N;K; g) with no more than g colors can be found as follows. Let d be the
degree of G(N;K; g). Clearly, d � g. First, remove colors of the K0 colored edges.
Then, perform at most dlog de iterations as follows. In initial iteration (i.e., iteration
0), we �nd a balanced 2-coloring of G(N;K; g) using colors 0 and 1 if d > 1, and let
G0 and G1 be the graphs induced by the edges with colors 0 and 1, respectively. If
81
Algorithm 1 A Balanced 2-Coloring of an I/O Mapping Graph
Input: G(N;K; g)
Output: a balanced 2-coloring of G(N;K; g)
for all PEi, 0 � i � N � 1, do
l(i) := r(i) := �1; /* l(i), r(i) are the left edge and right edge of edge i respectively.
*/
if �(i) 6= �1 then
if �(�i) 6= �1 then
l(i) := �i;
end if
if ��1(�(i)) 6= �1 then
r(i) ��1(�(i));
end if
if l(i) 6= �1 and r(l(i)) 6= �1 then
q(i) r(l(i)); /* q(i) is a pointer */
p(i) := 0; /* p(i) is used to �nd edge i in a path or cycle */
else
q(i) := i;
p(i) := 1; /* the edge i in a path */
end if
m(i) := m0(i) := i; /* m(i) (resp. m0(i)) is used to �nd the Rep found by edge i
(resp. i's neighbor) */
for t := 1 to dlogK=2e do
m(i) min fm(i); m(q(i))g;
p(i) p(q(i));
q(i) q(q(i));
end for
if p(i) = 1 then
m(i) := q(i); /* the Rep found by the edge i is the label of PE to which i is
pointing */
end if
if l(i) 6= �1 /* i has left edge */ then
m0(i) m(l(i));
else if r(i) 6= �1 /* i has right edge */ then
m0(i) m(r(i));
end if
if m(i) � m0(i) then
c(i) := 0; /* color i as red */
else
c(i) := 1; /* color i as blue */
end if
end if
end for
82
�(G0) > 1 (resp. �(G1) > 1), we execute iteration 1 to �nd a balanced 2-coloring
for G0 (resp. G1) using colors 00 and 01 (resp. 10 and 11). This process recursively
continues in a binary tree fashion until a solution to WEC is reached. More formally,
in each recursive iteration i, 1 � i � dlog de � 1, we �nd a balanced 2-coloring for
each graph Gz using colors z0 and z1 (i.e., concatenate 0 or 1 with z) if �(Gz) > 1,
where z is a binary representation of an integer in f0; 1; � � � ; 2i � 1g denoting the
color of edges in Gz in iteration i� 1.
Theorem 4 For any I/O mapping graph G(N;K; g), a g-edge coloring can be found
in O(log d�logK) time using a completely connected multiprocessor system of N PEs,
where d is the degree of G(N;K; g).
Proof: Let d0 = 2k such that k is the smallest integer satisfying d � 2k. We prove
the theorem by induction on k. If k = 1, it is true since a balanced 2-coloring is
a 2-edge coloring by Theorem 3. Assume that for any k < m � n, the theorem
holds. Now, we prove that the theorem holds for k = m. First, we �nd a balanced
2-coloring of G(N;K; g), which can be done in O(logK) time by Theorem 3. Let G0
and G1 be the graphs induced by the edges of two di�erent colors from this balanced
2-coloring. By the de�nition of balanced 2-coloring, we know that �(G0) � d0=2 and
�(G1) � d0=2. By the hypothesis, we can �nd a (d0=2)-edge coloring for each of G0
and G1 in O((k�1)�logK) time on a completely connected multiprocessor subsystem
of jE(G0)j and jE(G1)j PEs, respectively. These two colorings can be carried out
simultaneously since E(G0) \ E(G1) = ;. The (d0=2)-edge colorings of G0 and G1
compose a d0-edge coloring of G(N;K; g), which takes total O(k � logK) time using
a completely connected multiprocessor system of N PEs. Since d0=2 < d � d0 � g,
this theorem holds. 2
83
4.4.4 Parallel Routing in a Plane
We have shown how to assign each connection to a plane in an RNB B(N;x; p; �).
In this section, we show how connections are routed within each plane.
Lemma 7 Let C be a set of feasible connections for B(N;x; 1; �). If each connection
in C is set up in the �rst and last x stages such that the output link in stage i and
the input link in stage logN � i on each connection are connected with the same
subnetwork B(N;x; 1=2i+1; �), 0 � i � x� 1, then C can be routed by self-routing in
the middle logN � x stages.
Proof: By the topology of B(N;x; 1; �), we know that each connection must pass
through the same subnetwork B(N;x; 1=2i; �), 0 � i � logN � 1. Since the middle
logN �x stages of B(N;x; 1; �) consists of 2x Baseline network BL(N2x), this lemma
is true. 2
Theorem 5 Let C be a set of K feasible connections of B(N;x; 1; �). Then C can
be correctly routed in O(x logK + logN) time using a completely connected multi-
processor system of N PEs.
Proof: By Lemma 7, what we only need to do is to route C correctly in the �rst
and last x stages for x � 1. By the topology of B(N;x; 1; �), we know that the
output link in stage i and the input link in stage logN � i on each connection are
connected with the same subnetwork B(N;x; 1=2i+1; �), 0 � i � x � 1. Thus, we
need to decide which subnetwork to be used for each connection since there are 2i
B(N;x; 1=2i; �)s. This can be reduced to a 2-edge coloring of a bipartite graph with
degree of 2. For each subnetwork B(N;x; 1=2i; �), 0 � i � x � 1, we construct an
I/O mapping graph G(N=2i;Ki; 2), where Ki is the number of connections passing
through it. We color the edges of G(N=2i;Ki; 2) with two di�erent colors and assign
the connections (edges) with the same color to pass through the same subnetwork
84
B(N;x; 1=2i+1; �). Speci�cally, in each iteration i, 0 � i � x � 1, we run g-edge
coloring algorithm for 2i G(N=2i;Ki; 2)s with g = 2. By Theorem 4, each iteration
can be done in O(logK) time. Thus, the time to set up K feasible connections in the
�rst and last x stages is O(x logK). By Lemma 7, we can set up the connections in
the middle logN � x stages by self-routing, which takes logN � x time. Therefore,
the total time to route K feasible connections of B(N;x; 1; �) is O(x logK + logN)
using a completely connected multiprocessor system of N PEs. 2
4.4.5 Overall Routing Performance
Theorem 6 For any RNB B(N;x; p; �) such that p � 2bn�x+�
2c, K connections
(including existing and new connections) can be correctly routed in O(logK logN)
time using a completely connected multiprocessor system of N PEs.
Proof: Let g = 2bn�x+�
2c. By Theorem 4, we can �nd a g-edge coloring of
the I/O mapping graph G(N;K; g) in O(log d logK) time, where d is the degree of
G(N;K; g). By Lemma 4, we assign the connections with the same color to the
same plane. In each plane B(N;x; 1; �), by Theorem 5, we can route the connections
in O(x logK + logN) time. Since x < logN , d � g = 2bn�x+�
2c, the total time is
O((x+ log d) logK + logN) = O(logK logN). 2
By Lemma 5, for special cases of an RNB B(N; 0; p; �) and an RNB B(N;n�
1; p; �), the minimum number p of planes of Baseline network and Benes network,
equals to 2bn+�2c and 2b
1+�2c, respectively. Consequently, we can route N connections
in O(log2N) time for both B(N; 0; p; �) and B(N;n�1; p; �). For the RNB B(N;n�
1; p; 0), which is the electronic Benes network, this performance is the same as the
best known results reported in [49, 62].
85
4.5 Routing in Strictly Nonblocking Networks
In this section, we present a fast parallel routing algorithm for SNB B(N;x; p; �)
based on a strong (2g � 1)-edge coloring of G(N;K; g).
4.5.1 Strict Nonblockingness
The following lemma can be easily derived from the results of [92].
Lemma 8 If
p �((1 + �)x + 2
n�x2 (3
2+ 1
2�)� 1; for even n� x
(1 + �)x + 2n�x+1
2 (1 + 12�)� 1; for odd n� x
then B(N;x; p; �) is strictly nonblocking.
For an SNB network, we can route new connections (as long as these con-
nections form an I/O mapping from idle inputs to idle outputs) without disturbing
the existing ones; however, this routing problem is harder than that in an RNB net-
work when we need to route the new connections simultaneously. In this section, we
present a parallel algorithm based on graph coloring to speed up routing time.
Based on the discussions in Section 4.3, we know that the routing problem
for an SNB B(N;x; p; �) can be solved by �nding a strong edge coloring of the I/O
mapping graph G(N;K; g) with g = 2bn�x+�
2c.
Lemma 9 Any multigraph G has a strong (2�(G) � 1)-edge coloring, where �(G)
is the degree of G.
Proof: Consider coloring edges in an arbitrary order. Since each edge in G is adjacent
to at most 2�(G)� 2 edges, any uncolored edge in G can always be assigned a color
so that the total number of colors used is no more than 2�(G) � 1. 2
We consider a subclass of SNB networks, B(N; 0; p�; �) with p� = 2bn+�2c+1�1.
By Lemma 8, we know that B(N; 0; p�; �) is an SNB network. Since each plane of
86
B(N; 0; p�; �) is a Baseline network, the routing of connections in any plane can be
done by self-routing. Thus, the problem of setting up connections in B(N; 0; p�; �) is
reduced to �nding a plane for each new connection so that all connections, including
existing ones, are con ict-free. By Lemmas 4 and 9, this can be done by �nding a
strong (2g � 1)-edge coloring for G(N;K; g) of B(N; 0; p�; �) with K0 existing con-
nections and K�K0 new connections, where g = 2bn+�2c p�+1
2. In the next subsection,
we present an algorithm to �nd a strong (2g � 1)-edge coloring of G(N;K; g).
4.5.2 Algorithm for Strong (2g � 1)-Edge Coloring of G(N;K; g)
Let G(N;K0; g) and G(N;K �K0; g) denote the graph obtained from G(N;K; g) by
only keeping theK0 colored edges and by removing theK0 colored edges, respectively.
Let d� be the degree of G(N;K �K0; g), and let d0 = 2k such that k is the smallest
integer satisfying d� � 2k. Conceptually, a strong (2g�1)-edge coloring of G(N;K; g)
with K0(< K) colored edges can be done in the following two steps.
Step 1: �nd a set of matchings fM1;M2; � � � ;Md0g of G(N;K �K0; g);
Step 2: for i from 1 to d0 do the following: color the edges in Mi without
changing the colors of the edges in G(N;K0; g)S([j<iMj).
Finding a set of d0 matchings in a graph is equivalent to coloring the edges
in the graph with d0 di�erent colors, because edges with the same color are not
adjacent to each other. Thus, Step 1 can be done by �nding a d0-edge coloring of
G(N;K �K0; g) using the algorithm described in Section 4.4. This d0-edge coloring
dividesK�K0 uncolored edges (corresponding to new connections) into d0 matchings.
By Theorem 4, Step 1 takes O(log d0 � log(K �K0)) = O(log d� � log(K �K0)) time
using a completely connected multiprocessor system of N PEs.
In G(N;K; g), each edge is adjacent to at most 2g�2 edges, and hence, there
are at most 2g�2 colored edges adjacent to each edge in a matchingMi. Since edges
with the same color cannot be adjacent, we can color every edge in a matching by one
87
of the unused colors. This can be done by parallel searching for a free color among
2g � 1 colors as follows. Associate a Boolean array C[1::2g � 1] of 2g � 1 elements
with each vertex in G(N;K; g), with C[r] = 0 if and only if an edge adjacent to
the vertex has been colored with color r. Consider an edge e in Mi that connects
vertices u and v of G(N;K; g), and let Cu and Cv be the C array associated with
vertices u and v, respectively. Performing bit-wise AND operation on Cu and Cv and
obtain a Boolean array Du;v such that Du;v[s] = Cu[s]^Cv[s], 1 � s � 2g� 1. Then,
Du;v[t] = 1 if and only if color t is available for edge e. We can assign g=2 PEs to
each vertex w of G(N;K; g), and these PEs collectively maintain Cw. Then, using
g PEs, Du;v can be computed O(1) time, and �nding some t such that Du;v[t] = 1
by performing a parallel binary pre�x sums operation on Du;v, which takes O(log g)
time. Since no two edges are adjacent in a matching, uncolored edges in the matching
can be colored simultaneously by their assigned PEs in O(log g) time, and Step 2
takes O(d0 log g) time. Since d0=2 < d� � d0, O(d0 log g) = O(d� log g). Therefore, we
have the following claim.
Theorem 7 For any I/O mapping graph G(N;K; g) with K0(< K) colored edges, a
strong (2g � 1)-edge coloring can be found in O(log d� log(K �K0) + d� log g) time
using a completely connected multiprocessor system of N PEs, where d� is the degree
of G(N;K �K0; g).
4.5.3 Performance Analysis
We summarize the overall performance of our routing algorithm for SNB network
B(N; 0; p�; �) by the following theorem.
Theorem 8 For an SNB network B(N; 0; p; �) with p � p� = 2bn+�2c+1 � 1, con-
nections from any K �K0 idle inputs to any K �K0 idle outputs, with K0 existing
connections, can be correctly routed in O(d� logN) time using a completely connected
multiprocessor system of N PEs, where d� is the degree of G(N;K �K0; g).
88
Proof: By Theorem 7, we can �nd a strong (2g � 1)-edge coloring of G(N;K; g) in
O(log d� log(K � K0) + d� log g) time using a completely connected multiprocessor
system of N PEs. We assign each new connection with color i to the i-th plane of
B(N; 0; p; �). By Lemma 4, these new connections can be routed by self-routing in
O(logN) time. Thus, the total time is O(log d� log(K�K0)+d� log g+logN). Since
g = 2bn+�2c and K �K0 � N , this time complexity is O(d� logN). 2
Since d� � g = 2bn+�2c, d� = O(
pN ) in the worst case. Assuming that
the edges in G(N;K � K0; g) are uniformly distributed, then d� = d (K�K0)g
Ne =
O(K�K0pN
) in average. Therefore, the performance of our algorithm is summarized by
the following claim.
Corollary 1 Under the same conditions of Theorem 8, the worse-case and average-
case time complexities of our routing algorithm areO(pN logN) and O(K�K0p
NlogN),
respectively.
By Lemma 8, we derive the minimum number of planes, pmin, in B(N; 0; p; �)
as follows.
i. If there is no crosstalk-free constraint (i.e., � = 0),
pmin =
(322n2 � 1; for even n
2n+12 � 1; for odd n
and
ii. If there is a crosstalk-free constraint (i.e., � = 1),
pmin =
(2n2+1 � 1; for even n
322n+12 � 1; for odd n
Compared with B(N; 0; pmin; �), the hardware redundancy of B(N; 0; p�; �) is
shown as follows.
89
p� � pmin =
8>>>><>>>>:
0; if � = 0 and n is oddpN=2; if � = 0 and n is even
0; if � = 1 and n is evenp2N=2; if � = 1 and n is odd
The hardware cost of B(N; 0; p�; �), in terms of the number of SEs, is higher
than that of B(N; 0; pmin; �) in half of the cases, but both have the same hardware
complexity of �(N1:5 logN). The time for routing O(N) connections, however, is
improved from (N logN) to sublinear O(pN logN) in the worst case.
4.6 Self-Routing Nonblocking Networks
The attenuation of light passing through the switch has several components such
as �ber-to-switch and switch-to-�ber coupling loss, propagation loss in the medium,
loss at waveguide bends, loss at the couplers, etc. In a large switch, a substantial
part of this attenuation is directly proportional to the number of couplers that the
optical path passes through. Thus, the connection diameter is used to characterize
the signal loss [68]. Although B(N;x; p; �) built from Banyan-type network by hori-
zontal concatenation and/or vertical stacking has connection diameter O(logN) and
can be strictly nonblocking, �nding a plane for each new connection relies on global
information (i.e., the knowledge of other connections), which increases the time for
setting up connections as discussed in previous sections. In this section, we pro-
pose a self-routing strictly nonblocking switching network with O(logN) connection
diameter.
4.6.1 Connection Capacity of BL(N)
Lemma 10 For any connection set C of BL(N), if no two connections in C share
any modulo-g input group, then the connection paths for C are node con ict-free in
the �rst log g stages; if no two connections in C share any modulo-g output group,
90
then the connection paths for C are node con ict-free in the last log g stages, 2 �
g � 2n.
It is easy to verify that Lemma 10 is true according to the topology of BL(N).
For example, in Figure 1.13 in Chapter 1, two connections along paths P0 and P1 do
not share any modulo-4 input group, and thus, there is no node con ict in the �rst
two stages. But they share the �rst modulo-8 input group and the sixth modulo-2
output group, and thus, there are node con icts in stages 2 and 3. By Lemma 10,
the following claim can be derived.
Lemma 11 Given a connection set C of BL(N), if any two connections in C do
not share any modulo-2bn+�2c input group and also do not share any modulo-2b
n+�2c
output group, then
(i) for � = 0, there is no link con ict in BL(N);
(ii) for � = 1, there is no node con ict in BL(N).
Proof: We prove the lemma by considering the following two cases.
(1) n is even.
We have 2bn+�2c = 2
n2 . Since there are no two connections sharing any modulo-2
n2
input and output groups, by Lemma 10, there is no node con ict in the �rst n2and
last n2stages. Since n
2+ n
2= n, there is no node con ict in all n stages of BL(N).
Since no node con ict in stage i implies no link con ict in stage i. Thus, there is
neither link con ict nor node con ict in BL(N).
(2) n is odd. There are two subcases.
(2.1) For � = 0, we have 2bn+�2c = 2
n�12 . Since there are no two connections sharing
any modulo-2n�12 input and output groups, by Lemma 10, there is no node con ict
in the �rst n�12
stages, stage 0 to stage n�32, and last n�1
2stages, stage n+1
2to stage
n � 1. Thus, there is no node con ict in all stages except the central stage, stage
91
n�12, of BL(N). Since the output links of stage n�3
2is the input links of stage n�1
2
and the input links of stage n+12
is the output links of stage n�12, there is no link
con ict in all stages of BL(N).
(2.2) For � = 1, we have 2bn+�2c = 2
n+12 . By Lemma 10, there is no node con ict in
the �rst n+12
and last n+12
stages Since n+12
+ n+12
> n, there is no node con ict in
BL(N). 2
By Lemma 11, if we only allow one connection to pass through each modulo-
2bn2c input and output groups at any time, then we can route connections in BL(N)
without link con ict; if we only allow one connection to pass through each modulo-
2bn+12c input and output groups at any time, then we can route connections in BL(N)
without node con ict. The new class of self-routing strictly nonblocking networks
will be built based on this idea.
4.6.2 Constructing T (N;�)
In this subsection, we assume that M = 2m = N2
21��and g = N
21��= 2n�1+�.
Lemma 12 Given a connection set C of BL(M), if neither do two connections share
any modulo-g input group nor do they share any modulo-g output group in a given
connection set C, then C can be set up without con ict in BL(M).
Proof: By M = 2m = N2
21��= (2n)22�1+� = 22n�1+�, we have m = 2n� 1 + �.
According to Lemma 11, if any two connections in C do not share any modulo-
2bm+�2
c = 2b2n�1+2�
2c = 2n�1+� input and output groups at any time, then we can
route the connections of C in BL(M) with link con ict-free constraint (i.e. � = 0)
or with node con ict-free constraint (i.e. � = 1). 2
We select the �rst input in each modulo-g input group of BL(M) as a useful
input of BL(M), and the �rst output in each modulo-g output group of BL(M) as a
useful output of BL(M). Clearly, M=g = N . Thus, restricted to these useful inputs
92
and outputs, BL(M) can be used as an N �N self-routing switching network with
link or node con ict-free constraint, depending on the value of � by Lemma 12. In
the following we show how to construct an N � N self-routing strictly nonblocking
network, denoted by T (N;�), from BL(M).
We �rst give some de�nitions. A link (resp. SE) is called a redundant link
(resp. SE) if its removal will not a�ect the switching functionality of BL(M) for
establishing connections from N useful inputs to N useful outputs; otherwise it is
called an essential link (resp. SE). T (N;�) is constructed fromBL(M) by performing
the following two steps to remove all redundant links and SEs.
Step 1. BecauseBL(M) hasm = 2n�1+� = n+log g stages, the subnetworks
of BL(M) induced by the SEs from stage n to the last stage form a set of 2n BL(g)s.
Since each of these BL(g)s is connected with exactly one useful output of BL(M),
at most one of any given set of connections from useful inputs to useful outputs is
routed though each BL(g). We replace each of these BL(g)s by a g � 1 combiner,
and set the output of this combiner as an output of T (N;�).
Step 2. To complete the construction of T (N;�), we need to remove additional
redundant SEs and links in the �rst n stages of BL(M). It can be done by starting
from stage 0 to stage n � 1 as follows. Initially, N useful inputs are considered to
be connected with N essential links in stage 0. In stage i, 0 � i � n � 1, do the
following operations. Firstly, we identify all essential SEs and links: if an SE has one
of input connecting with an essential link, it is marked as an essential SE and its two
output links are marked as essential links. Secondly, we remove all redundant SEs
and links: if a link is not an essential link, it is removed; if both input links of an SE
have been removed, this SE and its two output links are considered redundant and
removed.
Example 7 Figure 4.5 (a)(i) and (b)(i) show BL(32) and BL(64), respectively,
where essential links and SEs are highlighted with dark color and redundant links
93
and SEs are colored gray. Figure 4.5 (a)(ii) and (b)(ii) show T (8; 0) and T (8; 1)
constructed from BL(32) and BL(64), respectively. 2
In BL(M), we know that two outputs of each SE in one stage are connected
with two SEs of next stage, one in the upper subnetwork and the other in the lower
subnetwork. Thus, the number of essential SEs in stage i (0 � i � n � 1) equals
to minf2iN;M=2g = minf2n+i; 22n�2+�g. Let s(N;�) denote the number of SEs
in T (N;�). It is easy to verify that there are 2n+i essential 1 � 2 SEs in stage i,
(0 � i � n� 2), 22n�2+� essential (2 � �)� 2 SEs in stage n� 1, and zero essential
SE in the remaining stages of BL(M). Therefore, by a simple calculation, the total
number of SEs in T (N;�) is
s(N;�) =n�2Xi=0
2n+i + 22n�2+� =3 + �
4N2 �N
=
(3N2
4�N; if � = 0
N2 �N; if � = 1
In T (N;�), input (resp. output) i is corresponding to input (resp. output)
i0 of BL(M), where the binary representation of i0 is the binary representation of i
concatenating with log g 0s at the end. It means that the �rst logM � log g = n bits
for i and i0 are the same. Therefore, the routing process in T (N;�) is the same as
that in BL(N), which is self-routing.
We summarize the above discussions by the following claim.
Theorem 9 T (N;�) is an N �N self-routing strictly nonblocking network of logN
stages. For � = 0, it consists of 3N2
4�N SEs, among which N2
2�N SEs are of size
1� 2 and N2
4SEs are of size 2� 2; for � = 1, it consists of N2 �N SEs, all of size
1� 2.
In an optical switching network, for practical reasons, the number of wave-
lengths used must be small. Clearly, if two connection paths are allowed to pass
94
( a )
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
4
5
6
7
0
1
2
3
( i ) ( ii )
( b )( i ) ( ii )
Figure 4.5. Construction of networks: (a) T (8; 0) based on BL(32); (b) T (8; 1) based
on BL(64)
95
through an SE, then at least two wavelengths are required. In general, two wave-
lengths are not suÆcient for an optical switching network. For example, for an
N �N crossbar, in order to establish an identity permutation, which means input i
is mapped to output i, then N wavelengths are necessary for crosstalk-free routing.
In this aspect, T (N;�) is superior, as indicated in the following claim.
Corollary 2 T (N; 1) is crosstalk-free with one wavelength and T (N; 0) is crosstalk-
free with two wavelengths.
Proof: Since all SEs in T (N; 1) are of size 1� 2, there is only one connection can be
passed through an SE at one time. Thus, one wavelength is suÆcient for crosstalk-
free routing in T (N; 1). All SEs in T (N; 0) are of size 1�2 except the ones in the last
stage. Thus, a total of two wavelengths are suÆcient to ensure that the connections
passing trough the same SEs use di�erent wavelengths. 2
4.6.3 Comparison
Compared with self-routing Banyan-type networks, T (N;�) is strictly nonblocking,
which is promising for high performance switching.
Compared with an N � N crossbar for photonic switching, T (N; 1) requires
slightly fewer number of SEs and only one wavelength; T (N; 0) requires much fewer
number of SEs with two wavelengths available. The di�erence between N � N
crossbar and T (N;�) for photonic switching is much more noticeable as shown in
Table 4.1.
4.7 Summary
One major contribution of this chapter is the design and analysis of parallel routing
algorithms for a class of nonblocking switching networks, B(N;x; p; �)s. Although
the assumed parallel machine model is a completely connected multiprocessor system
96
Networks Number of SEs Diameter Number of wavelengths
Crossbar N2 2N � 1 N
T (N; 0) 3N2
4�N logN 2
T (N; 1) N2 �N logN 1
Table 4.1. Comparison of self-routing strictly nonblocking photonic switching net-
works
of N PEs, the proposed algorithms can be transformed to algorithms for more real-
istic parallel computing models. The pointer jumping and binary searching, which
dominate the complexity of the proposed algorithms, can be reduced to sorting on
realistic parallel computing structures. It is interesting to note that the sorting
can be implemented in Banyan-type network in O(log2N) time [47]. Thus the pro-
posed algorithms can set up connections in B(N;x; p; �) with a slow-down factor
O(log2N) on a Banyan-type network, whose complexity is no larger than one plane
of B(N;x; p; �).
The approach of applying edge-coloring techniques to investigate the capacity
and routability of RNB switching networks has been widely used (refer to [10, 31]).
We extended this approach to SNB by de�ning strong edge-coloring. For a class of
RNB and SNB banyan-based switching networks obtained by horizontal expansion
and vertical replication, we proposed a uni�ed mathematical formulation for design-
ing parallel routing algorithms using this approach. For the SNB case, if there is no
existing connection, our routing algorithms for SNB and RNB are the same since we
only need to run the �rst step of SNB routing algorithm. But when there are existing
connections, we have to run the second step of SBN routing algorithm, whose perfor-
mance essentially proportional to d�, which can be as large as O(pN), making our
algorithm not practical. An open problem is how to design eÆcient polylogarithmic
time routing algorithm for the SNB case.
The results of this chapter have valuable architectural implications for design
97
and implementation of future large-scale electronic and optical switching networks.
Scalable nonblocking switching networks tend to have no self-routing capability. For
example, for a nonblocking switching network B(N;x; p; �), though self-routing ca-
pabilities exist in a portion of it, its routing is still computation intensive. Therefore,
for the design of a switching network, in addition to its hardware cost in terms of
the cost of SEs and interconnection links (and wavelengths), we must take the rout-
ing complexity into consideration. It remains a great challenge for �nding low-cost
high-speed nonblocking switching networks.
CHAPTER 5
PARALLEL CROSSTALK-FREE ROUTING FOR OPTICAL BENES
NETWORKS
5.1 Introduction
Benes networks are rearrangeable nonblocking permutation networks and are among
the most eÆcient switching architectures in terms of the number of 2� 2 switching
elements (SEs) used. In optical Benes networks, if two I/O connecting paths with
the same (close) wavelength(s) share a common SE, crosstalk will occur. In order to
reduce the crosstalk e�ect, three approaches, time, space, and wavelength dilations
have been proposed. In the time dilation approach (eg.[69, 74, 83, 99]), crosstalk can
be avoided by using the principle of recon�guration with time division multiplexing
(RTDM) paradigm proposed by C. Qiao et al. in [73]. More speci�cally, a set of
permutation connections is partitioned into subsets so that the connections in each
subset can be established simultaneously without crosstalk and the subsets can be
used to form a sequence of con�gurations for the set of connections. Such a subset is
called a crosstalk-free (CF) partial permutation. Since the paths realizing a CF partial
permutation for a given OMIN do not share any SE, the time dilation approach is
also useful for establishing a set of connections that would normally cause con icts
in blocking OMINs such as Banyan networks [69, 74, 83].
In this chapter, we focus on how to quickly con�gure an optical Benes net-
work for realizing an arbitrary (partial) permutation using time dilation approach.
It has been shown in [99, 100] that for an optical Benes network, a special type
of partial permutation, named semi-permutation, can be realized in one pass and
any permutation can be decomposed into two semi-permutations. However, the ex-
98
99
isting permutation decomposition algorithms have O(N) time complexity and the
existing crosstalk-free routing algorithms for an N �N optical Benes network take
O(N logN) time, which is slow even for circuit switching in optical domain. Us-
ing parallel processing techniques, we give a permutation decomposition algorithm
to decompose a partial permutation with K connections in O(logK) time. By ap-
plying our permutation decomposition algorithm and equitable coloring techniques,
we present a routing algorithm for realizing an arbitrary partial permutation with
K(� N) connections in an N �N optical Benes network in O(log2K + logN) time.
In addition, we show that the time dilation approach is the most cost-e�ective for
optical Benes networks provided that the costs in time, in space and in wavelength
were interchangeable.
The rest of this chapter is organized as follows. In Section 5.2, we present a
logarithmic parallel decomposition algorithm to decompose a (partial) permutation
into two (partial) semi-permutations. Section 5.3 shows a parallel routing algorithm
for realizing semi-permutations in optical Benes networks without crosstalk. In Sec-
tion 5.4, we compare three dilation approaches to avoiding crosstalk in optical Benes
networks. Section 5.5 summarizes this chapter.
5.2 Parallel Permutation Decomposition
In this section, we �rst give a simple proof to show that any permutation can be
decomposed into two semi-permutations, then present a parallel permutation de-
composition algorithm, and �nally extend this algorithm for the decomposition of
an arbitrary partial permutation. The presented decomposition algorithm is based
on �nding a 2-edge coloring of a bipartite graph G with �(G) � 2.
100
5.2.1 Decomposability
Clearly, a permutation can not be realized in a single pass in an N � N OMIN
without crosstalk. Hence, we are interested in a type of partial permutations that
can be passed through OMIN simultaneously without crosstalk. Y. Yang et al. [99]
introduced a concept called semi-permutation, which is a partial permutation that
ensures only one active input in each SE of the �rst and last stages of an OMIN at
the same time. Formally, we have the following de�nition.
De�nition 1 For any permutation � of f0; 1; � � � ; N�1g, a partial permutation with
N=2 active inputs, x0; x1; � � � ; xN=2�1, is called a semi-permutation of �, if it satis�es:
fbx0=2c; bx1=2c; � � � ; bxN=2�1=2cg =
fb�(x0)=2c; b�(x1)=2c; � � � ; b�(xN=2�1)=2cg =
f0; 1; � � � ; N=2 � 1g:
2
A partial semi-permutation is a partial permutation such that its input (resp.
output) set is the subset of the input (resp. output) set of a semi-permutation.
Clearly, a semi-permutation is a maximum potential partial permutation that can
be realized in one pass of an N �N OMIN built with 2 � 2 SEs.
For any (partial) permutation �, we can construct a bipartite graph, named
I/O mapping graph, as follows. The vertex set consists of two parts, V1 and V2 and
each part has N=2 vertices corresponding to I and O, respectively. That is, a pair
of two inputs (resp. outputs) 2i and 2i+ 1 with i 2 f0; 1; � � � ; N=2 � 1g, called dual
inputs (resp. dual outputs), is represented by a vertex in V1 (resp. V2). There is an
edge between vertex bi=2c in V1 and vertex bj=2c in V2 if and only if j = �(i). An I/O
mapping graph may consist of parallel edges between two vertices. Because there is a
101
one-to-one correspondence between active inputs and outputs in a permutation and
edges in an I/O mapping graph, we can label each edge by its corresponding input
or output. The following theorem was proved in [99]. We provide a much simpler
proof, which serves as the foundation of our parallel algorithms.
Theorem 10 Any (partial) permutation can be decomposed into two (partial) semi-
permutations.
Proof: Let �(G) be the maximum degree of vertices in the I/O mapping graph G.
Clearly, �(G) � 2 since each vertex presents two inputs or two outputs of an OMIN.
It is known that every bipartite graph G is �(G)-edge colorable. That is, we can
color edge set E(G) with �(G) colors so that the adjacent edges have di�erent colors
(see a proof in [8]). Thus the I/O mapping graph G constructed from a (partial)
permutation is 2-edge colorable. If we color G with a 2-edge coloring, then two edges
with ends corresponding two dual inputs and outputs incident at a vertex in G must
have di�erent colors. Thus, each of the subgraphs induced by the edges of the same
color corresponds to a (partial) semi-permutation. 2
Example 8 For the permutation
� =
0 1 2 3 4 5 6 7
1 6 0 5 3 2 7 4
!
a 2-edge coloring of its corresponding I/O mapping graph is shown in Figure 5.1,
where each edge is labeled by its corresponding input, and solid and dashed edges are
colored with di�erent colors. The solid and dashed edges correspond to two semi-
permutations
�1 =
0 3 4 6
1 5 3 7
!and �2 =
1 2 5 7
6 0 2 4
!
102
respectively. Clearly, � = �1 Æ �2. 2
3
0
1
2
3
V VG
6
7
5
4
3
2
1
0
1 2
2
0
1
Figure 5.1. A 2-edge coloring of bipartite graph G
Each connected component in the I/O mapping graph G is a cycle for a
permutation while it is a cycle, a path or an isolated vertex for a partial permutation.
If we interchange the colors of edges in any component of G, it is still a 2-edge coloring
of G. Thus, the decomposition of a (partial) permutation may not be unique, we can
assign one of dual inputs/outputs to either of two semi-permutations. In fact, given
a (partial) permutation, if the number of connected components (excluding isolated
vertices) in its corresponding I/O mapping graph is c, 1 � c � N=2, there are 2c�1
ways to decompose the permutation into a pair of (partial) semi-permutations.
5.2.2 Decomposing a Permutation into Two Semi-Permutations
We choose to present our parallel algorithms for a completely connected multipro-
cessor system since any algorithm on this parallel computing model can be easily
transformed to algorithms on more realistic multiprocessor systems. A completely
connected multiprocessor system of size N consists of a set of N processor elements
(PEs) connected in such a way that there is a direct connection between every pair
of PEs. The PEs are labeled beginning with 0 and placed as an array according to
their labels in nondecreasing order. We assume that each PE can communicate with
at most one processor during a communication step.
103
To facilitate the description of our algorithms, we introduce some notations.
Let avav�1 � � � a1a0 be the binary representation of a. We use �a to denote the integer
that has the binary representation avav�1 � � � a1(1 � a0). We use operator \:=" to
denote an assignment local to a PE or to the control unit, and use operator \ " to
denote an assignment requiring some interprocessor communication.
Initially, each PEi reads �(i) from inputs, assigns value i to m(i), and sets
value of ��1 in PE�(i) as i. The pointer p(i) of PEi will be set to point to a PE
with index of ��1(�(i)), which is actually done by two steps. In the �rst step, PEi
computes �i and reads value of �(�i) from the PE with index �i. In the second step,
PEi computes �(i) and reads value of ��1(�(i)) from the PE with index of �(i).
Then, by dlog(N=2)e times of pointer jumping [33], each PEi sets value of m(i) to
be the minimum index of PEs it ever points to. Finally, the parity of m(i) decides
in which semi-permutation Ii is; i.e. all inputs with the same parity are in the same
semi-permutation. The detailed implementation is given in Algorithm 2.
Algorithm 2 A Parallel Decomposition
Input: A permutation
Output: Two semi-permutations
for all PEi, 0 � i � N � 1, do
m(i) := i;
��1(�(i)) i;
p(i) ��1(�(i)); /* pointer initialization */
for t := 1 to dlog(N=2)e do
m(i) min fm(i); m(p(i))g; /* comparison */
p(i) p(p(i)); /* pointer jumping */
end for
if m(i) is even then
Ii is in the �rst semi-permutation;
else
Ii is in the second semi-permutation;
end if
end for
104
Theorem 11 Algorithm 2 correctly computes two semi-permutations for any per-
mutation in O(logN) time on a completely connected multiprocessor system of N
PEs.
Proof: After initialization of N pointers, a set of directed cycles (including loops)
are formed. It is easy to see that two dual inputs and two inputs mapped to a pair
of dual outputs are in di�erent directed cycles. Since the length of each directed
cycle is at most N=2, after dlog(N=2)e times of pointer jumping, each PEi maintains
the minimum index of the input in the directed cycle/loop to which Ii belongs.
Hence, each PEi has m(i) � 0 or 1 mod 2. Therefore, two dual inputs/outputs
are in di�erent semi-permutations. Clearly, the algorithm takes O(logN) time since
pointer jumping dominates the time complexity. 2
Example 9 Consider the permutation in Example 8. After initializing N pointers,
two directed cycles, (0 ! 6 ! 3) and (1 ! 2 ! 7), and two loops, 4 and 5, are
formed, where each edge is represented by a circle, two dual inputs are connected by
a dotted line, and two inputs mapped to two dual outputs are connected by a dashed
line. 2
0 1 2 3 4 5 6 7
Figure 5.2. A decomposition example
5.2.3 Parallel Decomposition Algorithm for Partial Permutations
The decomposition algorithm presented in the last subsection can be generalized
to decompose any partial permutation with K(< N) active inputs into two partial
semi-permutations.
105
Initially, each PEi is associated with edge i. Let p(i) be a pointer of PEi,
which is initially set to point to the PE with index of ��1(�(i)) if i is active and
��1(�(i)) exists (i.e. there is an active input j so that �(j) = �(i)), and it is set
to point to itself otherwise. For a partial permutation with K active inputs, its
corresponding I/G mapping graph G is the union of a set of paths and cycles. For
cycles, the case is the same as Algorithm 2. For paths, there are two directed paths
formed from each path by pointer initialization. Each end of the directed path is
pointing to an edge i labeled by its corresponding input such that input �i is idle or
output �(�i) is idle. By pointer jumping, the two directed paths formed from a path
can be colored with two di�erent colors by comparing the indices of the end edges.
That is, for an edge corresponding to input i, if the label of the end edge found by i
is less than the one founded by an input j with j = �i or ��1(�(i)), then color i with
one color; otherwise color i with the other color. Clearly, the edges corresponding to
the vertices in the same directed path are colored with the same color, and two dual
inputs and outputs are colored with di�erent colors.
Example 10 Figure 5.3 shows how to decompose a partial permutation into two
partial semi-permutations based on a 2-edge coloring. In Figure 5.3 (a), each edge of
5 paths is labeled by its corresponding input. In Figure 5.3 (b), the directed paths are
formed by pointer initialization, where each edge is represented by a circle. The edges
represented by solid circles are colored with one color, and the edges represented by
dashed circles are colored with the other color. The connections corresponding to the
edges with the same color are formed an partial semi-permutation. 2
Since the pointer jumping dominates the time complexity of the algorithm
and each connected component in G has at most K edges, the extended parallel de-
composition can be done in O(logK) time on a completely connected multiprocessor
system of N PEs. In summary, we have the following theorem.
106
( b )
1
( a )
22
2
9 2118
11
23
19310
517
14 20
4
6
12
8
1
21
17
19
18
14
5
4
3
2
620
12
9
10
11
8
23
22
Figure 5.3. Decomposition of a partial permutation based on 2-edge coloring: (a)
5 di�erent types of paths; (b) directed paths formed by pointer initialization and a
2-edge coloring
Theorem 12 For any partial permutation with K active inputs, two partial semi-
permutations can be computed in O(logK) time on a completely connected multipro-
cessor system of N PEs.
Every (partial) semi-permutation can be passed through the SEs in the �rst
and last stages of an N �N OMIN without crosstalk at one time. In order to route
a semi-permutation in a single pass without crosstalk, we need to assure there is
only one active input of each SE in every stage of OMINs. In the next section, we
will present a fast routing algorithm for realizing a (partial) semi-permutation in an
optical Benes network so that no two connections will pass through the same SE at
the same time.
107
5.3 Routing a Semi-Permutation in an Optical Benes Network
In this section, we �rst present a routing algorithm for realizing an arbitrary semi-
permutation in an optical Benes network based on our parallel decomposition al-
gorithm, and then, we improve the time complexity of the routing algorithm using
equitable coloring technique.
5.3.1 A Routing Algorithm Based on Parallel Decomposition
The algorithm for routing a semi-permutation in an optical Benes network is given
as Algorithm 3.
Algorithm 3 A Semi-Permutation Routing Algorithm in Optical Benes Networks
Input: A semi-permutation
Output: A setting of SEs of B(N) without crosstalk
Step 1. If the size of the semi-permutation is 1, then set up B(2) according to the
connection request, and exit.
Step 2. Decompose the semi-permutation into 2 parts, named upper/lower semi-
permutation, satisfying that two active inputs/outputs in a pair of dual SEs in the
�rst/last stage are in di�erent parts.
Step 3. Set SEs in the �rst and last stages so that the active inputs and out-
puts in the upper/lower semi-permutation are connected with the upper/lower
subnetwork.
Step 4. Recursively call this algorithm in the upper/lower subnetwork with the
upper/lower semi-permutation as input.
Theorem 13 For any semi-permutation of an optical B(N), Algorithm 3 correctly
sets the SEs of B(N) without crosstalk in O(log2N) time on a completely connected
multiprocessor system of N PEs.
Proof: By the topology of B(N), we know that every pair of dual SEs in stage i (resp.
2 logN � 2 � i), 0 � i � logN � 2, is connected with two SEs in stage i + 1 (resp.
108
i� 1) and these two SEs are in di�erent subnetwork B(N=2i+1)s. In order to satisfy
that no crosstalk occurs in each stage of B(N), two active inputs (resp. outputs)
belonging to a pair of dual SEs of stage i (resp. 2 logN � 2 � i) must be connected
with the SEs in di�erent subnetwork B(N=2i+1)s. This is equivalent to assigning a
2-edge coloring to a bipartite graph G, where 2 active inputs (outputs) belonging to
a pair of dual SEs of stage i (2 logN�2� i) compose a vertex and each connection is
corresponding to an edge. Thus, using parallel decomposition algorithm recursively,
the SEs are set without crosstalk for any given semi-permutation. By Theorem 11,
the time complexity of Step 2 in Algorithm 3 is O(logN). Since there are 2 logN �1
stages and every parallel decomposition step can decide the setting of SEs of two
stages (i.e. the �rst and last stages of a subnetwork) in B(N), the time complexity
of Algorithm 3 is O(log2N). 2
5.3.2 The Improved Routing of Partial Semi-Permutation by Equitable
Coloring
Since a partial semi-permutation is the subset of a semi-permutation, it can be routed
in an optical Benes network in one pass without crosstalk. By applying the extended
parallel decomposition in step 2 of Algorithm 3, the total time for routing any partial
permutation with K active inputs in an optical B(N) is O(logN logK).
In order to further improve the complexity of routing time to O(log2K +
logN), we introduce a concept, equitable edge coloring. A graph G is equitable c-
edge colorable if E(G) can be colored with c colors so that the adjacent edges are
colored with di�erent colors and the di�erence between the sizes of two color classes
is at most one, where a color class is the subset of E(G) with the same color for the
coloring. Clearly, both cycle and path are 2-edge colorable. For any 2-edge colorings
of paths or cycles, the sizes of two color classes are equal for a cycle and an even path
(a path with an even number of edges) while the di�erence between the sizes of two
color classes is one for an odd path (a path with an odd number of edges). The color
109
with which more than half edges in an odd path are colored is called primary color.
Thus, given a partial permutation, if the I/O mapping graph G has x odd paths, we
color paths and cycles in G with two di�erent colors c1 and c2 so that dx2e odd paths
have c1 as primary color and the remaining bx2c odd paths have c2 as primary color.
These 2-edge colorings of cycles and paths compose an equitable 2-edge coloring of
G.
To route a partial semi-permutation in an optical B(N) without crosstalk
in O(log2K + logN) time, we need to do a preprocessing and apply the equitable
2-edge coloring technique in step 2 of Algorithm 3. The preprocessing is to link
K PEs corresponding to K active inputs. This preprocessing step can be done by
a parallel pre�x sums operation [33], which takes O(logN) time on a completely
connected multiprocessors of N PEs. In the following, we show how to color x odd
paths of G with 2 colors so that the di�erence of 2 color classes is at most 1. It is
easy to see that for any odd path, the edge whose dual input is not active will be
colored with primary color. We call such an edge a primary edge. We concatenate
all primary edges by a parallel pre�x sums on the K linked PEs and alternately color
the primary edges with two di�erent colors. Thus there are dx2e primary edges with
one color and bx2c primary edges with another color. The edges in an odd path will
be colored using the primary edge as reference. That is, if an edge e and a primary
edge f are in the same directed cycle, then e and f have the same color; otherwise
they have di�erent colors. Therefore, an equitable 2-edge coloring of G is found.
Since the operations of pointer jumping and parallel pre�x sums dominate the time
complexity, an equitable 2-edge coloring of G can be found in O(logK) time using a
completely connected multiprocessors with N PEs.
Example 11 Figure 5.4 shows how to �nd an equitable 2-edge coloring. The primary
edges are marked as dark lines in Figure 5.4 (a). 2
110
( a )
23
54
6
1
( b )
5
4
3
2
6
9
11
10
12
8
1
8
12
9
10
11
Figure 5.4. An equitable 2-edge coloring of graph: (a) 3 odd paths and primary
edges; (b) directed paths formed by pointer initialization and an equitable 2-edge
coloring
Using equitable 2-edge coloring technique, we can decompose a partial per-
mutation into two partial semi-permutations with the di�erence between their sizes
being at most one. When we route a partial semi-permutation in an optical Benes
network, by applying the equitable 2-edge coloring technique in step 2 of Algorithm
3, the size of the partial permutation entering into each subnetwork is reduced by
half. Thus after logK iterations, there is at most one active input entering into one
subnetwork. Consequently, the time for setting up a partial semi-permutation with
K active inputs in an optical B(N) is O(logK) in each of the �rst logK iterations
and O(1) in each of the remaining iterations. Therefore, we have the following claim.
Theorem 14 For any partial permutation with K(< N) active inputs of an optical
B(N), it can be routed without crosstalk in O(log2K+logN) time using a completely
connected multiprocessor system of N PEs.
5.4 Comparisons of Three Dilation Approaches for Optical Benes Net-
works
There are three approaches, time dilation, space dilation and wavelength dilation,
can be used to avoid the crosstalk in OMINs.
111
In time dilation approach, given any (partial) permutation, we �rst use par-
allel decomposition algorithm to decompose the (partial) permutation into 2 (par-
tial) semi-permutations, then use Algorithm 3 twice to route two (partial) semi-
permutations without crosstalk.
In space dilation, a dilated Benes network, denoted as DB(N), consists of 2
copies of B(N) with the corresponding two inputs and outputs are connected to a
1� 2 SE and a 2� 1 combiner, respectively [42, 57] (see Figure 5.5 for an example).
INPUTS
5
3
2
4
6
0
1
OUTPUTS
77
3
2
0
1
5
6
4
2 3 4
STAGES
10
Figure 5.5. A space dilated Benes network DB(8)
For routing a permutation in a DB(N), we �rst decompose the permuta-
tion into 2 semi-permutations, then route each semi-permutation in one of copies of
DB(N) simultaneously. By Theorems 11-14, we have the following corollary.
Corollary 3 For any (partial) permutation with K(� N) active inputs of an optical
Benes network B(N), it can be routed without crosstalk in O(log2K + logN) time
on a completely connected multiprocessor system of N PEs by either time or space
dilation.
By Corollary 3, the time complexity to route a (partial) permutation in an
optical B(N) is the same as the time complexity of the best known parallel routing
algorithms for realizing a (partial) permutation in an electronic B(N) [43, 49, 62].
112
Compared with the time dilation approach, the space dilation approach uses
more than double of hardware, i.e. twice of SEs and links plus splitters and combin-
ers, and more than half of time to route a permutation, i.e. the time for decomposi-
tion and routing of one semi-permutation.
In wavelength dilation, if there is a wavelength converter available in each SE,
we can convert two input signals with the same wavelength entering into the same
SE to di�erent wavelengths. Thus, two wavelengths are necessary plus the costs of
the wavelength converters. If there is no wavelength converter available, i.e. each
connection will be assigned the same wavelength, then we �nd two wavelengths are
not suÆcient. An example is given as follows.
Example 12 Routing the permutation
� =
0 1 2 3
0 2 1 3
!
in an optical B(4).
In order to route the permutation � in B(4), by the topology of B(4), we know
that inputs 0 and 1 (outputs 2 and 3) are connected with di�erent subnetwork B(2)'s,
which are two SEs in the second stage of B(4). Since �(1) = 2 and �(3) = 3, we
know that inputs 1 and 3 must be connected with di�erent SEs in the second stage.
Consequently, inputs 0 and 3 must be connected with the same SE in the second stage
containing only 2 SEs. In order to avoid crosstalk, we must use di�erent wavelengths
for connections 0 ! 0 and 3 ! 3. We also know that the connections 0 ! 0 and
1 ! 2 must be carried on the signal with di�erent wavelengths since they pass the
same SE in the �rst stage. Thus, connections 3! 3 and 1! 2 must have the same
wavelength if there are only two available wavelengths. However, the connections
3 ! 3 and 1 ! 2 pass through the same SE in the last stage of B(4), which will
cause crosstalk. In fact, we need 4 wavelengths to route the above permutation � in
B(4). 2
113
From the above discussion, we know that the time dilation approach is the
most cost-e�ective provided that the cost both in space and in wavelength are at
least as high as the cost in time.
5.5 Summary
In this chapter, we proposed a fast parallel decomposition algorithm with time com-
plexity O(logN), which can decompose any permutation with size of N into two
semi-permutations assuring no crosstalk in SEs of the �rst and last stages in OMINs.
Based on this parallel decomposition, we further presented a fast crosstalk-free par-
allel routing algorithm, which can set up any permutation in O(log2N) time in an
optical B(N). The proposed decomposition algorithm can be generalized to any
partial permutation. Using the equitable 2-edge coloring technique, any partial per-
mutation with K(< N) active inputs can be routed in O(log2K + logN) time in an
optical B(N).
In addition, the proposed algorithms run on a completely connected multipro-
cessor system can be easily translated to the algorithms on more realistic multipro-
cessor systems. For an example, we know that the time complexities of our routing
algorithms for time and space dilated B(N) depends on the parallel permutation de-
composition algorithm. Pointer jumping is the most time-consuming operation in the
decomposition algorithm. Each pointer jumping step on a completely connected mul-
tiprocessor system can be implemented on a hypercube by a sorting operation, which
takes O(log2N) time. Consequently, the decomposition algorithm and routing algo-
rithms can be implemented in O(log3N) time and in O(log4N) time, respectively,
on a hypercube.
CHAPTER 6
PARALLEL ROUTING AND WAVELENGTH ASSIGNMENTS FOR OPTICAL
INTERCONNECTION NETWORKS
6.1 Introduction
The networks using optical transmission and maintaining optical data paths can
be used to remove the expensive optic-electro and electro-optic conversions. The
electronic parallel processing for controlling such networks are capable, in principle,
of meeting future high data rate requirements. For a nonblocking space-division-
multiplexing network, it can be strictly nonblocking (SNB), or rearrangeable non-
blocking (RNB). In SNB networks, a connection can be established from any idle
input to any idle output without disturbing existing connections while in RNB net-
works the connection can be established if the rearrangement of existing connections
is allowed. With wavelength-division-multiplexing (WDM) technology, the concept of
SNB and RNB in space division switching can be extended to the wavelength division
switching. Depending on whether wavelengths can be reassigned, this extension re-
sults in four combinations: wavelength-rearrangeable space-rearrangeable (WRSR),
wavelength-rearrangeable space-strict-sense (WRSS), wavelength-strict-sense space-
rearrangeable (WSSR), and wavelength-strict-sense space-strict-sense (WSSS). It has
been shown that using both the wavelength and space multiplexing techniques in a
fully dynamic manner, networks can achieve higher bandwidth and higher connec-
tivity [80].
The crosstalk in photonic switching networks adds a new type of blocking,
node block, also called wavelength con ict, compared with only link blocking in
electronic switching networks. Clearly, for a photonic switching network, if it is free
114
115
of wavelength con ict, it must be free of link con ict since the connections with
di�erent wavelengths can share the same link in such networks.
In order to minimize wavelength con icts in photonic switching networks,
three approaches, space dilation, time dilation and wavelength dilation, have been
proposed. Since the connections with neighboring wavelengths do not share any
SE, the wavelength dilation approach is also useful for establishing a set of connec-
tions that would normally cause link con icts in blocking space-division-multiplexing
OMINs such as Banyan networks.
In this chapter, we focus on the wavelength dilation approach to quickly con-
�guring an OMIN and assigning each connection a wavelength for realizing a permu-
tation without crosstalk. In wavelength dilation, if there are wavelength converters
available, we can convert the input signals with di�erent wavelengths entering into
the same SE to di�erent ones. Thus, two wavelengths are necessary plus the costs of
the wavelength converters. The use of wavelength converters will increase hardware
cost and con�guration time. If there is no wavelength converter available, i.e. each
connection will use a single wavelength, then we need to �nd a wavelength assign-
ment for connections plus a setting of SEs so that there is no crosstalk in OMINs.
In this chapter, we assume that no wavelength converter is available in OMINs and
assure the wavelengths in the same SE to be di�erent by routing.
The switch model used in this chapter follows [71, 82]. The OMINs under
such switch model can be built up using 2� 2 multi-wavelength SEs, in which each
input (resp. output) is capable of receiving (resp. transmitting) optical signals of
a set of wavelengths and each wavelength is switched independently in SEs [82].
Such a multi-wavelength SE has an independently controllable state, straight or
cross as shown in Figure 6.1 (a), for each wavelength. Figure 6.1 (b) shows a signal
transmission in a multi-wavelength SE, where the connections for the wavelength �2
in the upper input and the wavelength �02 in the lower input are in cross state and
116
all other connections are in straight state. If an SE can only receive/transmit one
wavelength for each input/output, it is called a basic SE. The OMINs considered
in this chapter are WRSS Banyan networks and WRSR Benes networks, where the
WRSR Benes networks only contain basic SEs.
Multi-Wavelength
SE
( b )
λλλ 1 2 k. . .
,,,
( a )
λλλ 1 2 k. . . λλλ 1 2 k
. . .,
λλλ 1 2 k. . .
, ,
Figure 6.1. A 2� 2 multi-wavelength SE: (a) two states; (b) signal transmission
For a permutation of an OMIN, if there is a setting of SEs to realize the
permutation and a wavelength assignment of connections so that no two connections
with the same wavelength share any SE, we called this setting and wavelength assign-
ment a crosstalk-free con�guration of the OMIN for the permutation. An algorithm
that can �nd a crosstalk-free con�guration for any permutation of an OMIN is called
a crosstalk-free routing and wavelength assignment algorithm for the OMIN.
In order to design crosstalk-free routing and wavelength assignment algo-
rithms, we �rst study the permutation capacity of these OMINs, and then show how
to partition a set of connections into subsets so that the connections in each subset
can be established simultaneously, and assign a wavelength to each connection so
that the connections in di�erent subsets have di�erent wavelengths. By applying
graph edge and vertex coloring techniques, we show that our algorithms can route
any permutation without crosstalk in O(log2N) time for a WRSS Banyan network
using at most 2blogN+1
2c wavelengths, and in O(log3N) time for a WRSR Benes net-
work using at most 2 logN wavelengths, on a completely connected multiprocessor
system of N PEs. Finally, we show that both routing and wavelength assignment
algorithms can be implemented on a hypercube with N=2 PEs in O(log4N) time.
117
The rest of chapter is organized as follows. In Section 6.2, a parallel crosstalk-
free routing and wavelength assignment algorithm for WRSS Banyan networks is
given. In Section 6.3, we develop a parallel crosstalk-free routing and wavelength as-
signment algorithm for WRSR Benes networks. Section 6.4 shows how to implement
our algorithms on a hypercube. Section 6.5 summarizes the chapter.
6.2 Parallel Routing and Wavelength Assignment in WRSS Banyan Net-
works
The idea behind our crosstalk-free routing and wavelength assignment algorithm for
WRSS Banyan networks is as follows. We partition a set of connections into subsets
so that the connections in the same subset don't share any SE, and then assign the
connections in di�erent subsets with di�erent wavelengths and the connections in the
same subset with the same wavelength. Each of these subsets is called a crosstalk-
free (CF) subset. Clearly, this wavelength assignment will not cause any crosstalk in
SEs. Since BL(N) is a self-routing network, the routing for each connection can be
easily done following the self-routing rule. We only need to consider how to partition
a set of connections into CF subsets and assign the connections in di�erent subsets
with di�erent wavelengths.
By Lemma 4 in Chapter 4, we have the following lemma.
Lemma 13 Given a partial permutation � of BL(N), if any two connections in �
do not share any modulo-2bn+12c input group and also do not share any modulo-2b
n+12c
output group, then � can be routed in BL(N) simultaneously without crosstalk.
We assume g = 2bn+12c in the rest of this section. According to Lemma 13, if we
assign di�erent wavelengths to the connections in � with sources (resp. destinations)
sharing the same modulo-g input (resp. output) group, then we can route � in
BL(N) without crosstalk. This wavelength assignment problem can be reduced to
118
a �(G(�; g))-edge coloring of a bipartite graph G(�; g). In G(�; g), the vertex set
consists of two parts, V1 and V2. Each part has N=g vertices, i.e., each modulo-g
input (resp. output) group is represented by a vertex in V1 (resp. V2). There is an
edge between vertex bi=gc in V1 and vertex bj=gc in V2 if j = �(i). Thus, G(�; g)
is a bipartite graph with N=g vertices in each of V1 and V2 and K edges, where at
most g edges are incident at any vertex, and the degree of G(�; g) equals to g. It
has been proved that any bipartite graph G has a �(G)-edge coloring [8]. Hence,
G(�; g) has a g-edge coloring since G(�; g) is bipartite and �(G(�; g)) = g. Thus,
if we can �nd a g-edge coloring of G(�; g), then we can assign wavelength i to the
connections corresponding to the edges with the color i, 0 � i � g � 1.
In Section 4.4 of Chapter 4, we know that there is an eÆcient algorithm for
�nding a �(G)-edge coloring of a bipartite graph G. By Theorem 4, we have the
following corollary.
Corollary 4 For any partial permutation � with K(� N) active inputs, a crosstalk-
free routing and wavelength assignment of � for a WRSS BL(N) can be found in
O(logN � logK) time using at most 2bn+12c wavelengths on a completely connected
multiprocessor system of N PEs.
It is easy to verify that 2bn+12c wavelengths are also necessary for a WRSS
BL(N) since there exist permutations with 2bn+12c connections sharing a common
SE.
6.3 Parallel Routing and Wavelength Assignment in WRSR Benes Net-
works
For space-division multiplexing, Benes networks are rearrangeable nonblocking. By
[53, 99], we know that each permutation can be decomposed into two crosstalk-
free partial permutations so that each CF partial permutation can be routed in an
119
optical Benes network simultaneously. Hence, if we assign the same wavelength to the
connections in the same CF partial permutation and assign di�erent wavelengths to
the connections in di�erent CF partial permutations, two wavelengths are suÆcient
for a WRSR B(N) in which SEs may contain non-basic states. (Figure 6.2 shows an
example, where di�erent line styles denote di�erent wavelengths.) In the following
two subsections, we will show the case that WRSR Benes networks only contain basic
SEs, leading to reduced hardware complexity [71].
6.3.1 Upper Bound for the Number of Wavelengths
In order to �nd an upper bound for the number of wavelengths needed for crosstalk-
free routing, we need to consider routing a permutation in an OMIN. We model
the wavelength assignment for a permutation in an OMIN as the vertex coloring
of a graph G!, where the vertex set V (G!) = fconnectionsg and the edge set
E(G!) = ffu; vgjtwo connections u and v con ict with each otherg. We call G!
a wavelength con ict graph. Although �nding the minimum number of wavelengths
and assigning the wavelengths to the connections are equivalent to �nding the mini-
mum number of colors and assigning the colors to the vertices respectively, which are
both NP-complete for general graphs, we can �nd an upper bound for the number
of wavelengths needed for realizing any permutation in WRSR Benes networks.
Theorem 15 For any permutation of a WRSR B(N),
! �(2 logN; if N � 4
2 logN � 1; otherwise
where ! is the number of wavelengths needed for the crosstalk-free routing of a per-
mutation in B(N).
Proof: Each connection con icts with at most 2 logN � 1 connections since it passes
through total 2 logN � 1 basic SEs. Thus �(G!) � 2 logN � 1. By Brooks theorem
120
(see a proof in [8]), ifG! is neither a complete graph nor an odd cycle, then we need at
most �(G!) colors to color V (G!) such that any two adjacent vertices have di�erent
colors; otherwise �(G!) + 1 colors are suÆcient. Clearly, for any permutation of an
OMIN with N > �(G!) + 1, G! is neither a complete graph nor an odd cycle since
�(G!) < N � 1 and N is even. Therefore, the theorem is true. 2
The following example shows that 4 wavelengths are necessary for any crosstalk-
free routing of a permutation in B(4) that only contains basic SEs.
Example 13 Routing the permutation
� =
0 1 2 3
0 2 1 3
!
in an optical B(4).
By the topology of B(4), it is easy to verify each connection con icts with
all other three connections, and thus, 4 wavelengths are necessary for routing this
permutation in B(4) without crosstalk. A wavelength assignment for � is shown in
Figure 6.2 (a). 2
The simple proof of an upper bound on the number of required wavelengths
as in Theorem 15 does not directly lead to a wavelength assignment algorithm. In
the next subsection, we utilize the properties of our permutation decomposition and
the structure of Benes network to obtain a fast parallel crosstalk-free routing and
wavelength assignment algorithm for a WRSR B(N) using no more than 2 logN
wavelengths.
6.3.2 Routing and Wavelength Assignment Algorithm
Our routing and wavelength assignment algorithm uses the permutation decompo-
sition algorithm of [53] as a subalgorithm and the vertex coloring technique similar
to that of [21]. Conceptually, this algorithm has logN iterations. In each iteration
121
OUTPUTS
STAGES
1
( b )
01
23
( a )
0 2
OUTPUTS
STAGES
1
2
01
23
01
23
INPUTS
01
23
INPUTS
0
Figure 6.2. A crosstalk-free routing and wavelength assignment for B(4): (a) a
WRSR B(4) contains only basic SEs; (b) a WRSR B(4) contains non-basic SEs
i, if 0 � i < logN � 1, the algorithm decides the setting of SEs in stage i and stage
2 logN � 2 � i and uses at most 2(i+ 1) + 1 wavelengths to ensure that there is no
wavelength con ict in stage j for any j 2 f0; � � � ; ig[f2 logN�2�i; � � � ; 2 logN�2g;
if i = logN � 1, the algorithm decides the setting of SEs in stage logN � 1 and uses
at most 2 logN wavelengths to ensure that there is no wavelength con ict in B(N).
We de�ne a wavelength class as the set of connections assigned the same
wavelength. A wavelength � is called a free wavelength for a connection c if � is not
assigned to any connection con icting with c.
Each PEi is associated with connection i, and maintains one variable �(i),
and two arrays Ci and Wi, 0 � i < N � 1. For any 0 � i � N � 1, Ci consists of
2 logN�1 entries Ci[j], 0 � j � 2 logN�2, and Wi consists of 2 logN entriesWi[k],
0 � k � 2 logN�1. �(i), Ci[j], andWi[k] are used to record the assigned wavelength,
the new con icting connections generated in iteration bj=2c, and the number of
122
con icting connections with wavelength k, respectively, for connection i. We call
Ci and Wi connection con ict array and wavelength con ict array of connection i,
respectively. The other variables are all working variables. Initially, let �(i) := 0,
Ci[j] := 1, and Wi[k] := 0, for i 2 f0; � � � ; N � 1g, j 2 f0; � � � ;� 2 logN � 2g, and
k 2 f0; � � � ; 2 logN�1g, respectively. We use operator \:=" to denote an assignment
local to a PE or to the control unit, and use operator \ " to denote an assignment
requiring some interprocessor communication. In our parallel routing and wavelength
assignment algorithm, each iteration i consists of the following steps:
Step 1-Permutation Decomposition: decompose a (partial) permutation of
each subnetwork B(N=2i) into two parts, each named upper or lower partial permu-
tation, satisfying that two active inputs (resp. outputs) in an SE in the �rst (resp.
last) stage of B(N=2i) are in di�erent parts.
Step 2-Setting SEs: set the SEs in the �rst and last stages of each B(N=2i) in
such a way that (i) if i 6= logN �1, the active inputs and outputs in the upper (resp.
lower) partial permutation are connected with an upper (resp. lower) subnetwork
B(N=2i+1); (ii) if i = logN � 1, each active input is connected with its mapped
output.
The above two steps decide the routing for the given permutation. The fol-
lowing steps are used to �nd a wavelength assignment for the routing solution. For
all PEc, 0 � c � N � 1, do in parallel:
Step 3-Recording Con icting Connections: (i) if there is a connection c0 so
that c and c0 pass through the same SE in stage i and c0 6= Cc[j] for all 0 � j < 2i,
then Cc[2i] := c0; (ii) if i 6= logN � 1 and there is a connection c" so that c and c00
pass through the same SE in stage 2 logN�2�i and c00 6= Cc[j] for all 0 � j < 2i+1,
then Cc[2i+ 1] := c00.
Step 4-Reassigning Wavelengths: if connection c is in a lower partial permu-
tation, �0(c) := �(c) and �(c) := �(c) + (2i+ 1).
123
Step 5-Updating Con icting Wavelengths: update wavelength con icts by (i)
adding new con icts and (ii) updating existing con icts, where (ii) consists of two
substeps: (ii-1) clearing old wavelengths and (ii-2) adding updated wavelengths. The
detailed implementation of this step is given in Algorithm 4.
Algorithm 4 Updating Con icting Wavelengths
if i 6= logN � 1, j 0 := 2i+ 1; otherwise, j0 := 2i;
for all PEc, 0 � c � N � 1, do
t(c) :=1;
for j = 2i to j0 doif Cc[j] 6=1 and �(c) � 2 logN � 1 then
t(Cc[j]) �(c);
end if
if t(c) 6=1 then
Wc[t(c)] := Wc[t(c)] + 1; /* (i): adding new con icts */
t(c) :=1;
end if
end for
if connection c is in a lower partial permutation and i 6= 0 then
for j = 0 to 2i� 1 do
if Cc[j] 6=1 then
t(Cc[j]) �0(c);
end if
if t(c) 6=1 then
Wc[t(c)] :=Wc[t(c)]� 1; /* (ii-1): clearing old wavelengths */
t(c) :=1;
end if
if Cc[j] 6=1 and �(c) � 2 logN � 1 then
t(Cc[j]) �(c);
end if
if t(c) 6=1 then
Wc[t(c)] :=Wc[t(c)] + 1; /* (ii-2): adding updated wavelengths */
t(c) :=1;
end if
end for
end if
end for
By the above �ve steps, it is easy to know the wavelength assignment in each
iteration will not result in any con ict in the SEs that have been set up so far.
However, we can reduce the number of wavelengths by reassigning new wavelengths
in f0; � � � ; 2(i+ 1)g to the connections with wavelengths in f2(i+ 1) + 1; � � � ; 2(2i+
124
1)�1 = 4i+1g without resulting in any wavelength con ict. (The correctness for the
reassignment of wavelengths will be proved in Lemma 14.) This is done as follows:
for �� = 2(i+ 1) + 1 to 4i+ 1, if �(c) = ��, then perform the following two steps:
Step 6-Adjusting Wavelengths: �nd a free wavelength j 2 f0; 1; � � � ; j0+1g such
that Wc[j] = 0 by checking the values in fWc[0]; � � � ;Wc[j0 + 1]g, and �0(c) := �(c)
and �(c) := j. (The value of j 0 in this step and next step is the same as that in
Algorithm 4.)
Step 7-Updating Con icting Wavelengths: for k = 0 to j0, do (i) if Cc[k] 6=1
and �0(c) � 2 logN � 1, then decrease WCc[k][�0(c)] by 1; and (ii) if Cc[k] 6=1, then
increase WCc[k][�(c)] by 1. (The detailed implementation is similar to Algorithm 4.)
Lemma 14 After iteration i, 0 � i � logN � 1, of our parallel routing and wave-
length assignment algorithm, there is no wavelength con ict in stage j, for any
j 2 f0; � � � ; ig [ f2 logN � 2 � i; � � � ; 2 logN � 2g, and at most !i wavelengths are
used, where
!i �(2(i+ 1); if i = 0; logN � 1
2(i+ 1) + 1; otherwise
Proof: The proof is done by induction on iteration i. If i = 0, it is true since
two connections passing though the same SE in �rst and last stages are assigned
di�erent wavelengths and !0 = 2. Now we assume that it is true for any i < k �
logN�1. In iteration k, by assumption, we know that there is no wavelength con ict
in stage j, for any j 2 f0; � � � ; k� 1g [ f2 logN � 1� k; � � � ; 2 logN � 2g, using !k�1
wavelengths. By Step 4, two connections passing though the same SE in stage k and
stage 2 logN � 2 � k are assigned di�erent wavelengths using 2 � !k�1 wavelengths.
Hence, there is no wavelength con ict in stage j for any j 2 f0; � � � ; kg[f2 logN�2�
k; � � � ; 2 logN �2g, using 2 �!k�1 wavelengths. In the following, we show that 2 �!k�1
wavelengths are too much for the case that 2 � !k�1 > 2(k + 1) + 1 if k 6= logN � 1
or the case that 2 �!k�1 > 2 logN if k = logN � 1 . For iteration k, each connection
125
con icts with at most 2(k+1) connections if k 6= logN �1 and at most 2 logN �1 if
k = logN�1. This is because for iteration j, if j � k < logN�1, we need to consider
wavelength con icts in two stages, stages j and 2 logN � 2� j; if j = k = logN � 1,
we only need to consider wavelength con ict in stage logN � 1 since stage j and
stage 2 logN � 2 � j are the same. Thus, in Step 6, a free wavelength of index
no greater than 2(k + 1) for k < logN � 1 and 2 logN � 1 for k = logN � 1 can
always be found. Furthermore, the connections in the same wavelength class have no
wavelength con ict so that we can do wavelength adjustment for these connections
at the same time without resulting in any new con ict. 2
Theorem 16 For any (partial) permutation, a routing and wavelength assignment
for a WRSR B(N) can be found in O(log3N) time using at most 2 logN wavelengths
on a completely connected multiprocessor system of N PEs.
Proof: By the recursive structure of B(N) and by applying our permutation decom-
position algorithm recursively, we can �nd a setting of SEs in B(N) so that any
permutation can be realized. By Lemma 14, we know that the wavelength assign-
ment assures no wavelength con ict for the routing solution. Now, we analyze the
time complexity. It is easy to see that in each iteration, Steps 2 and 4 take O(1) time
and each of other steps takes O(logN) time. Iteration i has at most !i�1(� 2i+ 1)
wavelength classes to be adjusted, and thus, Steps 6 and 7 in iteration i are executed
at most !i�1(� 2i+ 1) = O(logN) times. Since there are logN iterations, the total
time complexity of our routing and wavelength assignment algorithm is O(log3N).
2
Example 14 Figure 6.3 shows the process for routing the permutation
� =
0 1 2 3 4 5 6 7
0 2 1 4 3 7 5 6
!
in a WRSR B(8).
126
Ste
p 2
Ste
p 2
Ste
p 2
Ste
p 7
1 0111100
0 1 2 3 4 5 6 7
0 1 4 1 3 0 2 5
2
0
0
1
1
1
2
1
1
0
1
1
0
0
1
1
0
1
1
0
2
2
0
0
0
0
1
1
1
1
2
1
1
0
0
0
2
0
1
1
Ste
p 6
0
1
2
3
4
5
6
7
0
1 51
4
0
3 2
0
1
2
3
4
5
6
7
0
0
0
00+1=1
0
1
2
3
4
5
6
7
0
0+3=3 01
1+3=4
1
0+3=3 1+3=4
0
1
2
3
4
5
6
7
0
3+5=8 0+5=51
4
1+5=6
3 4+5=9
Ste
p 3
Ste
p 4
Ste
p 5
i
1
2
0
4
3
0
2
6
5
1
4
7
7
3
6
5
0 1 1 0 0 1 1 0
2
0
0
0
0
0
2
0
0
0
0
0
0
2
0
0
0
0
0
2
0
0
0
0
2
0
0
0
0
0
2
0
0
0
0
0
0
2
0
0
0
0
0
2
0
0
0
0
0 1 2 3 4 5 6 7
Ste
p 3
Ste
p 4
Ste
p 5
1
2
0
4
3
0
2
6
5
1
4
7
7
3
6
5
0 0000000
0 1 2 3 4 5 6 7
3 2 1 0 7 6 5 4
4 307
0 1 4 3 3 1 4 0
1
0
0
2
1
1
0
0
2
1
1
1
0
1
0
0
1
0
1
1
2
2
0
0
0
0
2
0
0
2
1
1
0
1
0
0
1
0
1
1
Ste
p 3
Ste
p 4
Ste
p 5
1
2
0
4
3
0
2
6
5
1
4
7
7
3
6
5
1 0111100
0 1 2 3 4 5 6 7
3 2 1 0 7 6 5 4
4 307
0 1 4 8 3 6 9 5
1
0
0
1
1
0
1
0
1
0
0
0
0
0
1
1
0
0
1
0
1
1
0
0
0
0
1
0
1
1
1
1
0
0
0
0
1
0
1
1
7 5 6 4 3 1 2 0
Ste
p 6
Ste
p 7
1 0111100
0 1 2 3 4 5 6 7
0 1 4 1 3 6 9 5
2
0
0
1
1
1
1
0
1
0
1
0
0
0
1
1
0
0
1
0
2
1
0
0
0
0
1
0
1
1
2
1
0
0
0
0
1
0
1
1
0
1
2
3
4
5
6
7
0
1 51
4
6
3 9
Ste
p 6
0
1
2
3
4
5
6
7
0
1 51
4
0
3 9
Ste
p 7
1 0111100
0 1 2 3 4 5 6 7
0 1 4 1 3 0 9 5
2
0
0
1
1
1
2
0
1
0
1
1
0
0
1
1
0
0
1
0
2
2
0
0
0
0
1
0
1
1
2
1
0
0
0
0
2
0
1
1
Iteration 0 Iteration 1 Iteration 2
0 1 2 3 4 5 6 7
0 2 1 4 3 7 5 6
1 2 5 6
2 1 7 5
0 3 4 7
0 4 3 6
1 5
2 7
2 6
1 5
3 4
4 3
0 7
0 6
1
2
2 6
1 5
0
0
7
6
4
3
3
4
5
7
1 2 5 6
2 1 7 5
0 3 4 7
0 4 3 6
1 5
2 7
2 6
1 5
3 4
4 3
0 7
0 6
Ste
p 1
Ste
p 1
Ste
p 1 Points to the upper partial permutaions
Points to the lower partial permutaions
:
:
( b )
0+1=1
0+1=1 0+1=1
( a )
i i
i
i
i
Ci[0]
Ci[4]
Ci[3]
Ci[2]
Ci[1]
Ci[0]
Ci[4]
Ci[3]
Ci[2]
Ci[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
Ci[0]
Ci[4]
Ci[3]
Ci[2]
Ci[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
Wi[0]
Wi[5]
Wi[4]
Wi[2]
Wi[3]
Wi[1]
0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
Inp
uts
Ou
tpu
ts
Wavelengths
0:
1:
2:
3:
4:
5:
(i) (i) (i)
(i)
(i)
(i)
Figure 6.3. Routing a permutation in WRSR B(8): (a) �nding a wavelength assign-
ment; (b) crosstalk-free routing in B(8)
127
In iteration i, by Step 1, each (partial) permutation is divided into one upper
partial permutation and one lower partial permutation by applying parallel decompo-
sition algorithm. Step 2 sets up SEs in stage i and stage 2 logN � 2� i according to
the decomposition. That is, if the connection is in the upper partial permutation, the
connection is connected with the upper subnetwork; otherwise, it is connected with
the lower subnetwork. In Step 3, if two connections con ict with each other, possibly
in two SEs, they record each other in their respective connection con ict array only
once. Step 4 reassigns a new wavelength with !i�1 larger than the original one to each
connection in the lower partial permutations so that there is no wavelength con ict
in any SE that has been set up so far. After the wavelength reassignment, in Step
5, each connection c updates Wd for each connection d in Cc by doing the following:
(i) if d is the new con icting connection of c generated in this iteration, the value
of the entry with index of c's updated wavelength in Wd is increased by 1 ; and (ii)
if d is an existing con icting connection for c, the value of the entry with index of
c's original wavelength in Wd is decreased by 1 and the value of the entry with index
of c's updated wavelength in Wd is increased by 1. By Lemma 14, we know that the
number of wavelengths is no more than !i. Hence, the wavelength greater than !i
can be adjusted to a wavelength with label less than or equal to !i, which is done by
Step 6. Step 6 �nds a free wavelength by looking at the wavelength con ict array
and assign the index of an entry with value of 0 as new wavelength to each adjusted
connection. Step 6 is immediately followed by Step 7, which updates the wavelength
con ict array for each adjusted connection so that every connection always main-
tains the up-to-date wavelength con ict information. Figure 6.3 (a) shows how our
routing and wavelength assignment algorithm works step by step and also shows the
corresponding wavelength con ict graph G! generated in each iteration, where every
connection is represented by a circle and labeled by its corresponding input. In G!,
con icting connections are connected by edges and the assigned wavelengths in each
iteration are shown beside the circles. Fig 6.3 (b) shows the �nal routing and wave-
128
length assignment for � in B(8), where 6 wavelengths are used.
2
6.4 Implementation on Realistic Multiprocessor Systems
The presented algorithms run on a completely connected multiprocessor system can
be easily transformed to algorithms on more realistic multiprocessor systems. As an
example, in this section, we show how to implement our algorithms on a hypercube
of N=2 PEs such that any (partial) permutation can be routed without crosstalk in
a WRSS BL(N) and a WRSR B(N) in O(log4N) time.
In our presentation, the Benes network B(N) is the back-to-back concatena-
tion of two BL(N)'s. Butter y networks are in the family of the hypercube as talked
in Subsection 1.3.1 of Chapter 1. Since each PE can communicate with at most one
other PE in every communication step of our algorithms, in the following, we show to
how to implement one communication step of a completely connected multiprocessor
system of N PEs by a set of one-to-one communications on a hypercube H(N=2), in
which each PE is responsible for a pair of connections i and �i.
The time complexity of our routing and wavelength assignment algorithm for
a WRSS BL(N) depends on edge coloring algorithm, which can be implemented
in O(log3N) time on H(N=2) [54], Thus, the routing and wavelength assignment
algorithm for a WRSS BL(N) takes O(log4N) on H(N=2).
Considering our routing and wavelength assignment algorithm for a WRSR
B(N), we can see that the total time for routing on H(N=2) only depends on the de-
composition algorithm [53], which can be implemented in O(log3N) time on H(N=2)
since each pointer jumping step on a completely connected multiprocessor system can
be implemented onH(N=2) by a sorting operation, which takesO(log2N) time. Con-
sequently, the routing on H(N=2) takes O(log4N) time. For wavelength assignment,
communications among PEs only occur in Step 5 and Step 7, in which PEc needs to
129
talk to PEd if d is recorded in Cc (see \ " operations in Algorithm 4). Fortunately,
all con icting connections of c are recorded in connection con ict array Cc in the or-
der of SEs through which c passes from both sides, i.e. from a pair of outside stages
i and 2 logN�2�i towards the center stage, stage logN�1. Thus, these con icting
connections can be located using this ordering via interstage connections in B(N).
Since the interstage interconnection pattern between stage i (resp. 2 logN � 2 � i)
and stage i+ 1 (resp. 2 logN � 3� i) of B(N) corresponds to (log N2� i)-dimension
edges of H(N=2), the communication ordering de�ned by connection con ict arrays
directly corresponds to a classic hypercube communication technique called dimen-
sion ordering. Thus, the total time for wavelength assignment on H(N=2) remains
unchanged. Therefore, when our routing and wavelength assignment algorithm for a
WRSR B(N) is implemented on H(N=2), it has a slowdown factor of O(logN) and
its time complexity is O(log4N).
6.5 Summary
In this chapter, we studied the crosstalk problem in OMINs using wavelength dila-
tion approach. We proposed parallel routing and wavelength assignment algorithms
to route a partial permutation in optical WRSS Banyan networks and WRSR Benes
networks so that there is no crosstalk in these networks. For an arbitrary partial
permutation, it can be routed without crosstalk in a WRSS BL(N) in O(log2N)
time using at most 2blogN+1
2c wavelengths and in a WRSR B(N) with only basic SEs
in O(log3N) time using at most 2 logN wavelengths, on a completely connected
multiprocessor system with N PEs. The proposed algorithms run on a completely
connected multiprocessor system can be easily transformed to algorithms on more
realistic multiprocessor systems. For example, our routing and wavelength assign-
ment algorithms for a WRSS BL(N) and a WRSR B(N) take O(log4N) time on a
hypercube with N=2 PEs.
CHAPTER 7
PARALLEL ROUTING ALGORITHMS FOR GROUP CONNECTORS
7.1 Introduction
Recently, a new class of interconnection networks called group connectors were pro-
posed in [101]. A group connector G(N; g) is de�ned as an interconnection network
that consists of N inputs and N outputs such that (1) its N outputs are divided
into g output groups with N=g functionally equivalent outputs in each group; and
(2) it can provide any simultaneous (N=g)-to-one connections from N inputs to g
output groups, possibly without the ability of distinguishing the order of outputs
within each group. Another type of N �N group connector G0(N; g) can be de�ned
by dividing its N inputs and N outputs into g equal-size groups, respectively. For
G0(N; g), if the inputs in the same input group are allowed to be connected to the
outputs in di�erent output groups, G0(N; g) and G(N; g) are the same in function-
ality; otherwise g separate planes of N=g �N=g networks can be used to implement
G0(N; g). Figure 7.1(a) and (b) illustrates the block diagram of G(8; 4) and G0(8; 4),
respectively.
0
1
2
3
0
1
2
3
0
1
2
3
(a) (b)
Figure 7.1. Block diagrams of a group connector: (a) G(8; 4); (b) G0(8; 4)
130
131
Group connectors have many applications. In general, a group connector
G(N; g) captures the simultaneous connections between N clients and N servers,
which are divided into g equal-size server groups such that the N=g servers in each
group are functionally equivalent. A group connector G(N; g) can also be viewed
as a g � g permutation network with internal speedup of factor N=g achieved by
space-division multiplexing [98].
Group connectors are particularly useful in dense wavelength-division multi-
plexing (DWDM) networks. With DWDM, it is now possible to transmit di�erent
wavelengths of light over the same �ber, which has provided another dimension to
increase bandwidth capacity. A group connector can be used as a switching network
in a DWDM router. For example, if some inputs and one or more groups of outputs
are connected to a local node, a group connector can be used as an add-drop cross-
connect switching matrix. Group connectors can also be used in the construction
of ingress edge routers of DWDM networks. An ingress edge router in a DWDM
optical network has a set of N electrical or optical input links and a set of g opti-
cal output links. Each optical output link i consists of a set of N=g data channels
Chi;1; � � � ; Chi;N=g, each using a di�erent wavelength. Associated with each input
link, there is an input line card (ILC) and associated with each output link there
is an output line card (OLC). A switching matrix M is between ILCs and OLCs,
and N=g connections are from the output of M to each OLC. The main function of
each ILC is to route input packets to appropriate OLCs by routing table lookup.
Each OLC transmits the received packets by g optical channels of the link it controls.
The block diagram of a DWDM ingress edge router is shown in Figure 7.2. A group
connector G(N; g) is served as the major switching matrixM in the design of ingress
edge routers of a burst-switched DWDM network [102].
As talked in subsection 1.2.3 of Chaper 1, an interconnection network is re-
arrangeable nonblocking if it can realize all possible permutations between inputs
132
...
...
...
OLC
OLC
OLC
1
2
g
...
.
.
.
.
.
.
ILC
ILC
ILC1
2
3
ILCN
ChChCh
2,12,22,N/g
ChChCh
n,1n,2n,N/g
Switching Matrix M
N/g
N/g
N/g ChChCh
1,11,21,N/g
Figure 7.2. Block diagram of an ingress edge router
and outputs when the rearrangement to existing connections is permitted. Sim-
ilarly, a group connector is rearrangeable nonblocking if it can realize all possible
connections between the inputs and group outputs when the rearrangement to ex-
isting connections is permitted. It has been shown [101] that the group connectors
based on Benes network, called Benes group connectors denoted by GB(N;n), and
the group connectors based on 3-stage Clos network, called Clos group connectors
denoted by GC(m;n; r), both are rearrangeable nonblocking. Rearrangeable non-
blocking networks, including Benes networks, Benes group connectors, 3-stage Clos
network and Clos group connector are very attractive for �xed-size cell switching
architecture. In such a switch, variable length packets are segmented into cells upon
arrival, transferred across the switch matrix, and then reassembled again before they
depart. Using �xed-size cells allows for slotted switching, which makes it easier for
the scheduler to con�gure the switch matrix for high throughput.
When a group connector is used as a switching matrix in a high-speed packet
router/switch, packet-forwarding speed is crucial. There are several factors that af-
fect packet-forwarding speed: routing (label) table lookup, switch scheduling, switch
routing and switch internal transmission. For group connectors, the implementations
of switch scheduling and switch routing are of particular importance.
In this chapter, we address the issue of how to quickly set up SEs so that
133
K(� N) con ict-free paths between inputs and output groups can be established
in group connector G(N; g). In particular, we present a parallel algorithm, named
ROUTE, and its variations, for the setup of a group connector with K connection
requests. In this context, Benes network B(N) is a special case of Benes group
connectors GB(N;n) with n = N . Thus, our algorithms can be applied to Benes
networks directly. Given any permutation, all known best sequential algorithms for
setting up the B(N) take O(N logN) time [45, 66, 93], and the best time complexity
of parallel algorithms isO(log2N) [49, 62]. Given any partial permutation withO(K)
connection requests, the parallel algorithms to set up Benes network in O(log2N)
time and in O(log2K + logN) time were proposed in [46] and [43] respectively. Our
main algorithm ROUTE extends the algorithm [43] to set up Benes group connector
for non-maximum mapping between inputs and output groups. As the algorithm
of [45], our algorithm sets up SEs in the �rst logN � 1 stages of GB(N;n) so that
the SEs in the remaining stages can be set up by self-routing [44]. On the other
hand, given any non-maximum mapping with O(K) connection requests, our algo-
rithm runs in O(log2K + logN) time on a completely connected computer or the
EREW PRAM model with N processing elements(PEs) as the algorithm of [43].
When it is implemented on a perfect shu�e computer and a hypercube of N PEs,
O(log4K + log2K � logN) time is suÆcient. Our algorithm ROUTE takes the ad-
vantages of algorithms of [43] and [45, 62]. For a Clos group connector GC(m;n; r)
with O(K) busy inputs, by using the decomposition technique in [49], our algorithm
ROUTE CLOS can determine the switch setting in O(logK logm) time ifm is an in-
tegral power of two and in O(log2K logm) time otherwise on a completely connected
computer or the EREW PRAM model with N PEs.
The rest of chapter is organized as follows. Section 7.2 introduces de�nitions
and notations. In Section 6.3, we develop a parallel routing algorithm ROUTE
for Benes group networks. Section 7.4 extends algorithm ROUTE to ROUTE Clos
134
for setting up the connections in Clos group connectors. Section 7.5 shows the
implementation of our algorithms on more realistic parallel machine models and
hardware redundancy of group connectors. Section 7.6 summarizes the chapter.
7.2 Preliminaries
Let I, O and G be the sets of N inputs, N outputs and g output groups of G(N; g)
respectively. Let � : I 7�! G be an I=G mapping that indicates connection requests
from inputs to output groups. If there is a connection request from Ii to Oj , set
�(i) = j and call Ii a busy input; otherwise set �(i) = �1 and call Ii an idle input.
An I/G mapping from I to G is legal if each input is mapped to at most one output
group and at most N=g di�erent inputs are mapped to the same output group.
When group connector is used as a switching matrix, legal mappings can be enforced
by using the arbitration hardware [97, 103]. A legal I/G mapping is maximum if
all inputs are busy, and non-maximum otherwise. We denote a legal I/G mapping
involving K busy inputs as �jK. Clearly, if K = N , �jN is a legal maximum I/G
mapping. A group connector G(N; g) has a feasible con�guration for a given �jK
if all SEs can be set up so that there are K con ict-free paths connecting the busy
inputs to output groups. Thus, G(N; g) is rearrangeable nonblocking if it has a
feasible con�guration for any �jK.
7.3 Parallel Routing for Benes Group Connectors
In this section, we develop a fast parallel routing algorithm ROUTE for Benes group
networks.
7.3.1 Structure of GB(N;n)
A Benes group connector GB(N;n) with N inputs and n = N=2k output groups
is constructed from a Benes network B(N) by permanently setting all inputs in its
135
last k stages straight, which leads to eliminating these SEs (see Figure 7.3 for an
example).
OUTPUT GROUPSINPUTS
Stage 0 Stage 1 Stage 2 Stage 3 Stage 4
1-level subnetworks
2-level subnetworks
3-level subnetworks
9
1011
1213
1415
8
01
23
45
67
1
2
3
0
Figure 7.3. A Benes group connector GB(16; 4) with k = 2
An L-level subnetwork (0 � L � logN�1) of Benes group connectorGB(N;n)
is de�ned as a Benes group connector GB(H;h) so that H = N=2L and h =
minfH;ng. Thus a GB(N;n) contains 2L L-level subnetworks. Clearly, 0-level
subnetwork of GB(N;n) is itself. Figure 7.3 shows a GB(16; 4), which contains two
1-level subnetworks GB(8; 4), four 2-level subnetworks GB(4; 4), and eight 3-level
subnetworks GB(2; 2). All SEs in the �rst(resp. last) stage of subnetwork GB(H;h)
are called input(resp. output) SEs of GB(H;h).
For a GB(N;n), we label the N inputs/outputs as 0; 1; � � � ; N � 1, n group
outputs as 0; � � � ; n � 1, and N=2 SEs in each stage as 0; � � � ; N=2 � 1 from top
to bottom, and 2 logN � 1 stages are indexed 0 through 2 logN � 2 from left to
right. Denote every input, output and output group of GB(N;n) by Ii, Oi, and Gj
respectively, where 0 � i � N � 1 and 0 � j � n � 1. Thus Oi connects to Gj
where j = i mod n in the last stage of GB(N;n) and every Gj consists of OjN=n+l,
0 � l � N=n � 1. If an input SE of GB(N;n) has 2(resp. 1, 0) busy inputs, it is
called busy(resp. semi-busy, idle) input SE.
136
7.3.2 Graph Model of GB(N;n)
Each GB(N;n) with a legal mapping �jK can be represented as a graph G as fol-
lowing:
Case 1: N = n.
The vertex set V (G) = fvjv is an input SE or an output SEg and edge set E(G) =
f(v;w; i)j there is a busy input i of input SE v with �(i) being an output of output
SE wg.
Case 2: N > n.
The vertex set V (G) = fvjv is an input SE or a group output g and edge set
E(G) = f(v;w; i)j there is a busy input i of input SE v with �(i) = wg.
It is clear that G is a bipartite graph in both cases with all input SEs as one
part A and all output SEs for Case 1 or all output groups for Case 2 as the other part
B. Each edge in G is corresponding to a pair of input and its mapped output group.
There is a one-to-one corresponding relation between every busy input of GB(N;n)
and every edge of G. We label each edge by its corresponding busy input. Hence,
we can exchange notation of edge and its corresponding input. If an edge is the end
edge of a path, we say its labeled input is the end input of the path; if two edges are
adjacent, we say their labeled inputs are adjacent; if an edge is colored with some
color, we say its labeled input is colored with that color.
In the following theorem, Theorem 17, we show that GB(N;n) with a legal
I/G mapping �jK has a feasible con�guration, which is done by showing G has an
equitable 2-edge coloring. The proof of theorem 17 not only shows Benes group con-
nector is rearrangeable nonblocking, but also implies a sequential algorithm, which
we will implement in parallel in next section, to set up SEs for any legal I/G mapping
of Benes group connector.
Theorem 17 Given any legal I/G mapping �jK of a Benes group connector GB(N;n),
137
GB(N;n) has a feasible con�guration.
Proof. Let GB(H;h) be the L-level subnetwork of GB(N;n). The proof is done by
induction on L. If L = logN � 1, GB(H;h) is GB(2; 2) or GB(2; 1) which consists
of a single node (i.e., a single 2�2 SE) and the claim is obviously true. Assume that
the claim is true for any L-level (0 < L � logN � 1) subnetwork GB(H;h). For any
legal I/G mapping of GB(N;n), we know that GB(N;n) (i.e. 0-level subnetwork)
can be represented as a bipartite graph G. We �rst prove G has an equitable 2-edge
coloring. Since G is bipartite, G does not contain any odd cycle, i.e. any cycle in G
has even number of edges [8]. So E(G) is the union of a set of even cycles and paths.
Thus we can alternately color each edge with one of two di�erent colors beginning
with any busy input along each even cycle or path so that the adjacent edges on the
same cycle or path have di�erent colors. We know that every vertex in part A has
degree � 2 since each input SE has at most 2 busy inputs, and every vertex in part
B has degree � 2 for Case 1 or � 2k for Case 2 since each output SE has at most
2 outputs or each output group are mapped by at most 2k busy inputs respectively.
Thus, if a vertex with degree d, then its adjacent dd=2e edges are colored with one
color and bd=2c edges are colored with another color. Therefore G has an equitable
2-edge coloring.
Then, we show that there is a feasible con�guration of Benes Group Connector
GB(N;n) if its graph model G has an equitable 2-edge coloring. We let two ends
(i.e. input and its mapped output group) of the edges with the same color connect
with the same 1-level subnetwork. By the de�nition of the equitable 2-edge coloring,
this setting of SEs satis�es the mapping constraints for 0-level subnetwork GB(H;h).
Since every pair of input and its mapped output group is connected with the same
1-level subnetwork, it generates two legal I=G mappings for two 1-level subnetworks
of GB(N;n). By induction, each of 1-level subnetworks has a feasible con�guration.
Therefore, GB(N;n) has a feasible con�guration. 2
138
We de�nemapping constraints as follows: for any L-level subnetworkGB(H;h)
of GB(N;n) (0 � L � logN � 2),
(1) Every busy input of input SEs and its mapped output group are connected with
the same (L+ 1)-level subnetwork;
(2) Two dual inputs(outputs) are connected with two di�erent (L+ 1)-level subnet-
works; and
(3) If H > h, the busy inputs of GB(H;h) mapped to the same output group are
enforced to be partitioned into two parts with each size � H2h, which are connected
with the di�erent (L+ 1)-level subnetworks G(H=2; h); otherwise (i.e. H = h), two
inputs mapped to two dual outputs must be connected through its two di�erent
GB(H=2;H=2) subnetworks.
By theorem 17 and topology of GB(N;n), we have the following corollary.
Corollary 5 Given any legal I/G mapping �jK, G is a feasible con�guration of
GB(N;n) if and only if G satis�es the mapping constraints.
7.3.3 Algorithm for GB(N;n)
For a legal I/G mapping �jK, any busy input Ii is speci�ed to be connected to a
unique output group Gj . We consider the operation of setting up K link-disjoint
paths from busy inputs to their mapped output groups as a routing process, and
an algorithm for establishing I/G connections as a routing algorithm. Our routing
algorithm is based on the sequential algorithm of [45] and the parallel algorithm of
[43, 62] for routing a permutation in Benes network.
A GB(N;n) consists of 2 logN � 1 � k stages. It can be regarded as a
concatenation of two parts, P1 being the �rst logN � 1 stages, and P2 being the
remaining stages. For each busy input Ii of P2 in stage s of GB(N;n), we de�ne
its control bit to be �(i)2 logN�2�k�s. The control bit is used to do self-routing, i.e.
if the control bit for a busy upper input is 0(resp. 1), then this busy input is set
139
straight(resp. cross), and if the control bit for a busy lower input is 0(resp. 1), then
this busy input is set cross(resp. straight).
Lemma 15 If a Benes group connector GB(N;n) has a feasible con�guration for a
legal I/G mapping �jK, then all busy inputs of P2 can be set up by self-routing.
Proof. Since GB(N;n) has a feasible con�guration, two dual outputs of an output
SE in last stage are only di�erent in the �rst bit (see �gure 7.4 for example, where
C; x 2 f0; 1g). Thus, if the control bit is 0 for a busy upper input or 1 for a
busy lower input, this input is set as straight, and if the control bit is 1 for a busy
upper input or 0 for a busy lower input, this input is set as cross. In general, if
GB(N;n) has a feasible con�guration, then, for two dual outputs of an SE in stage
s (logN � 1 � s � 2 logN � k � 2), the (2 logN � 1 � k � s)-th bit must equal
to 0 for upper one and 1 for lower one. Therefore, according to the control bit
�(i)2 logN�2�k�s of busy input i, we can set up SEs in P2 by self-routing. 2
0001
0010
0011
0000
INPUTS OUTPUT GROUPS
Stage 0 Stage 1 Stage 2 Stage 3 Stage 4
01
23
45
67
89
1011
1213
1415
00000001
00110010
0000
0010
0011
0001
0000
0000
0001
00100011
0001
00100011
000C
000C
000C
000C
000C
000C
000C
000C
000x001x
001x000x
000x001x
000x
001x
000x001x
000x001x
000x
001x
000x001x
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
00Cx
000C
000C
000C
000C
000C
000C
000C
000C
Figure 7.4. Hardware redundancy of P1 and control bit selection of P2 in G(16; 4)
If we de�ne P2-passable condition of GB(N;n) as (2) and (3) of the mapping
constraints, by Lemma 5 and Lemma 15, we have the following claim:
140
Theorem 18 For any legal I/G mapping �jK of GB(N;n), if P1 is set up to satisfy
P2-passable condition and P2 is set up by self routing, then the setting of GB(N;n)
is a feasible con�guration.
Our algorithm, named ROUTE, is presented for a parallel computer with
N PEs that are completely connected The N PEs are labeled as 0; 1; � � � ; N � 1.
Algorithm ROUTE consists of two phases: PHASE I and PHASE II. In PHASE I,
the setting of all SEs in the �rst logN �1 stages are determined. In PHASE II, self-
routing is performed for the SEs in the remaining stages. Conceptually, PHASE I
consists of logN � 1 iterations and each iteration contains 4 steps. In the i-th
iteration, 1 � i � k, the setting of busy inputs of stage i � 1 are determined in the
following way: we only consider 2i�1 independent (i � 1)-level subnetworks so that
P2-passable condition is satis�ed in each subnetwork. In the (k + 1)-th iteration,
we encounter 2k independent B(n) routing problems. Then, for each such problem,
our algorithm degenerates to a parallel routing algorithm based on the sequential
algorithm of [45]. PHASE II is self-routing process. Since P2-passable condition is
satis�ed after PHASE I, this guarantees that self-routing for P2 is always possible
by theorem 18. The basic structure of algorithm ROUTE is given as following:
Algorithm 5 ROUTE
Input: A legal mapping �jK for GB(N;n)
Output: A feasible con�guration of GB(N;n)
PHASE I: set up busy inputs in P1 in logN � 1 iterations
PHASE II: set up busy inputs in P2 by self routing.
In PHASE 1, in order to satisfying P2-passable condition of GB(N;n), by
the proof of Theorem 17, we need to give graph model G an equitable 2-edge coloring
and let the inputs with the same color connect to the same subnetwork of next level.
A parallel algorithm for �nding an equitable 2-edge coloring for a bipartite graph
can be found in Section 5.3 of Chapter 5.
141
Our algorithm in PHASE II is presented in such a way that all PEs participate
routing process for P2 of GB(N;n). Actually, once the setting of SEs in P1 is
determined, cells can be injected into GB(N;n). When the cells reach P2, the SEs
can determine its setting by inspecting the control bits of SEs. We use xl and xl
to denote the (l + 1)-th signi�cant bit bl of the binary representation of x and the
integer that has the binary representation bvbv�1 � � � (1� bl) � � � b1b0 respectively.
Algorithm 6 SelfRout
Input: N; k; s; �(i)
Output: setting of busy inputs in P2
for s = logN � 1 to 2 logN � k � 2 do
s := 2logN � 2 � s;
for all PEi, 0 � i � N � 1 do
if (i is even and (�(i))s�k = 0) or (i is odd and (�(i))s�k = 1) then
set input i as straight;
else
set input i as cross.
end if
end for
end for
7.3.4 Analysis and Example
In PHASE I, since the length of a cycle or path is at most K, we need O(logK)
time to �nd an equitable 2-edge coloring in the �rst logK iterations. Because the
number of busy inputs connecting to the same subnetwork of next level is reduced
by half after each iteration, each iteration in PHASE I only takes O(1) time after
O(logK) iterations. Thus, the total time for PHASE I is O(log2K + logN). There
are 2 logN � k iterations in PHASE II, each takes O(1) time. Therefore, the total
time complexity of algorithm ROUTE is O(log2K + logN). Therefore, we have the
following claim:
Theorem 19 For any legal I/G mapping �jK, algorithm ROUTE correctly com-
putes a feasible con�guration of GB(N;n) in O(log2K + logN) time on a parallel
142
completely connected computer or the EREW PRAM model using N PEs.
Example 15 The parallel algorithm ROUTE sets up busy inputs in the �rst stage
of GB(16; 4) for a legal I/G mapping
i : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
�(i) : 1 �1 1 0 2 3 �1 1 3 2 0 3 2 0 �1 2
!
using equitable edge 2-coloring technique.
After PHASE I of our algorithm, the SEs in the �rst stage of G(16; 4) is set
up as shown in Figure 7.5, and two new mappings are given to inputs of input SEs
in two subnetworks of next level as follows:
i : 0 1 2 3 4 5 6 7
�(i) : 1 0 2 �1 3 3 0 2
!
and
i : 8 9 10 11 12 13 14 15
�(i) : �1 1 3 1 2 0 2 �1
!
2
0
3
2
1Upper 1-level Subnetwork
Lower 1-level Subnetwork
OUTPUT GROUPSINPUTS
6
54
32
10
7
1514
1312
1110
98
Figure 7.5. The settings of SEs in the �rst stage of GB(16; 4) according to the
equitable 2-edge coloring
143
7.4 Parallel Routing for Clos Group Connectors
A Clos group connector GC(m;n; r), is constructed from three-stage Clos network
by replacing the third stage with fat-and-slim concentrator that was proposed by
in [67]. As Figure 7.6 (a) shown, a Clos group connector GC(m;n; r) consists of r
m � n SMs in the �rst stage, and n r � r SMs in the second stage, and r n � m
SMs in the third stage with N = mr inputs and r output groups. The SMs in the
�rst 2 stages are implemented by crossbar networks and the SMs in the last stage
are implemented by concentrators.
Crossbar1
r×r
Crossbar2
r×r...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Crossbar
1
r×r
Crossbar2
r×r...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(a)
.
.
.
.
.
.
.
.
.
...
...
...
OU
TP
UT
GR
OU
PS
INP
UT
S
Group g
Group g
Group g1
2
r
Crossbar1
m×n
Crossbarr
m×n
Crossbar2
m×nCrossbar
2
.
.
.
.
.
.
INP
UT
S
Crossbar1
Crossbarr
m×m
Crossbar2
m×mCrossbar
2
Crossbarm
r×r
OU
TP
UT
GR
OU
PS
Group g
Group g
Group g1
2
r
(b)
Crossbarn
r×r
Concentrator1
n×mConcentrator
2
n×mConcentrator
r
n×m m×m
Figure 7.6. Construction of a Clos group connector: (a) a 3-stage Clos group con-
nector GC(m;n; r); (b) a 2-stage Clos group connector GC(m;m; r)
Similar to Benes group connector, we model a Clos group connectorGC(m;n; r)
with a mapping �jK, which mapsK(� N) busy inputs to r output groups, as a graph
~G where vertex set V ( ~G) = fvjv is an SM in the �rst stage or an output groupg and
edge set E( ~G) = fe = vwj there is an input i of an input SM v with �(i) = wg.
It is clear that ~G is a bipartite graph with all SMs in the �rst stage as one part
and all output groups as another part. Let �( ~G) denote the maximum degree of
~G. Clearly, �( ~G) � m since each SM in the �rst stage has at most m busy inputs
and each output group has at most m busy inputs mapped to it. By the topology
of Clos group connectors, we know that every output of each SM in the �rst stage is
connected to di�erent SMs in the middle stage, and every output of each SM in the
144
second stage is connected to the di�erent output groups. Because each SM, which is
a crossbar, is nonblocking, what we need to do for setting up GC(m;n; r) is to route
all outputs of every SM in the �rst stage to di�erent SMs in the second stage so that
the inputs that are connected with the same SM in the second stage are mapped to
the di�erent output groups. Hence, if we can color E( ~G) with �( ~G) colors, then we
can set up the SMs of GC(m;n; r) by connecting the busy inputs corresponding to
the edges colored with di�erent colors to di�erent SMs in the second stage. Since the
number of SMs in the second stage is n � m, there is a set of edge-disjoint paths from
the inputs to the outputs connecting input i to output group �(i) for 1 � i � m� r.
Therefore, we can apply the routing algorithm of Benes group connector to Clos
group connector GC(m;n; r). The basic strategy of the algorithm for Clos group
connector, denoted by ROUTE CLOS, is similar to [49]: �rstly reduce the case of
arbitrary m, which is the number of inputs for each SM in the �rst stage, into the
case in which m is an integral power of two, and then recursively decompose the
original d-edge coloring into two d=2-edge coloring (4 � d � m) so that we can apply
the algorithms similar to the in section 7.3.3 to an N=2-vertex bipartite graph with
maximum degree 2. For brevity, the detail implementation is omitted. In summary,
we have the following claim:
Theorem 20 For any legal I/G mapping �jK, the algorithm ROUTE CLOS cor-
rectly computes a feasible con�guration of GC(m;n; r) in O(logK logm) time if m
is an integral power of two and O(log2K logm) otherwise on a parallel completely
connected computer or the EREW PRAM model using N PEs.
7.5 Generalizations and Hardware Redundancy
The parallel machine model we used is not realistic. However, our algorithm can
be converted to �t any realistic machine model. A parallel operation involving in-
terprocessor communication can be achieved by sorting. Let S(N) be the time for
145
sorting N elements on a parallel machine M with N processors. Then, as the al-
gorithm for routing in Benes network B(N) of [62], the algorithms ROUTE and
ROUTE CLOS can be implemented on a machine with N PEs in no more than
O(log2K � S(N) + logN � S(N)) and O(log2K logm � S(N)) running time respec-
tively, where there are O(logK) busy inputs. For example, when implemented on
parallel computers whose PEs are connected by perfect shu�e and hypercube net-
works, our algorithm ROUTE takes O(log4K + log2K � logN) time.
Also, it is not diÆcult to see that the proposed parallel algorithm for GB(N;n)
can set dual inputs in the SEs indexed by f0; � � � ; (N=2s+1) � jg, where 0 � s �
logN � 2 and j 2 f0; � � � ; 2s � 1g, in the stage s to straight. Thus, these SEs can
be eliminated. So the hardware redundancy in �rst logN � 1 stages isPlogN�2
i=0 2i =
2m�1 � 1 = N=2 � 1 SEs. Translated into crossing points, the number of saved
crosspoints in Benes group connector GB(N;N=2k) is increased to 2N � k + 2N � 4,
compared with the Benes network B(N). For example, we can reduce 1 + 2 + 4 = 7
SEs in the �rst 3 stages of GB(16; 4). Comparing with the Figure 7.3, the Benes
group connector in Figure 7.4 has much lower hardware cost. Thus, when k = 0 in
which case GB(N;n) is the Benes network B(N), the hardware redundancy achieved
by our algorithm is the same as the number given in [93].
For three-stage Clos group connector, it has been shown [101] that the suf-
�cient condition for the rearrangeable nonblocking GC(m;n; r) is n � m, which is
also the necessary condition. If we choose the minimum value of n, i.e. n = m,
each concentrator in the last stage becomes size of m �m. Thus, we can obtain a
rearrangeably nonblocking two-stage group connector by removing all concentrators
of the last stage as Figure 7.6 (b).
146
7.6 Summary
We have introduced a class of interconnection networks: Benes group connector
and Clos group connector based on Benes network and 3-stage Clos network with
the reduction of hardware redundancy, and designed fast parallel algorithms for
con�guration of Benes group connectors and Clos group connectors based on graph
coloring. Also our algorithms can be implemented in various realistic parallel machine
models by multiplication a factor of time for sorting N elements on it. To our
knowledge, all known algorithms for setting up Benes networks B(N) cannot be
directly applied to set up Benes group connectors, however, by letting n = N , our
algorithm for GB(N;n) can be directly applied to set up Benes network B(N) with
the same time complexity. All known algorithms for setting up Clos networks only
considered full permutation [4, 10, 31, 49], which cannot be directly applied to set up
Clos group connectors for non-maximummapping. By lettingK = N , our algorithm
for GC(m;n; r) can be directly applied to set up 3-stage Clos networks.
CHAPTER 8
CONCLUDING REMARKS
A switching network plays a key role in communication networks. Nonblocking
switching networks are always favored to be used as switching networks whenever
possible. Crosstalk-free requirement in photonic networks adds a new dimension of
constraints for nonblockingness. Switching algorithms play a fundamental role in
nonblocking networks, and any algorithm that requires more than linear time would
be considered too slow for real-time applications. One remedy is to use multiple
processors to route connections in parallel.
Design and analysis eÆcient switching algorithms is one of the most active
research areas in communication networks. Using parallel computing and processing
techniques to improve the time complexity of switching algorithms brings great chal-
lenge. In this chapter, we summarize the major contributions of this dissertation and
discuss the further research work as the extension of the dissertation in switching
area.
8.1 Contributions
One major contribution of this dissertation is the design and analysis of fast par-
allel routing and wavelengths assignment algorithms for establishing connections in
switching networks.
We studied a class of multistage nonblocking switching networks B(N;x; p; a),
which contain Banyan network, Benes network, and Cantor network as special cases.
By modeling the routing problems for this class of networks as weak and strong edge
147
148
colorings of bipartite graphs, we developed fast parallel routing algorithms that can
route an arbitrary partial permutation with K(� N) connections in a rearrangeable
nonblocking networkB(N;x; p; a) inO((x+log p) logK+logN) time and in a strictly
nonblocking network B(N; 0; p�; a) in O(log p� logK + p� log p�) time.
Crosstalk problem for photonic switching adds a new dimension of blocking for
switching networks. We presented fast parallel routing and wavelength assignment
algorithms to establish connection for photonic switching network using time, space
and wavelength dilations to avoid crosstalk.
We modeled the routing and wavelength assignment problems as the graph
coloring problems by combinatorial and graph theory approaches. Using various
parallel graph coloring techniques such as edge coloring, vertex coloring, equitable
coloring, and balance coloring and so on, we presented fast parallel routing algorithms
for photonic switching.
Using time dilation approach, we proposed a fast parallel decomposition al-
gorithm with time complexity O(logN) to decompose a permutation into two semi-
permutations which can be routed separately through the optical Benes networks
without crosstalk. The presented parallel crosstalk-free routing algorithm can set up
any permutation in O(log2N) time in an optical B(N). This decomposition algo-
rithm can be extended to set up any partial permutation with K(< N) connections
in O(log2K + logN) time in an optical B(N).
Using space dilation approach, the crosstalk can be avoided by increasing
the number of SEs in photonic switching networks. We developed sublinear-time
parallel routing algorithms for the class of networks constructed from Banyan-type
networks by horizontal concatenation of extra stages and/or vertical stacking of
multiple copies.
Using wavelength dilation approach, we presented fast parallel routing and
wavelength assignment algorithms to route connections in optical WRSS Banyan
149
networks and WRSR Benes networks so that the connections passing through the
same SE have di�erent wavelengths. For an arbitrary partial permutation, it can
be routed without crosstalk in a WRSS BL(N) in O(log2N) time using at most
2blogN+1
2c wavelengths and in a WRSR B(N) with only basic SEs in O(log3N) time
using at most 2 logN wavelengths.
The presented algorithms run on a completely connected multiprocessor sys-
tem can be easily transformed to algorithms on more realistic multiprocessor systems.
For example, the proposed algorithms used to set up connections inB(N;x; p; �) have
a slow-down factor O(log2N) on a Banyan-type multiprocessor system, whose com-
plexity is no larger than one plane of B(N;x; p; �); the decomposition algorithm and
routing algorithms for optical Benes network can be implemented in O(log3N) time
and in O(log4N) time, respectively, on a hypercube; the routing and wavelength
assignment algorithms for a WRSS BL(N) and a WRSR B(N) take O(log4N) time
on a hypercube with N=2 PEs.
Another major contribution of this dissertation is to the design and analysis
of fast parallel stable matching and acyclic stable matching algorithms for switch
scheduling.
For stable matching problem, we proposed a new approach, parallel iterative
improvement (PII), which treats this problem as an optimization problem. A par-
ticular PII algorithm based on this approach is presented. Using techniques such
as randomization and greedy selection, the experimental evaluations show that PII
algorithm has better average performance compared with the classical stable match-
ing algorithms and converges in linear iterations with high probability. Due to the
non-uniqueness of random selection, the stable matching generated by PII algorithm
provides more fairness. The PII algorithm can also be stopped at any time with an
output of a \near-stable" matching to satisfy time constraint in real time applica-
tions. In addition, the PII algorithm can be easily implemented in realistic parallel
150
computing models such as hypercube, mesh of trees, and array with multiple broad-
casting buses without or with a logarithmic-time slow down factor.
For acyclic stable matching problem, we modeled it as the dominating set
problem on a rooted dependency graph, and then propose a parallel algorithm for
�nding the dominating set. For any instance of acyclic stable matching problem,
our acyclic stable matching algorithm can �nd a stable matching in O(N logN)
time while the classical stable matching needs O(N2) time. Simulation results show
that the scheduler based on our acyclic stable matching algorithm is feasible to be
implemented at high speed using current CMOS technologies.
Design of low cost, high speed, and large capacity nonblocking switching
architectures is also a contribution of this dissertation work.
Scalable nonblocking switching networks tend to have no self-routing capa-
bility. For example, for a nonblocking switching network B(N;x; p; �), though self-
routing capabilities exist in a portion of it, its routing is still computation intensive.
By studying the connection capacity of Banyan-type networks, we proposed a new
class of self-routing strictly nonblocking networks T (N;�). Compared with existing
strictly nonblocking self-routing networks, T (N;�) has lower hardware cost, shorter
connection diameter, and much smaller number of required wavelengths. Conse-
quently, they are more feasible for implementation with reduced optical signal atten-
uation and crosstalk.
We have introduced a class of interconnection networks: Benes group connec-
tor and Clos group connector based on Benes network and 3-stage Clos network with
the reduction of hardware redundancy, and designed fast parallel routing algorithms
for these group connectors. We also showed that, by our routing algorithms, the
hardware of Benes group connectors can be reduced further.
Most of results discussed in this dissertation have been reported to research
community through publications [50]-[55].
151
8.2 Future Work
EÆcient switching algorithms and switching architectures directly a�ect the perfor-
mance of communication networks. More work needs to done in this area. In the
following, we suggest some directions as the extension of this dissertation for possible
further research.
Multicasting is an important feature for any switching network being intended
to support broadband integrated services digital networks (B-ISDN). We expect to
extend our routing algorithms with some modi�cations to support multicasting in
switching networking.
With wavelength division-multiplexing (WDM) technology, the concept of
SNB and RNB in space division switching can be extended to the wavelength division
switching. Depending on whether wavelengths can be reassigned, this extension re-
sults in four combinations: wavelength-rearrangeable space-rearrangeable (WRSR),
wavelength-rearrangeable space-strict-sense (WRSS), wavelength-strict-sense space-
rearrangeable (WSSR), and wavelength-strict-sense space-strict-sense (WSSS). It has
been shown that using both the wavelength and space multiplexing techniques in a
fully dynamic manner, networks can achieve higher bandwidth and higher connec-
tivity. It is worthwhile to investigate the connection capacities of these networks and
design eÆcient routing and wavelength assignment algorithms for them.
The scheduling algorithms based on stable matchings have been shown to
provide QoS guarantees. It is desirable to seek new approaches to further reduce the
time complexities for the solutions of stable matching and acyclic stable matching
problems.
For the design of a switching network, in addition to its hardware cost in
terms of the cost of SEs and interconnection links (and wavelengths), we must take
the routing complexity into consideration. It remains a great challenge for �nding
low-cost high-speed nonblocking switching networks.
BIBLIOGRAPHY
[1] H. Abeledo and U. G. Rothblum, \Paths to marriage stability",Discrete Applied
Mathematics, vol. 63, pp. 1-12, 1995.
[2] D. P. Agrawal, \Graph theoretical analysis and design of multistage interconnec-
tion networks", IEEE Transactions on Computers, vol. C-32, no. 7, pp. 637-648,
July 1983.
[3] R. Anderson, \Parallel algorithms for generating random permutations on a
shared memory machine", Proceedings of the 2nd ACM Symposium on Parallel
Algorithms and Architectures, pp. 95-102, 1990.
[4] S. Andersen, \The looping algorithm extended to base 2t rearrangeable switch-
ing network", IEEE Transactions on communications, vol. 25, pp. 1057-1063,
1977.
[5] V. E. Benes, \On rearrangeable three-stage connecting networks", Bell System
Technical Journal, vol. 41, no. 5, pp. 1481-1492, Sep. 1962.
[6] V. E. Benes, \Permutation groups, complexes, and rearrangeable connecting
networks", The Bell System Technical Journal, vol. 43, pp. 1619-1640, July
1964.
[7] V. E. Benes,Mathematical Theory of Connecting Networks and Telephone Traf-
�c, Academic Press, New York, 1965.
[8] J. A. Bondy and U.S.R. Murty, Graph Theory with Applications, Elsevier North-
Holland, 1976.
152
153
[9] J. Carpinelli, \Interconnection networks: Improved routing methods for Clos
and Benes networks", Ph.D. Thesis, Rensselaer Polytechnic Institute, Troy, NY,
Aug. 1987.
[10] J. Carpinelli and A. Y. Oru, \Applications of matching and edge-coloring al-
gorithms to routing in Clos networks", Networks, vol. 24, pp. 319-326, Sep.
1994.
[11] H. J. Chao, C. H. Lam, and E. Oki, Broadband Packet Switching Technologies,
John Wiley & Sons, Inc., 2001.
[12] S. T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, \Matching output
queuing with a combined input/output-queued switch", IEEE Journal on Se-
lected Areas in Communications, vol. 17, no. 6, pp. 1030-1039, 1999.
[13] C. Clos, \A study of non-blocking switching networks", Bell System Technical
Journal, vol. 32, pp. 406-424, Mar. 1953.
[14] R. Cole and J. Hopcroft, \On edge coloring bipartite graphs", SIAM Journal
on Computing, vol. 11, pp. 540-546, 1982.
[15] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algo-
rithms, The MIT Press and McGraw-Hill Book Company, second edition, 2001.
[16] R. Durstenfeld, \Random permutation (Algorithm 235)", Communication of
ACM, vol. 7, no. 7, pp. 420, 1964.
[17] T. Feder, N. Megiddo, and S. Plotkin, \A sublinear parallel algorithm for stable
matching", Theoretical Computer Science, vol. 233, pp. 297-308, 2000.
[18] H. Gabow, \Using Euler partitions to edge color bipartite multigraphs", Inter-
national Journal of Computer and Information Sciences, vol. 5, pp. 345-355,
1976.
154
[19] H. Gabow and O. Kariv, \Algorithms for edge coloring bipartite graphs and
multigraphs", SIAM Journal on Computing, vol. 11, pp. 117-129, 1982.
[20] D. Gale and L. S. Shapley, \College admissions and the stability of marriage",
American Mathematical Monthly, vol. 69, pp. 9-15, 1962.
[21] A. V. Goldberg, S. A. Plotkin, and G. E. Shannon, \Parallel symmetry-breaking
in sparse graphs," Proceedings of the Nineteenth Annual ACM Symposium on
Theory of Computing, pp. 315-323, 1987.
[22] Q. P. Gu and S. Peng, \Wavelengths requirement for permutation routing in
all-optical multistage interconnection networks", Proceedings of 14th Interna-
tional Parallel and Distributed Processing Symposium (IPDPS), pp. 761-768,
May 2000.
[23] D. Gus�eld, \Three fast algorithms for four problems in stable marriage", SIAM
Journal on Computing, vol. 16, no. 1, pp. 111-128, 1987.
[24] D. Gus�eld and R. W. Irving, The Stable Marriage Problem Structure and Al-
gorithms, MIT Press, 1989.
[25] T. Hagerup, \Fast parallel generation of random permutations", Proceedings of
the 18th Annual International Colloquium on Automata, Languages and Pro-
gramming, pp. 405-416, 1991.
[26] T. Hattori, T. Yamasaki, and M. Kumano, \New fast iteration algorithm for the
solution of generalized stable marriage problem", Proceedings of IEEE Interna-
tional Conference on Systems, Man, and Cybernetics, vol. 6. pp. 1051 -1056,
1999.
[27] H. Hinton, \A non-blocking optical interconnection network using directional
couplers", Proceedings of IEEE Global Telecommunications Conference, pp. 885-
889, Nov. 1984.
155
[28] J. E. Hopcroft and R. M. Karp, \An n2:5 algorithm for maximum matching in
bipartite graphs", SIAM Journal on Computing, vol. 2, pp. 225-231, 1973.
[29] M. E. C. Hull, \A parallel view of stable marriages", Information Processing
Letters, vol. 18, no. 1, pp. 63-66, 1984.
[30] D. K. Hunter, P. J. Legg, and I. Andonovic, \Architecture for large dilated
optical TDM switching networks", IEE Proceedings on Optoelectronics, vol. 140,
no. 5, pp. 337-343, Oct. 1993.
[31] F. K. Hwang, The Mathematical Theory of Nonblocking Switching Networks,
World Scienti�c, 1998.
[32] IEEE Standards Board, IEEE Standard VHDL Language Reference Manual,
2002.
[33] J. Jaja, An Introduction to Parallel Algorithms, Addison-Wesley, 1992.
[34] A. Jajszczyk, \A simple algorithm for the control of rearrangeable switching
networks", IEEE Transactions on Computers, vol. 33, pp. 169-171, 1985.
[35] A. C. Kam, K. Y. Siu, R. A. Barry, and E. C. Swanson, \A cell switch WDM
broadcast LAN with bandwidth guarantee and fair access", IEEE Journal of
lightwave technology, vol. 16, no. 12, pp. 2265-2280, Dec. 1998.
[36] A. Kam and K.-Y. Siu, \Linear complexity algorithms for QoS support in input-
queued switches with no speedup", IEEE Journal on Selected Areas in Commu-
nications, vol. 17, no. 6, pp. 1040-1056, June 1999.
[37] D. Kapur and M. S. Krishnamoorthy, \Worst-case choice for the stable marriage
problem", Information Processing Letters, vol. 21, pp. 27-30, 1985.
156
[38] M. J. Karol, M. G. Hluchyj, and S. P. Morgan \Input vs. output queueing on a
space-division packet switch", IEEE Transactions on communications, vol. 35,
no. 12, pp. 110-115, May 1987.
[39] S. Keshav, An Engineering Approach to Computer Networking, Addison-Wesley
Inc., 1997.
[40] C. T. Lea, \Crossover minimization in directional-coupler-based photonic
switching systems", IEEE Transactions on Communications, vol. 36, no. 3, pp.
355-363, Mar. 1988.
[41] C. T. Lea, \Multi-log2N networks and their applications in high-speed electronic
and photonic switching systems", IEEE Transactions on Communications, vol.
38, no. 10, pp. 1740-1749, Oct. 1990.
[42] C. T. Lea and D. J. Shyy, \Tradeo� of horizontal decomposition versus ver-
tical stacking in rearrangeable nonblocking networks", IEEE Transactions on
Communications, pp. 899-904, vol. 39, no. 6, June 1991.
[43] C. Y. Lee and A.Y. Oruc, \A fast parallel algorithm for routing unicast as-
signments in Benes networks", IEEE Transactions on Parallel and Distributed
Systems, vol. 6, no. 3, pp. 329-334, Mar. 1995.
[44] K. Y. Lee, \On the rearrangeability of a ( 2 logN � 1 ) stage permutation
network", IEEE Transactions on Computers, vol. 34, no. 5, pp. 412-425, May
1985.
[45] K. Y. Lee, \A new Benes network control algorithm", IEEE Transactions on
Computers, vol. 36, no. 6, pp. 768-772, June 1987.
[46] T. T. Lee and S. Y. Liew, \Parallel routing algorithms in Benes-Clos networks",
IEEE Transactions on Communications, vol. 50 no. 11, pp. 1841 - 1847, Nov.
2002.
157
[47] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays �
Trees � Hypercubes, Morgan Kaufmann Publishers, 1992.
[48] J. Lenfant, \Parallel permutations of data: a Benes network control algorithm
for frequently used permutations", IEEE Transactions on computers, vol. 27,
pp. 637-647, July 1978.
[49] G. F. Lev, N. Pippenger, and L. G. Valiant, \A fast parallel algorithm for
routing in permutation networks", IEEE Transactions on Computers, vol. 30,
pp. 93-100, Feb. 1981.
[50] E. Lu and S. Q. Zheng, \Parallel routing algorithms for nonblocking electronic
and photonic multistage switching networks", Proceedings of IEEE International
Parallel and Distributed Processing Symposium (IPDPS 2004), Workshop on
Advances in Parallel and Distributed Computing Models, April, 2004.
[51] E. Lu and S. Q. Zheng, \A parallel iterative improvement stable matching al-
gorithm", Proceedings of International Conference on High Performance Com-
puting (HiPC), Lecture Notes in Computer Science, Springer-Verlag, pp. 55-65,
Dec. 2003.
[52] E. Lu, M. Yang, Y. Zhang and S. Q. Zheng, \Design and implementation of an
acyclic stable matching scheduler", Proceedings of IEEE Global Communications
Conference (GlobeCom), pp. 3938-3942, Dec. 2003.
[53] E. Lu and S. Q. Zheng, \High-speed crosstalk-free routing for optical multistage
interconnection networks", Proceedings of the 12th IEEE International Confer-
ence on Computer Communications and Networks (ICCCN), pp. 249-254, Oct.
2003.
[54] E. Lu and S. Q. Zheng, \A fast parallel routing algorithm for Benes group
switches", Proceedings of the 14th IASTED International Conference on Parallel
158
and Distributed Computing and Systems, pp. 67-72, Nov. 2002.
[55] E. Lu and S. Q. Zheng, \Parallel algorithms for controlling group switches", Pro-
ceedings of the 15th ISCA International Conference on Parallel and Distributed
Computing Systems, pp. 84-89, Sep. 2002.
[56] G. Maier, A. Pattavina, and S. G. Colombo, \ Control of non-�lterable crosstalk
in optical-cross-connect banyan architectures", in Proceedings of IEEE Global
Telecommunications Conference GLOBECOM, vol. 2, pp. 1228-1232, Nov.-Dec.
2000.
[57] G. Maier and A. Pattavina, \Design of photonic rearrangeable networks with
zero �rst-order switching-element-crosstalk", IEEE Transactions on Communi-
cations, vol. 49, no. 7, pp. 1268-1279, Jul. 2001.
[58] N. McKeown, \Scheduling algorithms for input-bu�ered cell switches", Ph.D.
Thesis, University of California at Berkeley, 1995.
[59] D. G. McVitie and L. B. Wilson, \The stable marriage problem", Communica-
tion of the ACM, vol. 14, no. 7, pp. 486-490, 1971.
[60] C. Minkenberg, \On packet switch design", Ph.D. dissertation, Eindhoven Uni-
versity of Technology, 2001.
[61] N. Nassimi and S. Sahni, \A self-routing Benes network and parallel permutation
algorithms", IEEE Transactions on Computers, vol. 30, no. 5, pp. 148-154, May
1981.
[62] N. Nassimi and S. Sahni, \Parallel algorithms to set up the Benes permutation
network", IEEE Transactions on Computers, vol. 31, no. 2, pp. 148-154, Feb.
1982.
159
[63] G. Nong and M. Hamdi, \On the provision of integrated QoS guarantees of
unicast and multicast traÆc in input-queued switches", Proceedings of IEEE
Globecom 1999, vol. 3, pp. 1742-1746, 1999.
[64] G. Nong and M. Hamdi, \On the provision of quality-of-service guarantees for
input queued switches", IEEE Communications Magazine, vol. 38, no. 12, pp.
62-69, 2000.
[65] A. Olsson, Understanding Telecommunications, Ericsson, 2002.
[66] D. C. Opferman, and N. T. Tsao-Wu, \On a class of rearrangeable switching
networks", Part I: Control Algorithm, Bell System Technical Journal, vol. 50,
pp. 1,579-1,600, 1971.
[67] A. Y. Oruc and H. M. Huang, \Crosspoint complexity of sparse crossbar con-
centrators", IEEE Transactions on Information Theroy, vol. 42, no. 9, pp. 1466-
1471, Sep. 1996.
[68] K. Padmanabhan and A. Netravali, \Dilated network for photonic switching",
IEEE Transactions on Communications, vol. COM-35, no. 12, pp. 1357-1365,
Dec. 1987.
[69] Y. Pan, C. Qiao, and Y. Yang, \Optical multistage interconnection networks:
new challenges and approaches", IEEE Communications Magazine, vol. 37, no.
2, pp. 50-56, Feb. 1999.
[70] J. H. Patel, \Performance of processor-memory interconnections for multipro-
cessors", IEEE Transactions on Computers, vol. 30, no. 10, pp. 771-780, Oct.
1981.
[71] G. Pieris and G. Sasaki, \A linear lightwave Benes network", IEEE/ACM Trans-
actions on Networking, vol. 1, no. 4, pp. 441-445, Aug. 1993.
160
[72] B. Prabhakar and N. McKeown, \On the speedup required for combined input-
and output-queued switching", Automatica, vol. 35, no. 12, pp. 1909-1920, 1999.
[73] C. Qiao, R. Melhem, D. Chiarulli, and S. Levitan, \A time domain approach
for avoiding crosstalk in optical blocking multistage interconnection networks",
IEEE Journal Lightwave Technology, vol. 12, no. 10, pp. 1854-1862, Oct. 1994.
[74] C. Qiao, \Analysis of space-time tradeo�s in photonic switching networks",
Proceedings of IEEE INFOCOM, vol. 2, pp. 822-829, Mar. 1996.
[75] X. Qin and Y. Yang, \Nonblocking WDM switching networks with full and
limited wavelength conversion", IEEE Transactions on Communications, vol.
50, no. 12, pp. 2032-2041, Dec. 2002.
[76] M. J. Quinn, \A note on two parallel algorithms to solve the stable marriage
problem", BIT, vol. 25, pp. 473-476, 1985.
[77] C. S. Raghavendra and R. V. Boppana, \On self-routing in Benes and shu�e-
exchange networks", IEEE Trans. Comput. , vol. 40, no. 9, pp. 1057-1064, Sep.
1991.
[78] R. Ramaswami and K. Sivarajan, Optical Networks: A Practical Perspective,
second edition, Morgan Kaufmann, 2001.
[79] H. Ramanujam, \Decomposition of permutation networks", IEEE Transactions
on Computers, vol. 22, pp. 639-643, 1973.
[80] J. Sharony, S. Jiang, T. E. Stern, and K. W. Cheung, \Wavelength rearrangeable
and strictly nonblocking networks", IEEE Electronics Letters, vol. 28, no. 6, pp.
536-537, Mar. 1992.
161
[81] J. Sharony, K. W. Cheung, and T. E. Stern, \Wavelength Dilated Switches
(WDS)-a new class of high density, suppressed crosstalk, dynamic wavelength-
routing crossconnects", IEEE Photonics Technology Letters, vol. 4, no. 8, pp.
933-935, Aug. 1992.
[82] J. Sharony, K. W. Cheung, and T. E. Stern, \The wavelength dilation concept in
lightwave networks-implementation and system considerations ", IEEE Journal
of Lightwave Technology, vol. 1, no. 5/6, pp. 900-907, May-Jun. 1993.
[83] X. Shen, F. Yang, and Y. Pan, \Equivalent permutation capabilities between
time-division optical Omega networks and non-optical extra-stage Omega net-
works", IEEE/ACM Transactions on Networking, vol. 9, no. 4, Aug. 2001.
[84] G. H. Song and M. Goodman, \Asymmetrically-dilated cross-connect switches
for low-crosstalk WDM optical networks", Proceedings of IEEE 8th Annual
Meeting Conference on Lasers and Electro-Optics Society Annual Meeting, vol.
1, pp. 212-213, Oct. 1995.
[85] I. Stoica and H. Zhang, \Exact emulation of an output queueing switch by a
combined input output queueing switch", in Proceedings of the 6th IEEE/IFIP
IWQoS'98, Napa Valley, CA, pp. 218-224, May 1998.
[86] F. M. Suliman, A. B. Mohammad, and K. Seman, \A space dilated lightwave
network-a new approach", Proceedings of IEEE 10th International Conference
on Telecommunications (ICT 2003), vol. 2, pp. 1675-1679, 2003.
[87] A. Subramanian, \A new approach to stable matching problems", SIAM Journal
on Computing, vol. 23, no. 4, pp. 671-700, 1994.
[88] Synopsys Design Analyzer Datasheet, available at
http://www.synopsys.com/products/logic/deanalyzer ds.html, 1997.
162
[89] Y. Tamir and G. L. Frazier, \High-performance multiqueue bu�ers for VLSI
communication switches", Proceedings IEEE 15th Annual International Sympo-
sium on Computer Architecture, pp. 343-354, 1988.
[90] S. S. Tseng and R. C. T. Lee, \A parallel algorithm to solve the stable marriage
algorithm", BIT, vol. 24, pp. 308-316, 1984.
[91] M. Vaez and C. T. Lea, \Wide-sense nonblocking Banyan-type switching sys-
tems based on directional couplers", IEEE Journal on Selected Areas in Com-
munications, vol. 16, no. 7, pp. 1327-1332, Sep. 1998.
[92] M. Vaez and C. T. Lea, \Strictly nonblocking directional-coupler-based switch-
ing networks under crosstalk constraint", IEEE Transactions on Communica-
tions, vol. 48, no. 2, pp. 316-323, Feb. 2000.
[93] A. Waksman, A permutation Network, Journal of the ACM, vol. 15, no. 1, pp.
159-163, Jan. 1968.
[94] J. E. Watson et al., \A low-voltage 8�8 Ti:LiNbO3 switch with a dilated Benes
architecture", IEEE Journal of Lightwave Technology, vol. 8, pp. 794-800, May
1990.
[95] T. S. Wong and C. T. Lea, \Crosstalk reduction through wavelength assign-
ment in WDM photonic switching networks", IEEE Transactions on Commu-
nications, vol. 49, no. 7, pp. 1280-1287, Feb. 2001.
[96] C. L. Wu and T. Y. Feng, \On a class of multistage interconnection networks",
IEEE Transactions on Computers, vol. C-29, no. 8, pp. 694-702, Aug. 1980.
[97] M. Yang and S. Q. Zheng, \The kDDR scheduling algorithms for multi-server
packet switches", Proceedings of the ISCA 15th International Conference on
Parallel and Distributed Computing Systems, pp. 78-83, 2002.
163
[98] M. Yang and S. Q. Zheng, \EÆcient scheduling for CIOQ switches with space-
division multiplexing speedup", Proceedings of IEEE Infocom, 2003.
[99] Y. Yang, J. Wang, and Y. Pan, \Permutation capability of optical multistage
interconnection networks", Journal of Parallel and Distributed Computing, vol.
60, no. 1, pp. 72-91, Jan. 2000.
[100] Y. Yang and J. Wang, \Optimal all-to-all personalized exchange in a class of
optical multistage networks", IEEE Transactions on Parallel and Distributed
Systems, vol. 12, no. 6, pp. 567-582, June. 2001.
[101] Y. Yang, S. Q. Zheng, and D. Verchere, Group switching for DWDM networks,
submitted for publication.
[102] S. Q. Zheng and Y. Xiong, \Ingress edge router architecture and related chan-
nel scheduling algorithms for OBS networks", Alcatel internal technical report,
2000.
[103] S. Q. Zheng, M. Yang, and F. Masetti, \Hardware switch scheduling in high-
speed, high-capacity IP routers", Proceedings of the 14th IASTED International
Conference on Parallel and Distributed Computing and Systems, pp. 636-641,
2002.
VITA
Enyue Lu received B.S. degree in mathematics from Zhejiang Normal University,
China, in 1996, M.S. degree in mathematics from Nanjing University, China, in
1999, and M.S. degree in computer science fromUniversity of Texas at Dallas in 2001.
Currently, she is a Ph.D. candidate in computer science department at University of
Texas at Dallas.
Enyue Lu's current main research interests include parallel processing and comput-
ing, computer and communication networks, algorithm design and analysis, computer
architectures, software engineering, databases, and combinatorics and graph theory.
Her Ph.D. dissertation focuses on the design and analysis of eÆcient switching algo-
rithms for high-performance switches and routers. She has published several refereed
papers in those areas and earned a Best Paper Award at the 14th IASTED Interna-
tional Conference on Parallel and Distributed Computing and Systems in 2002. From
Jan. 2001 to May 2001, she worked as a co-op in UMTS/GSM Services Development
Group at Nortel Networks, Richardson, Texas.