+ All Categories
Home > Documents > 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

Date post: 21-Dec-2015
Category:
View: 225 times
Download: 1 times
Share this document with a friend
54
1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued
Transcript
Page 1: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

1

Switch Architectures

Input Queued, Output Queued,

Combined Input and Output Queued

Page 2: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

2

Outline

I. Introduction

II. System Model

III. The Least Cushion First/Most Urgent First Algorithm

IV. Conclusion

Page 3: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

3

Ⅰ. Introduction

Exponential growth of Internet traffic demands large scale switches

Common Switch Architectures Output Queued

High performance

Easier to provide QoS guarantee

Has serious scaling problem Input Queued

More scalable

Suffers from HOL blocking

Virtual Output Queues can improve performance

Difficult to provide QoS guarantee

Page 4: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

4

Output Queued-Shared Bus

1234

1

2

3

4

1

Input Port Output Port

Page 5: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

5

Output Queued-Shared Memory

Memory1234

1234

Input Port Output Port

Page 6: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

6

Input Queued

1 2 3 4OUTPUT PORT:

Input port:

1

2

3

4

Page 7: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

7

Input Queued with VOQ

1 2 3 4OUTPUT PORT:

For output port:

1

2

3

4

Input port:1

234

1234

1234

1234

Page 8: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

8

Ⅰ. Introduction

Input queued Output queued Shared Bus

Output queued Shared Memory

Memory BW S2 SN )1( SN 2

Example S = 10Gbps, N = 16

20Gbps 170Gbps 320Gbps

S : link speed

N : switch size (N×N)

Memory BW requirements for three common switch architectures:

Input queueing is necessary !

Can speedup the switch to improve performance CIOQ switch

Page 9: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

9

Ⅰ. Introduction

Complexity (Iteration)

Description

Maximum )( 5.2NO Achieves 100% throughput under uniform traffic

Maximum weight )log( 3 NNO Achieves 100% throughput under either uniform or non-uniform traffic

Maximal )(NO Achieves 100% throughput with a speedup of 2 times

Stable )( 2NO Exactly emulates an OQ switch with a speedup of 2 times

matching

Matching Algorithms for Performance Improvement:

Page 10: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

10

Ⅰ. Introduction

Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.

CIOQ Switch

EmulatedOQ Switch

. . .

Output 1

Output N

. . .

Output 1

Output N

. . .

Input 1

Input N

Input 1

Input N

. . .

IdenticalInput Traffic

IdenticalDeparture Pattern

Page 11: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

11

Ⅰ. Introduction

We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm

O(N) complexity with parallel comparators Exactly emulates an OQ switch with a speedup of 2

times No constraint on service discipline

Page 12: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

12

Ⅱ. System Model

Switching

Fabrics

Speedup=2

Page 13: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

13

Ⅱ. System Model

Switch fabric is speeded up by a factor of 2 There are 2 scheduling phases in slot k, referred to as

phase k.1 and phase k.2 A cell delivered to its destined output port in phase

k.1 can be transmitted out of the output port in the same slot (i.e., cut through)

A cell delivered in phase k.2 can only be transmitted in slot k+1 or after

Page 14: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

14

Ⅱ. System Model

Page 15: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

15

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

Let denote a cell at input port i destined to output port j

Definition 1: The cushion of cell : The number of cells residing in output port j which will depart

the emulated OQ switch earlier than cell

Definition 2: The cushion between input port i and output port j: The minimum of for all cells at input port i destined

to output port j If there is no cell destined to output port j, then is set

to

)( ,jixC

jix ,jix ,

),( jiC

)( , jixC

)( , jixC

jix ,

Page 16: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

16

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals

Definition 4: The input thread of cell at input port i: The set of cells at input port i which has a cushion

smaller than or equal to except cell itself Let denote the size of

),( jiC

jix ,

)( , jixIT

),( , jixC jix ,

|)(| , jixIT )( , jixIT

Page 17: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

17

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

Page 18: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

18

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

LCF / MUF Algorithm Step 1:

Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop.

If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port).

For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).

)},({min),( , lkCjiC lk,

jix , jkx ,

Page 19: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

19

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

LCF / MUF Algorithm

Step 2: Eliminate the ith row and the jth column (i.e., match

output port j to input port i) of the scheduling matrix.

If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1.

Consider for example the scheduling matrix given in page 13

Page 20: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

20

Ⅳ. Conclusion

We propose a new scheduling algorithm - the least cushion first / most urgent first algorithm Exactly emulates an OQ switch No constraint on service discipline

Implement issues of the LCF / MUF algorithm A switch has to know the cushions of all cells and the

relative departure order of cells destined to the same output port

It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ)

Feasible for static priority assignment schemes

Page 21: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

21

Outline

Systolic Array

Binary Heap

Pipelined Heap

Hardware Design

Page 22: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

22

The Systolic Array Priority Queue

Block 1Block 2Block 3Block n

Highest value

New value

NON-INCREASING PRIORITY VALUES

Permanent Data Register

Temporary Register

n = 1000

Hardware required: 1000 comparators, 2000 registers.

Performance: constant time.

Page 23: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

23

The Binary Heap Priority Queue

14

2 3 5 7

7 34

10

1

2 3

4 5 6 7

8 9 10 11 12

16

8

3

16 14 10 77 33 24 3 58

1 2 3 4 5 6 7 8 9 10 11 12 13 14

VALUE

n =1000Hardware required: 1 comparator, 1 register, 1 SRAM.Performance: O(log n).

15

Page 24: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

24

The Pipelined-Heap

Modified binary heap data structure

Constant-time operation. Similar to the Systolic Array.

Good hardware scalability. Similar to the Binary Heap.

Page 25: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

25

P-heap Data Structure (B,T)

16 14 10 7 3 24 1 57 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

4 1 3 1 0 1 2 0 1 0 0 1 0 1 1

value

capacity

16

2

4

1 5

7

8

7 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

operation positionvalue

Level 1

Level 2

Level 3

Level 4

Binary Array (B)Token Array (T)

Page 26: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

26

16

2 4 5

8 7 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

enq

operation positionvalue

9 1

(a) local-enqueue(1)

16

2 4 5

8 7 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

enq

operation positionvalue

9 2

(b) local-enqueue(2)

The Enqueue (Insert) Operation

Page 27: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

27

16

2 4 5

8 9 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

enq

operation positionvalue

7 10

(d) local-enqueue(4)

16

2 4 5

8 7 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

enq

operation positionvalue

9 5

(c) local-enqueue(3)

16

2 4 5

8 9 3

14 10

2 3

4 5 6 7

8 9 10 11 12 13 14 15

1operation positionvalue

7

(e)

Enqueue (contd)

Page 28: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

28

2 4 5

8 7 3

14 10

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

(b) local-dequeue(1)

deq

operation positionvalue

1

1

16

2 4 5

8 7 3

14 10

2 3

4 5 6 7

8 9 10 11 12 13 14 15

(a)

operation positionvalue

The Dequeue (Delete) Operation

Page 29: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

29

2 4 5

8

7 3

14

10

2 3

4 5 6 7

8 9 10 11 12 13 14 15

(d) local-dequeue(3)

deq

operation positionvalue

4

2 4 5

8 7 3

14

10

2 3

4 5 6 7

8 9 10 11 12 13 14 15

(c) local-dequeue(2)

deq

operation positionvalue

2

2

4

5

8

7 3

14

10

2 3

4 5 6 7

8 9 10 11 12 13 14 15

(e)

operation positionvalue

Dequeue (contd)

11

1

Page 30: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

30

Pipelined Operation

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

level level

level level

Page 31: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

31

Hardware Requirements

log N SRAMs represent the Binary Array B, N = size of the P-heap .

log N registers represent the Token Array T.

log N comparators required, one for each level of the P-heap.

Page 32: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

32

Binary Heap

16

11 12

8 11 9

1

2 3

4 5 6

161

112

123

84

115

96

viewed as an array

viewed as a binary tree

Left(i) = 2*iRight(i) = 2*i + 1Parent(i) = i / 2

A[i] >= A[Left(i)]A[i] >= A[Right(i)]

Page 33: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

33

Binary Heap : Insert Operation

16

11 12

8 10 9

1

2 3

4 5 6

161

112

123

84

105

96

viewed as an array

viewed as a binary tree

147

147

16

11 14

8 10 9

1

2 3

4 5 6

161

112

143

84

105

96

viewed as an array

viewed as a binary tree

127

127

Page 34: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

34

Binary Heap : Delete Operation

16

11 14

8 10 9

1

2 3

4 5 6

161

112

143

84

105

96

viewed as an array

viewed as a binary tree

127

127

16

11 14

8 10 9

2 3

4 5 6

121

112

143

84

105

96

viewed as an array

viewed as a binary tree

121

11 12

8 10 9

2 3

4 5 6

141

112

123

84

105

96

viewed as an array

viewed as a binary tree

141

Page 35: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

35

Binary Heap Operations

Both insert and delete are O(log N) operations (i.e. number of levels in the tree)

2*i can be implemented as left shift

i / 2 can be implemented as right shift

Page 36: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

36

Some scheduling algorithm

Outline

PIM

RRM

iSLIP (Better solution)

Page 37: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

37

Scheduling Algorithms

When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs.

This is equivalent to find a bipartite matching on a graph with N vertices.

The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.

Page 38: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

38

Scheduling packets

For Example

P( input #, output #) = order to leave

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2

Crossbar Switch

Input side Output side

Scheduling Algorithm need to decide the path and order of packetsthrough crossbar switch

Page 39: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

39

High performance systems

Usually, we design algorithm with the following properties:

High Throughput

Starvation Free

Fast

Simple to Implement

Page 40: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

40

Parallel Iterative Matching (PIM)

PIM has three steps to implement

Step1 : Request

Step2 : Grant

Step3 : Accept

Each decision is made randomly.

Page 41: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

41

The mathematics model of algorithm

We can assume that

Every input in[i] maintains the following state information: Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a

request for Out[k] (0, otherwise) Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i]

receives a grant from Out[k] (0, otherwise) Variable Ai, where Ai = k, if In[i] accepts the grant from

Out[k] (-1, if no output is accepted).

Page 42: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

42

The mathematics model (cond’t)

Every output Out[k] maintains the following state information:

Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise)

Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted)

Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).

Page 43: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

43

The model of PIM

Therefore, we can represent PIM algorithm as

Page 44: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

44

An example of PIM algorithm

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2(a) (b) (c)

Request Grant Accept

Seconditeration

Page 45: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

45

Problems with PIM

Hard to implement randomness in hardware

Unfairness occurs among connections under oversubscribed situation

Throughput is limited to approximately 63% for a single iteration

%63)/1( NNN

Page 46: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

46

The unfairness problem

λ1,1=1

λ1,2=1

λ2,1=1

μ1,1=1/4

μ 1,2=3/4

μ 2,1=3/4

Page 47: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

47

Round-Robin Matching Algorithm (RRM)

Use rotating priority to match inputs and outputs

Need a pointer gi to identify the highest priority

element

Apply rotating priority on both inputs and outputs

Page 48: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

48

The model of RRM

Page 49: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

49

RRM scheduling

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2(a) (b) (c)

21

23

4g2

41

23

4

g4

11

23

4a1

Page 50: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

50

Synchronization Problem

When an output receives a request, the output should choose an input to grant and gi must vary

to a new value

For example

λ1,1= λ1,2 =1

λ2,1= λ 2,2=1

μ1,1= μ1,2=1/4

μ 2,1= μ 2,2=1/4

Efficiency = 50%

Page 51: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

51

iSLIP algorithm

Use to fix synchronization problem of RRM

Changes its pointer gi only when the grant is

accepted by the input, or the pointer gi will keep

its value

Solves the synchronous problem and achieves 100% throughput

Page 52: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

52

The model of iSLIP

Page 53: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

53

Example of iSLIP

λ1,1= λ1,2 =1

λ2,1= λ 2,2=1

1st match

2nd match

3rd match

100% throughput is achieved

Page 54: 1 Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued.

54

Comparison of three algorithms


Recommended