Nick McKeownAssistant Professor of Electrical Engineeringand Computer Science
[email protected]://www.stanford.edu/~nickm
High PerformanceSwitching and Routing
High Performance Switching and Routing
2
1. The Demand for Bandwidth
2. The Shortage of
3. The Architecture of Switches
Switching/Routing Capacity
and Routers
4. Some (of our) solutions
High Performance Switching and Routing
3
What’s the Problem?
Time
Mos
t thi
ngs
High Performance Switching and Routing
4
Source: http://www.mfsdatanet.com/MAE/west.stats.html
Feb Apr Jun Aug Nov
200
400
600
800
460
200150
123
Agg
rega
te b
andw
idth
(M
bps)
100
620
520
720
Feb Apr
The San Jose NAPThe demand
High Performance Switching and Routing
5
The supplyR
oute
r P
erfo
rman
ce (
pack
ets/
seco
nd)
103
19901986 1994 1997
104105
106
High Performance Switching and Routing
6
1986 1992 1997
pack
ets/
seco
nd
Demand
Supply
Why we need faster switches/routers
High Performance Switching and Routing
7
Traffic Inversion10 years ago
High Performance Switching and Routing
8
ISP
ISP
Traffic InversionToday
High Performance Switching and Routing
9
November 1st, 1996
Why is this a problem?
Pac
ket L
oss
(%)
High Performance Switching and Routing
10
1. The Demand for Bandwidth
2. The Shortage of
3. The Architecture of Switches
Switching/Routing Capacity
and Routers
4. Some (of our) solutions
High Performance Switching and Routing
11
The Architecture of
Forwarding
OutputScheduler
Signaling &
Generic Packet Switch:
Decision
Processor
Switches and Routers
Mgmt
Interconnect
(e.g. IP Router, ATM Switch, LAN Switch)
Data Hdr
High Performance Switching and Routing
12
Performance of IP Routers
Min back-to-back packet size Packet size
ArrivalTime
CopyTime
Forwarding
Time
Time
Decision
High Performance Switching and Routing
13
Performance of IP Routers
Min back-to-back packet size Packet size
Arrivalrate
Copyrate
Headerprocessingtime
Most routersdo this poorly!
Most routersdo this ~ ok
Time
High Performance Switching and Routing
14
The Evolution of RoutersThe first shared memory routers
RoutingCPU
LineCard
LineCard
LineCard
BufferMemory
DMA DMA DMA
MAC MAC MAC
High Performance Switching and Routing
15
The Evolution of RoutersThe first shared memory routers
RoutingCPU
LineCard
LineCard
LineCard
BufferMemory
DMA DMA DMA
MAC MAC MAC
High Performance Switching and Routing
16
The Evolution of RoutersReducing the number of bus copies
RoutingCPU
LineCard
BufferMemory
DMA
MAC
BufferMemory
RouteCache
DMA
MAC
BufferMemory
RouteCache
DMA
MAC
BufferMemory
RouteCache
High Performance Switching and Routing
17
The Evolution of RoutersReducing the number of bus copies
RoutingCPU
LineCard
BufferMemory
DMA
MAC
BufferMemory
RouteCache
DMA
MAC
BufferMemory
RouteCache
DMA
MAC
BufferMemory
RouteCache
updates
High Performance Switching and Routing
18
The Evolution of RoutersAvoiding bus contention
MA
C
Buffer
Mem
ory
Route
Cache
MAC
BufferMemory
RouteCache
MA
C
Buf
fer
Mem
ory
Rou
teC
ache
BufferMemory
CPUROUTE
Advantage:Non-blocking backplane—
Disadvantage:Difficult to provide QoS
high throughput
High Performance Switching and Routing
19
1. The Demand for Bandwidth
2. The Shortage of
3. The Architecture of Switches
Switching/Routing Capacity
and Routers
4. Some (of our) solutions
High Performance Switching and Routing
20
Some (of our)Solutions
1. Accelerating Forwardng Decisions:• Longest-matching prefixes
2. Interconnections: Switched Backplanes• Input Queueing
— Theory— Unicast— Multicast
• Fast Buffering
• Speedup
• TheTiny Tera Project
High Performance Switching and Routing
21
Routing Lookups
212.17.9.4
Class A
Class B
Class C212.17.9.0 Port 4
Class A Class B Class C D
ExactMatch
(hash, cache,pipeline...)
Routing Table:
High Performance Switching and Routing
22
212.0.0.0/8
212.17.0.0/16212.17.9.0/24
212.17.9.4
CIDR uses “longest matching prefix” routing:
Hashing, caching and pipelining are hard!
Routing Lookups withCIDR (“supernetting”)
High Performance Switching and Routing
23
Perform Lookups Faster
Size ofRouting Tables
Cost of Memory(per byte)
Observation #1:
Time
High Performance Switching and Routing
24
Performing Lookups Faster
Prefix length
Numberin routingtable
24
212.17.9.0/24
212.17.9.40 232-1
256
Observation #2:
High Performance Switching and Routing
25
212.17.9.1 101
0
1
1
Port 4
Port 4Port 3
look further
look furtherPort 3
256Port 4Port 5
Port 4Port 5
16Mbytes of 50ns DRAM
<1Mbyte of 50ns DRAM
20 million lookups per second
High Performance Switching and Routing
26
1. Accelerating Forwardng Decisions:• Longest-matching prefixes
2. Interconnections: Switched Backplanes• Input Queueing
— Theory— Unicast— Multicast
• Fast Buffering
• Speedup
• TheTiny Tera Project
High Performance Switching and Routing
27
Shared Memory:
Input Queueing:
Should we use shared memory
Advantages:SimplicityHigh Bandwidth
Disadvantages:HOL BlockingLess efficient
Advantages:Highest Throughput.
Disadvantages:N-fold internal speedup
N p
orts
Difficult to control packet delay.
or input-queueing?
Possible to control packet delay.
Shared Memory
High Performance Switching and Routing
28
TimeM
emor
y S
ize
Time
Mem
ory
Spe
ed
SRAM
DRAM
Memory Bandwidth
High Performance Switching and Routing
29
An aside: How fast can sharedmemory operate?
Route Lookup
SharedMemory
200byte packet
5ns SRAM
How fast can a 16 port switch run with this architecture?
5ns per packet 2 memory operations per cell time×aggregate bandwidth is 160Gb/s⇒
High Performance Switching and Routing
30
Because of ashortage of memory bandwidth, mostmultigigabit and terabit switches and routers useeither:
1. Input Queueing, or2. Combined Input and Output Queueing.
Should we use shared memoryor input-queueing?
High Performance Switching and Routing
31
1 24 2
1 2
Inputs Outputs
ρmax 2 2– 58%= =
Input Cell Buffer
Cells CellsN
The Problem A Solution....
Head of Line Blocking
“Virtual Output Queueing”
ρmax 100%=
High Performance Switching and Routing
32
Input 1
Q(1,1)
Q(1,n)
A1(t)
Input m
Q(m,1)
Q(m,n)
Am(t)
D1(t)
Dn(t)
Output 1
Output n
Matching, MA1,1(t)
?
...but requires scheduling...
High Performance Switching and Routing
33
RequestGraph
BipartiteMatching
1234
1234
1234
1234
(Weight = 18)
25
242
7
....which is equivalent to graph matching
High Performance Switching and Routing
34
Practical Algorithms
1. iSLIP — Weight = 1
2. iLQF — Weight = Occupancy
— Simple to implement
— Iterative round-robin
3. iOCF — Weight = Cell Age
4. LPF — Weight = Backlog
Simple, fast,efficient
Good fornon-uniformtraffic.Complex!Good fornon-uniformtraffic.Simple.
High Performance Switching and Routing
35
Achieving 100% ThroughputLongestQueueFirst & OldestCell First
1234
1234
1234
1234
10 1
1
1
1 10
Maximum weight
WeightWaiting Time 100%Queue Length{ }=
High Performance Switching and Routing
36
Theorems
DefE Li j, n( )i j,∑ ∞< n∀,
100% throughput
Lyapunov Stability Criterion:
E V L n 1+( )( ) V L n( )( ) L n( )–[ ] 0 L n( ) k>∀,≤
Theorem:Both LQF and OCF can achieve 100% throughput forindependent traffic both uniform and non-uniform.
Proof:
http://tiny-tera.stanford.edu/~adisak/research.html
High Performance Switching and Routing
37
Approximating LQF and OCF
Iteration steps
Step 1. Request
Step 2. Grant to the largest request
Step 3. Accept grant to the largest request
iLQF&iOCF
10 1
1
1
1 10
10 1
1 10
10
10
1 2 3
High Performance Switching and Routing
38
Problem is in ComparatorsiLQF and iOCF
>input i
Grant Arbiter , 1
1
N
>
Accept Arbiter N
>input i
Grant Arbiter N
1
i
N
i>
Accept Arbiter, 1
Decoder
Decoder
b
b
b
b
b
b
b
b
b
b
b
b
logN
logN
Req
uest
s
Mat
ches
Clear Requests
L1,1
L2,1
LN,1
L1,N
L2,N
LN,N
L1,1
L1,2
L1,N
L1,N
L1,N
L1,N
High Performance Switching and Routing
39
Solution toComplexity Problem
☛ Longest Port First (LPF)
☛ Oldest Port First (OPF)
Advantages
— SIMPLER.
• Can use maximum size matching — O(N 2.5).
— FASTER.• Move magnitude comparator out of the critical path.
• Lends itself well to pipelining.
High Performance Switching and Routing
40
1
2
1
2
Using Port Occupancy
i.e. w1,1 = L1,1 + L1,2 + L1,1 + L2,1
w1,1
Input occupancy Output occupancy
L1,1
L1,2
L2,1
L2,2
LPF Algorithm
w i j, L i j,j∑ L i j,
i∑+=
High Performance Switching and Routing
41
On The Theorems
Theorem:LPF can achieve 100% throughput for independenttraffic both uniform and non-uniform.
Theorem:An LPF match is of both maximum weight andmaximum size.
Proof:
E V L n 1+( )( ) V L n( )( ) L n( )–[ ] 0 L n( ) k>∀,≤
V L n( )( ) LT
n( )TL n( )=
High Performance Switching and Routing
42
Presorting Inputs & Outputs
1920 9
4142 31
4445 34
2021 10
22
25
1
0
1
0
1
1
0
1
0
1
2 20 0
17 80
0 0 1
19 20 9
41 42 31
44 3445
20 21 10
22
25
1
1
1
0
1
1
0
0
0
1
19 20 9
41 42 31
44 3445
20 21 10
22
25
1
1
1
0
1
1
0
0
0
1
Weight request
Permute
High Performance Switching and Routing
43
11 0
10 1
00 1
1920 9
4142 31
4445 34
2021 10
22
25
1
0
1
0
1
1
0
1
0
1
Remove Weights
11 0
10 1
00 1
Matching01 0
10 0
00 1
High Performance Switching and Routing
44
Implementation
Input Occup Output Occup
Input permutation Output permutation
Maxsize
Raw Requests
Sorter Sorter
X Bar X Bar
11 011 0
00 1
Matching
Permuted Requests
Match
10 001 0
00 1
{10, 20, 30} {20, 25, 15}
{3, 2, 1} {2, 1, 3}
High Performance Switching and Routing
45
Multicast TrafficQueue Architecture
2. Why treat multicast differently?1. Making use of the crossbar
3. Why maintain a single FIFO queue?4. Fanout-splitting
134
High Performance Switching and Routing
46
0.1
1
10
100
1000
10000
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
Ave
rage
Cel
l Lat
ency
Offered Load
Fanout-splittingNo Fanout-splitting
Fanout-Splitting
High Performance Switching and Routing
47
Multicast Traffic
1. Residue Concentration
2. Tetris-based schedulers
High Performance Switching and Routing
48
1. Accelerating Forwardng Decisions:• Longest-matching prefixes
2. Interconnections: Switched Backplanes• Input Queueing
— Theory— Unicast— Multicast
• Fast Buffering
• Speedup
• TheTiny TeraProject
High Performance Switching and Routing
49
Fast BufferingPing-pong Memory
BufferMemory
BufferMemory
BufferMemory
M
M/2
M/2
High Performance Switching and Routing
50
M/2
M/2
iii
iii
iv
v
X1 X2=
X1
X2
M/2
M/2
X1
X2 X1 X2=
memory 1
memory 2
High Performance Switching and Routing
51
Fast BufferingPing-pong Memory
M
t
Occupancy
M/2
M/2
Maximum “cost” = M/2
High Performance Switching and Routing
52
Fast BufferingPing-pong Memory
Buffer size, M
ping-pong: (M/2,M/2)
Log(
Ove
rflow
Rat
e)
In practise, cost <5%
single memory: M single memory: M/2
High Performance Switching and Routing
53
Wastage Factor,ω R( )M R( ) M̃ R( )–
M R( )-------------------------------------≡
• decreases with M
• decreases with burstiness
• decreases with load
• decreases with number of ports
ω R( )
ω R( )
ω R( )
ω R( )
Some ResultsInput Queued Switch
High Performance Switching and Routing
54
1. Accelerating Forwardng Decisions:• Longest-matching prefixes
2. Interconnections: Switched Backplanes• Input Queueing
— Theory— Unicast— Multicast
• Fast Buffering
• Speedup
• TheTiny Tera Project
High Performance Switching and Routing
55
Combined Input- and Output-Queueing:
Matching Output Queueingwith Input- and Output- Queueing
How much speedup is enough?
k readsand writes
High Performance Switching and Routing
56
Conventional wisdom suggests:
Matching Output Queueingwith Input- and Output- Queueing
How much speedup is enough?
A speedupk 2 4–= leads to high throughput
High Performance Switching and Routing
57
Fact To match output queueing, with FIFO input queues:
Matching Output Queueingwith Input- and Output- Queueing
=?Traffic
Output QueuedSwitch
Combined Input
Switchand Output Queued
Fact To match output queueing, with virtual output queues:
k 21N----–
is necessary and sufficient.=
k N is necessary and sufficient.=
High Performance Switching and Routing
58
1. Accelerating Forwardng Decisions:• Longest-matching prefixes
2. Interconnections: Switched Backplanes• Input Queueing
— Theory— Unicast— Multicast
• Fast Buffering
• Speedup
• TheTiny Tera Project