Sundar Iyer
Winter 2012Lecture 7
Packet Buffers
EE384Packet Switch Architectures
The Problem
• All packet switches (e.g. Internet routers, Ethernet switch) require packet buffers for periods of congestion.
• Size: A commonly used “rule of thumb” says that buffers need to hold one RTT (about 0.25s) of data. Even if this could be reduced to 10ms, a 4x10Gb/s linecard would require 400Mbits of buffering.
• Speed: Clearly, the buffer needs to store (retrieve) packets as fast as they arrive
(depart). At 4x10Gb/s, minimum sized packets must arrive and depart every 8ns.
An ExamplePacket buffers for a 40Gb/s linecard
BufferMemory
Write Rate, R
One 40B packetevery 8ns
Read Rate, R
One 40B packetevery 8ns
Buffer Manager
UnpredictableScheduler Requests
Memory needs to be accessed for write or read every 4ns
Memory Operations Per Second (MOPS)
What is MOPS?• Num. Unique Memory Operations Per Second
• Refers to the speed of the address (not data) bus• Inverse of Random Access Time
Examples • SRAM with 4ns access time = 250M MOPS• DRAM with 50 ns access time = 20M MOPS
Memory Technology
Use SRAM?+ Fast enough random access time, but- Low density, high cost, high power.
Use DRAM? + High density means we can store data, but- Can’t meet random access time.
The Problem: No single memory technology is a good match
Ideal to have access/s of SRAM, Cost & Density of DRAM
800M MOPS$1 per Mb
800 Mb/s per pin
SRAM (S)
S
50M MOPS4c per Mb
1000 Mb/s per pin
FCRAM/RLDRAM (F)F
25M MOPS2c per Mb
3200 Mb/s per pin
XDRAM (X)X
25M MOPS1c per Mb
1600 Mb/s per pin
DDR3 (D)D
Sol 1: Can’t we just use lots of DRAMs as separate memories in parallel?
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
40B 40B40B40B 40B 40B 40B 40B
Solution– Write 40B packets to available banks– Read 40B packets from specified banks
Problem– What if back to back reads occur from a small number of
banks?
Read, write 40B every 4ns from a different ‘32ns access time’ memory
Sol 2: Can’t we just use lots of DRAMs as one monolithic memory in parallel?
BufferMemory
Write Rate, R
One 40B packetevery 8ns
Read Rate, R
One 40B packetevery 8ns
Buffer Manager
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
BufferMemory
Read/write 320B every 32ns
40-79Bytes: 0-39 … … … … … 280-319
320B 320B
Sol 2: Works fine if there is only one FIFO
Write Rate, R
One 40B packetevery 8ns
Read Rate, R
One 40B packetevery 8nsBuffer Manager
40-79Bytes: 0-39 … … … … … 280-319
320B
Slow Buffer Memory
320B40B 320B
320B
40B40B 40B 40B 40B 40B 40B 40B 40B
320B320B320B320B320B320B320B320B320B320B
Sol 2: Works fine if there is only one FIFO
Write Rate, R
One 40B packetevery 8ns
Read Rate, R
One 40B packetevery 8nsBuffer Manager
40-79Bytes: 0-39 … … … … … 280-319
320B
Buffer Memory
320B?B 320B
320B
?B
320B320B320B320B320B320B320B320B320B320B
& Supports Variable Length Packets
Sol 2: In practice, buffer holds many FIFOs
40-79Bytes: 0-39 … … … … … 280-319
320B 320B 320B 320B
320B 320B 320B 320B
320B 320B 320B 320B
1
2
Q
Q might be 1k – 64k
Write Rate, R
One 40B packetevery 8ns
Read Rate, R
One 40B packetevery 8nsBuffer Manager
320B
320B?B 320B
320B
?B
How can we writemultiple variable-lengthpackets into differentqueues?
Problem
A block contains packets for different queues, which must be written to, or read from different memory locations.
Sol 3: Hybrid Memory Hierarchy
Packet processor
ArrivingPackets
R
DepartingPackets
R
Small fast cacheSRAM
Big slow memoryDRAM
A CPU cache is probabilistic
Q: Why is randomness a problem in this context?
Small Probability of
Miss Rate
ArrivingPackets
R
UnpredictableScheduler
Requests
DepartingPackets
R
12
1
Q
21234
345
123456
Small SRAM for FIFO heads
SRAM
Sol 4: Hybrid Memory Hierarchy with 100% Cache Hit Rate
Large DRAM memory holds FIFO body
57 6810 9
79 81011
1214 1315
5052 515354
8688 878991 90
8284 838586
9294 9395 68 7911 10
1
Q
2
Writingb bytes
Readingb bytes
for FIFO tails
5556
9697
8788
57585960
899091
1
Q
2
Small SRAM
DRAM
Design questions
1. What is the minimum SRAM needed to guarantee that a byte is always available in SRAM when requested?
2. What algorithm minimizes the SRAM size?
An Example Q = 5, w = 9+, b = 6
t = 1
Bytes
t = 3
Bytes
t = 4
Bytes
t = 5
Bytes
t = 7
Bytes
t = 2
Bytes
t = 6
Bytes
t = 0
BytesReplenish
Replenish
An Example Q = 5, w = 9+, b = 6
t = 8
Bytes
t = 9
Bytes
t = 10
Bytes
t = 11
Bytes
t = 12
Bytes
t = 13
Bytes Replenish
t = 19
Bytes Replenish
t = 23
Bytes
Read
The size of the SRAM cache
Necessity– How large does the SRAM cache need to be under any management algorithm?– Claim: wQ > Q(b - 1)(2 + lnQ)
Sufficiency– For any pattern of arrivals, what is the smallest SRAM cache needed so that a byte is always
available when requested? – For one particular algorithm: wQ = Qb(2 + lnQ)
w
Bytes
Q
w
Definitions
Occupancy: X(q,t)The number of bytes in FIFO q (in SRAM) at time t.
Deficit: D(q,t) = w - X(q,t)
w
Q
w
occupancy deficit
Smallest SRAM cache
11
1
read 1 byte f rom every queue:
queues are replenished, and queues have a defi cit of 1 byte.
read 1 byte f rom every queue with defi cit of 1 byte:
Q Qb b
Q
1st iteration:
2nd iteration:21
11
. .
queues have a defi cit of 2 bytes.
At the end of the iteration queues have a defi cit of bytes.
Af ter some number of iterations, we are down to one queue:
xth th
b
x x Q xb
i e
iteration:
1 11 1 ln 1 ln ( 1) ln , ln(1 ) . since f or small
I f the queue has f ewer than bytes in it, then successive reads make the queue under-run bef ore it can be replenis
x
Q x Q x b Qb b
b b
1 1 ln .hed. So, w b b Q
Smallest SRAM cache
In addition, each queue needs to hold (b – 1) bytes in case it is replenished with b bytes when only 1 byte has been removed.
Therefore, SRAM size must be at least: Qw > Q(b – 1)(2 + lnQ).
Most Deficit Queue First
Algorithm: Every b timeslots, replenish the queue with the largest deficit.
Claim: An SRAM cache of size Qw > Qb(2 + lnQ) is sufficient.
Examples: 1. 40Gb/s linecard, b=640, Q=128: SRAM = 560kBytes2. 160Gb/s linecard, b=2560, Q=512: SRAM = 10MBytes
23
Intuition for Theorem The maximum number of un-replenished requests for any i queues
wi, is the solution of the difference equation -
with boundary conditions
( ) ; { }-
- i 1
i 1i w b
w w b i 2, 3, ... Qi 1
qw Qb
Examples:
1. 40Gb/s line card, b=640, Q=128: SRAM = 560kBytes2. 160Gb/s line card, b=2560, Q=512: SRAM = 10MBytes