Sundar Iyer

Post on 23-Feb-2016

58 views 0 download

Tags:

description

Winter 2012 Lecture 7 Packet Buffers. EE384 Packet Switch Architectures. Sundar Iyer. The Problem. All packet switches (e.g. Internet routers, Ethernet switch) require packet buffers for periods of congestion. - PowerPoint PPT Presentation

transcript

Sundar Iyer

Winter 2012Lecture 7

Packet Buffers

EE384Packet Switch Architectures

The Problem

• All packet switches (e.g. Internet routers, Ethernet switch) require packet buffers for periods of congestion.

• Size: A commonly used “rule of thumb” says that buffers need to hold one RTT (about 0.25s) of data. Even if this could be reduced to 10ms, a 4x10Gb/s linecard would require 400Mbits of buffering.

• Speed: Clearly, the buffer needs to store (retrieve) packets as fast as they arrive

(depart). At 4x10Gb/s, minimum sized packets must arrive and depart every 8ns.

An ExamplePacket buffers for a 40Gb/s linecard

BufferMemory

Write Rate, R

One 40B packetevery 8ns

Read Rate, R

One 40B packetevery 8ns

Buffer Manager

UnpredictableScheduler Requests

Memory needs to be accessed for write or read every 4ns

Memory Operations Per Second (MOPS)

What is MOPS?• Num. Unique Memory Operations Per Second

• Refers to the speed of the address (not data) bus• Inverse of Random Access Time

Examples • SRAM with 4ns access time = 250M MOPS• DRAM with 50 ns access time = 20M MOPS

Memory Technology

Use SRAM?+ Fast enough random access time, but- Low density, high cost, high power.

Use DRAM? + High density means we can store data, but- Can’t meet random access time.

The Problem: No single memory technology is a good match

Ideal to have access/s of SRAM, Cost & Density of DRAM

800M MOPS$1 per Mb

800 Mb/s per pin

SRAM (S)

S

50M MOPS4c per Mb

1000 Mb/s per pin

FCRAM/RLDRAM (F)F

25M MOPS2c per Mb

3200 Mb/s per pin

XDRAM (X)X

25M MOPS1c per Mb

1600 Mb/s per pin

DDR3 (D)D

Sol 1: Can’t we just use lots of DRAMs as separate memories in parallel?

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

40B 40B40B40B 40B 40B 40B 40B

Solution– Write 40B packets to available banks– Read 40B packets from specified banks

Problem– What if back to back reads occur from a small number of

banks?

Read, write 40B every 4ns from a different ‘32ns access time’ memory

Sol 2: Can’t we just use lots of DRAMs as one monolithic memory in parallel?

BufferMemory

Write Rate, R

One 40B packetevery 8ns

Read Rate, R

One 40B packetevery 8ns

Buffer Manager

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

BufferMemory

Read/write 320B every 32ns

40-79Bytes: 0-39 … … … … … 280-319

320B 320B

Sol 2: Works fine if there is only one FIFO

Write Rate, R

One 40B packetevery 8ns

Read Rate, R

One 40B packetevery 8nsBuffer Manager

40-79Bytes: 0-39 … … … … … 280-319

320B

Slow Buffer Memory

320B40B 320B

320B

40B40B 40B 40B 40B 40B 40B 40B 40B

320B320B320B320B320B320B320B320B320B320B

Sol 2: Works fine if there is only one FIFO

Write Rate, R

One 40B packetevery 8ns

Read Rate, R

One 40B packetevery 8nsBuffer Manager

40-79Bytes: 0-39 … … … … … 280-319

320B

Buffer Memory

320B?B 320B

320B

?B

320B320B320B320B320B320B320B320B320B320B

& Supports Variable Length Packets

Sol 2: In practice, buffer holds many FIFOs

40-79Bytes: 0-39 … … … … … 280-319

320B 320B 320B 320B

320B 320B 320B 320B

320B 320B 320B 320B

1

2

Q

Q might be 1k – 64k

Write Rate, R

One 40B packetevery 8ns

Read Rate, R

One 40B packetevery 8nsBuffer Manager

320B

320B?B 320B

320B

?B

How can we writemultiple variable-lengthpackets into differentqueues?

Problem

A block contains packets for different queues, which must be written to, or read from different memory locations.

Sol 3: Hybrid Memory Hierarchy

Packet processor

ArrivingPackets

R

DepartingPackets

R

Small fast cacheSRAM

Big slow memoryDRAM

A CPU cache is probabilistic

Q: Why is randomness a problem in this context?

Small Probability of

Miss Rate

ArrivingPackets

R

UnpredictableScheduler

Requests

DepartingPackets

R

12

1

Q

21234

345

123456

Small SRAM for FIFO heads

SRAM

Sol 4: Hybrid Memory Hierarchy with 100% Cache Hit Rate

Large DRAM memory holds FIFO body

57 6810 9

79 81011

1214 1315

5052 515354

8688 878991 90

8284 838586

9294 9395 68 7911 10

1

Q

2

Writingb bytes

Readingb bytes

for FIFO tails

5556

9697

8788

57585960

899091

1

Q

2

Small SRAM

DRAM

Design questions

1. What is the minimum SRAM needed to guarantee that a byte is always available in SRAM when requested?

2. What algorithm minimizes the SRAM size?

An Example Q = 5, w = 9+, b = 6

t = 1

Bytes

t = 3

Bytes

t = 4

Bytes

t = 5

Bytes

t = 7

Bytes

t = 2

Bytes

t = 6

Bytes

t = 0

BytesReplenish

Replenish

An Example Q = 5, w = 9+, b = 6

t = 8

Bytes

t = 9

Bytes

t = 10

Bytes

t = 11

Bytes

t = 12

Bytes

t = 13

Bytes Replenish

t = 19

Bytes Replenish

t = 23

Bytes

Read

The size of the SRAM cache

Necessity– How large does the SRAM cache need to be under any management algorithm?– Claim: wQ > Q(b - 1)(2 + lnQ)

Sufficiency– For any pattern of arrivals, what is the smallest SRAM cache needed so that a byte is always

available when requested? – For one particular algorithm: wQ = Qb(2 + lnQ)

w

Bytes

Q

w

Definitions

Occupancy: X(q,t)The number of bytes in FIFO q (in SRAM) at time t.

Deficit: D(q,t) = w - X(q,t)

w

Q

w

occupancy deficit

Smallest SRAM cache

11

1

read 1 byte f rom every queue:

queues are replenished, and queues have a defi cit of 1 byte.

read 1 byte f rom every queue with defi cit of 1 byte:

Q Qb b

Q

1st iteration:

2nd iteration:21

11

. .

queues have a defi cit of 2 bytes.

At the end of the iteration queues have a defi cit of bytes.

Af ter some number of iterations, we are down to one queue:

xth th

b

x x Q xb

i e

iteration:

1 11 1 ln 1 ln ( 1) ln , ln(1 ) . since f or small

I f the queue has f ewer than bytes in it, then successive reads make the queue under-run bef ore it can be replenis

x

Q x Q x b Qb b

b b

1 1 ln .hed. So, w b b Q

Smallest SRAM cache

In addition, each queue needs to hold (b – 1) bytes in case it is replenished with b bytes when only 1 byte has been removed.

Therefore, SRAM size must be at least: Qw > Q(b – 1)(2 + lnQ).

Most Deficit Queue First

Algorithm: Every b timeslots, replenish the queue with the largest deficit.

Claim: An SRAM cache of size Qw > Qb(2 + lnQ) is sufficient.

Examples: 1. 40Gb/s linecard, b=640, Q=128: SRAM = 560kBytes2. 160Gb/s linecard, b=2560, Q=512: SRAM = 10MBytes

23

Intuition for Theorem The maximum number of un-replenished requests for any i queues

wi, is the solution of the difference equation -

with boundary conditions

( ) ; { }-

- i 1

i 1i w b

w w b i 2, 3, ... Qi 1

qw Qb

Examples:

1. 40Gb/s line card, b=640, Q=128: SRAM = 560kBytes2. 160Gb/s line card, b=2560, Q=512: SRAM = 10MBytes