Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | rafael-duke |
View: | 25 times |
Download: | 0 times |
February 17, 2009 http://csg.csail.mit.edu/arvind L06-1
IP Lookup
Arvind Computer Science & Artificial Intelligence LabMassachusetts Institute of Technology
February 17, 2009 L06-2http://csg.csail.mit.edu/arvind
IP Lookup block in a router
QueueManager
Packet Processor
Exit functions
ControlProcessor
Line Card (LC)
IP Lookup
SRAM(lookup table)
Arbitration
Switch
LC
LC
LC
A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing tableLine rate and the order of arrival must be maintained line rate 15Mpps for 10GE
February 17, 2009 L06-3http://csg.csail.mit.edu/arvind
18
2
3
IP address Result M Ref
7.13.7.3 F
10.18.201.5 F
7.14.7.2
5.13.7.2 E
10.18.200.7 C
Sparse tree representation
3
A…
A…
B
C…
C…
5 D
F…
F…
14
A…
A…
7
F…
F…
200
F…
F…
F*
E5.*.*.*
D10.18.200.5
C10.18.200.*
B7.14.7.3
A7.14.*.* F…F…
F
F…
E5
7
10
255
0
14
4A In this lecture:Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits
1 to 3 memory accesses
February 17, 2009 L06-4http://csg.csail.mit.edu/arvind
“C” version of LPMintlpm (IPA ipa) /* 3 memory lookups */{ int p;
/* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p);
/* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; if (isLeaf(p)) return value(p);
/* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]];
return value(p); /* must be a leaf */}
Not obvious from the C code how to deal with - memory latency - pipelining
…
216 -1
0
…
…28 -1
0
…
28 -1
0
Must process a packet every 1/15 s or 67 ns
Must sustain 3 memory dependent lookups in 67 ns
Memory latency ~30ns to 40ns
February 17, 2009 L06-5http://csg.csail.mit.edu/arvind
Longest Prefix Match for IP lookup:3 possible implementation architectures
Rigid pipeline
Inefficient memory usage but simple design
Linear pipeline
Efficient memory usage through memory port replicator
Circular pipeline
Efficient memory with most complex control
Designer’s Ranking:
1 2 3Which is “best”?
Arvind, Nikhil, Rosenband & Dave ICCAD 2004
February 17, 2009 L06-6http://csg.csail.mit.edu/arvind
Circular pipeline
The fifo holds the request while the memory access is in progress
The architecture has been simplified for the sake of the lecture. Otherwise, a “completion buffer” has to be added at the exit to make sure that packets leave in order.
enter?enter?done?done?RAM
yesinQ
fifo
no
outQ
February 17, 2009 L06-7http://csg.csail.mit.edu/arvind
interface FIFO#(type t); method Action enq(t x); // enqueue an item method Action deq(); // remove oldest entry method t first(); // inspect oldest itemendinterface
FIFO
n = # of bits needed to represent a value of type t
not full
not empty
not empty
rdyenab
n
n
rdyenab
rdy
enq
deq
first
FIFO
module
February 17, 2009 L06-8http://csg.csail.mit.edu/arvind
Addr
Readyctr
(ctr > 0) ctr++
ctr--
deq
Enableenq
Request-Response Interface for Synchronous Memory
Synch MemLatency N
interface Mem#(type addrT, type dataT);method Action req(addrT x);
method Action deq(); method dataT peek();endinterface
Data
Ack
DataReady
req
deq
peek
Making a synchronous component latency- insensitive
February 17, 2009 L06-9http://csg.csail.mit.edu/arvind
Circular Pipeline Code rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule
enter?enter?done?done?RAM
inQ
fifo
rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule
When can enter fire?
inQ has an element and ram & fifo each has space
done? Is the same as isLeaf
February 17, 2009 L06-10http://csg.csail.mit.edu/arvind
Circular Pipeline Code: discussionrule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule
enter?enter?done?done?RAM
inQ
fifo
rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule
When can recirculate fire?
ram & fifo each has an element and ram, fifo & outQ each has space
Is this possible?
February 17, 2009 L06-11http://csg.csail.mit.edu/arvind
module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule
One Element FIFOenq and deq cannot even be enabled together much less fire concurrently!
n
not empty
not full rdyenab
rdyenab
enq
deq
FIFO
module
The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally
We can build such a FIFO
February 17, 2009 L06-12http://csg.csail.mit.edu/arvind
Dead cyclesrule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule
enter?enter?done?done?RAM
inQ
fifo
rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule
Can a new request enter the system when an old one is leaving?
Is this worth worrying about?
assume simultaneous enq & deq is allowed
February 17, 2009 L06-13http://csg.csail.mit.edu/arvind
The Effect of Dead Cycles
enterenterdone?done?RAM
yesin
fifo
no
What is the performance loss if “exit” and “enter” don’t ever happen in the same cycle?
>33% slowdown! Unacceptable
Circular Pipeline RAM takes several cycles to respond to a request Each IP request generates 1-3 RAM requests FIFO entries hold base pointer for next lookup and
unprocessed part of the IP address
February 17, 2009 L06-14http://csg.csail.mit.edu/arvind
The compiler issueCan the compiler detect all the conflicting conditions?
Important for correctness
Does the compiler detect conflicts that do not exist in reality?
False positives lower the performance The main reason is that sometimes the compiler
cannot detect under what conditions the two rules are mutually exclusive or conflict free
What can the user specify easily? Rule priorities to resolve nondeterministic choice
yes
yes
In many situations the correctness of the design is not enough; the design is not done unless the performance goals are met
February 17, 2009 L06-15http://csg.csail.mit.edu/arvind
Scheduling conflicting rules
When two rules conflict on a shared resource, they cannot both execute in the same clockThe compiler produces logic that ensures that, when both rules are applicable, only one will fire Which one? source annotations
(* descending_urgency = “recirculate, enter” *)
February 17, 2009 L06-16http://csg.csail.mit.edu/arvind
So is there a dead cycle? rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule
enter?enter?done?done?RAM
inQ
fifo
rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule
In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict!
February 17, 2009 L06-17http://csg.csail.mit.edu/arvind
Rule Splitingrule foo (True); if (p) r1 <= 5; else r2 <= 7;endrule
rule fooT (p); r1 <= 5;endrule
rule fooF (!p); r2 <= 7;endrule
rule fooT and fooF can be scheduled independently with some other rule
February 17, 2009 L06-18http://csg.csail.mit.edu/arvind
Spliting the recirculate rulerule recirculate (!isLeaf(ram.peek())); IP rip = fifo.first(); fifo.enq(rip << 8); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq();endrule
rule exit (isLeaf(ram.peek())); outQ.enq(ram.peek()); fifo.deq(); ram.deq();endrule
rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule
Now rules enter and exit can be scheduled simultaneously, assuming fifo.enq and fifo.deq can be done simultaneously
February 17, 2009 L06-19http://csg.csail.mit.edu/arvind
module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule
Back to the fifo problem
n
not empty
not full rdyenab
rdyenab
enq
deq
FIFO
module
The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally
February 17, 2009 L06-20http://csg.csail.mit.edu/arvind
RWire to rescue
interface RWire#(type t);method Action wset(t x);method Maybe#(t) wget();
endinterface
Like a register in that you can read and write it but unlike a register
- read happens after write- data disappears in the next cycle
RWires can break the atomicity of a rule if not used properly
February 17, 2009 L06-21http://csg.csail.mit.edu/arvind
module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); method Action enq(t x) if
(!full || isValid (deqEN.wget())); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule
One Element “Loopy” FIFO
not empty
not full rdyenab
rdyenab
enq
deq
FIFO
module
or
!full
This works correctly in both cases (fifo full and fifo empty).
February 17, 2009 L06-22http://csg.csail.mit.edu/arvind
Problem solved!
rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule
LFIFO fifo <- mkLFIFO; // use a loopy fifo
RWire has been safely encapsulated inside the Loopy FIFO – users of Loopy fifo need not be aware of RWires
February 17, 2009 L06-23http://csg.csail.mit.edu/arvind
Packaging a module:Turning a rule into a method
inQ
enter?enter?done?done?RAM
fifo
rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(p[15:0]); inQ.deq();endrule
method Action enter (IP ip); ram.req(ip[31:16]); fifo.enq(ip[15:0]);endmethod
outQ
Similarly a method can be written to extract elements from the outQ
February 17, 2009 L06-24http://csg.csail.mit.edu/arvind
Circular pipeline with Completion Buffer
luReq
luResp
Completion buffer- gives out tokens to control the entry into the circular pipeline- ensures that departures take place in order even if lookups complete out-of-order
The fifo holds the token while the memory access is in progress: Tuple2#(Bit#(16), Token)
enter?enter?done?done?RAM
cbufyes
getToken
inQ
fifo
no
remainingIP
February 17, 2009 L06-25http://csg.csail.mit.edu/arvind
Circular Pipeline Codewith Completion Buffer
rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq();endrule
enter?enter?done?done?RAM
cbufinQ
fifo
rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+rip[15:8]); end fifo.deq();endrule
February 17, 2009 L06-26http://csg.csail.mit.edu/arvind
Completion bufferinterface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method ActionValue#(t) getResult();endinterface
module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Int#(32)) cnt <- mkReg(0); //number of filled slots…
I
I
V
I
V
Icnt
i
o
buf
Elements must be representable as bits
typedef Bit#(TLog#(n)) TokenN#(numeric type n);typedef TokenN#(16) Token;
February 17, 2009 L06-27http://csg.csail.mit.edu/arvind
Completion buffer
// state elements // buf, i, o, n ... method ActionValue#(t) getToken() if (cnt < maxToken); cnt <= cnt + 1; i <= i + 1; buf.upd(i, Invalid); return i; endmethod
method Action put(Token tok, t data); return buf.upd(tok, Valid data); endmethod
method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid .x)); o <= o + 1; cnt <= cnt - 1; return x; endmethod
I
I
V
I
V
Icnt
i
o
buf
Home work: Think about concurrency Issues, i.e., can these methods be executed concurrently? Do they need to?
February 17, 2009 L06-28http://csg.csail.mit.edu/arvind
Longest Prefix Match for IP lookup:3 possible implementation architectures
Rigid pipeline
Inefficient memory usage but simple design
Linear pipeline
Efficient memory usage through memory port replicator
Circular pipeline
Efficient memory with most complex control
Which is “best”?Arvind, Nikhil, Rosenband & Dave ICCAD 2004
February 17, 2009 L06-29http://csg.csail.mit.edu/arvind
Implementations of Static pipelines
Two designers, two results
LPM versions Best Area(gates)
Best Speed(ns)
Static V (Replicated FSMs) 8898 3.60
Static V (Single FSM) 2271 3.56
Replicated:
RAM
FSM
MUX / De-MUX
FSM FSM FSM
Counter
MUX / De-MUX
resultIP addr
FSM
RAM
MUX
result
IP addr
BEST:
Each packet is processed by one FSM
Shared FSM
February 17, 2009 L06-30http://csg.csail.mit.edu/arvind
Synthesis resultsLPM versions
Code size(lines)
Best Area(gates)
Best Speed(ns)
Mem. util. (random workload)
Static V 220 2271 3.56 63.5%
Static BSV 179 2391 (5% larger) 3.32 (7% faster) 63.5%
Linear V 410 14759 4.7 99.9%
Linear BSV 168 15910 (8% larger) 4.7 (same) 99.9%
Circular V 364 8103 3.62 99.9%
Circular BSV 257 8170 (1% larger) 3.67 (2% slower) 99.9%
Synthesis: TSMC 0.18 µm lib
V = Verilog;BSV = Bluespec System Verilog
- Bluespec results can match carefully coded Verilog- Micro-architecture has a dramatic impact on performance- Architecture differences are much more important than language differences in determining QoR