Post on 21-Dec-2015
transcript
ECE 565High-Level Synthesis—An Introduction
Shantanu Dutt
ECE Dept., UIC
HLS Flow
• Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects)
Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd)
HLS Flow (contd)
Allocation: Simple counting of FUs after theabove 2 stages
(Binding)
Simple HLS Examples
+
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc
z ldz
X +
a b
c d
mux mux
demux
x y
lda ldb
ldx
ldc ldd
ldy
mux1 mux2I0I1
I0 I1
demux
cc 3(i+1)
lda = 1 reg. “a”loaded
Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted.
lda=1, ldb=1,ldc=1, ldd=1,
mux1=1, mux2=1demux=1,
ldz=1
mux1=0,mux2=0
demux=0,ldy=1
ldx=1
[z x+y](c3)
[y c+d](c2)
[x a x b](c1)
cc 3i
cc 3(i+2)
Reset
Controller FSM:
1 2 3 4 5 6
c1(1) c1(2)
c2(1) c3(1) c2(2) c3(2)
X
+
i) Non-overlapped pipelined scheduling
cc’s
Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t-care value
(a) Scheduling
(b) Arch. Synthesis
(c) Controller FSMSynthesis
O0O1
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d)
1 2 3 4 5 6
c1(1) c1(2)
c2(1) c3(1) c2(2) c3(2)
X
+
ii) Overlapped pipelined scheduling
z ldz
X +
a b
c d
mux mux
demux
x y
lda ldb
ldx
ldc ldd
ldy
mux1 mux2I0I1
I0 I1
demux
cc 3(i+1)
lda=1, ldb=1,mux1=0, mux2=0
demux=0,ldy=1, ldx=1
ldc=1, ldd=1,mux1=1,mux2=1,
demux=1,ldz=1
[y c+d, x a x b]((c1, c2)
[z x+y,](c3)
cc 3iReset
Controller FSM:
cc’s
• For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched.• Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc• Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule
(a) Scheduling
(b) Arch. Synthesis
(c) Controller FSMSynthesis
Simple HLS Examples (contd)
Condition(T/F)
in
out1 out2
T F
Distributor
Condition(T/F)
in1 in2
out
T F
Selectot• Some DFG control operation nodes:
• Conditional code: If (a > b) then c a-b;Else c b-a;
• Possible DFGs corresponding to the above conditional code:
Simple HLS Examples (contd)
• Iterative code: while (a > b) a a-b;
dist
>
sel
-
a b
a
T F
T F
Initializedto F
+
b
final a
Mux
Demux
ar1
cin 1
b’+1 = 2’s compl. of -b
b’1 0
1 0
s xor ovfl= 1 -ve= 0 +ve
mux
ldr1 lda ldb
demux
ldfina
To fsmc1c2
c1 c2+
cc’s
c1 c2Scheduling& binding:
a
(a) Scheduling (using only 1 adder/sub)
(b) Arch. Synthesis
Delay Nodes in DFGs
A delay node is generally implemented as a register; a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd)
register
Transformation in the DFG Mapping to the architecture
Detailed HLS Example
Detailed HLS Example (contd)
The synthesized architecture
Note: Not clear how register allocation has been done.It is sub-optimal (4 non-primary i/p regs. needed)
(a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency
Different paths (i/p o/p) in the DFG
(b) Reg. alloc. for o/p of operations
(c) Arch. synthesis
For WAR constraint
Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point.
Detailed HLS Example (contd)
Detailed HLS Example—Register Allocation
d0
3 non-primary i/pregs. needed
Detailed HLS Example—Register Allocation (contd)
• In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them)• Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard• The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes)• Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)
Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point.
Detailed HLS Example—Register Allocation (contd)
d0
3 non-primary i/pregs. needed
Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information