PROJECT REPORT - home.iitk.ac.inhome.iitk.ac.in/~sohum/Cornell_Result.pdf · 1 CORNELL-IIT SUMMER...

1

CORNELL-IIT SUMMER INTERNSHIP 2014

PROJECT REPORT Advised by Prof. Rajit Manohar and Jon Tse

Sohum

7/12/2014

This work is the result of Cornell-IIT Summer Internship Program 2014 in the Asynchronous VLSI Lab of Cornell University.

2

Introduction This work is the result of Cornell-IIT Summer Internship Program 2014 in the

Asynchronous VLSI Lab of Cornell University. The aim of this project is to

design an efficient digital neuron in terms of area, speed of state-updation

and energy required per update. The neuron is modeled on the Izhikevich

equations that describe the working of a biological neuron.

Various designs were made to compare various aspects of the device: mode of

operation (serial vs. parallel) and method of solving (with a multiplier or

with a fused multiplier-and-adder). Finally, a pipelined versions of parallel

neurons were designed to achieve better area efficiency.

Benchmark The research, entitled "Neural Spiking Dynamics in Asynchronous Digital Circuits",

which appeared in IJCNN, August 2013 was chosen as the benchmark (from

Table III) of electronic neurons. This is because its appearance is quite

recent and the neuron it proposed was the best such device when it was

written.

BENCHMARK

Area per neuron (sq. micrometer, 65nm tech.)

29,500

Energy per Update (nJ) 0.7

Time required per Update (ns) 54.0

The operating voltage is 1.2 volts. Synopsys 90nm educational library was used to

in the synthesis. The fan-out capacitance per pin is taken to be 4 fF.

3

Designs and their description The following table contains the names of various designs and their descriptions.

Henceforth, the names in this table will be used in any graphic or otherwise to

convey information about the corresponding design.

Name of Design Description

pmultadd A parallel neuron with a carry-lookahead adder and multiplier. The multiplier tree and adder is entirely combinational.

pmultacc A parallel neuron with a carry-lookahead adder and a fixed-point fused multiplier-adder. The multiplier-adder tree and adder is entirely combinational.

pipeaddneuron A parallel many-in-one neuron with a pipelined multiplier and a combinational carry-lookahead adder. This device can emulate 15 separate neurons with same configuration (same parameters e, f, g, t, a, b, c and d).

pipeaccneuron A parallel many-in-one neuron with a pipelined fixed-point fused multiplier-adder and a combinational carry-lookahead adder. This device can emulate 16 separate neurons with same configuration (same parameters e, f, g, t, a, b, c and d).

psaddneuron A neuron with a serial adder and a serial multiplier. The multiplier takes one operand input in parallel and the other in serial. The adder and multiplier are wrapped with a converter that converts parallel inputs to serial and the result vice-versa.

psaccneuron A neuron with a serial adder and a serial fixed-point fused multiplier-adder. The multiplier-adder takes one operand input in parallel and the other two in serial. The adder and multiplier-adder are wrapped with a converter that converts parallel inputs to serial and the result vice-versa.

4

Table of Results This table presents the essential characteristics of all the designs.

Name of the Design

Area per Neuron (sq. um)

Time per Update (ns)

Energy per Update (nJ)

pmultadd 24,000 (12,500) 17.9 0.1

pmultacc 26,800 (14,000) 11.9 0.1

psaddneuron 16,600 (8,660) 152.5 1.7

psaccneuron 18,600 (9,700) 87.0 0.7

pipeaddneuron 5,550 (2,890) 92.4 4.1

pipeaccneuron 6,270 (3,270) 55.0 3.1

BENCHMARK 29,500 54.0 0.7

(Multiplying factor for 65 nm estimate = (65/90)^2 = 0.5216 approx.)

1. The numbers in the 'Area per Neuron' columns are in 90nm technology,

and numbers in braces are its linearly-scaled version in 65nm technology

(i.e. multiplied by (0.65/0.9)^2 ) - same as the benchmark.

2. The 'Time per Update' is NOT equal to the clock period, or the time it takes

to complete one addition or multiplication. It is the time required to update

the neuron's state i.e. change the values of state variables u and v by

solving once the Izhikevich difference equation.

3. The 'Energy per Update' values were obtained by multiplying the time

required per update with the average power. The calculation of average

power is heavily influenced by the test-cases of the inputs as it dictates the

switching activity in the device. Hence to obtain a reasonable estimate of

5

the maximum power consumed an input test-case must be framed such

that the switching is maximum. The figures in this table were calculated

based on a default test case chosen by the compiler and therefore is not a

good estimate of the device's power characteristics.

This Table presents a detailed set of features of the designs.

Name of Design pmultadd pmultacc

psadd-neuron

psaccneuron

pipeaddneuron

pipeaccneuron

Total Area 24,000

(12,500) 26,800

(14,000) 16,600 (8,660)

18,600 (9,700)

83,200 (43,400)

100,000 (52,200)

Area per Neuron

24,000 (12,500)

26,800 (14,000)

16,600 (8,660)

18,600 (9,700)

5,550 (2,890)

6,270 (3,270)

Number of

Neurons 1 1 1 1 15 16

Power (uW)

6.80e +03

9.29e +03

1.14e +04

8.24e +03

4.39e +04

5.66e +04

Minimum Clk Period

(ns) 2.57 2.97 0.66 0.64 0.88 0.86

Clk Multiplier

7 4 33 x 7 = 231

34 x 4 = 136

15 x 7 = 105

16 x 4 = 64

'Clk Multiplier' is the number of clock cycles required to update the neuron's

state. 'Minimum Clk Period' is the smallest clock period possible without creating

a negative slack in the datapath.

6

Serial versus Parallel operation An important result is that the serial neurons are much worse than their parallel

counterparts in terms of Time/update and Energy/update , while they are

marginally smaller:

Comparison of serial vs. parallel neurons with a multiplier and adder :-

Name of the Design




pmultadd 24,000 (12,500) 17.9 0.1

psaddneuron 16,600 (8,660) 152.5 1.7

Comparison of serial vs. parallel neurons with a multiplier-adder and adder:-

Name of the Design




pmultacc 26,800 (14,000) 11.9 0.1

psaccneuron 18,600 (9,700) 87.0 0.7

The area advantage of the serial neuron over the parallel neuron begins at about

10-bit word length. The following table summarizes the area characteristics

versus the corresponding bit-length of a neuron word:

Area for different Word Lengths in 90 nm Technology (sq. um.)

Word Length (bits)

2 3 4 5 8 10 12 14

pmultadd 733 1470 2310 3320 6380 10400 13400 17200

psaddneuron 4080 4820 5620 6580 8370 10500 12300 13600

7

In terms of delay, the time per update of a parallel neuron is always better than

that of a serial neuron:

8

The worse performance of the serial over parallel neuron is the effect of a large

clock-to-Q propagation time of the flip-flop, as can be seen from this snippet of

timing analysis report:

Point Fanout Cap Trans Incr Path Attributes ------------------------------------------------------------------------------------------------- clock ideal_clock1 (rise edge) 0.0000 0.0000 clock network delay (propagated) 0.2883 0.2883 addr/sum_reg_0_/CLK (DFFX2) 0.1102 0.0000 0.2883 r addr/sum_reg_0_/QN (DFFX2) 0.0552 0.1560 0.4443 r addr/n133 (net) 1 8.4611 0.0000 0.4443 r U484/IN2 (NOR2X4) 0.0351 0.0001 & 0.4714 f U484/QN (NOR2X4) 0.0345 0.0221 0.4935 r n181 (net) 1 3.8303 0.0000 0.4935 r U861/IN3 (NAND4X0) 0.0358 0.0007 & 0.4942 r U861/QN (NAND4X0) 0.0774 0.0496 0.5438 f

Note the module names in bold, and the associated incremental delay introduced

in the path. Common gates like NAND and NOR have a delay of 0.02ns while the

Flip-Flops have a Clock-to-Q delay from 0.15-0.2ns (about 8 times that of a NAND

or XOR, etc). Since a full adder has 2 XOR gates in cascade and in serial adders and

multiplier(-adder) there is only one Full adder in each pipeline, therefore the

minimum clock cycle is delay at DFF for clock-to-Q + delay due to full adder = 8

(for DFF) + 2 = 10. But for parallel multiplier, the min. clock period is 8(for DFF) +

2 * 16 (there are 16 stages of multiplier tree reduction) = 40. Hence, it for 16bit

neurons the min. serial clock period is only one-fourth that of parallel. If we

consider the general case of N-bit words:

No. of stages in the multiplier tree-reduction: N No. of clk cycles required to evaluate using a serial multiplier: 2 N + 1 For Serial Multiplier, time per operation = (8 + 2)*(2 N + 1) For Parallel Multiplier, time per operation = 8 + 2* N

So, for large N this calculation shows that parallel is much better than serial. But

the graph comparing Time per Update varying word-length shows that parallel is

always better than serial.

9

Multiplier versus Multiplier-Adder The performance of neurons with a fused-multiplier-adder is much better in

terms of time per update than its counterpart with just a multiplier (however it is

a bit larger).

Comparison of parallel neurons :-

Name of the Design




pmultadd 24,000 (12,500) 17.9 0.1

pmultacc 26,800 (14,000) 11.9 0.1

Comparison of serial neurons :-

Name of the Design




psaddneuron 16,600 (8,660) 152.5 1.7

psaccneuron 18,600 (9,700) 87.0 0.7

The solution of the difference equation in a neuron with a fused multiplier-adder

and adder require 4 stages whereas a neuron with an adder and multiplier

requires 7 stages. The replacement of a multiplier with a fused multiplier-adder

does not lead to increase in the minimum clock cycle, because the extra stage of

addition is shielded by a pipeline stage (in a serial and pipelined neuron) or

atmost introduces an extra layer of full-adder (in parallel neurons). Indeed, the

minimum clock period of pmultacc is 2.97ns versus 2.57ns of pmultadd. Hence,

using a fused multiplier-adder leads to an almost 2-fold improvement in speed.

10

Breakdown of Area For simplicity, the area is broken down into Adder, Multiplier (or fused multiplier-adder instead) and Clock Tree (including D Flip-flops, memory and the entire clock-tree interconnect). IC Compiler creates power and area report for only those modules that were defined by the user. A register is a fundamental data-type and not a module. Also, all the clock-tree apparatus is instantiated by the compiler; since the user doesn't define a register or a clock-gating network - they are not present in the area or power report. Hence, the only way to measure their contribution is to subtract the consumption of the adder and multiplier from the whole. The area is dominated by the clocking apparatus in the pipelined neurons. This is expected because of the presence of many RAM memories to store the state variables of the emulated neurons. Note that the multiplier also occupy about half the area because of the presence of registers in every stage of the pipeline. The number of registers are much less in parallel neurons, hence most of the area

is occupied by the (fused) multiplier(-adder) tree.

In the case of serial neurons the area distribution is more equitable.

11

Breakdown of Power Consumption These predictions of power consumption may be farther away from reality than other predictions in this document because the test-cases were set to a default by the synthesis compiler. Hence, the power consumption reported here for the combinational parts may not be their maximum consumed power. Power consumption breakdown follows a pattern similar to that of area in the pipelined neurons. The largest consumer is the clock tree followed by the multiplier (the adder's contribution is about 0.1% - so small it is not represented in the pie chart below). In case of parallel neurons, the clock tree consumes most power despite being much smaller than the multiplier. That is because the multiplier tree switches depending on the input test-case but the clock tree switches repeatedly and hence consumes its maximum possible power. However, it is probable that the default test case chosen by the compiler is leading to minimum power consumption by the multiplier tree which reduces the multiplier's contribution in this chart. For serial neurons the power consumption is more equally distributed - in agreement with its area distribution.

12

SUMMARY

1. The following three designs have the best characteristics

compared to the benchmark (power is not considered because of its

associated ambiguity):

pipeaccneuron: it is atleast one-tenth the size with about similar

speed.

pmultacc: it is about half the size with close to 5 times speed.

pmultadd: it is smaller than half-size with close to 3 times speed .

2. The parallel neuron is always better than its serial counterpart.

Switching to serial with word-length at or beyond 10 leads to a small

reduction in area but a drastic reduction in speed and energy efficiency.

3. A neuron with a fused-multiplier-adder and adder is always better

than that with a multiplier and adder: a speed improvement of about 2

times is achieved for negligible area increase.

Date post:	03-Sep-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

PROJECT REPORT - home.iitk.ac.inhome.iitk.ac.in/~sohum/Cornell_Result.pdf · 1 CORNELL-IIT SUMMER...

Documents