Loom: Exploiting Weight and Activation Precisions to ... · Loom: Exploiting Weight and Activation...

Loom: Exploiting Weight and Activation Precisionsto Accelerate Convolutional Neural Networks

Sayeh Sharify, Alberto Delmas Lascorz, Patrick Judd, Andreas MoshovosDepartment of Electrical and Computer Engineering, University of Toronto

Email: {sayeh, delmasl1, judd, moshovos}@ece.utoronto.ca

Abstract—Loom (LM), a hardware inference accelerator forConvolutional Neural Networks (CNNs) is presented. In LM everybit of data precision that can be saved translates to proportionalperformance gains. Specifically, for convolutional layers LM’sexecution time scales inversely proportionally with the precisionsof both weights and activations. For fully-connected layers LM’sperformance scales inversely proportionally with the precisionof the weights. LM targets area constrained System-on-a-Chipdesigns such as those found on mobile devices that cannotafford the multi-megabyte buffers that would be needed to storeeach layer on-chip during processing. Experiments on imageclassification CNNs show that on average across all networksstudied and assuming that weights are supplied via a HighBandwidth Memory v2 (HBM2) interface, a configuration ofLM outperforms a state-of-the-art bit-parallel accelerator [1]by 2.34× without any loss in accuracy while being 2.23×more energy efficient. Moreover, LM can trade-off accuracy foradditional improvements in execution performance and energyefficiency.

I. INTRODUCTION

Deep neural networks (DNNs) have become the state-of-the-art technique in many recognition tasks such as object [2]and speech recognition [3]. The high computational band-width demands and energy consumption of DNNs motivatedseveral special purpose architectures such as the state-of-the-art DaDianNao (DaDN) data-parallel accelerator [1]. Tomaximize performance DaDN, as proposed, uses 36MB ofon-chip eDRAM to hold all input (weights and activation) andoutput data (activations) per layer. Unfortunately, such largeon-chip buffers are beyond the reach of embedded and mobilesystem-on-chip (SoC) devices.

This work presents Loom (LM), a hardware acceleratorfor inference with Convolutional Neural Networks (CNNs)targeting embedded systems where the bulk of the data pro-cessed cannot be held on chip and has to be fetched fromoff-chip memories. LM exploits the precision requirementvariability of modern CNNs to reduce the off-chip networkfootprint, increase bandwidth utilization, and to deliver perfor-mance which scales inversely proportional with precision forboth convolutional (CVLs) and fully-connected (FCLs) layers.Ideally, compared to a conventional DaDN-like data-parallelaccelerator that uses a fixed precision of 16 bits, LM achieves aspeedup of 256

Pa×Pwand 16

Pwfor CVLs and FCLs where Pw and

Pa are the precisions of weights and activations respectively.LM process both activations and weights bit-serially whilecompensating for the loss in computation bandwidth by ex-ploiting parallelism. Judicious reuse of activations or weights

enables LM to improve performance and energy efficiency overconventional bit-parallel designs without requiring a widermemory interface.

We evaluate LM on an SoC with a High Bandwidth Mem-ory V2 (HBM2) interface comparing against a DaDN-likeaccelerator (BASE). Both accelerators are configured so thatthey can utilize the full bandwidth of HBM2. On a set ofimage classification CNNs, on average LM yields a speedupof 2.37×, 1.74×, and 2.34× over BASE for the convolutional,fully-connected, and all layers respectively. The energy ef-ficiency of LM over BASE is 2.26×, 1.67× and 2.23× forthe aforementioned layers respectively. LM enables tradingoff accuracy for additional improvements in performance andenergy efficiency. For example, accepting a 1% relative lossin accuracy, LM yields 2.50× higher performance and 2.39×more energy efficiency than BASE.

The rest of this document is organized as follows: Sec-tion II illustrates the key concepts behind LM via an example.Section III reviews the BASE architecture and presents anequivalent Loom configuration. The evaluation methodologyand experimental results are presented in Section IV. Section Vreviews related work, and Section VI concludes.

II. Loom: A SIMPLIFIED EXAMPLE

This section explains how LM would process CVLs andFCLs assuming 2-bit activations and weights.Conventional Bit-Parallel Processing: Figure 1a showsa bit-parallel processing engine which multiplies two inputactivations with two weights generating a single 2-bit outputactivation per cycle. The engine can process two new 2-bitweights and/or activations per cycle a throughput of two 2b×2bproducts per cycle.Loom’s Approach: Figure 1b shows an equivalent LM enginecomprising four subunits organized in a 4 × 4 array. Eachsubunit accepts 2 bits of input activations and 2 bits of weightsper cycle. The subunits along the same column share theactivation inputs while the subunits along the same row sharetheir weight inputs. In total, this engine accepts 4 activationand 4 weight bits equaling the input bandwidth of the bit-parallel engine. Each subunit has two 1-bit Weight Registers(WRs), one 2-bit Output Register (OR), and can perform two1b× 1b products which it can accumulate into its OR.

Figure 1b through Figure 1f show how LM would processan FCL. As Figure 1b shows, in cycle 1, the left columnsubunits receive the least significant bits (LSBs) a0/0 and a1/0

arX

iv:1

706.

0785

3v1

[cs

.DC

] 2

3 Ju

n 20

17

Weight 0

Weight 1

Activation 0 Activation 1

X

X+ Out

2

2

2 2

w01/1

w01/0

w00/1

w00/0

a0/1 a0/0 a1/1 a1/0

(a) Bit-Parallel Engine processing 2b × 2b layerover two cycles

Out0

1

1

1

a0/0

1 1

X

X1

1

a1/0

1

w11/0

w10/0

w01/0

w00/0

Window lane 0

w00/0

w01/0

+1

1

1 1

X

X

Window lane 1

+ Out2

Out1

1

1

1X

X1

1

1w1

0/0

w11/0

+1

1X

X+ Out3

WR OR

(b) Cycle 1: Load LSB of weights from filters 0and 1 into the left WRs

Out0

1

1

1

a0/1

1 1

X

X1

1

a1/1

1

w31/0

w30/0

w21/0

w20/0 w0

0/0

w01/0

+1

1

1 1

X

X+ Out2

Out1

1

1

1X

X1

1

1w1

0/0

w11/0

+1

1X

X+ Out3

a0/0 a1/0

w31/0

w30/0

w21/0

w20/0

(c) Cycle 2: Load LSB of weights from filters 2and 3 into the right WRs

Out0

1

1

1

a0/0

1 1

X

X1

1

a1/0

1

w11/1

w10/1

w01/1

w00/1 w0

0/1

w01/1

+1

1

1 1

X

X+ Out2

Out1

1

1

1X

X1

1

1w1

0/1

w11/1

+1

1X

X+ Out3

a0/1 a1/1

w31/0

w30/0

w21/0

w20/0

(d) Cycle 3: Load MSB of weights from filters 0and 1 into the left WRs

Out0

1

1

1

a0/1

1 1

X

X1

1

a1/1

1

w31/1

w30/1

w21/1

w20/1 w0

0/1

w01/1

+1

1

1 1

X

X+ Out2

Out1

1

1

1X

X1

1

1w1

0/1

w11/1

+1

1X

X+ Out3

a0/0 a1/0

w31/1

w30/1

w21/1

w20/1

(e) Cycle 4: Load MSB of weights from filters 2and 3 into the right WRs

Out0

1

1

1

1 1

X

X1

1

1

+1

1

1 1

X

X+ Out2

Out1

1

1

1X

X1

1

1

+1

1X

X+ Out3

a0/1 a1/1

w31/1

w30/1

w21/1

w20/1

(f) Cycle 5: Multiply MSB of weights from filters2 and 3 with MSB of a0 and a1

Fig. 1. Processing an example Fully-Connected Layer using LM’s Approach.

of activations a0 and a1, and w00/0, w0

1/0, w10/0, and w1

1/0, theLSBs of four weights from filters 0 and 1. Each of these twosubunits calculates two 1b×1b products1 and stores their suminto its OR. In cycle 2, as Figure 1c shows, the left columnsubunits now multiply the same weight bits with the mostsignificant bits (MSBs) a0/1 and a1/1 of activations a0 anda1 respectively accumulating these into their ORs. In parallel,the two right column subunits load a0/0 and a1/0, the LSBsof the input activations a0 and a1, and multiply them by theLSBs of weights w2

0/0, w21/0, w3

0/0, and w31/0 from filters 2 and

3. In cycle 3, the left column subunits now load and multiplythe LSBs a0/0 and a1/0 with the MSBs w0

0/1, w01/1, w1

0/1, andw1

1/1 of the four weights from filters 0 and 1. In parallel, theright subunits reuse their WR-held weights w2

0/0, w21/0, w3

0/0,and w3

1/0 and multiply them the most significant bits a0/1and a1/1 of activations a0 and a1 (Figure 1d). As Figure 1eillustrates, in cycle 4, the left column subunits multiply theirWR-held weights and a0/1 and a1/1 the MSBs of activationsa0 and a1 and finish the calculation of output activations o0and o1. Concurrently, the right column subunits load w2

0/1,w2

1/1, w30/1, and w3

1/1, the MSBs of the weights from filters2 and 3 and multiply them with a0/0 and a1/0. In cycle 5, asFigure 1f shows, the right subunits complete the multiplicationof their WR-held weights and a0/1 and a1/1 the MSBs of thetwo activations. By the end of this cycle, output activations o2and o3 are ready as well.

In total it took 4+1 cycles to process 32 1b×1b products (4,8, 8, 8, 4 products in cycles 2 through 5, respectively). Notice

1In reality the product and accumulation would take place in the subsequentcycle. For clarity, we do not describe this in detail. It would only add an extracycle in the processing pipeline per layer.

that at the end of the fifth cycle, the left column subunitsare idle, thus another set of weights could have been loadedinto the WRs allowing a new set of outputs to commencecomputation. In the steady state, when the input activationsand the weights are represented in two bits, this engine willbe producing 8 1b × 1b terms every cycle thus matching the2 2b× 2b throughput of the parallel engine.

If the weights could be represented using only one bit, LMwould be producing two output activations per cycle, twicethe bandwidth of the bit-parallel engine. In general, if the bit-parallel hardware was using Pbase bits to represent the weightswhile only Pw bits were actually required, for the FCLs theLM engine would outperform the bit-parallel engine by Pbase

Pw.

Since there is no weight reuse in FCLs, Cn cycles are requiredto load a different set of weights to each of the Cn columns.Thus having activations that use less than Cn bits would notimprove performance (but could improve energy efficiency).Convolutional Layers: LM processes CVLs mostly similarlyto FCLs but exploits weight reuse across different windowsto exploit a reduction in precision for both weights andactivations. Specifically, in CVLs the subunits across thesame row share the same weight bits which they load inparallel into their WRs in a single cycle. These weight bitsare multiplied by the corresponding activation bits over Pa

cycles. Another set of weight bits needs to be loaded everyPa cycles, where Pa is the input activation precision. Here LMexploits weight reuse across multiple windows by having eachsubunit column process a different set of activations. Assumingthat the bit-parallel engine uses Pbase bits to represent bothinput activations and weights, LM will outperform the bit-parallel engine by P 2

base

Pw×Pawhere Pw and Pa are the weight

and activation precisions respectively.

Activation Lane 0

Activation Lane 15

ABin

Off-chip memory

Weight Lane 0

Weight Lane 15

8

IP08

16

16

16 16

IP0X

X

+16

16

16 16

to ABout

Filter

(a) Baseline design

ABin

Off-chip memory

Weight Lane 0

Weight Lane 15

1

1

to ABout

SIP(0,0)

+

WR

1

1

11Activation

Lane 0Activation

Lane 15

SIP(0,0)

1 1

Activation Lane 240

Activation Lane 255

SIP(15,0)

1 1From weight lane

(b) Loom

Fig. 2. The two CNN accelerators.

III. Loom ARCHITECTURE

This section describes the baseline DaDN-like design, howit was configured to work with an HBM2 memory, and finallythe Loom architecture.

A. Data Supply and Baseline System

Our baseline design (BASE) is an appropriately configureddata-parallel engine inspired by the DaDN accelerator [1].DaDN uses 16-bit fixed-point activations and weights. ADaDN chip integrates 16 tiles where each tile processes 16filters concurrently, and 16 weight and activation productsper filter. In total, a DaDN chip processes 16 × 16 = 256filters and 4K products concurrently requiring 8KB of weightand 32B or activation inputs (16 activations are reused by all256 filters) per cycle. Given the 1GHz operating frequency,sustaining DaDN’s compute bandwidth requires 8TB/sec and32GB/sec of weight and input activation bandwidth respec-tively. DaDN uses 32MB weight and 4MB activation eDRAMsfor this purpose. Such large on-chip memories are beyond thereach of modern embedded SoC designs. Given that there is noweight reuse in FCLs all weights have to be supplied from anoff-chip memory.2 Accordingly, BASE is a DaDN computeengine configured to match the external weight memory’sbandwidth. Assuming a High Bandwidth Memory v2 (HBM2)interface and current commercial offerings, weights can beread at the rate of 256GB/s [4]. Thus BASE can expect toprocess up to 128 weights per clock cycle. A single tile thatprocesses 16 weights from 8 filters suffices. An appropriatelysized Weight Buffer (WB) can keep the HBM2 interface busywhile tolerating its latency. The WB will be the same forboth BASE and LM and will be Mlat × 128B where Mlat

the latency of the external memory (for example, assuming a40ns Mlat, a WB of approximately 5KB would be sufficient.

2Since there is weight reuse in CVLs it may be possible to boost weightsupply bandwidth with a smaller than 32MB on-chip WM for CVLs. How-ever, off-chip memory bandwidth will remain a bottleneck for FCLs. Theexploration of such designs is left for future work.

Given the relatively low activation memory (AM) bandwidthand footprint, we assume that activations can be stored on-chip. The AM can be dedicated or shared among multiplecompute engines. It needs to sustain a 32B/cycle bandwidth.

Figure 2a illustrates the BASE design which processes eightfilters concurrently calculating 16 input activation and weightproducts per filter for a total of 128 products per cycle. Eachcycle, the design reduces the 16 products of each filter intoa single partial output activation, for a total of eight partialoutput activations for the whole chip. Internally, the chip hasan input activation buffer (ABin) to provide 16 activations percycle through 16 activation lanes, and an output activationbuffer (ABout) to accept eight partial output activations percycle. In total, 128 16b × 16b multipliers calculate the 128activation and weight products and eight 16-input 32b addertrees produce the partial output activations. All inter-layeractivation outputs except for the initial input and the finaloutput are stored in a 4MB Activation Memory (AM) which isconnected to the ABin and ABout buffers. Off-chip accessesare needed only for reading: 1) the input image, 2) the weights,and 3) for writing the final output.

B. Loom

Targeting a 1GHz clock frequency and an HBM2 interface,LM can expect to sustain an input bandwidth of up to 2Kweight bits per cycle. Accordingly, LM is configured to process128 filters concurrently and 16 weight bits per filter percycle, for a total of 128 × 16 = 2048 weight bits percycle. LM also accepts 256 1-bit input activations each ofwhich it multiplies with 128 1-bit weights thus matchingthe computation bandwidth of base in the worst case whereboth activations and weights need 16 bits. Figure 2b showsthe Loom design. It comprises 2K Serial Inner-Product Units(SIPs) organized in a 128 × 16 grid. Every cycle, each SIPmultiplies 16 1b input activations with 16 1b weights andreduces these products into a partial output activation. TheSIPs along the same row share a common 16b weight bus,

ne

g

x16

i=1(a0)

i=1(a15)

weight

1

weight

1

+

max

<<1

<<

o_nbout

i_nboutactivation

MSB1 0

prec16

WR

1

1

+ +<<i=1

MS

B

i_n

bo

ut

cas.

Accu. 1Accu. 2

Fig. 3. LM’s SIP.

and the SIPs along the same column share a common 16bactivation bus. Accordingly, as in BASE, the SIP array is fedby a 2Kb weight bus and a 256b activation input bus. Similarto BASE, LM has an ABout and an ABin. LM processes bothactivations and weights bit-serially.Reducing Memory Footprint and Bandwidth: Since bothweights and activations are processed bit-serially, LM can storeweights and activations in a bit-interleaved fashion and usingonly as many bits as necessary thus boosting the effectivebandwidth and storage capacity of the external weight memoryand the on-chip AM. For example, given 2K 13b weightsto be processed in parallel, LM would pack first their bit0 onto continuous rows, then their bit 1, and so on up tobit 12. BASE would stored them using 16 bits instead. Atransposer can rotate the output activations prior to writingthem to AM from ABout. Since each output activation entailsinner-products with tens to hundreds of inputs, the transposerdemand will be low. Next we explain how LM processes FCLsand CVLs.Convolutional Layers: Processing starts by reading in parallel2K weight bits from the off-chip memory, loading 16 bits to allWRs per SIP row. The loaded weights will be multiplied by 16corresponding activation bits per SIP column bit-serially overPLa cycles where PL

a is the activation precision for this layerL. Then, the second bit of weights will be loaded into WRsand multiplied with another set of 16 activation bits per SIProw, and so on. In total, the bit-serial multiplication will takePLa ×PL

w cycles. where PLw the weight precision for this layer

L. Whereas BASE would process 16 sets of 16 activations and128 filters over 256 cycles, LM processes them concurrentlybut bit-serially over PL

a × PLw cycles. If PL

a and/or PLw are

less than 16, LM will outperform BASE by 256/(PLa × PL

w ).Otherwise, LM will match BASE’s performance.Fully-Connected Layers: Processing starts by loading theLSBs of a set of weights into the WR registers of the first SIPcolumn and multiplying the loaded weights with the LSBs ofthe corresponding activations. In the second cycle, while thefirst column of SIPs is still busy with multiplying the LSBs ofits WRs by the second bit of the activations, the LSBs of a newset of weights can be loaded into the WRs of the second SIPcolumn. Each weight bit is reused for 16 cycles multiplyingwith bits 0 through bit 15 of the input activations. Thus, thereis enough time for LM to keep any single column of SIPs busywhile loading new sets of weights to the other 15 columns.

For example, as shown in Figure 2b LM can load a single bitof 2K weights to SIP(0,0)..SIP(0,127) in cycle 0, then loada single-bit of the next 2K weights to SIP(1,0)..SIP(1,127) incycle 1, and so on. After the first 15 cycles, all SIPs are fullyutilized. It will take PL

w × 16 cycles for LM to process 16sets of 16 activations and 128 filters while BASE processesthem in 256 cycles. Thus, when PL

w is less than 16, LMwill outperform BASE by 16/PL

w and it will match BASE’sperformance otherwise.Processing Layers with Few Outputs: For LM to keep all theSIPs busy an output activation must be assigned to each SIP.This is possible as long as the layer has at least 2K outputs.However, in the networks studied some FCLs have only 1Koutput activations, To avoid underutilization, LM’s implementsSIP cascading, in which SIPs along each row can form a daisy-chain, where the output of one can feed into an input of thenext via a multiplexer. This way, the computation of an outputactivation can be sliced along the bit dimension over the SIPsin the same row. In this case, each SIP processes only a portionof the input activations resulting into several partial outputactivations along the SIPs on the same row. Over the next Sncycles, where Sn is the number of bit slices used, the Snpartial outputs can be reduced into the final output activation.Other Layers: Similar to DaDN, LM processes the additionallayers needed by the studied networks. To do so, LM incorpo-rates units for MAX pooling as in DaDN. Moreover, to applynonlinear activations, an activation functional unit is presentat the output of the ABout. Given that each output activationtypically takes several cycles to compute, it is not necessaryto use more such functional units compared to BASE.Total computational bandwidth: In the worst case, whereboth activations and weights use 16b precisions, a single 16b×16b product that would have taken BASE one cycle to produce,now takes LM 256 cycles. Since BASE calculates 128 productsper cycle, LM needs to calculate the equivalent of 256× 12816b×16b products every 256 cycles. LM has 128×16 = 2048SIPs each producing 16 1b×1b products per cycle. Thus, over256 cycles, LM produces 2048 × 16 × 256 1b × 1b productsmatching BASE’s compute bandwidth.SIP: Bit-Serial Inner-Product Units: Figure 3 shows LM’sBit-Serial Inner-Product Unit (SIP). Every clock cycle, eachSIP multiplies 16 single-bit activations by 16 single-bitweights to produce a partial output activation. Internally, eachSIP has 16 1-bit Weight Registers (WRs), 16 2-input AND

Convolutional layersPer Layer Activation Network Weight Per Layer Activation Network Weight

Network Precision in Bits Precision in Bits Precision in Bits Precision in Bits100% Accuracy 99% Accuracy

NiN 8-8-8-9-7-8-8-9-9-8-8-8 11 8-8-7-9-7-8-8-9-9-8-7-8 10AlexNet 9-8-5-5-7 11 9-7-4-5-7 11GoogLeNet 10-8-10-9-8-10-9-8-9-10-7 11 10-8-9-8-8-9-10-8-9-10-8 10VGG S 7-8-9-7-9 12 7-8-9-7-9 11VGG M 7-7-7-8-7 12 6-8-7-7-7 12VGG 19 12-12-12-11-12-10-11-11-13-12-

13-13-13-13-13-1312 9-9-9-8-12-10-10-12-13-11-12-13-

13-13-13-1312

TABLE IPER LAYER ACTIVATION PRECISIONS AND PER NETWORK WEIGHT PRECISION PROFILES FOR THE CONVOLUTIONAL LAYERS.

Fully connected layersPer Layer Weight Per Layer Weight

Network Precision in Bits Precision in Bits100% Accuracy 99% Accuracy

AlexNet 10-9-9 9-8-8GoogLeNet 7 7VGG S 10-9-9 9-9-8VGG M 10-8-8 9-8-8VGG 19 10-9-9 10-9-8

TABLE IIPER LAYER WEIGHT PRECISIONS FOR FULLY-CONNECTED LAYERS.

gates to multiply the weights in the WRs with the incominginput activation bits, and a 16-input 1b adder tree that sumsthese partial products. Accu.1 accumulates and shifts theoutput of the adder tree over PL

a cycles. Every PLa cycles,

Accu.2 shifts the output of Accu.1 and accumulates it into theOR. After PL

a ×PLw cycles the Output Register (OR) contains

the inner-product of an activation and weight set. In each SIP,a multiplexer after Accu.1 implements cascading. To supportsigned 2’s complement activations, a negation block is usedto subtract the sum of the input activations corresponding tothe most significant bit of weights (MSB) from the partialsum when the MSB is 1. Each SIP also includes a comparator(max) to support max pooling layers.Tuning the Performance, Area and Energy Trade-off: Itis possible to trade off some of the performance benefits toreduce the number of SIPs and the respective area overhead byprocessing more than one bit activation per cycle. Using thismethod, LM requires fewer SIPs to match BASE’s throughput.The evaluation section considers 2-bit and 4-bit LM config-urations, denoted as (LM2b) and (LM4b), respectively whichneed 8 and 4 SIP columns, respectively. Since activationsnow are forced to be a multiple of 2 or 4 respectively, theseconfigurations give up some of the performance potential.For example, for LM4b reducing the PL

a from 8 to 5 bitsproduces no performance benefit, whereas for the LM1b itwould improve performance by 1.6×.

IV. EVALUATION

This section evaluates Loom performance, energy and areaand explores the trade-off between accuracy and performancecomparing to BASE and Stripes*3.

3Stripes* is a configuration of [5] that is appropriately scaled to match the256GB/s bandwidth of the HBM2 interface.

A. Methodology

Performance, Energy, and Area Methodology: The mea-surements were collected over layouts of all designs as fol-lows: The designs were synthesized for worst case, typi-cal case, and best case corners with the Synopsys DesignCompiler [6] using a TSMC 65nm library. Layouts wereproduced with Cadence Encounter [7] using the typical cornercase synthesis results which were more pessimistic for LMthan the worst case scenario. Power results are based on theactual data-driven activity factors. The clock frequency of alldesigns is set to 980 MHz matching the original DaDianNaodesign [1]. The ABin and ABout SRAM buffers were modeledwith CACTI [8], and the AM eDRAM area and energy weremodeled with Destiny [9]. Execution time is modeled via acustom cycle-accurate simulator.Weight and Activation Precisions: The methodology ofJudd et al. [10] was used to generate per layer precisionprofiles. Tables I and II indicate precisions for convolutionaland fully-connected layers, respectively. Caffe [11] was usedto measure how reducing the precision of each layer affects thenetwork’s overall top-1 prediction accuracy over 5000 images.The network models and trained-networks are taken from theCaffe Model Zoo [12]. Since LM’s performance for the CVLsdepends on both PL

a and PLw , we adjust them independently:

we use per layer activation precisions and a common across allCVLs weight precision (we found little inter-layer variabilityfor weight precisions but additional per layer exploration iswarranted). Since LM’s performance for FCLs performancedepends only on PL

w we only adjust weight precision for FCLs.Table I reports the per layer precisions of input activations

and network precisions of weights for the CVLs. The preci-sions that guarantee no accuracy loss for input activations varyfrom 5 to 13 bits and for weights vary from 10 to 12. Whena 99% accuracy is still acceptable, the activation and weightprecision can be as low as 4 and 10 bits, respectively. Table IIshows that the per layer weight precisions for the FCLs varyfrom 7 to 10 bits.

B. Results

Performance: Figure 4 shows the performance of Stripes*and Loom configurations for CVLs relative to BASE with theprecision profiles of Tables I and II. With no accuracy loss(100% accuracy) LM1b improves performance of CVLs by a

Fully-connected Layers Convolutional LayersDesign 1-bit 2-bit 4-bit 1-bit 2-bit 4-bit

Perf Eff Perf Eff Perf Eff Perf Eff Perf Eff Perf EffNiN 3.63 2.96 3.35 3.20 2.99 3.18

AlexNet 1.85 1.51 1.85 1.76 1.85 1.97 3.74 3.05 3.28 3.13 3.12 3.32GoogLeNet 2.25 1.84 2.27 2.16 2.28 2.42 2.13 1.74 2.12 2.02 1.99 2.11

VGG S 1.78 1.46 1.78 1.70 1.79 1.90 2.74 2.24 2.58 2.46 2.37 2.53VGG M 1.79 1.47 1.80 1.72 1.80 1.92 2.83 2.31 2.59 2.47 2.63 2.80VGG 19 1.63 1.33 1.63 1.56 1.63 1.74 1.79 1.47 1.72 1.64 1.56 1.66geomean 1.85 1.51 1.85 1.77 1.86 1.98 2.85 2.22 2.54 2.42 2.38 2.53

TABLE IIIEXECUTION TIME AND ENERGY EFFICIENCY IMPROVEMENTS FOR FULLY-CONNECT AND CONVOLUTIONAL LAYERS WITH 99% ACCURACY.

NiN AlexNet Google VGGS VGGM VGG19 Geo0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Stripes* 1-bit 2-bit 4-bit

Fig. 4. LM’s performance relative to BASE for convolutional layers with100% accuracy.

factor of 2.50× on average over BASE compared to 1.84× im-provement with Stripes*. Similarly, LM2b, and LM4b achieve,on average, speedups of 2.37× and 2.22× over BASE onthe CVLs, respectively. As expected LM2b and LM4b offerslightly lower performance than LM1b however given that thepower consumption of LM2b and LM4b are lower than LM1b,this can be a good trade-off. The performance loss of LM2b

and LM4b is due to the limitation of rounding up activationprecisions to be multiple of 2 and 4, respectively.

Figure 5 shows the performance of Stripes* and Loomconfigurations for FCLs layers. Since for the FCLs the perfor-mance improvement is only coming from lower precision ofweights, rounding up the activation precision does not effectthe performance of the designs. Hence all three configurationsof the LM outperform the BASE on average by a factorof ˜1.74× while Stripes* matches the same performance ofBASE. However, due to having shorter initiation interval perlayer the LM4b performs slightly better than the LM2b andLM1b on the FCLs. Since GoogLeNet has only one smallfully-connected layer, the initiation interval has higher effecton the performance of the fully-connect layer. Thus, theperformance variation for different configurations of Loom ishigher in GoogLeNet.

AlexNet Google VGGS VGGM VGG19 Geo0.0

0.5

1.0

1.5

2.0

2.5Stripes* 1-bit 2-bit 4-bit

Fig. 5. LM’s performance relative to BASE for fully-connected layers with100% accuracy.

Table III illustrates performance and energy efficiency ofFCLs and CVLs with an up to 1% loss in accuracy (99%accuracy). The average speedups for the FCLs with LM1b,LM2b, and LM4b are 1.85×, 1.85×, and 1.86×, respectively.The respective speedups for the CVLs are 2.85×, 2.54× and2.38×.Energy Efficiency: Figure 6 shows the energy efficiencyof Stripes*, LM1b, LM2b, and LM4b relative to BASE forCVLs using the 100% accuracy profiles of Table I. Since, thenumber of SIPs in LM1b, LM2b, and LM4b are 2k, 1k, and512, respectively, the power consumption of LM4b is less thanLM2b and that of LM1b so for the all networks LM4b hashigher energy efficiency than LM2b and LM1b. The LM1b,LM2b, and LM4b accelerators for CVLs achieve on averageenergy efficiencies of 2.04×, 2.26×, and 2.36× over BASEcompared to 1.61× improvement with Stripes*.

Figure 7 shows the energy efficiency of Stripes* and Loomconfigurations for FCLs layers with no accuracy loss. SinceStripes* does not improve the performance for FCLs andconsumes more energy than BASE, the energy efficiency ofStripes* for FCLs is less than one (0.87×). All three con-figurations of Loom have the same performance improvementfor FCLs. However, as the power consumption of LM4b is

NiN AlexNet Google VGGS VGGM VGG19 Geo0.0

0.5

1.0

1.5

2.0

2.5

3.0


Fig. 6. LM’s energy efficiency relative to BASE for convolutional layers with100% accuracy.

lower than that of two other configurations, it has the highestenergy efficiency. Similarly, the LM2b design is more energyefficient than LM1b. The energy efficiency improvements ofLM1b, LM2b, and LM4b over BASE are 1.43×, 1.67×, and1.86× respectively.

With the 99% accuracy profiles, LM1b, LM2b, and LM4b

energy efficiency improves to 2.22×, 2.42×, and 2.53× for theCVLs and 1.51×, 1.77×, and 1.98× for the FCLs (Table III).On average, over the whole network, LM1b, LM2b and LM4b

improve energy efficiency by factors of 2.19×, 2.39×, and2.50× over the BASE.

These energy measurements do not include the off-chipmemory accesses as an appropriate model for HBM2 is notavailable to us. However, since LM uses lower precisions forrepresenting the weights, it will transfer less data from off-chip. Thus our evaluation is conservative and the efficiency ofLM will be even higher.Area Overhead: Post layout measurements were used tomeasure the area of BASE and Loom. The LM1b configurationrequires 1.31× more area over BASE while achieving onaverage a 2.47× speedup. The LM2b and LM4b reduce thearea overhead to 1.23× and 1.14× while still improving theexecution time by 2.34× and 2.20× respectively. Thus LMexhibits better performance vs. area scaling than BASE.

C. Dynamic Precisions

To further improve the performance of Loom, similar to[13], the precision required to represent the input activationsand weights can be determined at runtime. This enables Loomto exploit smaller precisions without any accuracy loss asit explores the weight and activation precisions on smallergranularity. In this experiment, the activation precisions areadjusted per group of 16 activations that are broadcast tothe same column of SIPs. Figure 8 shows the performanceof Loom configurations relative to the BASE. Exploiting the

AlexNet Google VGGS VGGM VGG19 Geo0.0

0.5

1.0

1.5

2.0

2.5


Fig. 7. LM’s energy efficiency relative to BASE for fully-connected layerswith 100% accuracy.

NiN AlexNet Google VGGS VGGM VGG19 Geo0

1

2

3

4

Stripes* 1-bit

2-bit 4-bit

dynamic precision

Fig. 8. Relative performance of LM using dynamic precisions for activationswith 100% accuracy. Solid colors: performance not using dynamic precisions.

dynamic precision technique on average improves performanceby 3.32×, 3.18×, and 2.82× for LM1b, LM2b, and LM4b,compared to the 2.44× average improvement with Stripes*.

V. RELATED WORK

Bit-serial neural network (NN) hardware has been proposedbefore [14], [15]. While its performance scales with the inputdata precision, it is slower than an equivalently configured bit-parallel engine. For example, one design [14], takes (4 × p)cycles to multiply per weight and activation product where pis the precision of the weights.

In recent years, several DNN hardware accelerators havebeen proposed, however, in the interest of space we limitattention to the most related to this work. Stripes [5], [16]

processes activations bit-serially and reduces execution timeon CVLs only. Loom outperforms Stripes on both CVLs andFCLs: it exploits both weight and activation precisions inCVLs and weight precision in FCLs. Pragmatic’s performancefor the CVLs depends only on the number of activationbits that are 1 [17], but does not improve performance forFCLs. Further performance improvement may be possible bycombining Pragmatic’s approach with LM’s. Proteus exploitsper layer precisions reducing memory footprint and bandwidth.but requires crossbars per input weight to convert from thestorage format to the one used by the bit-parallel computeengines [18]. Loom obviates the need for such a conversion andthe corresponding crossbars. Hardwired NN implementationswhere the whole network is implemented directly in hardwarenaturally exploit per layer precisions [19]. Loom does notrequire that the whole network fit on chip nor does it hardwirethe per layer precisions at design time.

VI. CONCLUSION

This work presented Loom, a hardware inference acceleratorfor DNNs whose execution time for the convolutional and thefully-connected layers scales inversely proportionally with theprecision p used to represent the input data. LM can trade-off accuracy vs. performance and energy efficiency on the fly.The experimental results show that, on average LM is 2.34×faster and 2.23× more energy-efficient than a conventional bit-parallel accelerator. We targeted the available HBM2 interfaceand devices. However, we expect that LM will scale well tofuture HBM revisions.

REFERENCES

[1] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning super-computer,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACMInternational Symposium on, Dec 2014, pp. 609–622.

[2] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”CoRR, vol. abs/1311.2524, 2013.

[3] A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos,E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y.Ng, “Deep speech: Scaling up end-to-end speech recognition,” CoRR,vol. abs/1412.5567, 2014.

[4] J. Hruska, “Samsung announces mass production of next-generationHBM2 memory,” https://www.extremetech.com/extreme/221473-samsung-announces-mass-production-of-next-generation-hbm2-memory, 2016.

[5] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos,“Stripes: Bit-serial Deep Neural Network Computing ,” in Proc. of the49th Annual IEEE/ACM Intl’ Symposium on Microarchitecture, 2016.

[6] Synopsys, “Design Compiler,” http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages.

[7] Cadence, “Encounter RTL Compiler,”https://www.cadence.com/content/cadence-www/global/en US/home/training/all-courses/84441.html.

[8] N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool tounderstand large caches.”

[9] M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool formodeling emerging 3d nvm and edram caches,” in Design, AutomationTest in Europe Conference Exhibition, March 2015.

[10] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Ur-tasun, and A. Moshovos, “Reduced-Precision Strategies for BoundedMemory in Deep Neural Nets ,” arXiv:1511.05236v4 [cs.LG] , 2015.

[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[12] Y. Jia, “Caffe model zoo,” https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015.

[13] A. Delmas, P. Judd, S. Sharify, and A. Moshovos, “Dynamic stripes:Exploiting the dynamic precision requirements of activation values inneural networks,” arXiv preprint arXiv:1706.00504, 2017.

[14] B. Svensson and T. Nordstrom, “Execution of neural network algorithmson an array of bit-serial processors,” in Pattern Recognition, 1990.Proceedings., 10th International Conference on, vol. 2. IEEE, 1990.

[15] A. F. Murray, A. V. Smith, and Z. F. Butler, “Bit-serial neural networks,”in Neural Information Processing Systems, 1988, pp. 573–583.

[16] P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bit-serial Deep NeuralNetwork Computing ,” Computer Architecture Letters, 2016.

[17] J. Albericio, P. Judd, A. D. Lascorz, S. Sharify, and A. Moshovos,“Bit-pragmatic deep neural network computing,” Arxiv, vol.arXiv:1610.06920 [cs.LG], 2016.

[18] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger,and A. Moshovos, “Proteus: Exploiting numerical precision variabilityin deep neural networks,” in Proceedings of the 2016 InternationalConference on Supercomputing. ACM, 2016, p. 23.

[19] T. Szabo, L. Antoni, G. Horvath, and B. Feher, “A full-parallel digitalimplementation for pre-trained NNs,” in IJCNN 2000, Proceedings of theIEEE-INNS-ENNS International Joint Conference on Neural Networks,

2000, vol. 2, 2000, pp. 49–54 vol.2.

Date post:	14-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Loom: Exploiting Weight and Activation Precisions to ... · Loom: Exploiting Weight and Activation...

Documents