+ All Categories
Home > Documents > S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for...

S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for...

Date post: 18-Mar-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
8
. ....................................................................................................................................................................................................................... SOFTWARE–HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION . ....................................................................................................................................................................................................................... DESIGNERS MAKING DEEP LEARNING COMPUTING MORE EFFICIENT CANNOT RELY SOLELY ON HARDWARE.INCORPORATING SOFTWARE-OPTIMIZATION TECHNIQUES SUCH AS MODEL COMPRESSION LEADS TO SIGNIFICANT POWER SAVINGS AND PERFORMANCE IMPROVEMENT.THIS ARTICLE PROVIDES AN OVERVIEW OF DEEPHIS TECHNOLOGY FLOW, INCLUDING COMPRESSION, COMPILATION, AND HARDWARE ACCELERATION.TWO ACCELERATORS ACHIEVE EXTREMELY HIGH ENERGY EFFICIENCY FOR BOTH CLIENT AND DATACENTER APPLICATIONS WITH CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS. ...... Deep learning” and “neural network” are the current AI keywords. Deep learning is showing dominant performance in applications such as image classification 1 and speech recognition, 2 which makes it the top candidate for real-world AI applications. However, today’s computational efficiency is still not enough, and the computational com- plexity of neural networks far exceeds tradi- tional computer vision algorithms, so we cannot employ deep learning for many cases. To address this problem, researchers around the world have been working on customized hardware acceleration solutions. 3 There will be an unprecedented battle for deep learning hardware. We believe that, to build an efficient sys- tem for deep learning, we must consider soft- ware–hardware codesign, because software and hardware are coupled in deep learning. Considering both optimization in software and hardware, we propose a new design flow (see Figure 1). Three factors affect how to efficiently compute deep learning algorithms: workload, peak performance, and efficiency. A smaller workload with the same precision is always welcome. However, changing the workload can affect the hardware design. For example, replacing direct 2D convolution with a fast algorithm—for example, Winograd— in a convolutional neural network (CNN) changes the ratio between multiplication and addition and also changes the data access pat- tern. Furthermore, exploring the sparsity in neural networks changes even the data descrip- tion format and entire computing system— that is, from dense to sparse matrices. A higher peak performance is always wanted. However, because peak performance is usually proportional to the computation Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University and DeePhi Yuan Xie University of California, Santa Barbara Huazhong Yang Tsinghua University ............................................................ 2 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE
Transcript
Page 1: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

........................................................................................................................................................................................................................

SOFTWARE–HARDWARE CODESIGNFOR EFFICIENT NEURAL NETWORK

ACCELERATION........................................................................................................................................................................................................................

DESIGNERS MAKING DEEP LEARNING COMPUTING MORE EFFICIENT CANNOT RELY SOLELY

ON HARDWARE. INCORPORATING SOFTWARE-OPTIMIZATION TECHNIQUES SUCH AS MODEL

COMPRESSION LEADS TO SIGNIFICANT POWER SAVINGS AND PERFORMANCE

IMPROVEMENT. THIS ARTICLE PROVIDES AN OVERVIEW OF DEEPHI’S TECHNOLOGY FLOW,

INCLUDING COMPRESSION, COMPILATION, AND HARDWARE ACCELERATION. TWO

ACCELERATORS ACHIEVE EXTREMELY HIGH ENERGY EFFICIENCY FOR BOTH CLIENT AND

DATACENTER APPLICATIONS WITH CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS.

......Deep learning” and “neuralnetwork” are the current AI keywords. Deeplearning is showing dominant performancein applications such as image classification1

and speech recognition,2 which makes it thetop candidate for real-world AI applications.However, today’s computational efficiency isstill not enough, and the computational com-plexity of neural networks far exceeds tradi-tional computer vision algorithms, so wecannot employ deep learning for many cases.To address this problem, researchers aroundthe world have been working on customizedhardware acceleration solutions.3 There willbe an unprecedented battle for deep learninghardware.

We believe that, to build an efficient sys-tem for deep learning, we must consider soft-ware–hardware codesign, because softwareand hardware are coupled in deep learning.

Considering both optimization in softwareand hardware, we propose a new design flow(see Figure 1). Three factors affect how toefficiently compute deep learning algorithms:workload, peak performance, and efficiency.

A smaller workload with the same precisionis always welcome. However, changing theworkload can affect the hardware design. Forexample, replacing direct 2D convolution witha fast algorithm—for example, Winograd—in a convolutional neural network (CNN)changes the ratio between multiplication andaddition and also changes the data access pat-tern. Furthermore, exploring the sparsity inneural networks changes even the data descrip-tion format and entire computing system—that is, from dense to sparse matrices.

A higher peak performance is alwayswanted. However, because peak performanceis usually proportional to the computation

Kaiyuan Guo

Tsinghua University and DeePhi

Song Han

Stanford University and DeePhi

Song Yao

DeePhi

Yu Wang

Tsinghua University and DeePhi

Yuan Xie

University of California,

Santa Barbara

Huazhong Yang

Tsinghua University

............................................................

2 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE

Page 2: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

unit number and system frequency, a higherpeak performance often results in higher costand power. One way to increase the peak per-formance while lowering the cost is to sim-plify the operation—for example, by usingfewer bits to represent data and weight inneural networks. The robustness of deeplearning algorithms makes it possible to use16-bit, 8-bit, and even fewer-bit fixed-pointoperations to replace 32-bit floating-pointoperations while introducing negligible accu-racy loss. This tradeoff between peak per-formance and variable precision influencesboth algorithm and hardware design.

Efficiency reflects how well we use thecomputation units. An elegant memory sys-tem design to feed the computing units withenough data is the key to high efficiency. Toachieve this, we need to tackle both on-chipmemory and external memory system design.For the on-chip memory part, it is necessaryto explore data locality and data reuse tomake data stay in the cache as long as possi-ble. For the external memory part, increasingthe bandwidth helps increase the efficiencybut also leads to higher cost and power. Withthe same theoretical bandwidth, we need toincrease the burst length to fully utilize it—that is, organize data storage to match hard-ware requirements. The data simplificationmethod also reduces the data-bit width andthus reduces the bandwidth requirement.

Taking all three factors into account helpsus design a highly efficient deep learning sys-tem. Furthermore, because deep learning isevolving rapidly, taping out a certain designmight not be a good choice for a commercialproduct. In this case, general-purpose process-ors or a specialized hardware with enoughflexibility and change for reprogramming arepreferable. Field-programmable gate arrays(FPGAs), with their inherent reconfigurability,let us explore all the three levels of the designand incorporate state-of-the-art deep learningtechniques into a product within a shortdesign time. Thus, FPGAs have the potentialto become a mainstream deep learning proc-essing platform.

From Model to InstructionsNo standard state-of-the-art neural networkmodel exists. For CNNs, early models first

applied several convolution (Conv) layerssequentially to the input image to generatelow-dimension features, and then severalfully connected layers as the classifier. Cur-rent networks, such as ResNet1 and theinception module in GoogleNet4 used differ-ent branches and parallel layers in the net-work to achieve multiscale sampling andavoid vanishing gradients. The model sizeranges from fewer than 10 layers to morethan 100 layers for different tasks. For recur-rent neural networks (RNNs), there are alsomany variants, such as long short-term mem-ory (LSTM), gated recurrent units (GRUs),bidirectional RNNs used in speech recogni-tion,2 and sequence-to-sequence learningused in neural machine translation (NMT).5

A system must be flexible enough to exe-cute different neural network models. Toachieve this, a flexible description is necessary.Caffe, TensorFlow, and other deep learningframeworks provide efficient interfaces onCPU and GPU platforms. However, forspecialized systems, we need a tool and anintermediate representation to bridge theseframeworks and the hardware. We design thecustomized hardware considering the pat-terns of neural network computation toachieve high efficiency while leaving theinterface flexible. In this way, we can map dif-ferent networks onto it. Meanwhile, algo-rithm researchers and hardware developerscan work simultaneously, making the itera-tion of products fast and efficient.

We implement an instruction interfacefor our hardware. For CPUs or GPUs, theinstructions are fine grained, usually withscalar- or vector-level operations. Fine-grained

Compression

Application Platform

Modeldesign

QuantizationPruningHardware

designCompile

Workload Peakperformance Efficiency

Performance

Instruction

Figure 1. Our proposed design flow.

.................................................................

MARCH/APRIL 2017 3

Page 3: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

instruction is highly flexible, but consideringthe specialty of neural networks, this mightnot be an efficient interface. For example, theneural network computation is usually full ofloops, thus we try to partition the loops intosmall blocks such that each block can be doneby hardware. For CNN, each block can be aset of 2D convolutions, whereas for RNN,each block can be vector-matrix multiplica-tions. The operations for each block can berepresented by one instruction, which reducesthe instruction size while maintaining thehardware efficiency. We also use instructionsto describe data transfers between on-chipcache and off-chip memory, which lets thecompiler do static scheduling to achieve a bal-ance between the computation and I/O.

Consider the design flow shown inFigure 1. First, the deep learning algorithmis designed for the target application. Forthis design flow, the main target is to designthe neural network model. Then, the modelis optimized to be ready for hardware accel-eration. This step usually includes modelcompression and data quantization to reducethe workload and increase the peak perform-ance of the hardware design. Both of thesesteps are done by automatic tools, but devel-opers need to choose the best decision, con-sidering the accuracy loss and hardwareperformance gain. Next, the hardware isdesigned according to the optimization strat-egy used. These three steps are done itera-tively to ensure that the target application’srequirement is met. After hardware design,we use a customized compiler to convert theneural network model to instructions to beexecuted at runtime. Further optimizationon scheduling is automatically done in thecompiler to increase the hardware efficiency.

Aristotle: The CNN AcceleratorCNNs are widely used for image and videoprocessing. One of the most popular CNNapplications is object detection. But CNN’shigh computation complexity makes it animpractical choice for mobile platforms suchas smartphones or drones. To solve this prob-lem, we designed the Aristotle architecturefor energy-efficient CNN acceleration.

CNN mainly comprises several convolu-tion layers. Within each layer, there are n

input feature maps Ni(x, y) and m output fea-ture maps Mj(x, y). Each feature map is a 2Dimage. Equation 1 describes the computationwithin one convolution layer. The * denotes2D convolution operation, and W is the con-volution kernel. The bias for each output fea-ture map is bj. Function f is a nonlinearfunction on each pixel (for example, ReLUor sigmoid).

Mj ¼ f�Xn

i¼1

Wij �Ni þ bj

�(1)

CNN also uses pooling layers for down-sampling on the feature maps, usually maxpooling or average pooling. With poolinglayers, the size of feature maps is reduced.This helps increase the reception field of eachneuron (pixel) in the feature map. Thus,larger features in the original image can beextracted.

Figure 2 shows the proposed architec-ture. We implement the architecture on aXilinx XC7Z020 system on chip on a cus-tomized board. The board is 5 cm � 5 cm(see Figure 2a) with about 3 W runtime powerconsumption, and it can fit into small robots.

Figure 2b shows the system architecture.A common computation system includes aCPU and the external memory, which is thetop white part. To accelerate the CNN, weimplement the bottom gray part on theFPGA. Data and instruction communicationbetween the CPU and the FPGA is achievedwith a shared memory scheme. The FPGA-based accelerator accesses the externalmemory through the direct memory accessmodule. The host CPU accesses the statusregisters of the FPGA accelerator and sendscontrol signals through the general-purposeport by memory mapping.

At runtime, the accelerator sequentiallyreads all the instructions and executes themautomatically. The host CPU does no sched-uling work and waits only for the acceleratorto finish. For software developers, calling theCNN accelerator is like establishing a newthread. In real applications, the CNN is usu-ally a part of the algorithm. The CPU is usedto schedule the flow of the algorithm andhandles the non-CNN parts.

The basic unit for CNN computation isthe processing element (PE). Figure 2c shows

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

4 IEEE MICRO

Page 4: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

the PE’s architecture. As described by theequation of each output channel Mj, the con-volution layer does the summation on 2Dconvolution results. So, we implement multi-ple 2D convolvers in a single PE and add theresults together with an adder tree. We imple-ment the line buffer design for the convolverssuch that the 2D convolution can be proc-essed in a pipelined manner, achieving thethroughput of 1 pixel per cycle. Because thehardware resource is limited, we might notbe able to do the summation of all the convo-lution results with the adder tree. So, the out-put buffer offers intermediate results back toPE for accumulation. Nonlinear and poolingoperations are integrated in the pipeline andcan be bypassed if needed.

To fully utilize the data locality of CNNcomputation, the feature maps in the inputbuffer are shared by all the PEs. The same fea-ture maps use different convolution kernels

and biases to calculate different output featuremaps. The address space of the input andoutput buffer is available in the instructioninterface. For a certain convolution layer, thecompiler specially manages the on-chip cacheto minimize the external memory access.

Besides the hardware architecture design,we also do software-level optimization. Wetry to reduce the bit-width of data in theCNN model, such that the limited logic andmemory resource on the Zynq 7020 becomesrelatively larger. To fully utilize the limitedbit-width, we use a fixed-point format butallow the radix point of data to vary amongdifferent layers. This strategy adjusts the datato different dynamic ranges in different layersand prevents overflow. Figure 3 shows ourexperimental results on state-of-the-art net-works. We see that 8-bit quantization bringsnegligible performance loss for all these net-works, so we adopt an 8-bit hardware design.

5.0 cm

5.0

cm

Host CPUExternalmemory

Computing complex

DMA

PE

PE

PE…

Inputbuffer

Outputbuffer

PE

Controller

(a)

C

Convolvers

+

+

+

+

+ ReLU PoolC

C

Outputbuffer

Inputbuffer

Data

Bias

Wei

ght

s

Intermediate data

Controller

Addertree

Bias shift

Datashift

……

(b) (c)

Figure 2. Aristotle architecture for the convolutional neural network (CNN) accelerator. (a) System board. (b) Overall system

architecture. (c) Processing element architecture.

0102030405060708090

100

0102030405060708090

100

GoogLeNet VGG-16 SqueezeNet VGG-CNN-FGoogLeNet VGG-16 SqueezeNet VGG-CNN-F(a) (b)

fp-32 16-bit 8-bit 6-bitfp-32 16-bit 8-bit 6-bit

Cla

ssifi

catio

n ac

cura

cy (

%)

Cla

ssifi

catio

n ac

cura

cy (

%)

Figure 3. Quantization results for different CNN models. (a) Top-1 classification accuracy. (b) Top-5 classification accuracy.

.................................................................

MARCH/APRIL 2017 5

Page 5: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

The implementation on XC7Z020includes two Pes, each comprising 16 3 � 3convolvers working at 214 MHz. Thisdesign uses 272-Kbyte on-chip block RAM,198 digital signal processing (DSP) units,and about 27,000 lookup tables. Thelookup table and DSP costs equal about413,000 gates for an ASIC design. We usethree applications to evaluate the systemperformance: face alignment, object detectionusing the YOLO (You Only Look Once)algorithm,6 and image classification using theVGG (Visual Geometry Group) network.7

The same applications are also realized on theNvidia TK1 and TX1 platform with the latestcuDNN library. Figure 4 shows the perform-ance comparison, including estimation of ournext-generation Aristotle on XC7Z020 andZU2CG. The proposed architecture on theFPGA can offer similar performance to themobile GPU on these applications. However,the power consumption of the FPGA is about3 W, versus 15 W for the GPU. Note that thepeak performance is 326 Gflops for TK1 and1 Tflops for TX1. However, Aristotle’s peakperformance is only 123 GOPS. This showsthat Aristotle is much more efficient thanTK1 and TX1.

Descartes: The Sparse RNN/LSTMAcceleratorRNNs and LSTM are widely used inspeech recognition, natural language proc-essing, question answering, and machinetranslation.2,5,8–10 An LSTM networkaccepts an input sequence x ¼ (x1, …, xT)and computes an output sequence y ¼(y1, …, yT) by feeding the input vectorssequentially to the computation graphshown in Figure 5. In each time step t, amemory vector ct is calculated and used inthe tþ 1 time step. Two main types of com-putation are involved in this graph: matrixvector multiplication and element-wiseoperation.

State-of-the-art RNN and LSTM modelsare both computationally and memory inten-sive, making them power hungry and increas-ing a datacenter’s total cost of ownership.Accessing memory is more than two ordersof magnitude more energy consuming thanALU operations, so it’s critical to reducememory reference.

To address this problem, we first presentan effective model compression algorithmfor LSTM, which consists of pruning andquantization. The basic idea of the pruningstrategy is to zero out the weights with thesmallest absolute values. The loss of accu-racy can be compensated by retraining theremaining network with back propagation.Note that we do the pruning in a load-balance-aware way, which balances thenumber of nonzero weights in differentsubmatrices. So, when different submatricesare assigned to different processing ele-ments in hardware, the workload is

TK1 TX1(FP32) TX1(FP16) 7Z020 7Z020(v2) ZU2CG(v2)

347

153

96.5

364

176

117

0

50

100

150

200

250

300

350

400

VGG16

150

59.4

42.4

88

53

35

0

20

40

60

80

100

120

140

160

YOLO tiny

14.3

9.04

2.18 2.541.66 1.11

0

2

4

6

8

10

12

14

16

Face alignment

Run

time/

fram

e (m

s)

Run

time/

fram

e (m

s)

Run

time/

fram

e (m

s)

Figure 4. Performance comparison between mobile GPU and Aristotle on different FPGA platforms, measured by

milliseconds per frame.

i f o

g Cell

y =

h(x

)

y =

Wx

Input Output

Element-wise MULT

cell Memory vector

General function

gate: y = σ(Wx + b)

Figure 5. Computation graph of a single long short-term memory (LSTM)

layer. r denotes the sigmoid function on each element of a vector.

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

6 IEEE MICRO

Page 6: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

balanced. Thus, the synchronization over-head among different processing elementsis kept low.

Quantization is the second step for LSTMcompression. We encode the weights with a12-bit fixed-point format and the index with4 bits; thus, a single weight takes 2 bytes. In areal-world speech dataset, such as the oneused in our experiments, we pruned away90 percent parameters and quantized themodel to 12 bits, resulting in negligible lossof accuracy. The overall compression ratio is2� 10¼ 20 times.

We designed a scheduler that can effec-tively schedule the complex LSTM operationsusing sparse matrix vector multiplication asthe basic building block, with memory refer-ence fully overlapped with computation.Because our design targets server applications,we fix this schedule strategy in hardware tooffer the best efficiency.

The irregular computation pattern aftercompression poses challenges for the hard-ware design. Therefore, we design a hardwarearchitecture called Descartes that can workdirectly on the compressed model.11 It iscomposed of the components shown inFigure 6. The input vector is buffered in thefirst-in, first-out, and the matrix is encodedwith a compressed sparse column format,which contains the pointer, matrix value,and index value. Descartes first reads outthe pointer, then the index and weight.The decoded weights are fed to the ALU toget multiplied with the input vector and

added to the activation buffer; the location isspecified by the index. This completes thematrix vector multiplication. On the rightside is the element-wise unit, which containsa point-wise multiplier, adder tree, and Sig-moid and Tanh units. These are not on thecritical path compared with matrix vectormultiplication units. Sigmoid and Tanh unitsare implemented with a lookup table.

Descartes achieves high efficiency by loadbalancing and partitioning both the compu-tation and storage. Descartes also supportsprocessing multiple speech data concurrently,shown as different channels in Figure 6.

Implemented on a Xilinx XCKU060FPGA running at 200 MHz, the Descartesarchitecture has a processing power of 282GOPS per second, working directly on acompressed LSTM network, correspondingto 2.52 TOPS per second on an uncom-pressed network in which each operation isan add or multiplication. The LSTM modelis pruned to 10 percent nonzeros; taking pad-ding zeros into account, it’s 12.2 percentnonzeros. With 32 Pes, Descartes can processa speech recognition LSTM with 1,024 hid-den memory cells at 82.7 microseconds. Weevaluate the same LSTM on the Corei7-5930k CPU and Pascal TitanX GPU. Onthe CPU and GPU, the four LSTM gates aremerged into a single matrix, which improvesthe CPU/GPU computation resource utiliza-tion. We used MKL CBLAS and MKLSPBLAS for the CPU and cuBLAS andcuSPARSE for the GPU. Figure 7a shows

Software program

CPU MEM External memory

MEM controllerPCIE controller

Data bus

FPGA

Input buffer Output buffer

ES

E c

ontro

ller

ESE accelerator

Channel 0 Channel 1 Channel N

PE

PE

PE

PE

PE

PE

PE

PE

PE

Inputbuffer

Outputbuffer

Controller

Addertree

Sigmoidtanh

Point-wiseMul

ActBufMul/Add

SpMatRead

PtrRead

FIFO

PE_nPE_1

PE_0

channel_mchannel_1

channel_0

(a) (b)

Figure 6. Descartes architecture. (a) Overall system architecture. (b) Processing element architecture.

.................................................................

MARCH/APRIL 2017 7

Page 7: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

the speedup over CPU/GPU dense/sparseimplementations. Descartes even achievesabout 2 times speedup over the latest PascalTitanX GPU.

The Descartes architecture can run boththe sparse model and the dense model.Figure 8 shows the speedup with sparsity.Working on the compressed model with90 percent of the parameters pruned away,Descartes is 6.2 times faster than the base-line uncompressed dense model.

Descartes costs 294,000 lookup tables,1,505 DSPs, and 4.18 Mbytes of SRAM onan FPGA. The lookup table and DSP costsequal about 5,080,000 gates for an ASICdesign. The power consumption of Descartesis 41 W. The power consumption of theCPU is 111 W for dense implementationand 38 W for sparse implementation. TheGPU consumes 202 W for dense matrixmultiplication and 136 W for sparse imple-

mentation. Figure 7b shows the energy-efficiency comparison. Descartes achieves10 to 200 times better energy efficiency overthe other platforms.

T his article discusses efficient methodsand hardware for deep learning. As

shown with our experiments, model optimi-zation such as quantization and sparsificationgreatly benefits hardware design. To achievethe best energy efficiency, ASIC is always theoptimal choice. But now, it’s a bad idea totape out certain networks considering the fastiteration of deep learning algorithms. TheCPU and GPU can always support deeplearning, but they are not that energy effi-cient. FPGA just reaches the balance point.The proposed Aristotle and Descartes archi-tectures are the beginning of the deep learn-ing era. We will try to explore more efficientmethods for software–hardware codesign fordeep learning in the future. MICRO

AcknowledgmentsThis work represents the combined effortsof many talented full-time and intern engi-neers led by Dongliang Xie, Hong Luo, andLingzhi Sui in DeePhi.

....................................................................References1. K. He et al., “Deep Residual Learning for

Image Recognition,” Proc. IEEE Conf. Com-

puter Vision and Pattern Recognition, 2016,

pp. 770–778.

1 1 12.3 1.4 2.1

4440

10

2619 20

71

80 81

0

10

20

30

40

50

60

70

80

90

W1 (4,096 × 153 × 11.7%) W2 (4,096 × 512 × 11.4%) W3 (512 × 1,024 × 10.0%)

Sp

eed

up

0.31 1.54.3 5.3

61.3

0

10

20

30

40

50

60

70

CPU (dense) CPU (sparse) GPU (dense) GPU (sparse) Descartes

Ene

rgy

effic

ienc

y (G

OP

S/W

)

(a) (b)

CPU (dense)CPU (sparse)GPU (dense)GPU (sparse)Descartes

Figure 7. Performance comparison between Descartes and CPU/GPU. (a) Matrix multiplication speed, normalized with the

speed of dense matrix multiplication on the CPU. W1, W2, and W3 are the three main matrices in an LSTM layer, expressed

as (row� col� sparsity). (b) Power efficiency.

01234567

0 10 20 30 40 50 60 70 80 90

Sp

eed

up

Parameters pruned away (%)

With load balanceWithout load balance

5.5×speedup over

dense

6.2× speedup

over dense

Figure 8. Running the sparse LSTM model is 6.2 times faster than running

the dense model.

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

8 IEEE MICRO

Page 8: S –HARDWARE CODESIGN E NEURAL NETWORK … Micro_220.pdfCNN also uses pooling layers for down-sampling on the feature maps, usually max pooling or average pooling. With pooling layers,

2. D.A. et al., “Deep Speech 2: End-to-End

Speech Recognition in English and Mandarin,”

Proc. 33rd Int’l Conf. Machine Learning, 2016,

pp. 173–182.

3. J. Qiu et al., “Going Deeper with Embedded

FPGA Platform for Convolutional Neural

Network,” Proc. ACM/SIGDA Int’l Symp. Field-

Programmable Gate Arrays, 2016, pp. 26–35.

4. C. Szegedy et al., “Going Deeper with Con-

volutions,” Proc. IEEE Conf. Computer Vision

and Pattern Recognition, 2015, pp. 1–9.

5. I. Sutskever et al., “Sequence to Sequence

Learning with Neural Networks,” Advances

in Neural Information Processing Systems,

2014, pp. 3104–3112.

6. J. Redmon et al., “You Only Look Once:

Unified, Real-Time Object Detection,” Proc.

IEEE Conf. Computer Vision and Pattern

Recognition, 2016, pp. 779–788.

7. K. Simonyan and A. Zisserman, “Very Deep

Convolutional Networks for Large-Scale Image

Recognition,” arXiv preprint arXiv:1409.1556,

2014.

8. A. Hannun et al., “Deep Speech: Scaling up

End-to-End Speech Recognition,” arXiv,

preprint arXiv:1412.5567, 2014.

9. A. Karpathy and L. Fei-Fei, “Deep Visual-

Semantic Alignments for Generating Image

Descriptions,” Proc. IEEE Conf. Computer

Vision and Pattern Recognition, 2015, pp.

3128–3137.

10. S. Antol et al., “VQA: Visual Question

Answering,” Proc. IEEE Int’l Conf. Com-

puter Vision, 2015, pp. 2425–2433.

11. S. Han et al., “ESE: Efficient Speech Recog-

nition Engine with Sparse LSTM on FPGA,”

Proc. ACM/SIGDA Int’l Symp. Field-Pro-

grammable Gate Arrays, 2017, pp. 75–84.

Kaiyuan Guo is a PhD candidate in theDepartment of Electronic Engineering atTsinghua University and an intern inDeePhi. His research interests include hard-ware acceleration of deep learning andSLAM. Guo received a BS in electronicengineering from Tsinghua University. Con-tact him at [email protected].

Song Han is a founder of DeePhi and a PhDcandidate in the Department of ElectricalEngineering at Stanford University. His

research interests include deep learningmodel compression and hardware accelera-tion. Han received a BS in electronic engi-neering from Tsinghua University. Contacthim at [email protected].

Song Yao is the CEO and a founder ofDeePhi. His research interests include 3DIC, compiler designs, and hardware acceler-ation of deep learning. Yao received a BS inelectronic engineering from Tsinghua Uni-versity. Contact him at [email protected].

Yu Wang is a founder of DeePhi and anassociate professor in Department of Elec-tronic Engineering at Tsinghua University.His research interests include brain-inspiredcomputing system design with both CMOSand emerging devices. Wang received a PhDin electronic engineering from TsinghuaUniversity. Contact him at [email protected].

Yuan Xie is a professor leading the Scalableand Energy-Efficient Architecture Lab(SEAL) at the University of California,Santa Barbara. His research interests includecomputer architecture, electronics designautomation (EDA), VLSI design, andembedded systems design. Xie received aPhD in electrical engineering from Prince-ton University. He is an IEEE Fellow. Con-tact him at [email protected].

Huazhong Yang is a professor in theDepartment of Electronic Engineering atTsinghua University. His research interestsinclude mixed-signal circuit design, EDA,VLSI design, and IOT applications. Yangreceived a PhD in electronic engineeringfrom Tsinghua University. Contact him [email protected].

Read your subscriptions through the myCS publications portal at http://mycs.computer.org.

.................................................................

MARCH/APRIL 2017 9


Recommended