........................................................................................................................................................................................................................
SOFTWARE–HARDWARE CODESIGNFOR EFFICIENT NEURAL NETWORK
ACCELERATION........................................................................................................................................................................................................................
DESIGNERS MAKING DEEP LEARNING COMPUTING MORE EFFICIENT CANNOT RELY SOLELY
ON HARDWARE. INCORPORATING SOFTWARE-OPTIMIZATION TECHNIQUES SUCH AS MODEL
COMPRESSION LEADS TO SIGNIFICANT POWER SAVINGS AND PERFORMANCE
IMPROVEMENT. THIS ARTICLE PROVIDES AN OVERVIEW OF DEEPHI’S TECHNOLOGY FLOW,
INCLUDING COMPRESSION, COMPILATION, AND HARDWARE ACCELERATION. TWO
ACCELERATORS ACHIEVE EXTREMELY HIGH ENERGY EFFICIENCY FOR BOTH CLIENT AND
DATACENTER APPLICATIONS WITH CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS.
......Deep learning” and “neuralnetwork” are the current AI keywords. Deeplearning is showing dominant performancein applications such as image classification1
and speech recognition,2 which makes it thetop candidate for real-world AI applications.However, today’s computational efficiency isstill not enough, and the computational com-plexity of neural networks far exceeds tradi-tional computer vision algorithms, so wecannot employ deep learning for many cases.To address this problem, researchers aroundthe world have been working on customizedhardware acceleration solutions.3 There willbe an unprecedented battle for deep learninghardware.
We believe that, to build an efficient sys-tem for deep learning, we must consider soft-ware–hardware codesign, because softwareand hardware are coupled in deep learning.
Considering both optimization in softwareand hardware, we propose a new design flow(see Figure 1). Three factors affect how toefficiently compute deep learning algorithms:workload, peak performance, and efficiency.
A smaller workload with the same precisionis always welcome. However, changing theworkload can affect the hardware design. Forexample, replacing direct 2D convolution witha fast algorithm—for example, Winograd—in a convolutional neural network (CNN)changes the ratio between multiplication andaddition and also changes the data access pat-tern. Furthermore, exploring the sparsity inneural networks changes even the data descrip-tion format and entire computing system—that is, from dense to sparse matrices.
A higher peak performance is alwayswanted. However, because peak performanceis usually proportional to the computation
Kaiyuan Guo
Tsinghua University and DeePhi
Song Han
Stanford University and DeePhi
Song Yao
DeePhi
Yu Wang
Tsinghua University and DeePhi
Yuan Xie
University of California,
Santa Barbara
Huazhong Yang
Tsinghua University
............................................................
2 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
unit number and system frequency, a higherpeak performance often results in higher costand power. One way to increase the peak per-formance while lowering the cost is to sim-plify the operation—for example, by usingfewer bits to represent data and weight inneural networks. The robustness of deeplearning algorithms makes it possible to use16-bit, 8-bit, and even fewer-bit fixed-pointoperations to replace 32-bit floating-pointoperations while introducing negligible accu-racy loss. This tradeoff between peak per-formance and variable precision influencesboth algorithm and hardware design.
Efficiency reflects how well we use thecomputation units. An elegant memory sys-tem design to feed the computing units withenough data is the key to high efficiency. Toachieve this, we need to tackle both on-chipmemory and external memory system design.For the on-chip memory part, it is necessaryto explore data locality and data reuse tomake data stay in the cache as long as possi-ble. For the external memory part, increasingthe bandwidth helps increase the efficiencybut also leads to higher cost and power. Withthe same theoretical bandwidth, we need toincrease the burst length to fully utilize it—that is, organize data storage to match hard-ware requirements. The data simplificationmethod also reduces the data-bit width andthus reduces the bandwidth requirement.
Taking all three factors into account helpsus design a highly efficient deep learning sys-tem. Furthermore, because deep learning isevolving rapidly, taping out a certain designmight not be a good choice for a commercialproduct. In this case, general-purpose process-ors or a specialized hardware with enoughflexibility and change for reprogramming arepreferable. Field-programmable gate arrays(FPGAs), with their inherent reconfigurability,let us explore all the three levels of the designand incorporate state-of-the-art deep learningtechniques into a product within a shortdesign time. Thus, FPGAs have the potentialto become a mainstream deep learning proc-essing platform.
From Model to InstructionsNo standard state-of-the-art neural networkmodel exists. For CNNs, early models first
applied several convolution (Conv) layerssequentially to the input image to generatelow-dimension features, and then severalfully connected layers as the classifier. Cur-rent networks, such as ResNet1 and theinception module in GoogleNet4 used differ-ent branches and parallel layers in the net-work to achieve multiscale sampling andavoid vanishing gradients. The model sizeranges from fewer than 10 layers to morethan 100 layers for different tasks. For recur-rent neural networks (RNNs), there are alsomany variants, such as long short-term mem-ory (LSTM), gated recurrent units (GRUs),bidirectional RNNs used in speech recogni-tion,2 and sequence-to-sequence learningused in neural machine translation (NMT).5
A system must be flexible enough to exe-cute different neural network models. Toachieve this, a flexible description is necessary.Caffe, TensorFlow, and other deep learningframeworks provide efficient interfaces onCPU and GPU platforms. However, forspecialized systems, we need a tool and anintermediate representation to bridge theseframeworks and the hardware. We design thecustomized hardware considering the pat-terns of neural network computation toachieve high efficiency while leaving theinterface flexible. In this way, we can map dif-ferent networks onto it. Meanwhile, algo-rithm researchers and hardware developerscan work simultaneously, making the itera-tion of products fast and efficient.
We implement an instruction interfacefor our hardware. For CPUs or GPUs, theinstructions are fine grained, usually withscalar- or vector-level operations. Fine-grained
Compression
Application Platform
Modeldesign
QuantizationPruningHardware
designCompile
Workload Peakperformance Efficiency
Performance
Instruction
Figure 1. Our proposed design flow.
.................................................................
MARCH/APRIL 2017 3
instruction is highly flexible, but consideringthe specialty of neural networks, this mightnot be an efficient interface. For example, theneural network computation is usually full ofloops, thus we try to partition the loops intosmall blocks such that each block can be doneby hardware. For CNN, each block can be aset of 2D convolutions, whereas for RNN,each block can be vector-matrix multiplica-tions. The operations for each block can berepresented by one instruction, which reducesthe instruction size while maintaining thehardware efficiency. We also use instructionsto describe data transfers between on-chipcache and off-chip memory, which lets thecompiler do static scheduling to achieve a bal-ance between the computation and I/O.
Consider the design flow shown inFigure 1. First, the deep learning algorithmis designed for the target application. Forthis design flow, the main target is to designthe neural network model. Then, the modelis optimized to be ready for hardware accel-eration. This step usually includes modelcompression and data quantization to reducethe workload and increase the peak perform-ance of the hardware design. Both of thesesteps are done by automatic tools, but devel-opers need to choose the best decision, con-sidering the accuracy loss and hardwareperformance gain. Next, the hardware isdesigned according to the optimization strat-egy used. These three steps are done itera-tively to ensure that the target application’srequirement is met. After hardware design,we use a customized compiler to convert theneural network model to instructions to beexecuted at runtime. Further optimizationon scheduling is automatically done in thecompiler to increase the hardware efficiency.
Aristotle: The CNN AcceleratorCNNs are widely used for image and videoprocessing. One of the most popular CNNapplications is object detection. But CNN’shigh computation complexity makes it animpractical choice for mobile platforms suchas smartphones or drones. To solve this prob-lem, we designed the Aristotle architecturefor energy-efficient CNN acceleration.
CNN mainly comprises several convolu-tion layers. Within each layer, there are n
input feature maps Ni(x, y) and m output fea-ture maps Mj(x, y). Each feature map is a 2Dimage. Equation 1 describes the computationwithin one convolution layer. The * denotes2D convolution operation, and W is the con-volution kernel. The bias for each output fea-ture map is bj. Function f is a nonlinearfunction on each pixel (for example, ReLUor sigmoid).
Mj ¼ f�Xn
i¼1
Wij �Ni þ bj
�(1)
CNN also uses pooling layers for down-sampling on the feature maps, usually maxpooling or average pooling. With poolinglayers, the size of feature maps is reduced.This helps increase the reception field of eachneuron (pixel) in the feature map. Thus,larger features in the original image can beextracted.
Figure 2 shows the proposed architec-ture. We implement the architecture on aXilinx XC7Z020 system on chip on a cus-tomized board. The board is 5 cm � 5 cm(see Figure 2a) with about 3 W runtime powerconsumption, and it can fit into small robots.
Figure 2b shows the system architecture.A common computation system includes aCPU and the external memory, which is thetop white part. To accelerate the CNN, weimplement the bottom gray part on theFPGA. Data and instruction communicationbetween the CPU and the FPGA is achievedwith a shared memory scheme. The FPGA-based accelerator accesses the externalmemory through the direct memory accessmodule. The host CPU accesses the statusregisters of the FPGA accelerator and sendscontrol signals through the general-purposeport by memory mapping.
At runtime, the accelerator sequentiallyreads all the instructions and executes themautomatically. The host CPU does no sched-uling work and waits only for the acceleratorto finish. For software developers, calling theCNN accelerator is like establishing a newthread. In real applications, the CNN is usu-ally a part of the algorithm. The CPU is usedto schedule the flow of the algorithm andhandles the non-CNN parts.
The basic unit for CNN computation isthe processing element (PE). Figure 2c shows
..............................................................................................................................................................................................
HOT CHIPS
.................................................................
4 IEEE MICRO
the PE’s architecture. As described by theequation of each output channel Mj, the con-volution layer does the summation on 2Dconvolution results. So, we implement multi-ple 2D convolvers in a single PE and add theresults together with an adder tree. We imple-ment the line buffer design for the convolverssuch that the 2D convolution can be proc-essed in a pipelined manner, achieving thethroughput of 1 pixel per cycle. Because thehardware resource is limited, we might notbe able to do the summation of all the convo-lution results with the adder tree. So, the out-put buffer offers intermediate results back toPE for accumulation. Nonlinear and poolingoperations are integrated in the pipeline andcan be bypassed if needed.
To fully utilize the data locality of CNNcomputation, the feature maps in the inputbuffer are shared by all the PEs. The same fea-ture maps use different convolution kernels
and biases to calculate different output featuremaps. The address space of the input andoutput buffer is available in the instructioninterface. For a certain convolution layer, thecompiler specially manages the on-chip cacheto minimize the external memory access.
Besides the hardware architecture design,we also do software-level optimization. Wetry to reduce the bit-width of data in theCNN model, such that the limited logic andmemory resource on the Zynq 7020 becomesrelatively larger. To fully utilize the limitedbit-width, we use a fixed-point format butallow the radix point of data to vary amongdifferent layers. This strategy adjusts the datato different dynamic ranges in different layersand prevents overflow. Figure 3 shows ourexperimental results on state-of-the-art net-works. We see that 8-bit quantization bringsnegligible performance loss for all these net-works, so we adopt an 8-bit hardware design.
5.0 cm
5.0
cm
Host CPUExternalmemory
Computing complex
DMA
PE
PE
PE…
Inputbuffer
Outputbuffer
PE
Controller
(a)
C
Convolvers
+
+
+
+
+ ReLU PoolC
C
Outputbuffer
Inputbuffer
Data
Bias
Wei
ght
s
Intermediate data
Controller
Addertree
Bias shift
Datashift
……
…
…
(b) (c)
Figure 2. Aristotle architecture for the convolutional neural network (CNN) accelerator. (a) System board. (b) Overall system
architecture. (c) Processing element architecture.
0102030405060708090
100
0102030405060708090
100
GoogLeNet VGG-16 SqueezeNet VGG-CNN-FGoogLeNet VGG-16 SqueezeNet VGG-CNN-F(a) (b)
fp-32 16-bit 8-bit 6-bitfp-32 16-bit 8-bit 6-bit
Cla
ssifi
catio
n ac
cura
cy (
%)
Cla
ssifi
catio
n ac
cura
cy (
%)
Figure 3. Quantization results for different CNN models. (a) Top-1 classification accuracy. (b) Top-5 classification accuracy.
.................................................................
MARCH/APRIL 2017 5
The implementation on XC7Z020includes two Pes, each comprising 16 3 � 3convolvers working at 214 MHz. Thisdesign uses 272-Kbyte on-chip block RAM,198 digital signal processing (DSP) units,and about 27,000 lookup tables. Thelookup table and DSP costs equal about413,000 gates for an ASIC design. We usethree applications to evaluate the systemperformance: face alignment, object detectionusing the YOLO (You Only Look Once)algorithm,6 and image classification using theVGG (Visual Geometry Group) network.7
The same applications are also realized on theNvidia TK1 and TX1 platform with the latestcuDNN library. Figure 4 shows the perform-ance comparison, including estimation of ournext-generation Aristotle on XC7Z020 andZU2CG. The proposed architecture on theFPGA can offer similar performance to themobile GPU on these applications. However,the power consumption of the FPGA is about3 W, versus 15 W for the GPU. Note that thepeak performance is 326 Gflops for TK1 and1 Tflops for TX1. However, Aristotle’s peakperformance is only 123 GOPS. This showsthat Aristotle is much more efficient thanTK1 and TX1.
Descartes: The Sparse RNN/LSTMAcceleratorRNNs and LSTM are widely used inspeech recognition, natural language proc-essing, question answering, and machinetranslation.2,5,8–10 An LSTM networkaccepts an input sequence x ¼ (x1, …, xT)and computes an output sequence y ¼(y1, …, yT) by feeding the input vectorssequentially to the computation graphshown in Figure 5. In each time step t, amemory vector ct is calculated and used inthe tþ 1 time step. Two main types of com-putation are involved in this graph: matrixvector multiplication and element-wiseoperation.
State-of-the-art RNN and LSTM modelsare both computationally and memory inten-sive, making them power hungry and increas-ing a datacenter’s total cost of ownership.Accessing memory is more than two ordersof magnitude more energy consuming thanALU operations, so it’s critical to reducememory reference.
To address this problem, we first presentan effective model compression algorithmfor LSTM, which consists of pruning andquantization. The basic idea of the pruningstrategy is to zero out the weights with thesmallest absolute values. The loss of accu-racy can be compensated by retraining theremaining network with back propagation.Note that we do the pruning in a load-balance-aware way, which balances thenumber of nonzero weights in differentsubmatrices. So, when different submatricesare assigned to different processing ele-ments in hardware, the workload is
TK1 TX1(FP32) TX1(FP16) 7Z020 7Z020(v2) ZU2CG(v2)
347
153
96.5
364
176
117
0
50
100
150
200
250
300
350
400
VGG16
150
59.4
42.4
88
53
35
0
20
40
60
80
100
120
140
160
YOLO tiny
14.3
9.04
2.18 2.541.66 1.11
0
2
4
6
8
10
12
14
16
Face alignment
Run
time/
fram
e (m
s)
Run
time/
fram
e (m
s)
Run
time/
fram
e (m
s)
Figure 4. Performance comparison between mobile GPU and Aristotle on different FPGA platforms, measured by
milliseconds per frame.
i f o
g Cell
y =
h(x
)
y =
Wx
Input Output
Element-wise MULT
cell Memory vector
General function
gate: y = σ(Wx + b)
Figure 5. Computation graph of a single long short-term memory (LSTM)
layer. r denotes the sigmoid function on each element of a vector.
..............................................................................................................................................................................................
HOT CHIPS
.................................................................
6 IEEE MICRO
balanced. Thus, the synchronization over-head among different processing elementsis kept low.
Quantization is the second step for LSTMcompression. We encode the weights with a12-bit fixed-point format and the index with4 bits; thus, a single weight takes 2 bytes. In areal-world speech dataset, such as the oneused in our experiments, we pruned away90 percent parameters and quantized themodel to 12 bits, resulting in negligible lossof accuracy. The overall compression ratio is2� 10¼ 20 times.
We designed a scheduler that can effec-tively schedule the complex LSTM operationsusing sparse matrix vector multiplication asthe basic building block, with memory refer-ence fully overlapped with computation.Because our design targets server applications,we fix this schedule strategy in hardware tooffer the best efficiency.
The irregular computation pattern aftercompression poses challenges for the hard-ware design. Therefore, we design a hardwarearchitecture called Descartes that can workdirectly on the compressed model.11 It iscomposed of the components shown inFigure 6. The input vector is buffered in thefirst-in, first-out, and the matrix is encodedwith a compressed sparse column format,which contains the pointer, matrix value,and index value. Descartes first reads outthe pointer, then the index and weight.The decoded weights are fed to the ALU toget multiplied with the input vector and
added to the activation buffer; the location isspecified by the index. This completes thematrix vector multiplication. On the rightside is the element-wise unit, which containsa point-wise multiplier, adder tree, and Sig-moid and Tanh units. These are not on thecritical path compared with matrix vectormultiplication units. Sigmoid and Tanh unitsare implemented with a lookup table.
Descartes achieves high efficiency by loadbalancing and partitioning both the compu-tation and storage. Descartes also supportsprocessing multiple speech data concurrently,shown as different channels in Figure 6.
Implemented on a Xilinx XCKU060FPGA running at 200 MHz, the Descartesarchitecture has a processing power of 282GOPS per second, working directly on acompressed LSTM network, correspondingto 2.52 TOPS per second on an uncom-pressed network in which each operation isan add or multiplication. The LSTM modelis pruned to 10 percent nonzeros; taking pad-ding zeros into account, it’s 12.2 percentnonzeros. With 32 Pes, Descartes can processa speech recognition LSTM with 1,024 hid-den memory cells at 82.7 microseconds. Weevaluate the same LSTM on the Corei7-5930k CPU and Pascal TitanX GPU. Onthe CPU and GPU, the four LSTM gates aremerged into a single matrix, which improvesthe CPU/GPU computation resource utiliza-tion. We used MKL CBLAS and MKLSPBLAS for the CPU and cuBLAS andcuSPARSE for the GPU. Figure 7a shows
Software program
CPU MEM External memory
MEM controllerPCIE controller
Data bus
FPGA
Input buffer Output buffer
ES
E c
ontro
ller
ESE accelerator
Channel 0 Channel 1 Channel N
PE
PE
PE
PE
PE
PE
PE
PE
PE
Inputbuffer
Outputbuffer
Controller
Addertree
Sigmoidtanh
Point-wiseMul
ActBufMul/Add
SpMatRead
PtrRead
FIFO
PE_nPE_1
PE_0
channel_mchannel_1
channel_0
(a) (b)
Figure 6. Descartes architecture. (a) Overall system architecture. (b) Processing element architecture.
.................................................................
MARCH/APRIL 2017 7
the speedup over CPU/GPU dense/sparseimplementations. Descartes even achievesabout 2 times speedup over the latest PascalTitanX GPU.
The Descartes architecture can run boththe sparse model and the dense model.Figure 8 shows the speedup with sparsity.Working on the compressed model with90 percent of the parameters pruned away,Descartes is 6.2 times faster than the base-line uncompressed dense model.
Descartes costs 294,000 lookup tables,1,505 DSPs, and 4.18 Mbytes of SRAM onan FPGA. The lookup table and DSP costsequal about 5,080,000 gates for an ASICdesign. The power consumption of Descartesis 41 W. The power consumption of theCPU is 111 W for dense implementationand 38 W for sparse implementation. TheGPU consumes 202 W for dense matrixmultiplication and 136 W for sparse imple-
mentation. Figure 7b shows the energy-efficiency comparison. Descartes achieves10 to 200 times better energy efficiency overthe other platforms.
T his article discusses efficient methodsand hardware for deep learning. As
shown with our experiments, model optimi-zation such as quantization and sparsificationgreatly benefits hardware design. To achievethe best energy efficiency, ASIC is always theoptimal choice. But now, it’s a bad idea totape out certain networks considering the fastiteration of deep learning algorithms. TheCPU and GPU can always support deeplearning, but they are not that energy effi-cient. FPGA just reaches the balance point.The proposed Aristotle and Descartes archi-tectures are the beginning of the deep learn-ing era. We will try to explore more efficientmethods for software–hardware codesign fordeep learning in the future. MICRO
AcknowledgmentsThis work represents the combined effortsof many talented full-time and intern engi-neers led by Dongliang Xie, Hong Luo, andLingzhi Sui in DeePhi.
....................................................................References1. K. He et al., “Deep Residual Learning for
Image Recognition,” Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition, 2016,
pp. 770–778.
1 1 12.3 1.4 2.1
4440
10
2619 20
71
80 81
0
10
20
30
40
50
60
70
80
90
W1 (4,096 × 153 × 11.7%) W2 (4,096 × 512 × 11.4%) W3 (512 × 1,024 × 10.0%)
Sp
eed
up
0.31 1.54.3 5.3
61.3
0
10
20
30
40
50
60
70
CPU (dense) CPU (sparse) GPU (dense) GPU (sparse) Descartes
Ene
rgy
effic
ienc
y (G
OP
S/W
)
(a) (b)
CPU (dense)CPU (sparse)GPU (dense)GPU (sparse)Descartes
Figure 7. Performance comparison between Descartes and CPU/GPU. (a) Matrix multiplication speed, normalized with the
speed of dense matrix multiplication on the CPU. W1, W2, and W3 are the three main matrices in an LSTM layer, expressed
as (row� col� sparsity). (b) Power efficiency.
01234567
0 10 20 30 40 50 60 70 80 90
Sp
eed
up
Parameters pruned away (%)
With load balanceWithout load balance
5.5×speedup over
dense
6.2× speedup
over dense
Figure 8. Running the sparse LSTM model is 6.2 times faster than running
the dense model.
..............................................................................................................................................................................................
HOT CHIPS
.................................................................
8 IEEE MICRO
2. D.A. et al., “Deep Speech 2: End-to-End
Speech Recognition in English and Mandarin,”
Proc. 33rd Int’l Conf. Machine Learning, 2016,
pp. 173–182.
3. J. Qiu et al., “Going Deeper with Embedded
FPGA Platform for Convolutional Neural
Network,” Proc. ACM/SIGDA Int’l Symp. Field-
Programmable Gate Arrays, 2016, pp. 26–35.
4. C. Szegedy et al., “Going Deeper with Con-
volutions,” Proc. IEEE Conf. Computer Vision
and Pattern Recognition, 2015, pp. 1–9.
5. I. Sutskever et al., “Sequence to Sequence
Learning with Neural Networks,” Advances
in Neural Information Processing Systems,
2014, pp. 3104–3112.
6. J. Redmon et al., “You Only Look Once:
Unified, Real-Time Object Detection,” Proc.
IEEE Conf. Computer Vision and Pattern
Recognition, 2016, pp. 779–788.
7. K. Simonyan and A. Zisserman, “Very Deep
Convolutional Networks for Large-Scale Image
Recognition,” arXiv preprint arXiv:1409.1556,
2014.
8. A. Hannun et al., “Deep Speech: Scaling up
End-to-End Speech Recognition,” arXiv,
preprint arXiv:1412.5567, 2014.
9. A. Karpathy and L. Fei-Fei, “Deep Visual-
Semantic Alignments for Generating Image
Descriptions,” Proc. IEEE Conf. Computer
Vision and Pattern Recognition, 2015, pp.
3128–3137.
10. S. Antol et al., “VQA: Visual Question
Answering,” Proc. IEEE Int’l Conf. Com-
puter Vision, 2015, pp. 2425–2433.
11. S. Han et al., “ESE: Efficient Speech Recog-
nition Engine with Sparse LSTM on FPGA,”
Proc. ACM/SIGDA Int’l Symp. Field-Pro-
grammable Gate Arrays, 2017, pp. 75–84.
Kaiyuan Guo is a PhD candidate in theDepartment of Electronic Engineering atTsinghua University and an intern inDeePhi. His research interests include hard-ware acceleration of deep learning andSLAM. Guo received a BS in electronicengineering from Tsinghua University. Con-tact him at [email protected].
Song Han is a founder of DeePhi and a PhDcandidate in the Department of ElectricalEngineering at Stanford University. His
research interests include deep learningmodel compression and hardware accelera-tion. Han received a BS in electronic engi-neering from Tsinghua University. Contacthim at [email protected].
Song Yao is the CEO and a founder ofDeePhi. His research interests include 3DIC, compiler designs, and hardware acceler-ation of deep learning. Yao received a BS inelectronic engineering from Tsinghua Uni-versity. Contact him at [email protected].
Yu Wang is a founder of DeePhi and anassociate professor in Department of Elec-tronic Engineering at Tsinghua University.His research interests include brain-inspiredcomputing system design with both CMOSand emerging devices. Wang received a PhDin electronic engineering from TsinghuaUniversity. Contact him at [email protected].
Yuan Xie is a professor leading the Scalableand Energy-Efficient Architecture Lab(SEAL) at the University of California,Santa Barbara. His research interests includecomputer architecture, electronics designautomation (EDA), VLSI design, andembedded systems design. Xie received aPhD in electrical engineering from Prince-ton University. He is an IEEE Fellow. Con-tact him at [email protected].
Huazhong Yang is a professor in theDepartment of Electronic Engineering atTsinghua University. His research interestsinclude mixed-signal circuit design, EDA,VLSI design, and IOT applications. Yangreceived a PhD in electronic engineeringfrom Tsinghua University. Contact him [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.................................................................
MARCH/APRIL 2017 9