Long short-term memory networks in memristor crossbar arrays10.1038... · 05 10 15 Supplementary...

Articleshttps://doi.org/10.1038/s42256-018-0001-4

Long short-term memory networks in memristor crossbar arraysCan Li 1,5, Zhongrui Wang1, Mingyi Rao1, Daniel Belkin 1, Wenhao Song1, Hao Jiang1, Peng Yan1, Yunning Li1, Peng Lin1, Miao Hu2, Ning Ge3, John Paul Strachan2, Mark Barnell4, Qing Wu4, R. Stanley Williams 2, J. Joshua Yang 1* and Qiangfei Xia 1*

1Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA. 2Hewlett Packard Labs, Palo Alto, CA, USA. 3HP Labs, HP Inc., Palo Alto, CA, USA. 4Air Force Research Laboratory, Information Directorate, Rome, NY, USA. 5Present address: Hewlett Packard Labs, Palo Alto, CA, USA. *e-mail: [email protected]; [email protected]

SUPPLEMENTARY INFORMATION

In the format provided by the authors and unedited.

NatuRe MaCHiNe iNteLLiGeNCe | www.nature.com/natmachintell

http://orcid.org/0000-0003-3795-2008

http://orcid.org/0000-0002-3936-6419

http://orcid.org/0000-0003-0213-4259

http://orcid.org/0000-0003-0671-6010

http://orcid.org/0000-0003-1436-8423

mailto:[email protected]

mailto:[email protected]

http://www.nature.com/natmachintell

Supplementary Figures

Supplementary Figure 1. The two-pulse memristor conductance update scheme. a, The two-pulse memristor conductance update scheme for decreasing a memristor conductance, which includes a complete RESET (to a lower conductance state) cycle, and a SET (to a higher conductance state) cycle. The change in the gate voltage determines the change in conductance. b, The scheme for increasing a memristor conductance, where the RESET cycle in (a) is skipped. c, 20 cycles of potentiation / depression (overlaid together) with each cycle including 200 updates for two memristors (in blue and red respectively). Although there is a mismatch for the interception (we intentionally picked two memristor with more noticeable mismatch), the slopes (Δ𝐆/Δ𝐕gate) of the conductance changes match closely. d, Weight update error expressed as the normalized standard deviation during the two-pulse conductance update. The analysis is conducted with the data shown for all the devices in a 128×64 array, with two of them shown in (c).

s.d.

of G

upd

ate

erro

r / G

rang

e

Target G (conductance) (μS)

5%

4%

3%

2%

1%

0%0 200 400 600 800 10000 50

0.6 1.1 1.6 1.1 0.6

100 150 200

Pulse Number (#)

Gate Voltage (V)

0

200

400

600

800

1000

1200

1400

Con

duct

ance

(S)

c d

a b

clk

TE

BE

Gate

Vset

Vg, pre+ΔVg

SET cycle

clk

TE

BE

Gate

RESET cycle

Vset

Vreset

Vg, pre+ΔVg

VDD

SET cycle

Supplementary Figure 2: The software architecture developed for the experiment. The backend virtual class implements the matrix multiplication operations in both the forward and backward passes, and weight update operations. The software backend performs the matrix multiplication and weight update with MATLAB built-in functions, where the weights are represented by 32-bit floating-point numbers. Simu Array performs the operations in a crossbar array simulator, in which we could take non-ideal factors, such as random conductance update errors, conductance upper and lower bounds, array wire resistances, etc., into consideration. Finally, the Real Array performs the matrix multiplication and weight (represented by memristor conductance) update experimentally in the memristor crossbar array. The code is deposited at: http://github.com/lican81/memNN

Model

Layers

Dense Recurrent

LSTM

Optimizer

SGD RMSprop

Software Crossbar

Real Array Simu Array

Firmware on MCU

Loss

MSECross

Entropy

Backend

- weight / conductance matrix

+ multiply( vector )+ multiply_reverse( vector )+ update( delta_weight_matrix )

Supplementary Figure 3: Additional data for the regression experiment. a, Conductance map of the 34×60 memristor array in the LSTM layer after the in-situ training. b, Measured conductance of the 32 memristors in the fully-connected layer after training. c, Map of synaptic weight calculated from the conductances shown in (a). d, Synaptic weights calculated from the conductances shown in (b).

15 30 45 60

17

34 0

200

400

600

800a

15 30 45 6017 -400

-2000200400

0 5 10 15-200

0

200

400

0 5 10 15 20 25 300

200

400

b

Conductance (µS)W

eight Value (µS)

Cond

ucta

nce

(µS)

Weig

ht V

alue

(µS)

c d

Supplementary Figure 4: Pre-processing of the gait identification dataset. a, One frame from the raw video. b, The extracted silhouette (ref. 39 in the main text) from the video, which was further converted to a width profile vector. Each dimension of the width profile vector represents the width of the silhouette at the corresponding height. c, The width profile vectors at each frame in the video. d, The total width in the width vector profile in each frame shows a periodic trend, which after processing a low-pass spectrum by an inverse Fourier transformation of the low-passed spectrum is used to detect the gait cycles. e, One video is divided into multiple samples according to the gait cycles.

a

b

11 22 33 44 55 66 77 88

16

32

48

64

80

96

112

128

16

32

48

64

80

96

112

12820 40 60

Width (px)

b

c

20 40 60 80 100 120 140 160Frame #

163248648096

112128

y

0

10

20

30

40

50

60

Widt

h (p

x)

d

0 20 40 60 80 100 120 140 160 180Frame #

2000

3000

4000

5000

Widt

h su

mm

ation

(px)

Sample #1 Sample #2 Sample #3 Sample #4 Sample #5 Sample #6 Sample #7 Sample #8 Sample #9e

Supplementary Figure 5: Additional data for the classification experiment. a, Conductance map of the 128×56 memristors in the LSTM layer after the in-situ training. b, Conductance map of the 28×8 fully-connected layer after training. c, Map of synaptic weights calculated from the conductances shown in (a). The LSTM synaptic weights are constituted by the weights (𝐖a, 𝐖i, 𝐖f and 𝐖o) connected to the input, and the recurrent weights (𝐔a, 𝐔i, 𝐔f and 𝐔o) connected to the LSTM outputs from the previous time step d, Map of the synaptic weights that calculated from the conductances shown in (b).

c

a

14 28 42 56

1

50

64 -400

-300

-200

-100

0

100

200

300

400

Weight Value (µS)

2 4 6 8

2

4

6

8

10

12

14 -400

-300

-200

-100

0

100

200

300

400

Weight Value (µS)

14 28 42 56

1

50

64

114

128 0

100

200

300

400

500

600

700

800

Conductance (µS)2 4 6 8

5

10

15

20

25

28 0

100

200

300

400

500

600

700

800

Conductance (µS)

b

d

Supplementary Figure 6: Output from the human identification inference. a, Raw electrical current output after the in-situ training. Different curves represent the current output from different columns (col1 to col8). The maximum current output is identified as the inference result of the memristor RNN. b, The Bayesian probability computed from the data in (a) by the softmax function.

a

b

0

0.2

0.4

0.6

0.8

1

Baye

sian

prob

abilit

y

Person #1 Person #3Person #2 Person #4 Person #5 Person #6 Person #7 Person #8

col1 col2 col3 col4 col5 col6 col7 col8

Person #1-400

-200

0

200

400Cu

rrent

(A)

Person #3Person #2 Person #4 Person #5 Person #6 Person #7 Person #8

col1 col2 col3 col4 col5 col6 col7 col8

Supplementary Figure 7: Comparisons among training on the experimental crossbar array, software training with 32-bit floating point, and training on simulated crossbar arrays with various weight update errors. a, Comparison of the mean square error (MSE), between the prediction made by the LSTM network and the ground truth, as the loss for the airline passenger number regression experiment. The dashed horizontal line represents the result we acquired from the experiment after 800 epochs of training. One sees that the MSE loss becomes less predictable and, in general, larger with increasing conductance update error. b, Accuracy comparison between the software and the crossbar array simulations with various random weight update errors. Larger memristor conductance update errors yield worse prediction accuracy, but update errors smaller than 0.6% are not statistically significant. For both boxplots, the figures show the statistical result from 50 runs of software / simulated training with parameters indicated on the x-axis. The red pluses on the plots indicate the outliners outside ±2.7 × s. d.

Experiment

Experiment (Max)

FP32 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3s.d. of conductance update error / conductance range (%)

0.2

0.30.40.50.60.70.80.91

1.52

MSE

Los

sa

FP32 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3s.d. of conductance update error (%)

0

20

40

60

80

100

Accu

racy

(%)

b

Supplementary Table

Supplementary Table 1: The summary of on-crossbar and off-crossbar operations

Inference only Training

M=14, N=50 (This work)

M=512, N=256 (Larger network)

M=14, N=50 (This work)

M=512, N=256 (Larger network)

On-crossbar operations 7,168 AOP 3,146 kAOP 7,168 AOP 3,146 kAOP

Off-crossbar operations

70 AOP + 70 tanh/sigmoid

2.56 kAOP + 2.56 kilo

tanh/sigmoid

294 AOP (bp) + 3,584 AOP

(outer product)

11 kAOP (bp) + 1,573 kAOP

(outer product) On-crossbar

memory 7,168 Byte 3 MBtyte 7,168 Byte 3 MByte

Off-crossbar memory 14 Byte 256 Byte

7,168 Byte (Gate voltage) +

3,000 Byte (I/O history)

3 MByte (Gate voltage) +

68.5 KByte (I/O history)

Note: AOP stands for Analog Operation.

Supplementary Notes

Supplementary Note 1: Summary of on-crossbar and off-crossbar operations and

memories

We have summarized the specific number of operations and memories for networks of different

sizes (Supplementary Table 1). We assume all the parameters, including the input/output (I/O)

histories and the gate voltage matrices are stored in 8-bits (1 Byte). For an LSTM layer with input

dimension of N and M hidden units, the size of the input vector is (𝑀 + 𝑁 + 1), and the size of

the output vector is 4𝑀 . Because two memristors are used to represent one weight to allow for

negative values, the size of the memristor crossbar array would be 2 (𝑀 + 𝑁 + 1) × (4𝑀). For

each temporal step, the inference described in Eq. 1 (in the main text, as all the following equations)

involves (𝑀 + 𝑁 + 1) × (4𝑀) multiply-accumulates (MACs) performed on-crossbar, and in Eq.

2 involves 3M sigmoid operations, 2M tanh operations, 3M multiplications and 2M additions,

all performed off-crossbar. Increasing the size of the problem, the on-crossbar operations will

dominate the computations since they have complexity O(𝑀2), while the off-crossbar operations

scale with O(𝑀). Most model parameters O(𝑀2) are stored in the crossbar array, with O(𝑀)

S/H or registers storing the intermediate signals.

The training process is more complicated. In the present work, only Eq. 18 can be performed in

the crossbar array, while all others are performed in software. Among those, Eq. 12-Eq. 17

involve 16M multiplications and 5M subtractions in software. Specifically, they are:

Eq. 12, 2𝑀 multiplication, 𝑀 derivative of sigmoid function (=𝑀 subtraction + 𝑀

multiplication).

Eq. 13, 2𝑀 multiplication, 𝑀 derivative of tanh function (=𝑀 subtraction + 𝑀

multiplication), and M addition.

Eq. 14, 2𝑀 multiplication, 𝑀 derivative of tanh function (=𝑀 subtraction + 𝑀

multiplication).


multiplication).


multiplication).

Eq. 17, 𝑀 multiplication.

Eq. 18, error backpropagation, (𝑀 + 𝑁 + 1) × (4𝑀) MACs, can be performed on-crossbar.

Eq. 20, weight gradient calculated by outer product, (𝑀 + 𝑁 + 1) × (4𝑀) multiplications, off-

crossbar. (Eq. 19 is for the fully connected layer).

For SGD without momentum, Eqs. 21-22 is the accumulation of the weight gradient calculated

by Eq. 20 that scaled by learning rate. They can be combined and directly accumulated to the

gate voltage matrix, by another scaling with a coefficient Δ𝑉gate/Δ𝐺. In this case, there are

2(𝑀 + 𝑁 + 1)/(4𝑀) multiplication operations, and it requires an auxiliary memory (can be

potentially implemented in the memristor array) that stores 2(𝑀 + 𝑁 + 1) × (4𝑀) gate voltage

values. In addition, the history of the input values and the activations also need to be stored,

which requires (𝑁 + 5𝑀) × 𝑇 memory, where 𝑇 is the depth of the temporal sequence, and we

are updating the array after each inference (batch size = 1).

It is clear that for an inference only system, most of the operations and memories are performed /

stored in-crossbar, while a large number of operations are off-chip for training.

Supplementary Note 2: Additional discussion on the output current for a large array

The size of the memristor array may be limited by the output current, but there are several potential

solutions with their corresponding tradeoffs. First, the read current can be lowered by reducing

the input voltage amplitude (or the pulse duty cycle) applied to the row wires to a very low level,

as long as the input signal-to-noise ratio (SNR) is acceptable. The current may not be lower than

a micro-ampere because this generally increases the sensing overhead. A high sensing current level

on the other hand does not necessarily lead to a greatly increased energy consumption because the

total sensing time is reduced. Secondly, the memristor device conductance can be lowered by

device/materials engineering subject to the constraint that the IV characteristic should be as linear

as possible, or different weight updating schemes should be employed. There are also architectural

solutions that involve tiling many smaller crossbar arrays to handle larger computations.S1

Supplementary Note 3: Additional discussion on weight update overhead

In our programming scheme, we estimate that the actual time consumed for RESET is around 1s,

and that for SET is about 2 s for updating the entire 128×64 array. Most of that time is spent in

the serial communication between the off-chip computer and the microcontroller. An individual

device can be switched within 5 nsS2 and even faster. Therefore, for an on-chip integrated system

a parallel weight update for a 128×64 array will take less than 5 ns × 64 (RESET) + 5 ns × 128

(SET) = 960 ns, functionally equivalent to 7.95 Gbps (for a 1 bit cell) or 47.68 Gbps (considering

a memristor is a 6-bit equivalent cellS3. Thus, the weight update scheme is not a limiting factor to

the throughput. The 5 ns switching time cited above was limited by our available measurement

apparatus. It has been reported that a memristor can be switched within 85 psS4, which if

implemented would boost the above equivalent rates, for a 128×64 array, to 16.32 ns, 0.46 Tbps

(1 bit) or 2.74 Tbps (6 bit).

Considering an average conductance of 500 𝜇S, the power consumption for the entire

128×64 array is (2.5 V)2 × 500 𝜇S × 64 = 0.18 W for SET and (1.7 V)2 × 500 𝜇S × 1282 =

0.09 W for RESET. However, the energy required to update one cell is very low:

(2.5 V)2 × 500 µS × 5 ns =15.6 pJ / 6 bit cell for SET and (1.7 V)2 × 500 μS × 5 ns =7.2

pJ / 6 bit cell for RESET. As a comparison, DRAM and SRAM in the 16 nm technology node

requires >100 pJ per single-bit flipS5. The energy can further be reduced by lowering the operation

voltage, shortening the write pulse with lower conductance in an improved memristor device.

Supplementary References

S1. Shafiee, A., Nag, A., Muralimanohar, N. Balasubramonia, R., Strachan, J. P., Hu, M., Williams, R. S., & Srikumar, V. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars, Seoul, South Korean, June 18-22 2016, IEEE.

S2. Jiang, H. et al. Sub-10 nm Ta Channel Responsible for Superior Performance of a HfO2 Memristor. Scientific Reports 6, 28525 (2016).

S3. Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nature Electronics 1, 52 (2018).

S4. Choi, B. J., High-Speed and Low-Energy Nitride Memristors, Advanced Functional Materials, 26, 5290 (2016)

S5. Wojcicki, T. VLSI: Circuits for Emerging Applications. 414 (CRC Press, 2014).

Date post:	29-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Long short-term memory networks in memristor crossbar arrays10.1038... · 05 10 15 Supplementary...

Documents