[IEEE 2009 International Symposium on Signals, Circuits and Systems - ISSCS 2009 - Iasi, Romania...

FPGA Implementation of Feed-Forward Neural Networks for Smart Devices Development

Stefan Oniga1, Alin Tisan1, Daniel Mic1, Claudiu Lung1, Ioan Orha1, Attila Buchman1, Andrei Vida-Ratiu1

1 Electronic and Computer Engineering Department, North University of Baia Mare, Baia Mare, Romania [email protected]

Abstract – This paper presents the results obtained in the implementation of Feed-Forward Artificial Neural Networks (FF-ANN) with one or several layers, used in the development of smart devices that needs learning capability and adaptive behavior. The networks were implemented using ANN specific blocks created by the authors using the System Generator software. The training and the testing of the networks was conducted using sets of 150 training and test vectors with 7 elements.

I. INTRODUCTION Artificial neural networks were inspired by biology and

require parallel calculus capabilities. A real model can reach as high as 1 million neurons and thousands of billions of connections. Microprocessors and DPSs are not suited to implement parallel calculus. Designing fully parallel models is possible using ASIC devices but these are expensive, their development takes time and the ANN obtained will be suited for a single type (class) of applications. Due to the advantages offered, the FPGA devices are used more and more to implement ANN [3].

The FPGA devices offer not only the parallel computing but also the flexibility given by ability of being re-programmable, the shortening of the production cycle and the reduction of the costs. Also, they allow the testing of various ANN topologies using the same hardware.

The method presented in this paper allows the implementation of an artificial neural network by simply using the libraries offered with the System Generator software (Xilinx Toolbox), and/or the created ANN specific library. The Xilinx library offers to the designer a large variety of blocks that can be use to model any circuit, but to model a neural network is pretty difficult, this is why creating a library of ANN specific blocks is very useful.

In our research [1] an ANN blocks library was developed, blocks that can be configured by the designer of the network and are directly implementable. Thus were developed from simple elements such as multiply - accumulate blocks, activation functions, to different one layer networks having a variable number of neurons. All these elements have parameters accessible through the graphical interface or loadable from the Matlab environment or a specified file.

II. NEURON MODEL Artificial neural networks are inspired by and modelled from

biological neural networks, of whose complexity is, for example, in the case of the human brain, of 1011 neurons, each having an average of 103-104 connexions. The transmission of the signals between the biological neurons through the synapses is a complex chemical process in which information-carrying substances (called neurotransmitters) are generated. Their effect is a raise or a fall in the potential inside the receipting cell. If the potential reaches a certain threshold, the neuron is excited and produces nervous impulses that are transmitted through the axons to other neurons. This is the characteristic that the model proposed by McCulloch Pitts tries to reproduce.

This model has N inputs, each having a weight wi (i=1..N) assigned to it. The main component of the neuron is the weighted adder which calculates the net input of the neuron as in (1):

xw = a ii

N

1=i∑ (1)

The output signal of the neuron is a function of these values:

⎟⎠⎞⎜

⎝⎛ −=− ∑ θθ xw f af= y(x) ii

N

1=i)( (2)

where θ represents the value of the activation threshold of the neuron.

Function f is called activation function. Initially, the function proposed by the McCulloch Pitts model was the threshold function but other functions are widely used as well, such as the linear, saturation or sigmoid functions.

The net input calculus block makes the xiwi multiplications and sums up the results. The outputs are then obtained in the neuron state memory block as a result of applying the activation function on the sum of multiplications.

A neuronal processing element must be made up of the following main blocks: Input buffer (data memory), weights memory, multiply - accumulator, the activation function, output buffer and the command block.

9781-4244-3786-3/09/$25.00 ©2009 IEEE

Out1

Register

d

enqz-1Mult

a

b(ab )z-3

Convert

cast

Accumulator

b

rstq

Reg_en4

Reset Acc3

Weight2

Data

1

Figure 1. The MAC block designed with Xilinx blocks

Equation (1) can be implemented through the multiply and accumulate (MAC) operation that is fundamental in digital signal processing and many other applications, using:

• One MAC block (neuronal parallelism) • N MAC blocks (for a complete implementation, or

synapses parallelism) • 1 to N MAC blocks (in the case of partial parallel

implementation)

A. Implementation of the multiply and accumulate block Because there is no MAC block in the Xilinx Blockset

Library, the authors have created this block using existing Xilinx blocks, one of the possible solutions being presented in Fig. 1.

The multiplier can be implemented very efficiently using the dedicated multiplier blocks already present in the FPGA devices. This can be set to either the maximum precision or to a value chosen by the user.

The accumulator is implemented using the accumulator block from in the Xilinx library. The precision of the accumulator depends on the number of accumulation operations that is on the number of inputs the neuron has and it has to be large enough to avoid overflow. Thus the number of bits of the accumulator is a parameter of the MAC block in the System Generator (3):

nr.bits_ac = ceil(log2(p)) + (n + m) (3)

Where p represents the number of accumulation operations, and n+m represent the number of bits that the accumulator input data has.

The resources used to implement the MAC depend on a very large number of factors, among which are: the number of input bits, the number of weights bits, the precision of the multiplier, the number of bits the accumulator works on and last but not least by the way the multiplier is implemented (with dedicated multipliers or by using logic resources CLB).

These resources depend mostly on the parameters mentioned above and this is why, according to the application, they must be modified in order to achieve an optimal resources/performance ratio.

For example, if the precision of the accumulator can be reduced from 23 to 20 bits and the precision requested for the outputs is 3 bits, the MAC block can be modified by inserting between the accumulator and the output a converter block that reduces the number of bits on which the output word is represented, thus decreasing the number of flip-flops necessary to implement the output registry.

Through various optimizations, presented in [6], the resources can be decreased to 10 slices for a MAC implemented with a dedicated multiplier in VIRTEX-II.

The weights representation precision is one of the most important choices when implementing ANN in an FPGA. A higher precision means less quantization errors in the final implementation, while a lower precision leads to a simpler circuit, higher speed and the reduction of the resources necessary for the implementation as well as of the power required. A way of solving this compromise is to determine the minimum precision necessary to solve the given problem. Traditionally, the minimum precision is determined by successive trials, simulating the solutions before the implementation. Several authors studied the problem of minimum precision, the conclusion being different function of the ANN type and of the application it was used for [3] [4].

B. Neuron implemented using the ANN Blockset library The ANN blockset library created by the authors contains a

first set of blocks specific to neural networks. The Neuron implemented using the ANN Blockset library is

presented in Fig. 2. The main parameters of the neuron that can be set from the user interface are: the number of neuron inputs, the number of bits used to represent the input data, the weights and the outputs, the type and the precision of the multiplier, the number of bits allocated to the accumulator, the files that are loaded on the weights memory and in the activation block.

III. FEED-FORWARD NEURAL NETWORKS IMPLEMENTATIONS The FF-ANN, also called with forward propagation, are

among the most widely used types of ANNs in applications that associate a set of input models with a set of target models [2].

There is no rule that establishes the number of layers of a neural network or the number of neurons per layer. These are usually determined experimentally. As a general rule it is recommended that the network be as simple as possible. In most cases, a hidden layer is enough to solve any problem. Generally, the number of hidden layers is not bigger than two. The number of neurons from the input and output layers depend on the application, while the number of neurons in the hidden layers are determined experimentally, by trials. A too high number leads only to an important increase in the training time (which can go as far as days) and to a diminishing of the generalization capacity of the network. Thus ANN will answer very well to the data that it learned but will obtain poor results with the test data. This is why it is advisable to start with a small number of neurons and to increase this number if necessary.

Out1

Weight Memory

addr

MAC

Data

Weight

Reset Acc

OutData Memory

data_A

addr_A

we_A

addr_B

OutB

Control Logic

addr_A_data_RAM

we_A_data_RAM

addr_B_data_RAM

addr_weight_ROM

reset_Acc

Activation Function

addr

In1

Figure 2. Neuron implemented using the ANN Blockset library

A. Implementation of a layer of FF ANN with neuron parallelism

This implementation represents a compromise between speed performance, on one hand, and the implementation resources needed and flexibility on the other hand. This is why this is the type parallelization chosen for the implementation.

In this experiment, a software implementation of the ANN training phase was chosen respectively a hardware implementation for the propagation phase.

The neural networks implemented were trained through several training methods. One of the training methods that were tested is the Levenberg-Marquardt method that is part of the fast training algorithms for a FF-BP ANN.

The weights of the trained network can be saved in a weights file. After training, the network can also be loaded in Simulink for simulations. This model can serve as standard for the model that will be developed using System Generator.

In order to implement a neuronal layer with neuron parallelism, a few particularities must be observed. For this, the structure of the neuron presented in fig. 2 is modified in such a manner that it will contain only one MAC block and one weights memory.

Using the modified neurons, a data memory and a logic control block, a neuronal layer with any number of neurons can be implemented. Fig. 3 presents a neuronal layer consisting of 7 neurons. Due to the fact that the neurons from the next layer do not need all the data simultaneously, the multiplexer transmits the output data of the ANN in a successive manner. As a consequence the activation function block can be common to all the neurons in the layer.

For the implementation, the representation precision for the weights, which were calculated in double precision, is reduced. Implicitly, the ROM memory block used to store the weights makes a reduction to the number of bits specified using the quantization by rounding to the closest value represented in the number of bits specified (the "round" Matlab function). When the maximum value is exceeded, saturation to the maximum values is applied.

Figure 3. The hardware model of a neuronal layer consisting of 7 neurons.

TABLE I. NUMBER BITS USED IN WEIGHTS IMPLEMENTATION

Quantization Number of erroneous vectors No. of bits

Method 8 9 10 11 12 13 14 15 16

round 109 50 44 60 37 39 41 41 41 fix 123 78 19 22 35 37 38 39 41 ceil 150 150 141 101 77 53 49 48 46 floor 150 147 82 10 23 24 30 38 41 convergent 109 50 44 60 37 39 41 41 41

To modify the way weights precision representation is

reduced, before loading them into the memory, an explicit reduction of precision can be made using Matlab functions. For example:

q = quantizer( 'fixed', 'fix', 'saturate', [nb bp]);

pond=round(q, ponderi); (4)

allows the representation of the weights in fixed point representation. „q” represents the quantization parameters. The quantization is achieved by rounding to the closest value to zero, expressed on the number of bits specified by "nb" of which "bp" are the bits used to express the fraction part. The overflow is treated by saturation.

The rounding possibilities are those usual in Matlab. Round, fix, ceil floor or convergent.

For overflow we can choose from Wrap and saturate. Testing various solutions for reducing precision we obtained

the results presented in table 1. The minimal number of errors (10) was obtained for the

rounding to the closest value to minus infinite (floor), for a 11 bit representation of the weights.

Simulating the network for various weight representation precisions we obtained the following graph presented in fig. 4 that shows the errors function on the number of bits / weight.

The resources used by a neuron are 14 slices, a dedicated

multiplier and a Block RAM type memory, while the maximum work frequency for the circuit after the optimization is 186, 916 MHz.

In order to verify that the network is functioning properly, a post-implementation simulation with ModelSim Software was conducted, using the testbench and the test vectors generated by the Simulink/System Generator model. The frequency of the clock used is 100 MHz.

8 9 10 11 12 13 14 15 160

20

40

60

80

100

120

140150

Nr. biti/ponderi

Nr.

vec

tori

eron

ati

Figure 4. The errors of the software model for the ANN FF-BP-LM 1x7

Figure 5. FF-BP-LM 2x7 neural network with neuron parallelism

B. Implementation of the FF BP ANN with two layers The model for the neural network constructed with hardware

implementable blocks, created by the author and blocks from the Xilinx library, is presented in fig. 5. Testing various variations for the weights representation precision reduction, the minimal number of errors (5) is obtained for the rounding to the closest value, for a representation on 12 bits. In this way the correct association percent is 99.52%.

The maximum frequency at which a complete set of 7 elements of input vector can feed the network can be calculated with the following formula:

)1)(1( 21

max ++=

MMff clk (5)

Where M1 and M2 represent the number of connection for the neurons in layers 1 and respectively 2. The maximum frequency of the input signal is fmax = 135,465 / 64 = 2,17 MHz.

IV. CONCLUSIONS This paper tested the development method presented in [5]

and studied the effect the block parameters have on the network. The results obtained prove the correctness of the method and the study conducted represents a real help for those who wish to implement neural networks in FPGA. FF propagation networks with one or two layers were implemented. The networks were trained through two different methods: through the Hebbian method and the Levenberg-Marquardt method.

Studying the effect of the number of bits used to represent the weight demonstrates that a number of 11-12 bits is usually enough to obtain the same precision as the software model that utilises weights on 64 bits. Another conclusion refers to the importance of the way weights precision is reduced. Through an adequate reduction we can obtain a better precision of the hardware model compared to the software one.

An important space was given to the influence of the component blocks parameters over the resources utilized and the maximum working frequency. The latency time of the multiplier and the way quantization is performed respectively the way overflow is treated have the biggest influence on the resources used. It was revealed that the highest advantage comes from a latency time of 1, quantization by truncating and wrap for the overflow. This way, the resources needed for a single MAC were reduced from approximately 60 to 12 slices.

In order to evaluate the performance of a neural network, the work frequency is very important. This is why the equations that allow the determination of the maximum work frequency were deducted.

REFERENCES [1] A. Tisan, S. Oniga, A. Buchman, C. Gavrincea, “Architecture and

Algorithms for Synthesizable Neural Networks with On-Chip Learning”, International Symposium on Signals, Circuits and Systems, ISSCS 2007, July 12-13, 2007, Iasi, Romania, vol.1, pp 265 - 268, ISBN 1-4244-0968-3, IEEE Catalog Number: 07EX1678, Library of Congress: 2007920356

[2] J. Torresen, Sh.Tomita, „A Review of Implementation of Backpropagation Neural Networks”, Chapter 2 in the book by N. Sundararajan and P. Saratchandran (editors): Parallel Architectures for Artificial Neural Networks, IEEE CS Press, 1998. ISBN 0-8186-8399-6

[3] J. Zhu, P. Sutton, „FPGA Implementations of Neural Networks – a Survey of a Decade of Progress”, Proceedings of 13th International Conference on Field Programmable Logic and Applications (FPL 2003), Lisbon, Sep 2003

[4] S. Dragici, “On the capabilities of neural networks using limited precision weights”, Neural networks, Volume 15 , Issue 3, April 2002 pp. 395-414

[5] S. Oniga, “A New Method for FPGA Implementation of Artificial Neural Network Used in Smart Devices”, International Computer Science Conference microCAD 2005, Miskolc, Hungary, March 2005, pp. 31-36.

[6] S. Oniga, A. Tisan, D. Mic, A. Buchman, A. Vida-Ratiu, “Optimizing FPGA implementation of feed-forward neural networks”,, Proceedings of the 11th International Conference on Optimization of Electrical and Electronic Equipment OPTIM 2008, 2008, Brasov, Romania, May 22-23, pp. 31-36, IEEE Catalog Number 08EX1996, ISBN 1-4244-1544-6, Library of the Congress 2007905111, ISBN 978-973-131-032-9.

Date post:	16-Oct-2016
Category:	Documents
Upload:	andrei
View:	216 times
Download:	2 times

[IEEE 2009 International Symposium on Signals, Circuits and Systems - ISSCS 2009 - Iasi, Romania...

Documents