[IEEE 2012 UKSim 14th International Conference on Computer Modelling and Simulation (UKSim) -...

A reconfigurable, Generic and programmable Feed Forward Neural-Network implementation in FPGA

Ayman Youssef, Karim Mohammed, Amin Nassar Electronics and Electrical Communications Department

Cairo University Giza, Egypt

Email: [email protected] [email protected]

Abstract— this paper presents a new reconfigurable generic hardware implementation of multilayer feed-forward Neural-Networks (NNs) using field-programmable gate arrays (FPGAs). Implementations of feed-forward Neural-Networks face two major issues: 1) Limited resources available on the FPGA compared to the large number of multiplications required by Neural-Networks 2) The limited reusability of the design when applied to Neural-Network applications with different architectures. Our proposed implementation addresses both issues: The design reduces resource requirements by time-sharing. The time-shared resources are arranged in a scalable configurable processing unit. The scalability allows the user to implement the design with variable number of neurons-starting from only one neuron to the maximum number of neurons in any layer. The design also gives the user the ability to reconfigure it to solve different applications, this is performed with programming-like ease and flexibility and a GUI was implemented to allow automatic configuration of the design for different applications.

Keywords-component; Field-programmable gate array (FPGA); Lookup Table (LUT); Digital Signal Processor (DSP); hardware implementation; layer multiplexing; Neural-Networks (NNs).

I. INTRODUCTION There are many applications and problems that a human

brain can solve more effectively than ordinary software; this stimulated the development of machines that emulate the human neural process. This led to the birth of artificial Neural-Networks (ANNs). Neural-Networks solutions have some clear advantages in some areas over other solution techniques [1]. To begin with neural networks can learn from the input sequence and the environment, secondly it is an essentially nonlinear system thus being more suitable to the solution of essentially nonlinear problems, thirdly solutions developed by ANNs tend to be fault tolerant, this is due to the distributed nature of computation which tends to make the effects of isolated faults non-catastrophic. The final advantage of ANNs is that their calculations are highly parallel; this inherent parallelism means that they are more suitable for hardware implementations, particularly VLSI implementations.

Implementation of Neural-Networks can be accomplished either in software or hardware. Software solutions are naturally fast to develop and very flexible in their reprogramming, however their problem is that it takes a general purpose processor too much time to solve the application; this is because an ANN application by definition has quite a large number of multiplications associated with the large number of connections in the network. These operations are handled very inefficiently by the general-purpose ALU of a processor. This opened the way for dedicated hardware implementations like on Field-Programmable Gate Arrays (FPGA), and Application Specific Integrated Circuit (ASIC). ASIC implementations allow a very efficient implementation of the ANN, however, they do not offer reconfigurability by the user, once a design is set we are stuck with it. That is why FPGA implementation is the most suitable hardware implementation for Neural-Network applications as it allows the user to reconfigure it for different applications and designs, and in the same time it preserves the parallel architecture of the Neural-Network. FPGA implementations face some challenges, first NNs require large resource as there are many multiplications and nonlinear excitation functions, and the second challenge is the need for the design to be generic to solve multiple applications. The third challenge is the length of the design cycle and the ability of the design to practically solve real time problems. Numerous works have reported achieving high speed low cost FPGA feed forward Neural-Network implementations. P. Domingos [2], implemented an Artificial Neural-Network system and compared its results with the software solution, the speed of the design is very high compared to software solution but no hardware resource reuse technique implemented on this design, resulting in ineffectively using the hardware resources. To address this issue, layer multiplexing and neuron reuse are often used for effective resource utilization at the cost of speed. This allows larger Neural-Networks to be practically implemented on FPGA [3]. A multilayer perception Neural-Network on FPGA (ARC_PCI board) with parameterized number of layers and neurons, to perform the feed forward computation has been reported. The reuse technique reduced the hardware resource usage of the FPGA, and more over the

2012 14th International Conference on Modelling and Simulation

978-0-7695-4682-7/12 $26.00 © 2012 IEEE

DOI 10.1109/UKSim.2012.12

9


978-0-7695-4682-7/12 $26.00 © 2012 IEEE

DOI 10.1109/UKSim.2012.12

9


978-0-7695-4682-7/12 $26.00 © 2012 IEEE

DOI 10.1109/UKSim.2012.12

9

parameterization made the design more flexible to diverse applications, but this design did not give the user the ability to choose the number of neurons to implement a Neural-Network and the ability to trade between the application speed and area on the FPGA (number of neurons used to implement the application) [4].

This paper presents a novel feed forward design that takes a step forward from the state of the art. We integrate resource reutilization techniques and pipelining to implement a generic feed forward network on FPGA with high speed and low cost. The design is parameterized and gives the user the ability to choose the number of neurons, layers, and range for both the weights and the input samples. Also the design can be programmed to solve problems with different numbers of neurons ranging from one neuron to the maximum layer number of neurons. This gives the user the ability to trade off between the speed and the resource utilization of the network, and choose the appropriate specifications for the application. Also we implemented a compiler and a GUI that allows the user to reconfigure the design with different applications. These two applications were implemented with different Neural-Network architectures and a table of comparison between these configurations is presented. Also a comparison is done between FPGA solution speed and DSP solution speed and number of neurons to implement application.

II. SINGLE NEURON DESIGN The main block in Neural-Network architecture is the

single neuron; the neuron architecture is very simple it consists of a multiplier and an activation function (Fig. 1)

Figure (1) Single neuron As simple it seems, the neuron requires large number of multipliers to calculate its output. This can be reduced by doing the multiplications serially using an accumulator and a multiplier. This reduces the resources effectively as there may not be enough multipliers to implement the Neural-Network on the FPGA. (Fig. 2)

Figure (2) Neuron implementation using multiply-accumulate

The neuron also needs an activation function. In this design we implement a LUT(look up table) to do this function and it contains only one LUT that all neuron outputs pass through, this is done also for effective resource utilization as the LUT needs lot of resources.

III. NEURAL-NETWORK IMPLEMENTATION The Neural-Network implementation is shown in (Fig. 3).

Figure (3)

Figure (3) Novel ANN processor

It consists of: 1) A control unit. 2) Generic number of neurons. 3) Activation function LUT. 4) Three memories

101010

(code memory, weight memory, and input memory for input samples).

A. Control unit. The control unit controls all the design components. The control unit is run through a user-defined program, it takes its input from the code, and then it executes according to the instruction. There are three instructions in our instruction set: 1) “loadweinp”, this instruction Loads the weight and the input of a neuron and then activates the neuron by enabling the accumulator register. The "loadweinp" instruction takes two arguments: The address of the weight memory and the address of the input memory. 2) “Store”, this instruction stores the output of a single neuron cell in a specified memory location. The instruction takes two arguments: The neuron number, and the memory address to save it in. 3) “reset”, this instruction resets all the memory registers - always used when a layer computation is finished. Using these three instructions the design can be programmed to solve any Neural-Network application and also give the user the ability to choose the number of neurons to implement the application- ranging from the layer maximum number of neurons to a single neuron; this allows the user to configure the design with different speed and area specifications. These three instruction work as follows:-

1) "loadweinp" This instruction (e.g. loadweinp 1,1) has two arguments: The first argument is the address of the input memory to be loaded into the neurons-given that all neurons are loaded with the same input samples, the second argument is the weight memory address. This loads all the neurons with their corresponding weights, this is done at the same time using two distinct data buses -one for the input memory and one for the weight memory- to increase the design speed. After loading the inputs and weights, the instruction enables the accumulator register in the neurons to activate them.

2) "store" This instruction (e.g. store 0,2) has two arguments: the first argument is the neuron number whose output we want to save, the second argument is the memory location where the data is stored. The instruction works as follows it takes the neuron output and passes it to the activation function, and then the output is stored in memory.

3) "Reset" This instruction (reset) resets all the neurons and clears the accumulators in the neuron cells, it is used in two cases: 1) the beginning of the application. 2) When a layer computation is finished.

TABLE 1. Instructions and their corresponding cycle budget Instruction Number of clock cycles Loadweinp 4 Store 5 Reset 2

B. Code memory. The code RAM is the memory of the control unit instructions. The cell has also two generic variables, the first is the memory size and the second is the memory width. This allows the input memory size, weight memory size and numbers of neurons to be generic. The memory data is generated using a C# GUI and an underlying parser that takes the instructions and generate the memory binary code. The GUI is shown in Fig. 4.

C. Weight memory. The weight memory is the memory where all the neuron weights are saved; these weights are loaded to all the design neurons at the same time to increase the design speed. These weights were calculated using MATLAB software and then converted to binary using the C# memory GUI. Each memory address contains the weights of the neurons

D. Input memory. The input memory is the memory where the input samples are saved; it is parameterized on both the memory width and memory size.

E. Sigmoid Activation function. The design contains a Lookup Table (LUT) containing values of one activation function; this LUT is used by all the neurons in the design for effective resource utilization. A MATLAB GUI is implemented to generate the LUT data depending on the activation function chosen by the user.

F. Compiler and GUI. We implemented two GUI components using C#: the first GUI is shown in (Fig.4) and is used to compile the program needed to work on the FPGA generating the binary code.

111111

Figure (4) compiler GUI

The second GUI is the memory GUI (Fig. 5), it allows the user to enter the data needed for the weight memory and the input memory. It also allows the user to choose the number of bits to represent the integer part of the weights and the fraction part to enable the user to adjust the design for the accuracy needed by the application.

Figure (5) Memory GUI

IV. TESTING AND RESULTS The design was tested on time series prediction

application.

A. Time series Prediction using Neural-Network The application we tested the design on is a time series prediction. Time series prediction is a very important problem in many real-time applications. This problem consists of the prediction of new time series values from previous ones. The time series we want to predict here is the Mackey-Glass time series. A simple 2-4-1 Neural-Network is implemented and trained in MATLAB, then the weights values are loaded into our design GUI and converted to

binary representation depending on the accuracy specified by the user, then a functional simulation is done on Xilinx ISE design suite 12.1, the outputs of the Neural-Network are save on text file , and finally loaded into MATLAB and drawn as red and green points on the graph as shown in (Fig.6).The network is implemented by two design settings to test the design ability to fit bigger applications in small area, the first setting is by configuring the number of neurons in the design to 4 (layer multiplexing), the results of this setting is shown in green points. The second setting is using only one neuron. This is an example on how the design can be reconfigured for different speed and area specification. The results from these settings are shown in red points in the next figure.

0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3fixed point simulation vs vhdl

Figure (6) Fixed-point simulation versus results from VHDL

A comparison is done between the speed of the two design settings and the speed of software (MATLAB) and also the speed of general purpose DSP. Also shown in Table 2 is the number of clock cycles necessary

TABLE 2. Application clock cycles

DSP FPGA

4 neurons FPGA 1 neuron

Clock cycles 29,640 62 92 The following table compares between the minimum time needed (working at maximum speed) for the prediction of one sample of the time series between MATLAB software, DSP solution and our two design settings.

121212

TABLE 3. Design time MATLAB DSP FPGA

4 neurons FPGA1 neuron

time 46.446 ms 29.64us 0.442857us 0.65714us This comparison shows how fast our design can work with respect to DSP, also it can be seen that the overhead in speed when using 1 neuron instead of 4 neurons. It is clear that our solution is 60 times faster than the DSP solution. The design is synthesized using Xilinx ISE design suite on Virtex6 XC6VLX75T device and a maximum frequency of 148MHz is found and the following is 4 neuron synthesis results.

TABLE 4. Synthesis Results 1-

neurons 2- neurons

4-neuron

Number of slices registers 216 248 344Number of slices LUTs 331 373 500Number of fully used LUT pairs

139 195 363

Number of block Rams 3 3 3Number of DSP48E1s 1 2 4

The one neuron design uses only one multiplier instead of four multipliers (i.e. 75% reduction in multipliers resources) and a reduction also in other resources, with only about 30% overhead in speed, this will be very important if the FPGA board does not have enough number of multipliers for the application, this will be transferred to a reduction in slices resources, or when a new efficient implementations of multipliers needed.

V.CONCLUSION In this paper we presented a novel feed forward Neural-Network implementation, the design is generic, reconfigurable and programmable, this gives the user the ability to trade between speed and area, the design speed is 60 times faster than DSP, a GUI was implemented to reconfigure the design and program it. The design was tested on time series prediction application and a comparison was done to compare between different designs parameters for both applications. It has been shown that a user can have a reduction 75% in the number of multipliers with only 30% overhead in speed the design was simulated and synthesized on Xilinx ISE suite 12.1 on Virtex6 XC6VLX75T device and a maximum frequency of 148MHz was reported.

REFERENCES

[1] S.Haykin, “Neural-Networks and Learning Machines,” pearson

international Edition, third edition. [2] Domingos, P.O.; Silva, F.M.; Neto, H.C.;"An Efficient and

Scalable Architecture for Neural-Networks with backpropagation learning,' Field programmable logic and applications, pp. 89-94, Aug. 2005.

[3] S. Himavathi, D.Anitha, and A. Muthuramalingam, “Feedforward Neural-Network Implementation in FPGA Using Layer Muliplexing for Effective Resource Utilization” IEEE transaction on Neural-Networks, vol, 18,NO, 3,MAY 2007.

[4] Daniel Ferrer, Ramiro Gonz´alez, Roberto Fleitas, Julio P´erez Acle, Rafael Canetti, "NeuroFPGA—implementing Artificial Neural-Networks on programmable logic devices."Design, Automation and test in Europe, vol. 3,pp. 30218,2004.

131313

Date post:	12-Dec-2016
Category:	Documents
Upload:	amin
View:	215 times
Download:	1 times

[IEEE 2012 UKSim 14th International Conference on Computer Modelling and Simulation (UKSim) -...

Documents