Design space exploration for image processing ...

Design space exploration for image processing architectures on FPGA targets

Chandrajit Pal, Avik Kotal, Asit Samanta, Amlan Chakrabarti, Ranjan Ghosh

University College of Science, Technology and Agriculture, University of Calcutta,

92, APC Road, Kolkata, India http://www.caluniv.ac.in/

Abstract. Due to the emergence of embedded applications in image and

video processing, communication and cryptography, improvement of

pictorial information for better human perception like de-blurring, de-

noising in several fields such as satellite imaging, medical imaging,

mobile applications etc. are gaining importance for renewed research.

Behind such developments, the primary responsibility lies with the

advancement of semiconductor technology leading to FPGA based

programmable logic devices, which combine the advantages of both

custom hardware and dedicated DSP resources. In addition, FPGA

provides powerful reconfiguration feature and hence is an ideal target

for rapid prototyping. We have endeavored to exploit exceptional

features of FPGA technology in respect to hardware parallelism leading

to higher computational density and throughput, and have observed

better performances than those one can get just merely porting the

image processing software algorithms to hardware. In this paper, we

intend to present an elaborate review, based on our expertise and

experiences, on undertaking necessary trans-formation to an image

processing software algorithm including the optimization techniques

that makes its operation in hardware comparatively faster.

Keywords: IP(intellectual property), FPGA(Field Programmable Gate Array), non-recurring engineering costs (NRE), FPGA-in-the-loop (FIL).

1 Introduction Human beings have historically relied on their vision for tasks ranging from basic

instinctive survival skills to detailed and elaborate analysis of works of art. Our ability

to guide our actions and engage our cognitive abilities based on visual input is a

remarkable trait of the human species, and much of how exactly we do what we

intend to do and seem to do it so well remains to be explored. The need to extract

information from images and interpret their contents is one of the driving factors in

the development of image processing and computer vision for the past decades, which

demands for processing of the same to extract use-ful information from it. Digital

image processing (DIP) is an ever growing area with a variety of applications

including medicine, video surveillance and many

2 Authors Suppressed Due to Excessive Length

more. To implement the upcoming sophisticated DIP algorithms and to process the

large amount of data captured from sources such as satellites or medical instruments,

intelligent high speed real-time systems have become imperative [1]. Image

processing algorithms implemented in hardware (instead of software) have recently

emerged as the most viable solution for improving the performance of image

processing systems. Our goal is to familiarize applications programmers with the state

of the art in compiling high-level programs to FPGAs, and to survey the relevant

research work on FPGAs. The outstanding features, which FPGAs o er such as

optimization, high computational density, low cost etc, make them an increasingly

preferred choice of experts in image processing eld today. Technological

advancement in the manufacture of semiconductor ICs of-fers opportunities to

implement a wider range of imaging operations in real time. Implementations of

existing ones need improvement. With the intrusion of reconfigurable hardware

devices together with system level hardware description languages further accelerated

the design and implementation of image process-ing algorithm in hardware. Due to

the possibility of ne-grained parallelism of imaging operations, FPGA circuits are

capable of competing with other calculation based implementation environments. This

advancement have now made it possible to design complete embedded systems on a

chip (SoC) by combining sensor, signal processing and memory onto a single

substrate. With the ideal use of System-on-a-Programmable-Chip (SOPC) technology

FPGAs prove to be a very efficient, cost-effective and attractive methodology for

design verification [2]. In this paper we survey the various hardware implementation of image processing

algorithms and show how the DSP design environment from Xilinx can be used to

develop hardware-based computer vision algorithms from a system level approach,

making it suitable for developing co-design environments with an emphasis on the

salient features of FPGA. Section 2 highlights the setback of other hardware

implementation alternatives and serves to set the basis for explaining the advantage of

FPGAs while dealing with and evaluating several significant parameters. Section 3

summarizes the related research on FPGA implementation of image processing

algorithms. Section 4 deals with the main contributions of the Xilinx DSP design

environment, with the application examples and hard-ware architectures, 5 deals with

the results and discussion and finally section 6 concludes the work with the discussion

and projection towards future work.

2 Software paradigm to hardware(FPGA)

In general, sophisticated image processing algorithms are so computationally

intensive that general-purpose CPUs cannot satisfy real-time constraints [3]. Software

provides the flexibility and re-programmability features but leads to sequential

execution of instructions and also increases the compiler overhead capable of

identifying and execution of multi-thread components. However execution in

customized hardware is inherently parallel as of its architecture and as a result the

independent instructions of the algorithm can be executed in parallel

Lecture Notes in Computer Science: Authors' Instructions 3

subject to the availability of suitable hardware components, thereby increasing the speed of execution. Gains are made in two ways, while comparing hardware implementation with a software counterpart. Firstly, a software implementation is constrained to execute only one instruction at a

time. Although the life cycle of the instruction fetch/decode/execute cycle may be

pipelined, and modern processors allow different threads to be executed on separate

cores, software is inherently sequential by nature. A hardware implementation, on the

other hand is fundamentally parallel, with each operation or instruction implemented

on separate hardware module. In fact a hardware system must be explicitly

programmed to perform operations sequentially if necessary. If an algorithm can be

implemented in parallel to efficiently make use of the available hardware,

considerable performance gains can be achieved. Secondly, a serial implementation is memory bound, with data communicated from

one operation to the next through memory. As a result a software processor needs to

spend a significant proportion of its time reading its input data from memory, and

writing the results of each operation ( including intermediate operations ) to memory.

Traditional digital signal processors are microprocessors designed to perform a

special purpose, are well-suited to algorithmic-intensive tasks but are limited in

performance by clock rate and the sequential nature of their internal design. This

limits the maximum number of operations per unit time that they can carry out on the

incoming data samples. Typically, three or four clock cycles are required per

arithmetic logic unit (ALU), which lead to lower throughput. Multicore architectures

may increase performance, but these are still limited. Designing with traditional signal

processors therefore necessitates the reuse of architectural elements for algorithm

implementation. In order to increase the performance of a system the number of

processing elements needs to be increased, which has a negative effect of shifting the

paradigm of concentration from signal processing to task overhead in controlling

multiple processing elements. A solution to this increasing complexity of DSP ( Digital Signal Processing )

implementations ( e.g digital lter design for multimedia applications ) came with the

introduction of FPGA technology, developed as a means to combine and concentrate

discrete memory and logic, thus enabling higher integration, higher performance and

increased flexibility with their massively parallel struc-tures containing a uniform

array of configurable logic blocks ( CLBs ), memory, DSP slices along with other

elements [4],[5]. Nevertheless with the constant advancement of semiconductor technologies, FP-GAs

are becoming sufficiently more powerful to support real-time image processing due to

their high logic density, generic architecture and considerable on-chip memory.

Moreover, the straightforward reconfiguration procedure allows designers to

configure the hardware as many times as needed without extra cost i.e the ability to

tailor the implementation to match system requirements. With these benefits there is a

continued hardware design to meet the vertical requirements to meet the time critical

and computationally complex applications that can be achieved through FPGA.

Moreover its very high-speed I/O further reduces


cost and minimizes bottlenecks by maximizing data flow right from capturing through

the processing chain to the nal output. Sometimes constant upgradation in the device

is required where ASICs (Application Specific Integrated Circuits) doesn't t well, as

once it is programmed it cannot be changed [6]. Most machine vision algorithms are dominated by low and intermediate level image

processing operations, many of which are inherently parallel. This makes them

amenable to a parallel hardware implementation on an FPGA, which have the

potential to significantly accelerate the image processing component of a machine

vision system. On an FPGA system, each operation is implemented in parallel, on separate hardware

component allowing data to pass directly from one operation to an-other, significantly

reducing or even eliminating the memory overhead. Fortunately, the low and

intermediate level image processing operations typically used in a machine vision

algorithm can be readily parallelized. FPGA implementation results in a smaller and

more significantly lower power design that combines the flexibility and

programmability of software with the speed and parallelism of hardware [7]. Hence, we choose an FPGA platform to rapidly prototype and evaluate our design methodology.

2.1 Evaluating FPGA with its advantages and disadvantages as a platform

suitable for digital image processing applications. Benefits of FPGA: There are several advantages that makes FPGA a preferred choice as it o ers a convenient and flexible platform where real time machine vision systems can be implemented.

In general, various image processing algorithms require multiple iterative

processing of data sets as will be elaborated in the subsequent sections,

requires sequential operations on a general purpose computer with multiple

passes. It can be fused to one pass in an FPGA. It can be operated on

multiple image windows in parallel as well as multiple operations within one

window also in parallel. Optimization techniques such as loop unrolling, loop fusion etc help to

effectively utilize the FPGA resources while maintaining the proper acceleration by reducing many redundant operations.

Any digital logic circuitry can be configured differently as per the need of

the hour and application at hand. So rapid prototyping of the devices are

possible, which helps to test any architectural design we need to perform in a

short time to market. Its software like flexibility to reprogram and easy

upgradeability allows its solutions to evolve quickly. FPGA's inherent parallel configurable components, parallel programmable

I/O, allow them to read, process and write from memory banks

simultaneously. As result operations such as convolutions, correlations,

digital FIR filtering can be done much faster using pipelining and

parallelism.


This reconfigurable and reusability feature of FPGA helps to develop im-age

processing IP CORES, thus helps to generate most cost effective smart

systems. These IP's can be quickly integrated without any moderation or

repeating any verification reduces the time to market and reduces the non-

recurring engineering (NRE) costs. There is a high logic as well as computational density within the FPGA

together with a low development metric allows the lowest volume consumer

electronics market to bear the development cost of FPGA. They are useful

for low volume applications unlike ASIC's. Since we use hardware description language for designing the RTL model, the

flexibility and configurability of FPGA comes out of it together with the speed

and parallelism, which comes from the hardware implementation [8]. Shortcomings of FPGA The limitations of FPGA as faced in image process-ing operations are noted below:

Hardware supports inherent parallel operations as per their architecture, and

as a result offers much greater speed than software execution. But at the cost

of an increased development time and proper skill needed by a design

engineer. As it is used for product prototyping, its timing path cannot be fixed and

optimized in advance as it needs to be changed with programming. As a result it operates at a very lower clock speed unlike ASIC.

Since they are general purpose and programmable, they require large chip (silicon) area and consume more power.

With FPGA Floating point operations are cost effective and complex

mathematical operations such as division and direct multiplication are also

computationally expensive. So it remains a good choice for the designers to

reformulate their algorithms to avoid complexity [9]. Nevertheless the advantages outnumber the limitations and FPGA will continue to be a preferable choice for the designer community for the days to come.

2.2 Algorithm to hardware design flow The work flow graph shown in Fig. 1 shows the basic steps of implementing an image

processing algorithm in hardware. Step 1 requires a detailed algorithmic

understanding and its subsequent software implementation. Secondly the design

should be optimized from both the algorithm (e.g. using algebraic transforms) and

hardware (using efficient storage schemes and adjusting fixed point computation

specifications) viewpoints. Finally, the overall evaluation in terms of speed, resource

utilization, and image fidelity, decides whether additional adjustments in the design

decisions are needed. Once done FPGA-in-the-Loop Verification is carried out, which

enables us to run the test cases faster. It also opens the possibility to explore more test

cases and perform extensive regression testing on our


designs ensuring that the algorithm will behave as expected in the real world. A good

software design does not necessarily correspond to a good hardware design and this

clearly serves the purpose as to follow the steps mentioned in Figure 1a.

Fig. 1. Algorithm to hardware design flow graph.

3 Background and Related Work

Since 2000 we have seen a good amount of research on utilizing FPGA as a suit-able

prototyping platform for realizing image and video processing algorithms. Digital image

processing algorithms are normally categorized into 3 types: low, intermediate and high

level. Low level operations are computationally intensive and operate on individual pixels

and sometimes on its neighborhood involving geometric operation etc [7]. Intermediate-

level operation includes conversion of the pixel data into different representation like

histogram, segmentation, thresholding and the operations related to these. High level

algorithms tries to extract meaningful information from the image like object

identification, classification etc. As we move up from low to high level operations there is

an obvious de-crease in the exploitable data parallelism due to a shift from pixel data to

more descriptive and informative representations. Here we intend to focus on the low level

operational (local filters) algorithms to deliberately show the capabilities of FPGA for

computationally intensive tasks targeted for low and intermediate-level operations. As it is

well known, a separate class of low level computationally intensive task includes image

filtering operation based on convolution. Several related research works have been done so

far. Paper [10] have shown the various hardware convolution architectures related


to look-up-table (LUT), distributed arithmetic and Multiplierless Convolution (MC)

architecture and have stressed the usage of MC architecture since it is simple to

implement and the multiplication operation can be replaced by an addition operation.

However, such a realization is possible if only if a coefficient value is a power of 2

and is only favorable for small convolution kernels, thereby it loses its robustness.

Paper [11] shows the various area efficient 2D shift-variant convolution architectures.

They have proposed some novel FPGA-efficient architectures for generating a

moving window over a row wise print path. Their moving window includes row

major, column major and moving window with rotation stage architectures

respectively. However their main architectural drawbacks is the memory overhead

including an elevated memory bus bandwidth requirement as it needs to fetch

multiple rows from external memory while processing a single row. Secondly more

than one clock pulse is required for processing a single pixel. Paper [12] shows three

different architectures for dealing with filter kernels whose coefficient value is

varying. Their pipeline as well as convolve and gather architecture is worth noting.

However they lag with some initial fixed redundant clock cycles used to buffer for the

occurrence of the first convolution and an elevated pipelined architectural complexity,

which comes from its construction of various segments meant for varying filter kernel

coefficients. Paper [13] discusses a multiple window partial buffering scheme for 2 dimensional

convolutions. Their buffering strategy shows a good balance between on-chip

resource utilization and external memory bus bandwidth suitable for low cost FPGA

implementation. Paper [14] have shown an optimized implementation of discrete

linear convolution. They have presented a direct method of reducing convolution

processing time with computational hardware implementing discrete linear

convolution of two finite length sequences. The implementation is advantageous with

respect to operation, power and area optimization. Their claim that the architecture is

capable of computing real time image processing algorithm for a particular

application raises doubt since there is no validation results. Moreover for convolvers

of large size it is recommended to use dedicated DSP blocks either as hard core or in

software library while designing RTL for better performance issues. Paper [15] shows the hardware architecture for 2D linear and morphological filtering

applied to video processing applications. However video processing algorithm

verification should not be done with USB, since it is much slower with respect to

ethernet (point to point). Moreover they have used much slower clock frequency (10

MHz) to process, making it much unfamiliar. 4. Hardware convolution architectures

The convolution equation is given by

--------- (1)


where (m,n) are pixel positions, h[m,n] denotes the filter response function and

x[m,n] is the image to be filtered. [a,b] denotes the window filter size [16]. The process scenario is clear from Fig.2.

Fig. 2. Working procedure of a sliding window architecture.

Fig. 3. Complete parallel hardware architecture of a 3x 3 filter kernel implementation for simplicity. Actually implemented 5x 5 kernel mask.

Here we have discussed five different convolution hardware architectures namely

the fully parallel architecture, next an optimized version with MAC FIR lters,

separable kernel architecture and another pipelined architecture capable of reducing

some redundant operations. All of them have been designed to implement equation 1. Fig.3 shows the buffer lines, which helps to store the image pixels prior to convolve,

thereby saving additional time to fetch them from an external memory. Instead of

sliding the kernel over the image this technique helps to feed the image through the

window. This architecture is very common, which shows 2 buffer lines together with

some memory registers, which assists in loading a 3*3 neighborhood. For the

convolution operation it needs 9 multiplication and 8 addition operations and is a

generic architecture with the highest complexity. This architecture computes a new

output pixel at every clock cycle after an initial delay but consume more resources. For Fig.4 The buffer line consists of a single port RAM, as shown in unit (2.a) of Fig. 4;

the counter in it is incremented to write the current pixel data and to read it subsequently.

The output of each of five buffers of unit-1 connects to respective inputs of unit-2, each of

five parallel sub-circuits of unit-2 consists of five MAC FIR engines; one such unit is

elaborately shown in unit-2.a of Fig. 4 depicting the ASR (Addressable Shift Register)

implementing the input delay buffer. The address port runs n times faster than the data

port, where n is the number of filter taps. The ROM and ASR address are produced by the

counter. The sequence counts from 0 to n 1, then repeats. Pipeline registers r0 r2 increase

performance. A capture register is required for streaming operation. A down sampler

reduces the capture register sample period to the output sample period. The filter

coefficients are stored in ROM. Five outputs of ve MAC engines are sequentially added to

get the result, whose absolute value is computed and the data is narrowed to 8-bits. The

blue colored block is elaborated in unit-2.b (Fig. 4) as the (multiply-accumulate)MAC

engine. Enabling the 'Pipeline to Greatest Extent Possible' mask configuration parameter

ensures the internal pipeline stages of the dedicated multipliers are used [17]. The yellow

box is elaborated in unit 2.c (Fig. 4), which calculates the absolute value before

multiplying with the scaling factor, which is the sum of the weight of the filter

coefficients. This architecture has the advantage of using less resources but needs 5 clock

cycles to process per pixel. The underlying 5-tap MAC FIR filters are clocked 5 times

faster than the input rate. Therefore the throughput of the design is 100 Mhz/5= 20 million

pixels per second. For a 64x64 image this is 20x106

/(64x64)= 4883 frames/sec. For our

experiment the image size is 150x150, so 889 frames/sec. This architecture consumes

very less hardware resources. For linear operation, convolution has some interesting properties such as

commutatively. Therefore for PxP kernels can be rede ned as the convolution of a Px1

kernel (Q1) with a 1x P kernel (Q2). As a result the equation can be formulated as



I x Q1 x Q2 = I x Q2 x Q1 (2)

Fig.5 and 6 implements the right hand and left hand side of the equation 2

respectively. The design with separable convolution kernel architecture is shown in

Fig. 5 and Fig.6. In Fig.5 the column convolution has been carried out in the rst

section of the hardware before the row buffering scheme. The row bu ering is shown

in the detailed architecture in unit 1.a of Fig.4 as explained previously and the row

convolution in unit 4.a of Fig. 4 respectively. The partially processed pixels after the

column convolution is passed through the row convolution section to get the filtered

pixel and is capable of processing (100x106)/256x256= 1526 frames/sec. 100 stands

for the frequency of the FPGA board in MHz and image size is 256 x 256 and

100x106/(150x150) = 4444 frames/sec for a 150x150 size image.

This architecture is capable of processing 1 pixel/clock cycle and its complexity is reduced from O(N

2) for normal convolution as discussed to O(2N).

Fig.7 takes the advantage of only five multiplications and two 4-operand additions. In

other words this architecture reduces these redundant operations. But in contrast, this

architecture has three mult-add pipelines, which allows to operate with three mask

columns. It is to be noted that this architecture selects (to the output adder) 5-

predefined input operands (see connections of inputs of this adder in Fig.7). This

architecture also processes 1 pixel/clock cycle. It is to be noted that the architecture shown in Fig.4 needs 5 clock cycles to process 1

pixel as shown in the timing diagram in Fig.8. The rest of all architectures in Figures

3, 5, 6 and 7 processes 1 pixel/clock cycle as shown in the timing diagram in Fig.12,

9, 10, 11. For the above architectures discussed in section 4, the hardware resource utilization has been shown in Table 1.

5 Results and Timing Diagram The corresponding hardware architectures have been applied for verifying an edge preserving bilateral filter, which involves execution of multiple convolution operations in parallel pipelining fashion. The results of the denoised image are as shown in Fig.13 and 14. Filter output for image size of 150x150 for the additive Gaussian noise. Filter settings σs=20, σr=50 and σ=12 for the additive Gaussian noise, where σs and σr are the domain and range kernel standard deviations and only σ is the needed for the white Gaussian noise. There remain some considerations while planning to implement complex image

processing algorithms in real time. One such issue is to process a particular frame of a

video sequence within 33 ms in order to process with a speed of 30 (frames per

second) fps. In order to make correct design decisions a well known standard formula

given by:

where tframe is the processing time for one frame, C is the total number of clock cycles

required to process one frame of M pixels, f is the maximum clock

Lecture Notes in Computer Science: Authors' Instructions 11 Fig. 4. Hardware blocks showing the ltering hardware architecture of a 5x5 filter kernel implementation [18].

frequency at which the design can run, ncore is the number of processing units, tp is

the pixel-level throughput with one processing unit (0 < tp < 1), N is the number of

iterations in an iterative algorithm and is the overhead ( latency ) in clock cycles for one frame [3].

We have tested for our convolution architectures discussed above for a single image filtering application and have measured the time via the well known eqn 3 [3].

For 150 x 150 resolution image, M= 22500, N = 1, tp = 1 i.e per pixel processed per

clock pulse, and = 350 i.e the latency in clock cycle, f = 100 MHz, ncore = 1. Therefore the

tframe = 0.00022 seconds = 0.2 ms 33ms ( i.e much less than the minimum timing

threshold required to process per frame in real time video rate ). We have measured the

same execution in software and it came to be 0.008 second. Therefore the acceleration in

hardware is 0.008/0.00022 = 40x . From Table 1 it is clear that architecture in Fig.5, 6 and

7 are most suitable w.r.t resource usage. We have also measured the power consumption

of the individual hardware architectures as shown in Table 2. From the data it is


From WORKSPACE Register Register Register Register Register

Register

To WORKSPACE Register CONVERT

unit 4

LINE ROW

BUFFERING ABSOLUTE

HARDWARE CONVOLU- BLOCK

-TION

unit 4a

IN

IN

IN

OUT

IN

IN Register Register

unit 4a magnified

Fig. 5. Hardware blocks showing the filtering hardware architecture for separable kernel. Right hand side of Eqn. 2.

clear that the normal convolution hardware in Fig.4 and the separable hardware architectures in Fig.5, 6 consumes the least power among the rest. 6 Discussions and Future Directions

In this paper we have discussed in brief our motivation towards the computer vision

algorithm implementation realized in hardware and presented various e efficient

convolution architectures with almost similar results, with minute changes in the

PSNR of the filtered output images resulted after applying Gaussian filtering on a

noisy image shown in Fig.13. We have also tested our architectures, which when

applied to a particular edge preserving algorithm produced good results (with

enhanced PSNR as shown in Fig.13). It has been shown that Xilinx System Generator

(XSG) environment can be used to develop hardware-based computer vision

algorithms from a system level approach, making it suitable for developing co-design

environments. We have also used FPGA-in-the-loop (FIL) verification [19], to verify

our design. This approach also ensures that the algorithm will behave as expected in

the real world. In future we need to explore more high level technique and approaches

to circuit optimization with energy efficiency.

Lecture Notes in Computer Science: Authors' Instructions 13 Table 1. DEVICE UTILIZATION OF THE VARIOUS OPTIMIZED HARD-WARE ARCHITECTURES FOR IMAGE SIZE 150x150 FOR VIRTEX 5 LX110T OpenSPARC EVALUATION PLATFORM

Percentage Image Size (150x150)

utilization Normal Convolution fully parallel SSDC hardware architecture hardware(Fig.4) architecture(Fig.3) (Fig.5 and 6) in Fig.7

occupied slices 525 1586 623 740 out of 17,280 (4%) (9%) (4%) (4%)

Slice LUTs 1062 2922 1593 1595 out of 69,120 (2%) (4%) (3%) (2%)

Block-RAM/FIFO 7 6 6 6 out of 148 (5%) (4%) (4%) (4%)

Flip Flops 4041 4042 810 1890 out of 69,120 (6%) (6%) (2%) (3%)

IOBs 1 1 1 1 out of 640 (1%) (1%) (1%) (1%)

Mults/DSP48s 5 0 0 0 out of 64 (8%) (0%) (0%) (0%)

BUFGs/BUFCTRLs 2 2 2 2 out of 32 (6%) (6%) (6%) (6%)

*SSDC = Separable Single Dimensional Convolution

Table 2. POWER CONSUMPTION OF THE VARIOUS OPTIMIZED HARD-WARE ARCHITECTURES FOR IMAGE SIZE 150x150 FOR VIRTEX 5 LX110T OpenSPARC EVALUATION PLATFORM

Power Image Size (150x150) Consumption Static Power Dynamic Power Total Power

(in Watt) (in Watt) (in Watt) Normal Convolution 0.703 0.041 0.744 Hardware in Fig.4

Separable Hardware 0.702 0.025 0.728 architecture in Fig.5,6

Architecture in 1.188 0.072 1.26 Fig.7

Fully Parallel arch. 1.188 0.068 1.26 Hardware in Fig.3


LINE ROW

BUFFERING

From WORKSPACE CONVOLU- Register

Register Register Register Register

HARDWARE

-TION

Register

unit 5a

ABSOLUTE BLOCK

Normalization factor

IN

casting

IN

out

IN

OUT IN

IN Register Register

unit 5a magnified

Fig. 6. Hardware blocks showing the filtering hardware architecture for separable kernel. Left hand side of Eqn. 2. Acknowledgment

This work has been supported by the Department of Science and Technology, Govt of

India under grant No DST/INSPIRE FELLOWSHIP/2012/320 as well as grant from

TEQIP phase 2 (COE), University of Calcutta for the experimental equipments. The

authors wish to thank Dr. Kunal Narayan Chaudhury for his help regarding some

theoretical understandings. References

1. Gribbon, K. Bailey, D. Johnston, C.: Design Patterns for Image Processing Algo-

rithm Development on FPGAs.TENCON 2005 - 2005 IEEE Region 10 Conference doi: 10.1109/TENCON.2005.301109 147, 1-6 (2005).

2. Li, Ye Yao, Qingming Tian, Bin Xu, Wencong: Fast double-parallel image pro-

cessing based on FPGA:Proceedings of 2011 IEEE International Conference on Vehicular Electronics and Safety pp. 97-102. doi: 10.1109/ICVES.2011.5983754

(2011) 3. Wenqian Wu and Acton, S.T. and Lach, J, Real-Time Processing of Ultra-sound

Images with Speckle Reducing Anisotropic Di usion. Fortieth Asilomar Conference on Signals, Systems and Computers, 2006. ACSSC '06, pp:1458-1464,doi=10.1109/ACSSC.2006.355000, 2006.

15 Fig. 7. An optimized convolution architecture developed to work with kernels like Gaussian, high pass filters, point and line detection etc.


Fig. 8. Simulation results showing the time interval taken to process the image pixels for a normal convolution hardware architecture in Fig.4 where 5 clock pulses are needed to process per pixel. Each clock pulse duration is 10 ns.

Fig. 9. Simulation results showing the time interval taken to process the image pixels. Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process. This timing diagram is followed by all the architectures except for Fig.4. It is

implementing right hand side of equation 2. .

Fig. 10. Simulation results showing the time interval taken to process the image pixels. Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process. This timing diagram is followed by all the architectures except for Fig.6. It is

implementing left hand side of equation 2. .

4. Reg Zatrepalek, Hardent Inc. Using FPGAs to solve tough DSP design challenges,

23rd july 2007, "http://www.eetimes.com/document.asp?piddl_msgpage=2&doc_


Fig. 11. Simulation results showing the time interval taken to process the image pixels. Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process. This timing diagram is followed by all the architectures except for Fig.7.

. Fig. 12. Simulation results showing the time interval taken to process the image pixels.

Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process. This timing diagram is followed by all the architectures except for Fig.3 and it is a complete parallel architecture.

.

id=1279776&page_number=1". 5. J.a. Kalomiros, J.Lygouras, Design and evaluation of a hardware/software FPGA-

based system for fast image processing, Microprocessors and Microsystems, Year:2008, Vol:32, Issue:2, Pages:95-106.

6. A. E. Nelson, Implementation of image processing algorithms on FPGA hardware, May 2000, "http://www.isis.vanderbilt.edu/sites/default/files/Nelson_ T_0_0_2000_Implementa.pdf ".

7. D. Bailey, Machine Vision Handbook,2012, doi:10.1007/978-1-84996-169-1, ISBN:978-1-84996-168-4.

8. Daggu Venkateshwar Rao, et al Implementation and Evaluation of Image Process-ing Algorithms on Recon gurable Architecture using C-based Hardware Descrip-tive Languages Available:www.gbspublisher.com/ijtacs/1002.pdf

9. Kuon, Ian Tessier, Russell Rose, Jonathan,FPGA Architecture: Survey and Chal-lenges, pp-135-253, (2007),doi: 10.1561/1000000005.

10. Wiatr, K. Jamro, E,Implementation image data convolutions operations in FPGA recon gurable structures for real-time vision systems, Proceedings Inter-national

Conference on Information Technology: Coding and Computing (Cat. No.PR00540), pp: 152-157, doi: 10.1109/ITCC.2000.844199.


Fig. 13. Gaussian filtered output for image size of 150 150 applied over noisy image

with (variance) σ 2 = 0:005. Filter settings σ s=20 (domain kernel std dev). The filtered

images (a),(b),(c),(d) and (e) correspond to the architectures shown in Figures 4, 5, 6,7 and 3. Fig. 14. Filter output for checkerboard image of size 150x150 for the additive Gaussian

noise. Filter settings σ s=20, σ r=50 and σ =12 for the additive Gaussian noise [18].

11. Cardells-Tormo, F. Molinet, P, for

FPGA-based digital image Processing Systems Design and

10.1109/SIPS.2005.1579866.

Area-e cient 2-D shift-variant convolvers processing, IEEE Workshop on Signal Implementation, 2005, pp:209-213, doi:


12. Sriram, Vinay Kearney, David, A FPGA implementation of variable kernel convo-

lution, pp:105-109, doi: 10.1109/.45, (2007). 13. Hui Zhang, Mingxin Xia, and Guangshu Hu, A Multiwindow Partial Bu ering

Scheme for FPGA-Based 2-D Convolvers, pp:200-204, issue:2, vol-54, (2007). 14. Mohammad, Khader Agaian, Sos, E cient FPGA implementation of convolution,

pp:3478-3483, issue:october, (2009). 15. Ramrez, Juan Manuel Flores, Emmanuel Morales Martnez-carballido, Jorge En-

riquez, Rogerio, An FPGA-based Architecture for Linear and Morphological Image Filtering, pp:90-95, issue:3, (2010).

16. Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing 3 Edition, Pub-lisher: Pearson (2008), ISBN-13 9788131726952.

17. James Hwang, Jonathan Ballagh,'Building Custom FIR Filters Using System Gen-erator," in Springer Berlin Heidelberg, 2002, series vol. 2438, pp. 1101 { 1104.

18. Chandrajit Pal, K.N.Chaudhury, Asit Samanta, Amlan Chakrabarti, Ranjan

Ghosh,Hardware software co-design of a fast bilateral lter in FPGA , India Con-ference (INDICON), 2013 Annual IEEE, pp:1-6, ISBN:978-1-4799-2274-1, doi: 10.1109/INDCON.2013.6726034.

19. www.mathworks.com/products/hdl-verifier

Date post:	08-Dec-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Design space exploration for image processing ...

Documents