A digital array based bit serial processor for arbitrary window size kernel convolution in vision...

A digital array based bit serial processor for arbitrary window sizekernel convolution in vision sensors

Mehdi Habibi n, Alireza Bafandeh, Muhammad Ali MontazerolghaemDepartment of Electrical Engineering, University of Isfahan, Isfahan, Iran

a r t i c l e i n f o

Article history:Received 5 February 2013Received in revised form29 November 2013Accepted 29 November 2013

Keywords:Array-basedBit serialVision sensorsKernel convolutionDigital processing

a b s t r a c t

The high speed and in-pixel processing of image data in smart vision sensors is an important solution forreal time machine vision tasks. Diverse architectures have been presented for array based kernelconvolution processing, many of which use analog processing elements to save space. In this paper adigital array based bit serial architecture is presented to perform certain image filtering tasks in thedigital domain and hence gain higher accuracies than the analog methods. The presented methodbenefits from more diverse convolution options such as arbitrary size kernel windows, compared withthe digital pulse based approaches. The proposed digital cell structure is compact enough to fit inside animage sensor pixel. When incorporated in a vision chip, resolutions of up to 12 bit accuracy can beobtained in kernel convolution functions with 35�28 μm2 layout area usage per pixel in a 90 nmtechnology. Still, higher accuracies can be obtained with larger pixels. The power consumption of theapproach is approximately 10 nW/pixel at a frame rate of 1 kfps.

& 2013 Elsevier B.V. All rights reserved.

1. Introduction

With the ever increasing demand for complex vision systems,the use of powerful processors with higher processing speed andlower power dissipation is necessary. In-pixel digital processingsolutions are advantageous compared with external digital imageprocessors for two main reasons; the processing speed can beincreased significantly and the dynamic power dissipation canbe reduced considerably. The reduction of dynamic power ispossible since with multiple in-pixel digital processors, the clockpulse frequency can be reduced. Subsequently with a lower clockfrequency the supply voltage can be lowered, resulting in overallreduction of dynamic power dissipation [1].

CMOS image sensors have the ability to perform all or part ofthe required processing inside each pixel using active MOSFETdevices [2]. These types of image sensors with processing cap-abilities are known as vision chips. Since image processing andespecially machine vision tasks are diverse, usually low level andfundamental tasks are implemented in the sensor structure.Kernel convolution is one of the most important low level imageprocessing functions used in machine vision tasks. It is effective innoise removal, edge detection, image sharpness and softnessadjustment, directional filters, image compression and many otherapplications.

Although different vision chips have been presented that per-form processing using analog photodiode data [3], however analogprocessing circuits usually lack accuracy since the current andvoltage parameters typically change between 1–5% of the fullswing value depending on the device geometry [4]. Thus mostvision chips which perform some type of in-pixel signal processingin the analog domain usually have a pixel accuracy of approxi-mately 7–8 bits [5,6].

If the pixel data is prepared in binary form, the digital binaryprocessing of data can be accomplished without the drawbacks of theanalog approach. To this day, many image sensors with in-pixel ADCstructures have been presented to provide high speed and accuratedata output [7–9]. Furthermore, using multi-tier chip technology, thecombination of a digital pixel sensor and FPGA die can result inprogrammable vision sensors without sacrificing the pixel fill factor[10–12]. An important issue is that it is difficult for analog circuits tobenefit from device scaling due to device mismatches and subse-quently accuracy issues. However the device scaling is advantageousin the digital approaches since it reduces the circuit size and allowsthe processor to be integrated inside each pixel [13,14].

It should be noted that digital pulse based schemes have beenpresented that are appropriate and effective for dedicated imageprocessing tasks such as cellular neural vision systems [15]. Thesensors presented in [16,17] can perform addition and multi-plication directly on pulse trains by using pulse addition or pulsedivision (reduction). The direct processing of pulse trains suffersfrom the drawback that the convolution options are relativelylimited and kernel convolution window sizes of more than 3�3

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/vlsi

INTEGRATION, the VLSI journal

0167-9260/$ - see front matter & 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.vlsi.2013.11.007

n Corresponding author.E-mail address: [email protected] (M. Habibi).

Please cite this article as: M. Habibi, et al., A digital array based bit serial processor for arbitrary window size kernel convolution invision sensors, INTEGRATION, the VLSI journal (2013), http://dx.doi.org/10.1016/j.vlsi.2013.11.007i

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

www.sciencedirect.com/science/journal/01679260

www.elsevier.com/locate/vlsi

http://dx.doi.org/10.1016/j.vlsi.2013.11.007



mailto:[email protected]





pixels is relatively cumbersome to implement. This is a limitingfactor in adaptive machine vision tasks where the kernel sizeneeds to be adjusted over a wide dynamic range.

In this paper a kernel convolution vision chip is presentedwhich can process image data and perform desired filteringfunctions in real time. The processing is performed in the digital

Fig. 1. (a) Basic kernel convolution principle. (b) Different image filtering functions implemented by different convolution kernels.

Fig. 2. Different hardware solutions used for kernel convolution. The processor is based on (a) Pixel based analog multipliers, (b) Pixel based switch capacitor blocks,(c) Digital event based computation block and (d) Digital row based computation block.

M. Habibi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎2





domain and hence, higher accuracy results are obtained. Thepresented processing elements are compact enough to fit insideeach pixel of the corresponding image sensor and thus theprocessing is carried out in parallel among all pixels. This resultsin a high processing speed especially in large array size sensors. Toobtain a compact design, bit serial processing is used inside eachpixel. With the given architecture, arbitrary size window kernelconvolution is possible with just a simple routing and a few bits ofinterconnects between adjacent pixels. This is in contrast to mostprogrammable kernel convolution image sensors where the size ofthe kernel window is usually limited to 3�3 cells.

The rest of the paper is organized as follows: In Section 2 thebasic principle of kernel convolution is explained. The proposedarchitecture will be presented in Section 3. The functionality,power dissipation and processing speed analyses of the structureare given in Section 4. In Section 5, simulation results arepresented and the effectiveness of the approach is shown. Someconcluding remarks are given in the final section.

2. Image kernel convolution

The kernel convolution procedure involves the weight sumcomputation of each pixel and its neighbors at a specific distance(usually nearest neighbors). The task is performed by multi-plication and summation of the 2D matrix known as the kernelon each of the image pixels and its neighbors as shown in Fig. 1

(a) where Pij is the original pixel value at location (i, j), anm is the3�3 kernel coefficients and P'ij is the filtered image pixel value atlocation (i, j). Different kernel values can perform different imageprocessing functions. Some different image filtering functions areillustrated in Fig. 1(b). As it can be seen in the figure, the Gaussiankernel can be used to remove high frequency components of thedata and thus soften the image. The Laplacian kernel can extracthigh frequency image edge data. Since noise data also containshigh frequency components, the kernel obtained by the combi-nation of Laplacian and Gaussian kernels (known as LOG orLaplacian of Gaussian) is more effective for edge data extraction.While specific kernels can directly extract edges, the Gaussiankernel itself can also be used for this purpose. This can beperformed by subtracting two Gaussian filtered images withdifferent variance values.

The conceptual architectures of some different kernel convolu-tion processors are shown in Fig. 2. As depicted in Fig. 2(a), theapproach used in [6,16,18] calculates its output result by usinganalog current multipliers to multiply neighbor signal valueswith the desired coefficients and finally sum up the product.The obtained result itself is an analog signal except in [16] wherethe analog signal is subsequently converted into a pulse train.The approach in Fig. 2(b) uses analog switch-capacitor blocks toperform signal amplification and summation, thus providing therequired product and sum functionalities for kernel convolution [3].Another method shown in Fig. 2(c) performs kernel convolution basedon pixel events [19]. In this method whenever a pixel requires

Fig. 3. Conceptual block diagram of three adjacent row cells in the (a) horizontal and (b) vertical convolution configurations.

M. Habibi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3





clarifying its output value, an event is generated. The digital eventbased processor then calculates the resulting value of that pixelconsidering the state of its neighbor cells and produces appropriateevent based output results. This method essentially evaluates thekernel convolution result one pixel at a time and the processing is notcommenced at all pixels simultaneously. Similarly in [20] a 2Dprocessor is used to load each pixel and its neighbor data into atemporary storage area and evaluate the corresponding result. Againthe processing is performed one pixel at a time. In Fig. 2(d) a digitalrow based processor is used to evaluate the output results row by row[21]. Since processing is commenced one row at a time the computa-tion speed will be higher than the methods which perform convolu-tion one pixel at a time but slower than the methods where theprocessing is carried out at all pixels simultaneously.

3. Proposed structure

The hardware presented in this work can be considered as anarray processor. This essentially means that every pixel of the smartvision sensor is associated with a corresponding processor. In theproposed structure the kernel convolution is commenced at allpixels simultaneously and in parallel fashion. The parallel proces-sing of data has the advantages of higher processing speed andlower power consumption. In contrast to analog vision chips, in thiswork digital hardware is incorporated in the design to increaseaccuracy. Since the amount of available hardware area in each pixelis relatively limited, the digital hardware cannot be too complex.Thus in the proposed design, two dimensional kernel convolutionis performed by first initiating a one dimensional horizontal

Fig. 4. (a) The regular interconnection required between adjacent neighbor pixels. (b) The complete structure of a single pixel.






Fig. 5. Internal structures of the pixel processor. (a) Bit serial adder, (b) 3 bit line delay adjustment stage, (c) Result shift register, (d) Upper/lower (Left/Right) shift register.

Fig. 6. Control lines function and their purpose in the kernel processing.






convolution and then continuing the procedure with a verticalconvolution. For the multiplication steps “Left Shift” functions areused to provide 1, 2, 4 and 8 coefficients. In practice a finalnormalization step will be necessary to provide the actual requiredfloating number coefficients. The summation procedure is accom-plished by two full adders and in bit-serial style within every pixel.Although this procedure does not cover all possible kernel matrixes,however a wide range of symmetric options can be realized by thetechnique. In the analysis section the kernels producible by thisapproach will be discussed.

The main idea behind the approach is to use three shiftregisters within every cell. For horizontal kernel convolution, thedigital data is shifted to the left and right throughout the pixelarray using two of these shift registers. In this way, the data ofeach pixel will be accessible for all other pixels (depending on theconvolution window size) without any complex digital bus inter-connects. Furthermore large size convolution windows can beeasily implemented; an advantage that is hard to find in previousvision chip designs. The third shift register is named the resultshift register which stores the final filtered result related to thecorresponding pixel. The same three shift registers are used invertical convolution for the shifting of the data towards the upperand lower cells and also storing of the final result value.

In Fig. 3(a) the conceptual block diagram of three adjacent rowcells are shown in the horizontal convolution configuration.Initially the pixel data is placed on all three shift registers at eachpixel. The data is shifted to the left and right adjacent cells usingtwo of the shift registers. For example in a 3�3 kernel window,the data of each pixel has to be shifted completely only to the

adjacent cell. For larger kernel windows the shifting has to becontinued to reach cells further away. As the data is been shifted,the bit serial addition is performed at each pixel and the previousvalue of the result shift register is summed up with the content ofthe two shift left and right registers. Fig. 3(b) shows how the sameregisters are used in the vertical convolution step configuration.In this step, the upper and lower shift registers are loaded with the

Fig. 7. Possible (a) 3�3 and (b) 5�5 cell wide kernels which can be processed by the given approach.

Clo

ck c

ount

Kernel size

Bit ellimination

Processing phase

Fig. 8. Number of clock cycles required per frame to process different kernelconvolutions.






pixel's product value while the result shift register preserves itscontent from the horizontal shift phase.

Fig. 4(a) shows the regular interconnection required betweenthe neighbor pixels. It should be noted that the interconnectionbetween the pixels are single bit data lines thus a relatively simplerouting is required which significantly reduces the area cost andsimplifies the in-pixel processor design. This simple interconnectionis satisfactory even for the computation of kernel windows whichare larger than 3�3 pixels wide since data can be transferred topixels further away using the left/right and up/down shift registerapproach.

The complete structure of a single pixel is shown in Fig. 4(b).For the sake of clarity the input and output ports of each pixel areonly shown at the left and right sides in this figure. In the actuallayout, the ports are placed as that shown in Fig. 4(a). The pixelstructure consists of the left (or up) shift register, the right (ordown shift register), the result shift register, the full adder cell, themultiplexers required for structure configuration of the data input,horizontal convolution and vertical convolution phases and alsothe selecting of the required coefficient from the three shiftregisters. The x1, x2, x4 and x8 outputs from the shift registersare actually appropriate output extensions from the shift register

outputs. For example the x8 output extension contains threeadditional latch bits in the least significant positions comparedwith the x1 extension. The three bit shift register of Fig. 4(b) isused to provide additional latch stages in the processing feedbackpath to adjust the bit position of the product sum result.

The internal structure of each functional block is shown in Fig. 5.The initial data is loaded in all three shift registers using the “Data”bit line and by activating the “Data C” control line (bit control line“4”). With the completion of a row based convolution, the shiftregisters will eventually contain the data of the neighbor pixels andthe result shift register will contain the actual pixel's convolutionresult. In order to continue with the column based filtering, the shiftregisters data should be updated and synchronized with the resultregister. This is accomplished with bit control line “5” which allowsthe content of the result register to be simultaneously copied tothe two shift registers. Control bit 7 is used to throw away leastsignificant bits whenever necessary, to avoid overflows. To increaseaccuracy the bit elimination phase is left for the latter steps of thekernel convolution and is not performed on the initial data set.This control line can also be used to produce 1/2 coefficients. Thecontrol lines function and their purpose in the kernel processing areillustrated in Fig. 6.

Table 1Comparison of processing speed, critical path propagation delay and power dissipation of a conventional processor and the array based processor.

Conventional digital processor Array based digital processor

Processing speed (fps) f clkAαA�m�n

f clkBαB

Critical path propagation delay (s) βAðVDDA �VT Þ

βBðVDDB �VT Þ

Power dissipation (W) PstaticAþCLA � f clkA � VDD2A PstaticBþCLB � f clkB � VDD2

B �m� n

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Out

put V

olta

ge (V

)

Input Voltage (V)

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Out

put V

olta

ge (V

)

Input Voltage (V)

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Out

put V

olta

ge (V

)

Input Voltage (V)

NMHworst case worst case=0.32V NMHworst case=0.29V NMLworst case worst case=0.29V NMLworst case=0.34V

Fast nMOS, Slow pMOS Nominal nMOS, Nominal pMOS Slow nMOS, Fast pMOS

Fast nMOS, Slow pMOS Nominal nMOS, Nominal pMOS Slow nMOS, Fast pMOS

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5

Out

put V

olta

ge (V

)

Input Voltage (V)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5

Out

put V

olta

ge (V

)

Input Voltage (V)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5

Out

put V

olta

ge (V

)

Input Voltage (V)

NMHworst case worst case worst case=0.14V NMLworst case worst case

=0.34V NMH=0.27V NML

=0.22V NMH =0.18V NMH=0.14V NML =0.18V NMLworst case=0.20V

Fig. 9. Voltage transfer curve of a minimum size CMOS inverter in the incorporated 90 nm technology under different process conditions. The supply potential is 1 V in(a) and 0.5 V in (b).






The presented hardware benefits from several key factors com-pared with the methods given in Fig. 2. One advantage is that sincethe processing is performed in the digital domain, it is possible to

increase accuracy and the approach can be implemented in compactdeep submicron technologies. Another advantage is that the archi-tecture is compact enough to fit inside each pixel of an image sensor

Fig. 10. The layout of a single processor pixel in a 90 nm digital 1 poly 9 metal process.

Fig. 11. Functional evaluation of 3�3 processor cells. (a) A specific kernel, (b) An arbitrary input image data window, (c) Control waveforms.






and it can be used to construct a high throughput parallel processingarray. Furthermore with the given approach it is possible to performarbitrary size kernel convolutions.

4. Structure analyses

In this section several different aspects of the proposed design suchas accuracy, kernel convolution options, processing speed and opera-tion voltage reduction constraints will be investigated and analyzed.

4.1. Accuracy

In the presented design, higher output accuracies can beobtained simply by extending the three shift registers lengths. Ifa pixel output result with accuracy of N bits is desired, since3 extra bits are required for the coefficients, the shift registerslengths should be equal to Nþ3. Most of the layout area isoccupied by the three shift registers, and with longer shiftregisters, the area overhead of the adders/multiplexers does notchange (since the processing inside each pixel is performedserially). Thus the consumed area per pixel can be approximatedby: regA� (Nþ3)�3; where regA is the area consumed by a 1 bitshift register. In this work the size of each shift register is limitedto 15 bits due to pixel layout area limitation. With larger areas,higher number of register bits and subsequently higher accuraciescan be obtained.

4.2. Convolution kernels

With the proposed architecture, it is possible to multiply factors of1/2, 1, 2, 4 and 8 in the sum product term at every step. Furthermore inthe proposed architecture, while the horizontal and vertical kernelscan be used individually to perform 1D convolution, the 2D filteringprocedure is performed using two consecutive vertical and horizontalconvolution steps and thus the equivalent 2D kernels which areimplementable by the architecture are symmetric. If kr and kc areassumed the row and column kernel matrixes, the final 2D kernel canbe expressed as:

kernel¼ kTr � kc ð1Þ

For example in the 3�3 kernel case, if kr¼[A1 A0 A1] and kc¼[B1 B0B1] then the resulting kernel matrix will be as:

A1B1 A1B0 A1B1A0B1 A0B0 A0B1A1B1 A1B0 A1B1

264

375 ð2Þ

The possible kernel values for 3 cell wide kr or kc are illustrated inFig. 7(a). kr and kc need not to be the samematrixes and can be chosenany of the given combinations. By choosing all possible permutationsof kr and kc, a set of 3�3 kernel group implementable by the structurewill be obtained. Possible combinations of 5�5 kernels are alsoillustrated in Fig. 7(b).

It should be noted that it is possible to combine different rowand column kernel sizes to achieve rectangular kernels forexample in directional filtering applications.

Fig. 12. Signal control waveforms for a 5�5 cell matrix kernel convolution. The figures of (a) and (b) correspond to different kernel cells with different bit elimination steps.






4.3. Processing speed

With each kernel configuration a different bit eliminationsequence will be required; for example a kernel with higher cellvalues will require more bit elimination steps than a kernelwith lower cell values. It should be noted that the different bitelimination combinations subsequently affects the final normal-ization factor that brings the outputted data value within thedesired range.

The total number of bits required to be omitted, d, for a givenkernel can be calculated as follows:

d¼ nþD�LD : length of Data in binaryn : length of M in binaryM : M1 �M2

M1 : Sum of row coef f icientsM2 : Sum of column coef f icients ð3Þwhere L represents the shift register's length, D represents thebinary input data length, M1 and M2 are the sum of row andcolumn coefficients respectively and finally n is the length ofM1�M2 in binary format.

On the other hand, since the number of sequential additionssteps required for each row/column convolution is equal to (N�1)/2 for an N�N size kernel, thus for the complete convolution,(N�1) sequential addition steps are required. Multiplying thisvalue by the registers length, L, the number of clock cycles can be

obtained. In practice some additional clock cycles are required forprocess initiation and bit elimination. Thus the final number ofclock pulses, NClk, required to accomplish a 2D filter kernel in eachframe can be evaluated by:

NClk ¼ 1þL� ðN�1Þþd

L : Length of shif t registersN : size of f ilterd : number of deleted bits ð4ÞIn Fig. 8 the number of clock cycles per frame is plotted fordifferent kernel sizes assuming the highest possible cell values.The total clock count is composed of the clock cycles required tocomplete the summation steps and the clock cycles required forbit elimination.

4.4. Clock frequency

In a conventional digital processor, the processing speed ismainly limited by the maximum allowable clock frequency. Forexample in a typical kernel convolution procedure, since the resultis calculated one pixel at a time, a high clock frequency is requiredto obtain high processing speeds. In the proposed array basedprocessor however, the kernel convolution is commenced at allpixels simultaneously and the processing speed is ultimatelylimited by the rate where the photodiodes can produce validimage data (assumed 1 kHz in this work). The required clockfrequency is fairly low in the array based approach. If the clock

Ideal

Actual

Sigma=0.2 Sigma=1 Sigma=2 Sigma=10

Sigma=0.2 Sigma=1 Sigma=2

Ideal

Actual

Sigma=10

Fig. 13. Ideal and hardware filtered images of the moon image under different sigma values. In (a) the complete image is shown while (b) shows a zoomed in portion.






frequency duty cycle is assumed 50%, then the maximum allow-able critical path propagation delay will be equal to half of theoperation clock pulse period. Thus with low clock frequencies, thearchitecture can endure longer propagation delays and as it will beshown in the next subsection, the supply voltage can be reducedto some extent. The reduction of power supply can help reducedynamic power dissipation. Table 1 presents the formulatedsummary of relationship between the different factors of proces-sing speed, clock frequency, critical path propagation delay andpower dissipation. In this table, indices “A” indicate parametersrelated to the conventional processor and indices “B” indicatethose of the array based approach. fclk is the clock frequency, α isthe number of clock cycles required to compute the result ofkernel convolution for one pixel, m and n are the image dimen-sions, β is a coefficient that relates propagation delay to supplyvoltage, VDD is the supply potential, VT is the threshold voltage,Pstatic is the static power dissipation due to leakage currents and CLis the capacitive load of each processing unit. It is noted that αB inTable 1 is the equivalent of NClk in (4).

4.5. Supply voltage considerations

The reduction of power supply is not only limited by theacceptable propagation delay but also by the noise margin. Whenthe supply voltage decreases, the propagation delay will increase.

The noise margin (NM) will also degrade with low supply voltagepotentials. NM is defined as the minimum of the high (NMH) andlow (NML) noise margins. NMH and NML are evaluated by thefollowing equation:

NMH¼ VOH�VIH

NML¼ VIL�VOL ð5Þ

where VIL and VIH are the low and high input potentials at the gate'svoltage transfer curve turning point respectively and similarly, VOH

and VOL correspond to the output voltages. The turning point of avoltage transfer curve is the input and output voltage in which theabsolute slope of the curve equals unity. For a correct circuitoperation, the noise margin should be a positive value. It is shownthat if a basic inverter cell has a positive noise margin then allstandard cells derived from this cell will function as desired [22]

Different pMOS and nMOS corners and voltage/temperatureconditions can significantly affect the parameters in (5). Fig. 9investigates the noise margin of a standard cell CMOS inverterunder different process, voltage and temperature conditions forthe incorporated 90 nm process. The temperature is swept from�55 1C to 125 1C for each case. The shown markers indicate thelocation of the turning point in worst case conditions where thenoise margin will be the least. As it can be seen, when the supplyvoltage is reduced to 0.5 V, in the worst case conditions a noisemargin of 0.14 V will still exist which is appropriate since it is close

Ideal

Actual

Sigma=0.2 Sigma=1 Sigma=2 Sigma=10

Sigma=0.2 Sigma=1 Sigma=2

Ideal

Actual

Sigma=10

Fig. 14. Ideal and hardware filtered images of a camera man image under different sigma values. (a) shows the complete image while (b) shows a zoomed in portion.






to the threshold value. In the simulation results it will be observedthat the critical path propagation delay does not limit the amountof supply voltage reduction and it is the noise margin thateventually limits the minimum supply voltage value.

5. Simulation results

For the purpose of simulation and to evaluate the performanceof the proposed approach, the given design is implemented in a90 nm standard CMOS technology. Fig. 10 shows the layout of asingle processor pixel. In the 1poly 9metal process, the pixeldimensions only expand to 35 μm�28 μm for 12 bit output resultaccuracy. Higher output bits require longer shift registers andhence larger pixel cells. Each processor cell has 11 signal portswhich are common among all pixels. These signals include thesupply connections VDD and GND together with the controlsignals Control(0:7), Clk and reset which are routed horizontallythroughout every row. In addition, 8 data signal bits connectneighbor processor cells together and provide data transfer forkernel window convolutions. These ports are located at the upper,lower, left and right edges of the corresponding pixels. The actualvision sensor pixel will include both the processor pixel cell andthe photodiode/in-pixel ADC frontend section. The digital photodata producing section can vary in size with different architecturesand fill factors but the typical values are around approximately10 μm�10 μm.

Fig. 11 shows a functional evaluation of a 3�3 processor cellarray. In this evaluation the input data is presented by the arbitrarydata window of Fig. 11(a). This is the data which is produced by theADC circuit of each pixel. The digital data length of each pixel shouldbe 8 bits. A specific 3�3 kernel matrix value is also given in Fig. 11(b). Fig. 11(c) shows both the control waveform required to run theprocessor array and also the data values produced in the result andshift registers of the pixel in the window center. As the figure shows,the processing is completed in 32 clock cycles. The result register is15 bits but only the 12 most significant positions represent theactual result data (As shown in Fig. 5, 3 bits are left out in the finaloutput port). In the last clock phase, the result register shows acontent of 0�5B0. Since the 3 LSBs are not part of the output result,throwing these 3 bits away, gives a result of B6 (Hex). On the otherhand since the bit elimination steps have reduced and removed4 LSBs of the output result, the equivalent final output will be equalto 2912. This is in good agreement with the actual result of 2991. Asstated earlier, the result normalization step which produces floatingvalues in the desired data range is not performed on chip and thenumber of eliminated bits should also be accounted in the finalnormalization factor. The waveform of Fig. 11 shows the shift registercontent of only the center pixel and how the final data result is builtup inside that pixel. It should be noted that the same procedure iscommented simultaneously in all other pixels of the complete array.

The signal control waveforms for a 5�5 cell kernel convolutionare shown in Fig. 12. The number of bits that should be omitted ateach stage differs depending on the kernel matrix coefficientvalues. The control waveform of Fig. 12(a) does not have bitomission phase in the first step due to the low coefficient valuesused in the first neighbor cells; however the waveform shown inFig. 12(b) includes this step due to higher coefficient factors.

For qualitative and quantitative comparisons, the kernel convolu-tion capabilities of the presented hardware are compared with ideal2D Gaussian filters. Since 3�3 kernel windows are relatively commonamong previously reported vision sensors, to show the potential of thetechnique, 5�5 cell kernels are used and comparisons are performed

Fig. 15. The absolute mean error between ideal Gaussian filtering and proposedhardware. Comparison is performed with (a) 5�5 wide approximated kernel,(b) 3�3 wide approximated kernel and (c) combination of 5�5 and 3�3 widekernels.

Table 2Performance evaluation of the proposed structure under different kernel window sizes and supply voltage values.

3�3 kernel @ VDD¼1 V 5�5 kernel @ VDD¼1 V 3�3 kernel @ VDD¼0.5 V 5�5 kernel @ VDD¼0.5 V

Clock frequency required for 1 kfps 35 kHz 70 kHz 35 kHz 70 kHzCritical path propagation delay 1.3 ns 1.3 ns 9.2 ns 9.2 nsAverage power dissipation per pixel@1 kfps 14 nW/pixel 25 nW/pixel 4 nW/pixel 7 nW/pixel






under different kernel sigma values. For this purpose an array ofinterconnected processing elements forming a 200�200 pixel planeis used to perform Gaussian filtering functions and for each sigmavalue, among all possible kernel matrix options, the one that producesthe least error is chosen for comparison with the ideal Gaussianfiltering procedure. The data obtained from hardware simulation andideal software Gaussian filtering, both are normalized and mapped tothe intensity scale of 0 to 255. Fig. 13 shows the filtered images of themoon image under different sigma values. In Fig. 13(a) the complete200�200 pixel image is shown while Fig. 13(b) shows a zoomed inportion for qualitative comparison of the actual result produced byhardware and the ideal case. Fig. 14 repeats the same procedure forthe camera man image. In Fig. 15(a) and (b), the average pixel errorvalue of the hardware result and the ideal Gaussian filtering iscompared. Similar to the previous case, for each sigma value, thesame size hardware kernel window which produces the least error ischosen for comparison. As it can be seen due to the kernel coefficientlimitation of multiple 2 factors, the implementation of lower sigmakernels produces higher errors. As shown in Fig. 15(c), it was furtherobserved that the implementation of ideal 5�5 cell Gaussian kernelswith low sigma values (up to 0.8) was best approximated with 3�3hardware kernels. It should be noted that the error presented here isjust the error between the approximated kernel and the idealGaussian kernel (on the scale of 0 to 255); however the actualaccuracy of the approximated kernel remains at 12 bits and the outputis free from random and fixed pattern noise errors observed in analogprocessing hardware. This means that in the proposed approachconsecutive executions of the same process will produce exactly thesame results represented by 12 bits, unlike the analog techniquewhere subsequent executions of the same process will suffer fromerrors that change with time and from one pixel to another.

Circuit performance evaluations are presented in Table 2. Asthe table shows, the critical path propagation delay of thearchitecture is relatively low and each cycle of the processing iscompleted in a short period of time. However the actual frame ratewill be eventually limited by the photodiode noise level. At higherillumination levels, the frame rate can be increased while lowerillumination levels require longer exposure times and hence theframe rate should be decreased. High speed vision sensors usuallyoperate at a frame rate of 1 kfps. Since the processing time of theproposed structure is much lower than the clock period, thesupply voltage can be decreased to reduce power dissipation. Asshown earlier, in the presented structure the amount of supplyvoltage reduction is eventually limited by the noise margin of thegates for correct static operation (and not the critical path delay).In Table 2, the voltage level is decreased to 0.5 V. As the tableshows, for the reduced voltage case, the propagation delay is stillsignificantly lower than the clock period and the power dissipationhas been reduced considerably. It should be noted that theevaluations of power dissipation and critical path propagation

delay are performed at worst case corner and temperatureconditions.

A comparison of the presented sensor with previous kernelconvolution chips is performed in Table 3. As the table shows,while the pixel size of the method is in the same order of theprevious chips, however it shows relatively lower power dissipa-tion, higher accuracy and more importantly, the presentedapproach is capable of performing arbitrary window size kernelconvolutions. It should be noted that the approach presented in[19] is based on an event processing system which computes thekernel convolution one pixel at a time and subsequently theeventual processing speed is relatively lower than architecturesbased on parallel processing (especially when the event genera-tion rate is high); However since processing is performed when-ever an event is generated by a specific pixel, in these types ofsensors a specific frame rate is not reported and the processingspeed is expressed with events per second.

Compared with the digital methods, the presented techniquerequires a much lower clock frequency to achieve the requiredprocessing frame rate; thus decreasing dynamic power dissipation.Although the deep submicron implementation of the processor isconsidered as an advantage of a fully digital design, but to obtain arough power dissipation comparison solely based on the givenarchitecture, the evaluation is also performed using a 0.35 mmMOSFET model file (post layout extractions are not performed inthis case) and the power usage is also reported for this particularcondition in Table 3.

6. Conclusions

In this paper an array based digital bit serial kernel convolutionprocesser was designed and presented. These processors are animportant part of high speed and smart vision sensors and helpthe initial and time consuming steps of the vision data processingto be performed in real time and with lower power dissipation.Unlike most conventional kernel convolution vision chips, thespecific bit serial design of the processor allows convolutionwindows with sizes larger than 3�3 cells and there is nolimitation to the kernel size. The only constraint is determinedby the output result register bit length. For a 12 bit un-normalizedoutput result, a pixel area of 35�28 μm2 is required. For higher bitcounts, larger cells will be required. The kernel set provided by theapproach can be used to implement many symmetric filter typesincluding vertical and horizontal directional and 2D kernels. Thesimulation results show that Gaussian filters can be approximatedeffectively with limited error. Judging by the results, the obtainedoperating conditions and low power consumption of the approach,makes it an appropriate choice in smart CMOS image sensors andvision chips for kernel convolution tasks.

Table 3Comparison of the proposed kernel convolution processor with previously reported structures.

Processor Technology Window size Accuracy Power dissipation Processing Speed Pixel size Operation method

[19] 0.35 μm 32�32 6 bit 48 μW 37 Mepsa fclk¼100 MHz 58 μm�54 μm Digital event based[21] 65 nm 5�5 8 bit 0.9 μW 40 fps @fclk¼20 MHz 10.8 μm�10.8 μm Digital row based[18] 0.35 μm 3�3 8 bit b 70 nW 100 fps 35 μm�35 μm Analog[6] 0.35 μm 3�3 7–8 bit b 350 μW 10 kfps 75.7 μm�73.3 μm Analog[3] 0.35 μm 3�3 7–8 bit b 1 μW 30 fps 35 μm�35 μm Analog[16] 0.35 μm 3�3 NA 50 nW Pulse train output 160 μm�160 μm Analog ( pulse train output)Presented work 90 nm unlimited 12 bit E10 nW (VDD¼0.5 Vc) 1 kfps @fclk¼70 kHz 35 μm�28 μm Digital pixel based

0.35 μm unlimited 12 bit E105 nW (VDD¼1.8 Vc) 1 kfps @fclk¼70 kHz –

a The speed of this sensor is expressed with events per second and it represents the maximum number of pixels which can clarify their value per unit time.b Considering the inaccuracy and FPN in the analog signal.c Reduced supply potential.






References

[1] R. Jevtic, C. Carreras, A complete dynamic power estimation model for data-paths in FPGA DSP designs, Integr. VLSI J. 45 (2012) 172–185.

[2] M. Habibi, Analysis, enhancement, and sensitivity improvement of thecorrelation image sensor, IEEE Trans. Instrum. Meas. 61 (2012) 708–718.

[3] N. Massari, M. Gottardi, L. Gonzo, D. Stoppa, A. Simoni, A CMOS image sensorWith programmable pixel-level analog processing, IEEE Trans. Neural Netw. 16(2005) 1673–1684.

[4] J.R Baker, CMOS Circuit Design, Layout, and Simulation, Wiley-IEEE Press,2010.

[5] A. Zarandy, T. Fulop, Approaching object detector mouse retina circuit modelanalysis and implementation on cellular sensor-processor array, Int. J. CircuitTheory Appl. 40 (2012) 1249–1264.

[6] J.F. Lopez, F.V. Fernandez, J.M. Lopez-Villegas, J.M. de la Rosa, ACE16k basedstand-alone system for real-time pre-processing tasks, Proc. SPIE 5837 (2005)872–879.

[7] A. Kitchen, A. Bermak, A. Bouzerdoum, A digital pixel sensor array withprogrammable dynamic range, IEEE Trans. Electron. Devices 52 (2005)2591–2601.

[8] S. Kleinfelder, S.H. Lim, X. Liu, A. El Gamal, A 10,000 frames/s CMOS digitalpixel sensor, IEEE J. Solid-State Circuits 36 (2001) 2049–2059.

[9] A. Bermak, Y.F. Yung, A DPS array with programmable resolution andreconfigurable conversion time, IEEE Trans. VLSI Syst. 14 (2006) 15–22.

[10] A. Joginipally, A. Varela, R. Schott, Z. Fitzsimmons, Efficient FPGA implementa-tion of steerable Gaussian smoothers, in: Proceedings of the 44th IEEESoutheastern Symposium on System Theory, 2012, pp. 78–82.

[11] Przemyslaw Brylski, M. Strzelecki, Parallel digital image processor implemen-ted in FPGA technology, in: Signal Processing Algorithms, Architectures,Arrangements, and Applications Conference Proceedings, 2011, pp. 25–28.

[12] Carlos Gonzalez, S. Sánchez, A. Paz, J. Resano, D. Mozos, A. Plaza, Use of FPGAor GPU-based architectures for remotely sensed hyperspectral image proces-sing, Integr. VLSI J. 46 (2013) 89–218.

[13] C. Shoushun, A. Bermak, W. Yan, D. Martinez, Adaptive-quantization digitalimage sensor for low-power image compression, IEEE Trans. Circuits Syst. I:Regul. Pap. 54 (2007) 13–25.

[14] P.S. Mandolesi, P. Julian, A.G. Andreou, A scalable and programmable simplicalCNN digital pixel processor architecture, IEEE Trans. Circuits Syst. 51 (2004)988–996.

[15] S. Scholze, H. Eisenreich, S. Hoppner, G. Ellguth, S. Henker, M. Ander,S. Hänzsche, J. Partzsch, C. Mayr, R. Schüffny, A 32 GBit/s communicationSoC for a waferscale neuromorphic system, Integr. VLSI J. 45 (2012) 61–75.

[16] Y. Huang, E.M. Drakakis, C. Toumazou, P. Degenaar, A CMOS image sensor withspiking pixels for retinal stimulation, in: Proceedings of the IEEE Circuits andSystems Conference, 2008 pp. 1548–1551.

[17] R. Berner, T. Delbruck, Event-based pixel sensitive to changes of color andbrightness, IEEE Trans. Circuits Syst. 58 (2011) 1581–1590.

[18] W. Jendernalik, G. Blakiewicz, J. Jakusz, S. Szczepanski, R. Piotrowski, Ananalog sub-miliwatt CMOS image sensor with pixel-level convolution proces-sing, IEEE Trans. Circuits Syst. 60 (2013) 279–289.

[19] L. Camunas-Mesa, C. Zamarreno-Ramos, A. Linares-Barranco, A.J. Acosta-Jimenez, T. Serrano-Gotarredona, B. Linares-Barranco, An event-drivenmulti-kernel convolution processor module for event-driven vision sensors,IEEE J. Solid-State Circuits 47 (2012) 504–517.

[20] P.Y. Hsiao, S.S. Chou, F.C. Huang, Generic 2-D Gaussian smoothing filter fornoisy image processing, in: Proceedings of the TENCON IEEE Region Con-ference, 2007, pp. 1–4.

[21] H. Zhu, T. Shibata, A real-time motion-feature-extraction image processoremploying digital-pixel-sensor-based parallel architecture, in: Proceedings ofthe IEEE International Symposium on Circuits and Systems, 2012, pp. 1612–1615.

[22] N. Weste, D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective,4th Edition, Addison-Wesley, 2010.

Mehdi Habibi was born in 1981. He received his B.S.,M.S. and Ph.D. degrees in Electrical Engineering fromIsfahan University of Technology, Isfahan, Iran in 2003,2005 and 2010 respectively. He is currently an AssistantProfessor at the University of Isfahan, Department ofElectrical Engineering. He was a Member of the IUTRobocup team and earned the third ranks in the 2002small size and 2003 middle size international Robocupleagues in Germany and Italy.

His research interests include CMOS vision sensorsand microelectronics circuit design.

Alireza Bafandeh received his B.Sc. degree in ElectricalEngineering from University of Isfahan, Isfahan, Iran, in2012. He is currently working toward his M.Sc. degreein Microelectronic Circuit Design at the AmirkabirUniversity of Technology.

His current research interests include analog andmixed-signal circuits and systems, digital systemdesign, digital calibration of sigma-delta ADCs.

Muhammad Ali Montazerolghaem received his B.Sc.degree in Electrical Engineering from University ofIsfahan, Isfahan, Iran, in 2012. He is currently workingtoward his M.Sc. degree in Microelectronic CircuitDesign at the Amirkabir University of Technology.

His current research interests include analog andmixed-signal circuits and systems, digital calibrationof pipelined ADCs.



http://refhub.elsevier.com/S0167-9260(13)00078-3/sbref1


















































Date post:	23-Dec-2016
Category:	Documents
Upload:	muhammad-ali
View:	213 times
Download:	1 times

A digital array based bit serial processor for arbitrary window size kernel convolution in vision...

Documents