Parallel image-processing system based on the TMS32010 digital signal processor

Parallel image-processing system based on theTMS32010 digital signal processor

K.N. Ngan, PhDA.A. Kassim, BEngH.S.Singh, PhD

Indexing terms: Image processing, Picture processing and pattern recognition, Signal processing, Algorithms

Abstract: A parallel image processor (PIP) con-sisting of eight Texas Instruments TMS32010digital signal processors is described. The architec-ture is designed for image-processing applicationsand two common pattern-recognition algorithms,i.e. edge detection followed by thinning are imple-mented achieving a total processing time of lessthan one second for a 256 x 256 pixel image. Theadvantages and limitations of using theTMS32010 as a fast signal processor aredescribed. Problems encountered in programmingthe parallel processors and ways to overcomethem are highlighted.

1 Introduction

The TMS32010 [1] is a widely used 32-bit micro-processor for digital signal-processing (DSP) applica-tions. It is very different from general-purposemicroprocessors in that it has an on-chip multiplier, com-putes a pipelined running sum of products, and performsdata scaling, memory shifts and pointer incrementationin parallel with other operations. These features make itvery useful in many aspects of signal processing. Most ofthe instructions require only one cycle and with a 200 nscycle time, the processor is capable of executing fivemillion instructions per second. Therefore it is ideal fordigital signal processing especially image processingwhere a large amount of data is to be processed.

TMS32010s have been used in multiprocessor configu-rations for various DSP applications such as fast Fouriertransform (FFT) and speech filterbank [2], digital correl-ation and spectrum analysis [3], acoustic echo cancel-lation [4] and adaptive filtering [5]. However, there hasnot been much work done or at least published literatureon designing a parallel processing system based onTMS32010 for digital image processing. The workreported in this paper is carried out with the above objec-tive in mind.

An image is basically a two dimensional data arrayand image processing involves the performance of thesame operation over image segments of the same sizeuntil the whole image is processed. Therefore it is amen-able to parallel processing where several processors

Paper 5172E (C2), first received 4th March and in revised form 5thNovember 1986

The authors are with the Department of Electrical Engineering, Nation-al University of Singapore, Kent Ridge, Singapore 0511

process the image segments in synchronism, in one oper-ation cycle, thereby cutting down processing time by afactor determined by the number of processors used.

The PIP architecture described in this paper is of themaster-slave configuration where a MC68000-basedsingle-board computer [6] controls the parallel systemconsisting of eight TMS32010s. The common programmemory of the TMS32010s is RAM memory which isdownloaded with the program instructions before theprocessing begins. The image to be processed is thenloaded into an image buffer which is segmented into eightsegments, each to be processed by one TMS32010. Theprocessed results are stored into another image buffer fordisplay. Different processes can thus be implemented bydownloading the respective programs into the programmemory before the processing is initiated.

To evaluate the parallel processing system, two image-processing algorithms for pattern recognition, i.e. edgedetection and thinning, have been implemented. Theadvantages and limitations of the PIP and the ways toovercome them will be discussed.

2 Parallel image-processor (PIP) architecture

The parallel image processor as shown in Fig. 1 consistsof a MC68OOO-based single-board computer linked to ahost computer via a serial line, a frame-grabber/displayunit and eight TMS32010-based parallel-processing units.The frame grabber loads a digitised image from a videocamera into either of the two buffers. The frame displayconstantly generates a TV signal from a 64 kilobytebuffer within itself which can be loaded with the datafrom either the image or result buffer.

The PIP can operate either in the parallel mode or thenonparallel mode. In the nonparallel mode, the imageand result buffers and the program memory of theTMS32010s are accessible as part of the MC68000memory map while the TMS32010s are in the inactive orreset state. In the parallel mode, the TMS32010s formeight processing units, as shown in Fig. 2, with acommon program memory provided by the 4 kilowordhigh-speed static RAM (70 ns 61HL64). The two bufferseach become segmented into eight parts and are pairedwith one segment from the image buffer and another cor-responding one from the result buffer. These paired buffersegments are isolated from each other and also from theMC68000 busbar and become accessible to a TMS32010processing unit through one of its I/O ports.

The parallel mode is initialised by the MC68000 afterthe program memory is loaded with a TMS32010program and the image buffer with an image. The 4 kilo-word program memory is mapped to the directly

IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987 119

addressable memory space of the TMS32010s. Theprogram memory is common to all the processors andthe address and control signals of one processor, pro-cessor no. 1 (Fig. 3), are used to access the program

Each pair of the corresponding segments of the twobuffers is made completely accessible to a TMS32010processing unit. Each unit is capable of reading the con-tents of the adjacent paired buffer segments as well as its

hostcomputer

RS-232CMC 68000controller

mt control address data

addressbusbardata busbar

control address data

parallelprocessing

units (TMS 32010s)

Fig. 1 Block diagram of the parallel image processor {PIP)

memory and the buffer segments. The TMS32010s run insynchronism and the program instructions are broadcastto all the processors. This mode of parallel operation isidentical to a single-instruction multiple-data (SIMD)system.

V7/.

TVcamera

framegrabber

timing

framedisplay(D/A)

video output

TVmonitor

own but can only write into its own paired buffersegment. This capability of accessing the contents of theadjacent segments is particularly useful in image pro-cessing where the data are processed in overlappingblocks. An address counter is used to provide a valid

n I parallel

adress busbar

parallel- ~u

TMS 32010

programmemory(AKx16) Tcfdi^^jkbqf

Fig. 2 Processing unit no. 1

120

address busbar

data busbar^addresscounter

clockload

parallel

memorydecoder

CSimagebuffersegment 1(8Kx8)00000 -01FFF

memory1/0addressdecoder

CSresultbuffersegment 1(8Kx8)10000-11FFF

•/A

address busbar

\ .from data output o1 buffer segment 2~ todata selector of processor 2

IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987

address to the buffer segments while the data are movedin and out of the 144-word processor internal memory

displays the next frame using the data from the resultbuffer while updating its own buffer with the processed

4KX16programmemory

MC68000 controllercontrol

address data

address busbar

addressdecoder

CSimage buffer

segment no.18Kx80000001FFF

CSresult buffersegment no.1

8Kx810000-11FFF

CSimage buffer

segment no. 28Kx8

02000-03FFF


8Kx812000-13FFF

CSimage buffersegment no.3

8Kx804000-05FFF


8Kx814000-15FFF

^ # ^ ^ ^ ^

Fig. 3 Parallel processing unitsTristate buffers with an active low-output enable, bidirectional for data busbars

through one of eight ports. As shown in Fig. 3, a dataselector routes the data from one of the data busbars ofthe three paired buffer segments to the TMS32010. Sinceall the processors access the same location of the buffersegments at the same time, only one address counter isused and its output is broadcast to all the buffer seg-ments. This counter is again controlled by processor no.1 which controls the program memory.

To access a particular location of the three sets of 16kilobyte paired buffer segments, a 16-bit addressA0O-A15 is loaded into the address counter. Address lineA15 is used to distinguish between the image and resultbuffer segments while A13 and A14 are used to controlthe data selector. The address counter is hardwired toautoincrement every time an access of the buffers ismade. To access consecutive locations in a block, onlythe first address of the block needs to be provided andthis tremendously reduces the overhead on the software.

The TMS32010s process the image data of the imagebuffer segments and store the results in the correspondingresult buffer segments. At the end of the parallel program,the TMS32010s interrupt the MC68000 which then setsthe system back into the nonparallel mode. TheMC68000 then interrupts the frame display unit which

IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987

parallel

image. The processed image will continually be displayedon the monitor until the next frame has been processed.

3 Loading routine

Each TMS32010 processor has to process 8 kilobytes ofdata or 32 lines of the 256 lines in the entire image. Thesedata have to be moved into the internal memory of theTMS32010s in blocks. Part of the 144-word internalmemory is used for the variables and constants of theimage-processing programs. The loading of the data intothe processor is the main overhead and so has to be asefficient as possible.

A block of 32 points of a line is processed at one go.To facilitate this process, the 32 points and all their sur-rounding neighbours, as shown in Fig. 4, are loaded into

X ( i - l . j - i ) X(i + 0 . j -1 ) X( i *1 , j -1 ) X0+31J-1) X(i+32.j-1)

X ( i - l . j * O ) Xd+O. j + O) X(i+1, j+0) X(i*31, j*0) X(U32,j+0)

X ( i - 1 , / * 1 ) X ( i * 0 , j * 1 ^ X( i - r1 , j+ l ) X(it-31.j+l) X(i+32,j + 1)

Fig. 4 Block which is loaded into the processor memory

121

the internal memory. The points X(i + 0, 7 + 0) toX(i + 31, j; + 0) are the points to be processed and therest are their neighbouring points.

As shown in Fig. 5, each buffer segment is divided into

00 1F20 2F 30 3F EO EF FO FF

00

01

02

03

04

05

06

07

.

1F

1

2

3

4

5

6

7

8

\.

32

33

34

35

36

37

38

39

40

.

64

65

66

67

68

69

70

71

72

.

96

•

193

194

195

196

197

198

199

200

•

224

225

226

227

228

229

230

231

232

:.

256

Fig. 5 Buffer segment divided into block of 32 bytes

vertical columns of width 32 bytes. The blocks of the left-most vertical column are loaded and processed startingat the top left corner, and the process is repeated for theremaining columns. When the first block of each verticalcolumn is being processed, three blocks will be loaded,this includes the block in question, the block above andthe block below. As for the remaining blocks of thecolumn, only the block below them needs to be loadedinto the internal memory space overwriting the blockabove the previous processed block. When the last blockof a vertical column is to be processed, the block below isloaded from the buffer segment below. If the loading/processing procedure had been done horizontally, thenthree blocks would have had to be loaded for each blockto be processed.

4 Parallel programming

The architecture of the parallel image processor imposesa restriction on its programming. If data being tested in adecision process are different in the parallel processors,then different actions would have to be taken by them.This is clearly impossible in an SIMD machine. The dataare different only if they come directly from the buffersegments or are derived from them and so the restrictiononly involves such data. To overcome this problem, allthe possible actions of the decision process are executedand the results stored. A pointer is then generated fromthe input data to determine which result to use. Thistechnique is used in the edge-detection routine describedbelow.

5 Image processing

To evaluate the parallel-processing system, two pattern-recognition algorithms were implemented. The algo-rithms are the Sobel gradient operators for edge detec-tion and a thinning algorithm. They are typicalimage-processing tasks performed by a computer. Theimplementation of the algorithms on the PIP highlightsthe advantages and limitations of the system and theways to overcome them.

6 Edge detection: Sobel gradients

The method used for edge detection is by computing theSobel gradient functions [7]. Consider an image pixelX(i, j) and its neighbouring pixels as in Fig. 6. The Sobel

X ( i - 1 ,

X( i - 1 ,

X ( i - 1 ,

t - 1 )

i - 0 )

I t i )

X ( i + 0

X ( i + 0

X ( i + 0

. J

J

, J

- 1 )

+ 0)

+1)

X ( i t 1

X ( f 1

X ( i •• 1

. 1 -

• • • '

1)

0)

Fig. 6 Matrix of a pixel X(i + 0,j + 0) and its surrounding pixels

gradients, Gl and G2, of the point X(i, j) are given byeqns. 1 and 2:

Gl = 1/4[{X(I - \,j + 1) + 2X(i + 0,7 + 1)

+ X(i+ 1,7 + l)}-{X(i- 1,7-1)

+ 2X(i + 0,7 - 1) + X(i + 1 , 7 - 1)}] (1)

G2 = l/4[{X(i + 1 , 7 - 1 ) + 2X(i + 1 , 7 + 0)

2X(i - 1,7 - i - 1,7+1)}] (2)

The gradient G is then given by an operator Op on Gland G2 and is given in eqn. 3:

G = Op[G\, G2]

|G2|] (3)

If the gradient is greater than a predefined threshold Tthen the gradient exists and the point X(i, j) is set to thepeak white value 255 or else it is set to zero which isdisplayed as peak black.

Before the processing begins, a set of 32 points andtheir surrounding neighbours are loaded into the datamemory. Three data memory locations are used as poin-ters to the rows. These pointers are updated as the pro-cessing proceeds until all the points have been processed.The address counter is loaded with the address of theresult buffer segment corresponding to the block beingprocessed after which the results are sent directly to theoutput port without storage.

The mathematical processing as given by eqns. 1 and 2is easily done by the TMS32010s but the problem lies ineqn. 3 where the moduli of Gl and G2 have to be com-puted. The modulus of a number is the negative of thenumber if it is less than zero otherwise it is the numberitself. The comparison cannot be executed by the parallelprocessors for reasons mentioned earlier and to solve theproblem, both answers, that is Gl and — Gl (or G2 or— G2), are evaluated and stored in consecutive data-memory locations, say N and N + 1. The sign bit of Gl(or G2) is then added to N to generate the location of theanswer. Thus if Gl is negative, the sign bit is one and theanswer would be in location N + 1 which is — Gl.

7 Thinning

The edge produced by the Sobel gradient operators isoften more than one pixel thick and so a thinning oper-ation has to be applied to reduce this to one pixel. Theloading procedure is the same as for the edge-detectionalgorithm explained above but the results are stored backinto the same buffer. The method used is a derivative of

122 IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987

the classical thinning algorithm [8]. The algorithminvolves comparing an edge point that lies on a givenside of the border with a set of six masks. This is repeatedsuccessively from the north, south, east and west.

The restriction of the parallel system forces all themasks to be tested with each and every pixel, regardlessof whether it is an edge pixel or not. The pixel is then setto be a thinned edge pixel if and only if it is an edge pixeland it and its surrounding pixels match any one of themasks. Fig. 7 shows the masks to be matched. Fig. laand b represent two masks while the rest of the masks areobtained by rotating Fig. 1c by 0°, 90°, 180°, 270°. The

A

0

B

A

P

B

A

0

B

B

B

B

0

P

0

A

A

A

B

0

A

0

P

A

A

A

A

Fig. 7 Three masks to be matched

point P, which corresponds to X(i + 0,j + 0) of Fig. 6, isthe pixel being considered while the rest are its neigh-bouring points. If P is an edge pixel then for P to be athinned edge pixel, at least one of the points marked A orB must be an edge pixel.

The problem is solved as a Boolean operation.Boolean operations can be carried out by representing alledge pixels with the boolean T and the rest with '0'. Themasks of Fig. la and b and those obtained by rotatingFig. 1c are represented by the Boolean expressionsNOT[M1] to NOT[M6], respectively.

H = logical OR of all points except P orX(iJ)

Ml = NOT[J/] .OR. X(i - 1,7 + 0) .OR.

Ml = NOT[//] .OR. X(i + 0J- 1) .OR.

M3 = NOT[H] .OR. X(i + 0J- 1) .OR.

MA = NOT[tf] .OR. X(i + 0J- 1) .OR.

MS = NOT[tf] .OR. X(i + 1J + 0) .OR.

(4)

(5)

(6)

(7)

(8)

(9)

M6 = NOT[tf] .OR. X(i + 0,7 + 1) .OR.X(i-\J + 0) (10)

M = Ml .AND. Ml .AND. M3 .AND. MA .AND.M5.AND. M6 (11)

P = NOT[M] .AND. X(i + 0,7 + 0) (12)

The point P is a thinned edge pixel and set to 1 if andonly if there is a match with one of the masks and X(i, j)is an edge pixel. Whenever such a match occurs, M ofeqn. 11 becomes 0 and P of eqn. 12 becomes 1 if and

only if X(i + 0, 7 + 0) is 1. The condition that X(i, j)should be an edge pixel is necessary since each and everypoint would have to be tested. In nonparallel systems, thecomparisons are only made with the edge pixels andhence result in less computation, but as seen in the resultsthe processing times are fast because of the high oper-ating speed of the TMS32010 and the parallel structure.

The thinning algorithm calls for testing an edge pixelon each side of the border in turn. This is easily incorpo-rated into the final expr. 12. When the test is done on theright border which occurs when X(i + 1, j + 0) is 0, theresult P is as in eqn. 13.

P = [P .AND. NOT[X(i +1,7 + 0)]] .OR.

[X(i,j) .AND. X(i+ 1,7 + 0)] (13)

The thinning process is carried out on the entire imagesegment on each side of the border in turn which meansthat the process has to be repeated four times. Thisaccounts for the long processing time of 652 ms. Thereare simpler, less rigorous thinning algorithms which donot require processing to be done on each side of theborder. Such algorithms require much shorter processingtimes which could be as small as a quarter of that of thealgorithm used. However, the program above shows howsuch a complex algorithm could be implemented on thePIP.

8 Results

Programs for edge detection and thinning based on thealgorithms presented above have been written and testedon the PIP. Their performances are listed in Table 1.

Table 1: Table of the processing times of the patternrecognition algorithms

Operation Time Loading time, Total timetaken, ms ms taken, ms

Edge detection 105.80 17.95 123.75Thinning 580.00 71.80 651.80Complete program 685.80 89.75 775.55

From Table 1, it can be seen that the time taken toproduce a thinned 256 x 256 pixel edge image is 775.55ms of which 11.6% was used for loading. The loadingtime of the image data for the thinning process is fourtimes that for the edge-detection process. This is becausethe thinning routine is repeated four times, once for eachof the four sides of the border in turn. The results of edgedetection and thinning are shown in Figs. 9 and 10,respectively, and Fig. 8 contains the original image.

9 Discussion

The TMS32010 is primarily meant for digital signal pro-cessing and lacks many of the nominal features of manycommercial 32-bit general-purpose microprocessorslargely because of its large on-chip multiplier. The mainproblem of the TMS32010 is its limited addressing capa-bility which requires the comparatively large 16 kilobytebuffer segments to be I/O accessed. It is because of thisthat blocks of data from the buffer segments have to bemoved in and out of the internal memory rather thandirectly accessing them. This is the main overhead on thesystem performance and is minimised by hardwiring theaddress counters to autoincrement since only the startingaddress of the block needs to be supplied.

IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987 123

Fig. 8 Original image

Fig. 9 Edge-detected image

Fig. 10 Thinned image

The TMS32010s are made to run in synchronismwhich results in a very simple system. An alternative tosuch a system is to have a multiple-instruction multiple-data (MIMD) type system, where a number ofTMS32010s operate independently of each other. But forimage processing, the processors should be able to com-municate with one another to facilitate the exchange of

image data. This requires complicated interprocessorcommunications and also individual program memorieswhich demand a complex and expensive system.

The synchronous nature of the system, however,imposes a major restriction on programming. Thisrestriction occurs whenever a decision has to be made ondata which can be different in the different processingunits since such a situation means that each processorwould have to take different actions which is impossible.Two such situations have been highlighted in the twoimage-processing programs described earlier.

The TMS32010 itself is already a very fast device andhaving eight of them operating in parallel further reducesthe processing time by a factor of eight. The presentsystem configuration of the PIP does not attain real-timeprocessing speed of 33 ms but it is capable of doing sowith the addition of many more processors and minormodifications of the design. Nevertheless for manynonreal-time image-processing applications such aspattern recognition, the speed is adequate. Moreover, ascompared with other commercial systems with similarcapabilities, our design is simpler, low-cost, flexible andeasily interfaced to any host computer system.

We have so far only developed pattern-recognitionsoftware but the system is capable of much more such asimage transformations, digital filtering and data compres-sion and many other image-processing operations.

10 Conclusion

A parallel image processor based on the TMS32010 DSPchips has been developed which is able to process imagesseveral orders of magnitude faster than a conventionalimage-processing system based on a general-purposemicroprocessor. Two pattern-recognition algorithmswere implemented achieving a total processing time of775 ms.

11 References

1 TMS32010 User's Guide' (Texas Instruments Corp., 1983)2 MORGAN, D.R., and SILVERMAN, H.F.: 'An investigation into

the efficiency of a parallel TMS32O architecture: DFT and speechfilterbank applications'. Proceedings of ICASSP 1985, pp. 42.1.1-42.1.4

3 GANESAN, S., AHMAD, M.O., and SWAMY, M.N.S.: 'A multi-microprocessor system with distributed common memory for real-time digital correlation and spectrum analysis'. Proceedings ofICASSP 1985, pp. 42.3.1-42.3.4

4 DEGRYSE, D., DRUILHE, F , and GILLOIRE, A.: 'A multiproces-sor structure for signal processing application to acoustic echo can-,cellation'. Proceedings of ICASSP 1985, pp. 42.4.1-42.4.4

5 MILLER, T.K., and ALEXANDER, S.T.: 'An implementation ofLMS adaptive filter using an SIMD multiprocessor ring architecture'.Proceedings of ICASSP 1985, pp. 42.2.1-42.2.4

6 'MC68000 16/32-bit microprocessor, programmer reference manual'(Prentice-Hall Inc., Englewood Cliffs, New Jersey, USA, 1984, 4thedn.)

7 ROSENFELD, A., and KAK, A.C.: 'Digital picture processing,volume II' (Academic Press, New York, 1982), pp. 232-240

8 PAVLIDIS, T.: 'Algorithms for graphics and image processing'(Computer Science Press, Rockville, USA, 1982), pp. 195-201

124 IEE PROCEEDINGS, Vol. 134, Pt. E, No. 2, MARCH 1987

Date post:	20-Sep-2016
Category:	Documents
Upload:	hs
View:	218 times
Download:	2 times

Parallel image-processing system based on the TMS32010 digital signal processor

Documents