Fast algorithm and its systolic realisation for distance transformation

rithm and its systolic realisation for nsf or mat io n

C.-H.Chen D.-L. Ya ng

Abstract: Distance transformation has been widely applied to image matching and shape analysis. The hardware implementation of distance transformation is necessary because real- time (video rate) processing is required for inost applications. The paper proposes a four-pass algorithm for distance transformation. Its computation complexity is the same as a two-pass raster scan algorithm but the data dependencies are simplei- and therefore more suitable for designing hardware architectures. Systolic arrays are very amenable to VLST implementation. They are especially suited to a special class of computation-bound algorithms with regular. localised data flow. In the paper the authors design a systolic array for the proposed four-pass algorithm and use the multilevel pipelining technique to improve the performance. Its speed is 2.4 times, and the cost is 2/3 times, that of a systolic array designed using the two-pass raster scan algorithm.

I Introduction

Consider a binary image consisting of object pixels and background pixels. Distance transformation iDT) produces a mapping from a double-valued image to a multivalued image where every pixel has a \;due corresponding to the distance to the nearest background pixel.

Computing the distance between two pixels is essen- tially a global operation which is prohibitively costly and unsuitable for VLSl implementation. Therefore algorithrns that consider only the local neighbourhood. but still givc a reasonable approximation to the Eucli- dean distance are necessary. A number of different DTs [I.] have been developed, but the chamfer DT [2] has better accuracy than others. In this paper; a 3-4 DT is adopted to design a hardware architecture, however. the result can be applied to other metricc with only simple rnodificatioii.

As for DT computation. the local distance i .liicli measures the distance from a central pixel to its ccrre-

0 IEE, 1996

IEE Proceeding.s online no. 19960380 Paper first received 7th November 199.1 and in revised form 15th January 199s The authors are with the Ikpai-tmenl or Electrical Engineerins. National C'lieng K L U I ~ University, Tainan. Taiwan, Republic of' China

________~.. __ ~~

sponding neighbour is propagated sequentially [2--4], or in parallel [2; 5, 61. The sequential propagation algorithm is more suitable for a VLSI implementation than the parallel propagation algorithm because the computation complexity of the sequential propagation algorithm is of O(N2), whereas it is of 0 ( N 3 ) for the parallel propagation algorithm where N is the image size. For parallel propagation algorithms, the column array processors [7] can be used to complete DT but many iterations are required. A systolic array designed using the sequential two-pass raster scan algorithm has been developed in [8]. In their design, for an N x N image, N processing elements (PES) are required to form a systolic array for processing the whole image. The number of PES can be reduced to half if a feedback path from the output of the rNi21th PE is provided to the input of the first PE. Each PE is realised by six sets of an adder cascaded by a camparator. Total execution time of their systolic array for two scans of the whole image is 8Ar + 4 ( N - 1) and the duration of a clock cycle is equal to the delays of one comparator plus one adder. In this paper, a four-pass algorithm of DT is first proposed; its computation complexity is the same as the two-pass raster scan algorithm but its data dependencies are simpler and therefore more suitable for designing hardware architectures. Then, we design a systolic array using the proposed four-pass algorithm. The duration of a clock cycle is the same as the systolic array designed in [SI. The total number of PES is equal to ?v- and each PE i s realised by two sets of an adder cascaded by a camparator. Total execution time o f our systolic array for four pass scans of the whole image is 51\r. Its speed is 2.4 times and its cost is 213 times that of the systolic array designed in [8].

2 Algorithms

2. I Two-pass algorithm According to [9, IO], DT on a binary image I (object U, background B) produces a mapping from a double-valued function f l in a space S to a multivalued ftmction f2 in the same space S The binary function f , is defined dS

where M is a large integer value greater than the maximum possible distance value of any binary object that could exist in I . The value of the function f 2 replaces those of f l by minimising a distance metric dist(p, ph) over the binary image I , where p and P b are two pixels in the space S such that p E I and P b E B. The distance values off&), fo rp t I : are given by

TEE Proc.-Comi,ut. Digit. Tech., Vol. 143, No. 3, Mtry 1996 I68

hood and W, stands for the local distance given by the mask. Propagation of distanceis are an iterative proce- dure, where the number of iterations correspond to the distance from the point sources, dependent on the niax- imal distance value of the object 0. Figs. 2-4 illustrate the parallel propagation of the 3---4 DT. Typically, for such parallel algorithms, the time complexity is of the order of O(N) for an image of size N x N, but these require an expensive architecture with one processor per pixel, that is, a total time-processor complexity or numerical complexity of the ortier of O(N3). For a 3 x 3 mask, it needs eight additions and eight minimisation operations per pixel in the integer range. The total number of operations is

16(o~erataorbs/;r,zzel) x N 2 (;ozlrrrls/.Lteru~iorL) x : ~ ~ ( i t e ’ ~ a t i o n s / j r u ? n e )

= 16 x iVi x n. ’ (o~~erat ior~ , .~ / f rumc) where N j is the number of iterations.

However, for such a simple propagation transform, a sequential implementation provides equivalcnt results. In sequential propagation, an object pixel updates its distance value by comparing itself to some of its neighbours which already have addled local distances. The neighbourhood is defined by the submask which is a subset of the complete mask. Propagation occurs ‘from the previously processed neigh bours into the presently processed pixel’, that is the concept of causality. Now the time complexity is of the order of O(N2) and only one processor is required, leading to a time-processor complexity of the order of O(N2) [lo]. For sequential propagation, we can rewrite eqn. 3 as

In general, DTs are defined by the local distances between a central pixel and all the pixels in a small neighbourhood around it. The local distances are then propagated over the image, sequentially or in parallel. A convenient way to illustrate DT is a ‘mask’, where each pixel in the neighbourhood is marked by its local distances as shown in Fig. 1 . In this paper the 3 x 3 neighbourhood is considered. The most common values of m and h for various DTs are listed in Table 1. The maximum error that can occur between the DT and the Euclidean distance in an N x N image is also listed P11.

Table 1: Suggested integer local distances

a b Name Max. error

1 x city-block 58.6

1 1 chessboard 41.4

2 3 2-3 DT 13.4

3 4 3-4 DT 8.1

5 7 5-7 DT 8.3

%

Fig. 1 Musk fir 3 x 3 weighted DT

Fig. 2 Illustralion binary image

of un iterative protedure for pui allel el

Fig.3 3 4 DT

Illus fration of an iterative pvocednve for paixdlel map after one iteration

propagation

p,opugation

DT

DI’

0 3 3 0 0 0 ti+tR-l 0 0 0 0 0 0 Fig. 4 Illustration of an iterative procc.dureJbr parallel propugution 07 complete 3-4 DT map

In parallel propagation, the distance value at a cell is a function of its neighbours’ in the previous iteration. We can rewrite eqn. 2 for the iterative process as

where j stands for iteration, Np is the neighbourhood of p defined by the mask, pn is a pixel in that neighbour-

IEE Pioc-Compur Digit Icvh Voi 147 Tvo 3 Muy 1996

where N/ is the neighbourhood of p defined by the submask used in the present pass, and the other sym- bols are the same as in eqn. 3. In initial iteration (i.e. j = 0), object pixels are thereby assigned the value ilir (eqn. 1). We require that M he a larger integer value that could exist in I. This is necessary for correctly using the minimisation operation in eqn. 4 to propagate distance. Since an iteration corresponds to a raster scan pass, the maximal number of iterations is fixed and independent of the object 0 and only depends on the number of masks used and the size of image

Fig. 5 Submask uaed in forwdid raster scdn

Subnzutkr o/ 151 o-pus~ algorithm j‘or W U7

Fig. 6 Submask used in backward raster scan

Submusks ig~ two-puss ulgoritlzrn,fiJr 3 4 07

The well known algorithm described by Rosenfeld, Pfaltz [3] and Borgefors [2] is a two-pass transformation using two complementary raster scans and the associated four-neighbour su bniaslcs (partial neighbourhood) as shown in Figs. 5 and 6. Here and subsequently, we specify a raster scan a$ ‘from left to right and from top to bottom’, which means that the raster submask is horizontal, scanning along the raster line from left to right, then starting again at the left hand

I69

end of the next line, and so on to the bottom of the image I . An example is shown in Figs. 7 and 8.

Fig.7 3 - 4 DT

Illustration of two-pass algorithnv for 3-4 ' map after forward raster scan

07

I " " ' I

Fig.8 Illustration o j two-pass algorithms for 3 4 D T 3.~4 DT map after backward raster scan

2.2 Proposed fo ur-pass algorithm The four-pass, two-neighbour algorithm produces the same results as the two-pass, four-neighbour algorithm. In the four-pass algorithm, we use four submasks each with two neighbours, decomposed from the submasks used in the two-pass algorithm. The four complementary rasters are shown in Fig. 9.

a c

b d

Fig. 9 Four complementary rasters U From left to right and from top to bottom h From top to bottom and from right to left c From right to left and from bottom to top d From bottom to top and from left to right

These four complementary rasters can scan in ran- dom, but we adopted the circular fashion in clockwise (or counterclockwise), since it is convenient for pipelining between SAS within the system. The four-pass algorithm for the circular fashion in clockwise is described as follows: Initial phase:

The first pass f%:, Y) = f l ( X I Y)

f o r y " 1, . . . , N d o f o r % = 1, . . . , N d o

The second pass f o r x = N , . . . ,1 do f o r y = 1,. . . N do

The third pass f o r y = N, . . . ,1 do

for x = AV,. . . ,1 do

f 2 2 ( X , Y ) , f $ ( x + l , y + 1 ) + 4 , f,"b + 1 , Y ) + 3

The fourth pass f o r 2 = 1,. . . , N do fo r y = A T j . . . ,1 do

fi(x;y) =min [S.L3!X1Y), f ; ( x - 1 ~ y + 1 ) + 4 , 1 1 f 2 ~ x + 1 , ~ ) + 3 1

where the fj(x,y) is the distance value of the pixel in position (x,y) for the j th raster. An example is shown in Figs. 10-13.

Fig. 10 3 4 DT map after raster scan

Illustration offour-pass algorithm for 3 4 DT

Fig. 1 1 3 1 DT map after second raster scan

Illustrution ojjour-pass algorithm jor 3-4 07

Fig. 3 4 I

12 I T map after third raster scan

Illustration offour-pass algorithm for 3-4

Fig. 13 Illustration 3 4 DT map after final

DT

of four-pass algorithm for 3.- raster scan

4 D T

Both two-pass and four-pass algorithms have the same computation complexity. The two-pass algorithm needs four additions and four minimisation operations per pixel in the integer range, and two rasters to propagate the local distance over the image I. For an N x N image, the total number of operations is

x 8 (operations/pixel) 2(rasters/frarne) x N2(pixeLs/raster)

= 16 x N2(operations/frame)

170 IEE Pror.-Coniput Digit. Tech., Vol. 143, No. 3, Muv 1996

The four-pass algorithm needs two additions and two minimisation operations per pixel in the integer range, and four rasters to propagate the local distance over the image I . For an N x N image, the total number of operations is equal to that of the two-pass algorithm as given below

4(ras ters / f rame) x i~~ (pixelslraster) x 4 (oper ations/pixel)

= 16 x N'(operations/fru.rne) For the data dependencies between the central pixel and its neighbour, the four-pass algorithm is simpler than the two-pass algorithm. In the design stage, simpler data dependencies offer advantages in arranging timing and allocating processors, thus reducing computation time and hardware cost.

Table 2: Data dependencies and the number of basic operations per pixel of different DT algorithms

Parallel propagation Two-pass Four-pass

Data dependencies/ 8 4 2 pixel

0 pera ti o ns l 1 6x Nx Nx Ni 16xNxN 16xNxN frame

In Table 2, we compare the number of data dependencies per pixel and the total number of operations per frame for the parallel propagation algorithm, the two- pass algorithm and the four-pass algorithm. The parallel propagation algorithm has the largest data dependencies and total operations. The two-pass and four-pass algorithms have the same total number of operations per frame, but the four-pass algorithm has the smallest data dependencies.

3 Design of systolic arrays

In this Section, we adopt the canonical mapping meth- odology [lo] and the multilevel pipelining technique to design a SA. First, the dependence graph (DG) of the four-pass algorithm is built. Next, we select the fastest scheduling and cheapest projection vectors that satisfy the condition that any edge of the DG will have one or more than one delays, and that an equitemporal hyper- plane is not projected to the same PE to perform the corresponding SFG. Subsequently, the delay and operation of one node in SFG are combined with the module operation to form a PE.

Fig. 14 DG ojjour-pass algorithm

3. I The macro-operation of the four-pass algorithm is repeated below:

Mapping algorithm to DG

1 1 f%! Y), f $ ( z j y ) = min f $ ( x - :l,y - 1) +4,

[ f t i ~ - L Y ) + 3 ] where we only show the part of the first pass, since the rest have similar forms. This algorithm has been expressed in a local form, in which each pixel is directly dependent on the neighbourhood of itself. The DG corresponding to this algorithm is shown in Fig. 14. Note that the input and output in each node are neglected. Fig. 15 shows the complet? data dependencies for one node. The edge vectors {b ,} , where i = 1, 2, illustrate that ,fi(x,y) is directly dependent upon $(x ~ 1 ,y - 1) andJ&x ~ 1 ,y), and each node denotes the macro-operation given above.

rX

I I P Fig. 15 DG oj jour-puss algorithm dcpendentieA oj one node

IX 1 Y

/

/

/

/

A A

Fig. 16 Illustrution of mapping DG to SFG for four-puss algorithm

3.2 Mapping DG to SFG We choose the schedule vector 2 = [l 1jT, the projection vector d' = [0 1lr and the processor basis P = [l O I T , which is orthogonal to d'. Thus, we can describe the mapping from DG to SFC; by an algebraic transformation given as follows:

T - d = [ O 11 , P = [ 1 O I r r , S = [ l 1IT

Node mapping:

T n = P c =

Arc mapping:

1 01. [;I = z

-

b11= [: [:I = [:] ILE Proc.-Comput. Dixit. Tech , Vol 143, No 3, May 1996 171

U 0 mapping:

where the input mapping and output mapping has the same result. The resulting SFG is a horizontal linear array show in Fig. 16, where each dashed line inter- prets a set of linear schedule hyperplanes where a node belonging to the set was calculated in the same time step. The node mapping result ‘n = x’ means that any node (x,y) in DG is mapped to node ‘x’ SFG which will be implemented as PE,. The same illustration applies to the input and output in DG mapping to node position in SFG. The resulting ‘ t (c ) = x + J’ is the computation time step, that any input data will be transmitted into PE before t (c ) and any output data will be calculated completely after t(c). Table 3 illus- trates the IiO mapping.

Table 3: Timing scheme of the SA for the four-pass algorithm

PE, PE, PE, PE,

A I t1 c 1

r2 c 1 c 2 c 1

t3 c 1 c 2 c 1 c 2 c 1

AI A2 AI

A I A2 AI A2 AI

AI A2 AI A2 AI A2 AI c 1 c 2 c 1 c 2 c 1 c 2 c 1

A2 AI A2 AI A2 AI A2 c 2 c 1 c 2 c 1 c 2 c 1 c 2

A2 AI A2 A I A2 c 2 c 1 c 2 c 1 c 2

A2 A I A2 c 2 c 1 c 2

A2 c 2

t5

t6

G

t8

The schedule vector 2 = [ I 1IT chosen is not the fastest selection if w,e only consider mapping DG to the SFG stage, and s = [l OlT is faster. But, when we use the multilevel pipelining technique, the SA for schedule vector 2 = [l 1IT can be pipelined in the system level but the 2 [l OlT case cannot. The 2 = [l 1IT will result in the best performance for the four-pass algorithm.

I

Fig. 17 Illustrution of resulting SFG for,four-puss ulgorithm

3.3 Mapping SFG onto SA In this stage, we combine the module operation with delays, output edges and I/O edges of any node in SFG to form a PE. The resulting SA is shown in Fig. 17. For the PE level pipelining, The macro-operation of the four-pass algorithm is decomposed into two additions and two minimisation operations as follows:

172

The above basic operations imply that all the additions involved can be calculated at the same time since they do not have any dependencies between each other. But the minimisation operations involved must be calculated sequentially. The minimisation operations can be written in the recursive form as shown below.

where C,<(x,y) is dependent on CIc-,(x,y) and A,<(x,y). One addition and one minimisation operation of the same subindex are combined in the same clock cycle in which the addition is followed by the minimisation operation. It can be implemented as an adder followed by a comparator. The timing scheme shown in Table 3 lists the computation of every PE at the corresponding time step. We can see that the mapping results of t(c) in Ti0 mapping is ‘x + y’ which can be verified by the corresponding time step at every position denoted by ‘C2’. The total computation time is calculated by time mapping the result, replacing ‘x + y’ with the image size. It can be verified by the last time step. For checking the causality condition, we take all operations of f:(2,2) for example. They were calculated by PE2 from t3 to t4, where all additions ( A , at t3 and A2 at t4) were calculated after the corresponding minimisation operations, i.e., C2 of PE, and itself. Fig. 18 shows the corresponding design of PE, which is simple, regular and suitable for VLSI implementation.

T

Fig. 18 PE of SA for four-paLs algorithm

t12 t l l h 0 Fig. 19 Illustration of system level pipelining jbr four-puss algorithm

IEE Proc.-Cornput. Digit. Tech., Vol. 143, No. 3, May 1996

In the system level pipelining, the SA of a pass can be fired up when its previous pass has been half exe- cuted as shown in Fig. 19 where the computation time steps for the first pass and the second pass are indi- cated. It can be observed that all the four rasters can be implemented by only one SA, because it satisfies the causality condition and does not overlap any PE for the same time step.

The duration of a clock cycle of the proposed linear systolic array with N PES may be decreased if the system is too large to fit on a chip and the memory band- width is taken into account, A smaller linear systolic array with Nlm PES can be used if each pass is divided into m subpasses. The total number of clocks increases to 4mN + Nlm. The duration of a clock cycle equals the propagation delays of an adder plus a comparator. There are many methods to obtain high-speed adders and comparators [13, 141. Here, an 8-bit parallel adder cascading an 8-bit parallel comparator has been simu- lated in the SPICE circuit simulator. The simulation result of the clock duration is about 20.8811s for a 7 . 2 5 ~ PFET and 4p-n NFET CMOS process.

4 Conclusions

This paper first develops the four-pass algorithm for 3 4 DT. The computation complexity of the four-pass algorithm is the same as the conventional two-pass algorithm, but less than the parallel propagation. Its data dependencies are less than for the conventional two-pass algorithm and therefore more suitable for hardware design. The SA for the four-pass algorithm was designed. Its speed is 2.4 times and cost is 213 times that of a systolic array designed using the two-pass

raster scan algorithm. The SA designed in this paper has the properties of modularity, regularity, local inter- connection, a high degree of p ipelining, highly synchro- nised multiprocessing and are very amenable to VLSI implementation.

5

1

2

3

4

5

6

I

8

9

References

CONZALEZ, R.C.. and WOODS, R.E.: ‘Digital image processing’ 1-3 (Addison--Wesley, 1992), Reprinted with correlations, pp. 45- 41 BORGEFORS, G.: ‘Distance transformations in digital images’, CVGIP, 1986, 34, pp. 344-371 ROSENFELD, A., and PFALTZ, J.L.: ‘Distance function on digital pictures’, Pattern Recognit., 1968, 1 , (l), pp. 33-61 WANG. X.. and BERTRAND. Gr.: ‘Some seauential algorithms for d gene;ahzed dlstance transformation based on Mkkowski operation’, IEEE Truns , 1992, PAMI-14, (1 l), pp 11 1 4 1 121 BORGEFORS. G , HARTMANN, T , and TANIMOTO, S L ‘Parallel distance transforms on pyramid machines: Theory and implementation’, Signul Process., 1990, 21, ( I ) , pp. 61-86 ROSENFELD, A.: ‘Parallel image processing using cellular arrays’, Computer, 1983, pp. 14--20 ZHAO, n., and DAIJT, D.G.: ‘A real-time column array processor architecture for images’, IEEE Trans., 1992, CSVT-2, (l), pp. 3 8 4 8 SHIH, F.Y., KING, C.T., and PlJ, C.C.: ‘Pipeline architectures for recursive morphological operations’, IEEE Truns., 1993, IP-4, (l), pp. 11-18 MEHER, P.K., SATAPATHY, I.K., and PANDA, G.: ‘Efficient systolic solution for a new prime factor dlscrete Hartley transform algorithm’. IEE Proc. G. 1993. 140. (2). vv. 135-139 _ . , _ _

10 LEYMXRIE, F:, and LEVINE, M.D.: ‘Fast raster scan distance propagation on the discrete rectangular lattice’, CVGIP: Image Underst.. 1992, 55, (l), pp. 8 4 9 4

I1 BORGEFORS, G.: ‘Distance transformations in arbitrary dimen- sions’, CVGIP, 1984, 27, pp. 321-345

12 KUNG, S.Y.: ‘VLSI array processors’ (Prentice-Hall, 1988), pp. 1 10-294

13 SHOJI, M.: ‘CMOS digital circuit technology’ (Prentice-Hall, 1987)

14 WESTE, N.H.E., and ESHRAGHIAN, K.: ‘Principles of CMOS VLSI design’ (Addison-Wesley, 1993)

IEE Pioc -Cornput. Digif. Tech., Vol. 143, No. 3, Muy 1996 173

Date post:	20-Sep-2016
Category:	Documents
Upload:	d-l
View:	212 times
Download:	0 times

Fast algorithm and its systolic realisation for distance transformation

Documents