Connected Component Labelling, an embarrassingly ... van de... · • Pass1: equivalent labels are...

Connected Component Labelling,

an embarrassingly sequential algorithm

Platform Parallel Netherlands

GPGPU-day, 20 June 2013

Jaap van de Loosdrecht

NHL Centre of Expertise in Computer Vision

Van de Loosdrecht Machine Vision BV

Limerick Institute of Technology

Overview

• Introduction and background

• Connected Component Labelling

• Sequential

• Few-core

• Many-core

• Kalentev et al. approach

• Suggestions for extending

• Suggestions for optimizing

• Summary and conclusions

• Future work on CCL

• References

• Future of intelligent cameras

• Questions

Introduction

• Manager NHL Centre of Expertise in Computer Vision

• University of Applied Sciences, Leeuwarden

• 4 FTE

• Since 1996: 180 industrial projects

• Managing director Van de Loosdrecht Machine Vision BV

• VisionLab: development environment for Computer Vision with

Pattern matching, Neural networks and Genetic algorithms

• Portable library (ANSI C++)

Windows, Linux and Android

x86, x64, ARM and PowerPC

• Student Limerick Institute of Technology (Ireland)

• Research master project,1 September 2011 – 1 September 2013

Research master project

“Accelerating sequential computer vision algorithms using

commodity parallel hardware”

Apply parallel programming techniques to meet the challenges posed

in computer vision by the limits of sequential architectures

Distinctive: investigate how to speed up a whole library by

parallelizing the algorithms in an economical way and execute them

on multiple platforms

• Generic library, 100.000 lines of ANSI C++

• Portability and vendor independency

• OpenMP for CPU, OpenCL for GPU

• Variance in execution times

• Run-time prediction if parallelization is beneficial

Computer vision algorithms and parallelization

Classification image operators

• Low level image operators

• Point operators

• Local neighbour operators

• Global operators

• Connectivity based operators

• High level image operators

• Often built on the low level operators

• “Specials”

• Pattern matcher, neural network, genetic algorithm, etc

Idea: start with low level image operators, design and implement

skeletons for parallelizing representatives in each classes

Demonstration Label Blobs

• Open image cells.jl,

• Show image contents

• ThresholdIsoData


• Explain background/objects, white/black and 0/1

• LabelBlobs, show image contents


• Explain 3 used colours

• BlobAnalyse

• Explain table

• Explain successive label numbering

Screen shot demo

8

Label blobs iterative algorithm

• Give each object pixel a unique positive value

Classical sequential approach Haralick and Shapiro (1992)

• Binary image:

1

5

9

2

6

10

3

7

12

4

8

13 11

1

1

1

1

1

1

1

1

1

1

1

1 1

9

Label blobs iterative algorithm

• Repeat until no changes

• Down pass (top left to right bottom):

give each pixel the minimum value of its 8 neighbours

• Up pass (right bottom to top left):

give each pixel the minimum value of its 8 neighbours

1

1

1

1

1

1

3

3

1

3

3

1 1

1

1

1

1

1

1

1

1

1

1

1

1 1

Sequential version

He, Chao, and Suzuki (2008): two passes approach best performance

• Pass1: equivalent labels are stored in equivalence table

(neighbourhood search)

• Resolving equivalences with search algorithm

• Pass2: assign label to pixel (lookup table)

• Analysis of execution time (VisionLab) in s on Core i7-2640M

for ‘typical image’ cells.jl

Size image Pass1 (s) Resolving

equivalences (s)

Pass2 (s) Total (s) Pass1/Total

256x256 134 1 43 178 0.75

512x512 405 2 159 566 0.71

1024x1024 1358 3 629 1990 0.68

Parallel version

• Rosenfeld and Pfaltz (1966): CCL cannot be implemented with parallel

local operations

• Hawick, Leist and Playne (2010): Label Equivalence best performance

• Kalentev, Rai, Kemnitz, and Schneider (2011): alternative Label

Equivalence approach

• Store equivalence table in image

• No atomic operations

• Claim efficient in terms of number of iterations needed,

on average 5 iterations on their test set

• Algorithm

• Initial pass

• Multiple iterations

• Link pass (neighbourhood search)

• Label equalize pass (neighbourhood search)

• Final pass

Kalentev et al. approach

It is expected that

• Both passes of iteration have similar complexity as Pass1

• Initial and final pass have similar complexity as Pass2

Analysis

• On average Kalentev et al approach needs 5 iterations

• One simple initial pass

• 10 neighbourhood search passes

• One simple final pass

• Extra post processing step with two simple passes

Estimation

• Sequential version 1 unit of execution time

• Kalentev et al. 8.2 units of (sequential) execution time

Kalentev et al. approach

• Different approaches needed for few-core CPU approach and

many-core GPU approach

• GPU approach will suffer from branch diversion

Few-core approach on Core i7-2600 CPU @ 3.4 GHz (quad-core)

By Kalentev et al. suggested framework host code

WriteBuffer(image)

int notDone = 1;

RunKernel(“InitLabels”,image);

WriteBuffer(notDone);

while (notDone == 1) {

notDone = 0;

WriteBuffer(notDone);

RunKernel(“Link”,image,notDone)

RunKernel(“LabelEqualize”,image)

ReadBuffer(notDone);

} // while notDone

ReadBuffer(image)

Suggestions for extending Kalentev et al. approach

• InitLabel kernel is extended to set the border pixels of the image to the

background value

• Link kernels are implemented for both four and eight connectivity

• Post processing step with two passes is added in order to make the

labelling of the blobs successive

Suggestions for optimizing Kalentev et al. approach

• Each iteration has a Link pass and a LabelEqualize pass. For the last

iteration the LabelEqualize pass is redundant

• Many of the kernel execute, read buffer and write buffer commands

can be asynchronously started and synchronized using events

• The write to the “IsNotDone” buffer can be done in parallel to the

LabelEqualize pass

• Except second pass post processing step, all kernels can be

vectorized

• InitLabel kernel straightforward

• Other kernels a quick test if all pixels in the vector are background

pixels

• Beneficial for processing background pixels

• Little extra overhead for object pixels

Core i7-2600 with GTX 560 Ti (OEM)



Summary and conclusions

Connected component labelling

• Different approaches for few-core and many-core approaches

• Few-core approach: reasonable speedups on CPUs

• Many-core approach: reasonable speedups on GPUs

• Suggestions for extending Kalentev et al. approach

• Suggestions for optimizing Kalentev et al. approach

Future work on Connected Component Labelling

• Parallelize few-core label repair step

• Implement and benchmark OpenCL implementation few-core

approach

• Research in finding the break-even point few-core versus many-

core approach

• Implement and benchmark approach suggested by Stava and

Benes (2011), only H/W ^2

References

• Van de Loosdrecht, J., 2013. Accelerating sequential computer vision algorithms using commodity

parallel hardware. Research master project at Limerick Institute of Technology. Expected to be published

in autumn 2013 at www.vdlmv.nl/thesis.

• Haralick, R.M. and Shapiro, L.G., 1992. Computer and Robot Vision. Volume I and Volume II. Reading:

Addison-Welsey Publishing Company.

• He, L., Chao, Y. and Suzuki, K., 2008. A Run-Based Two-Scan Labeling Algorithm. IEEE Transactions on

image processing, 17(5), pp.749-56.

• Rosenfeld, A. and Pfaltz, J.L., 1966. Sequential Operations in Digital Picture Processing. Journal of the

ACM , 13(4), pp.471-94.

• Hawick, K.A., Leist, A. and Playne, D.P., 2010. Parallel graph component labeling with GPUs and CUDA.

Parallel Computing, 36(12), pp.655-78.

• Kalentev, O., Rai, A., Kemnitz, S. and Schneider, S., 2011. Connected component labeling on a 2D grid

using CUDA. Journal of Parallel and Distributed Computing, 71 (4), pp.615-20.

• Stava, O. and Benes, B., 2011. Connected Component Labeling in CUDA. In: Wen-Mei, W.H. ed. 2011. Gpu

Computing Gems, Emerald edition. Burlington: Morgan Kaufman. Ch.35.

Future: Intelligent camera with heterogonous computing

XIMEA Currera G

• AMD T-56N

• Dual-core x64 1.6 GHZ

• 80 core GPU 500 MHz

• 2 GB DDR3

• 32 GB SSD

• 4 USB-3, 1 USB-2

• HDMI

• PoE Gigabit ethernet

• Micro PLC

• 8 digital I/Os

• Many image sensors

<= 5M pixel

Prototype XIMEA Currera G

Prototype XIMEA Currera G

Questions ?

Jaap van de Loosdrecht

NHL Centre of Expertise in Computer Vision

[email protected]

www.nhl.nl/computervision

Van de Loosdrecht Machine Vision BV

[email protected]

www.vdlmv.nl

Date post:	11-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Connected Component Labelling, an embarrassingly ... van de... · • Pass1: equivalent labels are...

Documents