Connected Component Labelling,
an embarrassingly sequential algorithm
Platform Parallel Netherlands
GPGPU-day, 20 June 2013
Jaap van de Loosdrecht
NHL Centre of Expertise in Computer Vision
Van de Loosdrecht Machine Vision BV
Limerick Institute of Technology
Overview
• Introduction and background
• Connected Component Labelling
• Sequential
• Few-core
• Many-core
• Kalentev et al. approach
• Suggestions for extending
• Suggestions for optimizing
• Summary and conclusions
• Future work on CCL
• References
• Future of intelligent cameras
• Questions
Introduction
• Manager NHL Centre of Expertise in Computer Vision
• University of Applied Sciences, Leeuwarden
• 4 FTE
• Since 1996: 180 industrial projects
• Managing director Van de Loosdrecht Machine Vision BV
• VisionLab: development environment for Computer Vision with
Pattern matching, Neural networks and Genetic algorithms
• Portable library (ANSI C++)
Windows, Linux and Android
x86, x64, ARM and PowerPC
• Student Limerick Institute of Technology (Ireland)
• Research master project,1 September 2011 – 1 September 2013
Research master project
“Accelerating sequential computer vision algorithms using
commodity parallel hardware”
Apply parallel programming techniques to meet the challenges posed
in computer vision by the limits of sequential architectures
Distinctive: investigate how to speed up a whole library by
parallelizing the algorithms in an economical way and execute them
on multiple platforms
• Generic library, 100.000 lines of ANSI C++
• Portability and vendor independency
• OpenMP for CPU, OpenCL for GPU
• Variance in execution times
• Run-time prediction if parallelization is beneficial
Computer vision algorithms and parallelization
Classification image operators
• Low level image operators
• Point operators
• Local neighbour operators
• Global operators
• Connectivity based operators
• High level image operators
• Often built on the low level operators
• “Specials”
• Pattern matcher, neural network, genetic algorithm, etc
Idea: start with low level image operators, design and implement
skeletons for parallelizing representatives in each classes
Demonstration Label Blobs
• Open image cells.jl,
• Show image contents
• ThresholdIsoData
• Show image contents
• Explain background/objects, white/black and 0/1
• LabelBlobs, show image contents
• Show image contents
• Explain 3 used colours
• BlobAnalyse
• Explain table
• Explain successive label numbering
Screen shot demo
8
Label blobs iterative algorithm
• Give each object pixel a unique positive value
Classical sequential approach Haralick and Shapiro (1992)
• Binary image:
1
5
9
2
6
10
3
7
12
4
8
13 11
1
1
1
1
1
1
1
1
1
1
1
1 1
9
Label blobs iterative algorithm
• Repeat until no changes
• Down pass (top left to right bottom):
give each pixel the minimum value of its 8 neighbours
• Up pass (right bottom to top left):
give each pixel the minimum value of its 8 neighbours
1
1
1
1
1
1
3
3
1
3
3
1 1
1
1
1
1
1
1
1
1
1
1
1
1 1
Sequential version
He, Chao, and Suzuki (2008): two passes approach best performance
• Pass1: equivalent labels are stored in equivalence table
(neighbourhood search)
• Resolving equivalences with search algorithm
• Pass2: assign label to pixel (lookup table)
• Analysis of execution time (VisionLab) in s on Core i7-2640M
for ‘typical image’ cells.jl
Size image Pass1 (s) Resolving
equivalences (s)
Pass2 (s) Total (s) Pass1/Total
256x256 134 1 43 178 0.75
512x512 405 2 159 566 0.71
1024x1024 1358 3 629 1990 0.68
Parallel version
• Rosenfeld and Pfaltz (1966): CCL cannot be implemented with parallel
local operations
• Hawick, Leist and Playne (2010): Label Equivalence best performance
• Kalentev, Rai, Kemnitz, and Schneider (2011): alternative Label
Equivalence approach
• Store equivalence table in image
• No atomic operations
• Claim efficient in terms of number of iterations needed,
on average 5 iterations on their test set
• Algorithm
• Initial pass
• Multiple iterations
• Link pass (neighbourhood search)
• Label equalize pass (neighbourhood search)
• Final pass
Kalentev et al. approach
It is expected that
• Both passes of iteration have similar complexity as Pass1
• Initial and final pass have similar complexity as Pass2
Analysis
• On average Kalentev et al approach needs 5 iterations
• One simple initial pass
• 10 neighbourhood search passes
• One simple final pass
• Extra post processing step with two simple passes
Estimation
• Sequential version 1 unit of execution time
• Kalentev et al. 8.2 units of (sequential) execution time
Kalentev et al. approach
• Different approaches needed for few-core CPU approach and
many-core GPU approach
• GPU approach will suffer from branch diversion
Few-core approach on Core i7-2600 CPU @ 3.4 GHz (quad-core)
By Kalentev et al. suggested framework host code
WriteBuffer(image)
int notDone = 1;
RunKernel(“InitLabels”,image);
WriteBuffer(notDone);
while (notDone == 1) {
notDone = 0;
WriteBuffer(notDone);
RunKernel(“Link”,image,notDone)
RunKernel(“LabelEqualize”,image)
ReadBuffer(notDone);
} // while notDone
ReadBuffer(image)
Suggestions for extending Kalentev et al. approach
• InitLabel kernel is extended to set the border pixels of the image to the
background value
• Link kernels are implemented for both four and eight connectivity
• Post processing step with two passes is added in order to make the
labelling of the blobs successive
Suggestions for optimizing Kalentev et al. approach
• Each iteration has a Link pass and a LabelEqualize pass. For the last
iteration the LabelEqualize pass is redundant
• Many of the kernel execute, read buffer and write buffer commands
can be asynchronously started and synchronized using events
• The write to the “IsNotDone” buffer can be done in parallel to the
LabelEqualize pass
• Except second pass post processing step, all kernels can be
vectorized
• InitLabel kernel straightforward
• Other kernels a quick test if all pixels in the vector are background
pixels
• Beneficial for processing background pixels
• Little extra overhead for object pixels
Core i7-2600 with GTX 560 Ti (OEM)
Core i7-2600 with GTX 560 Ti (OEM)
Core i7-2600 with GTX 560 Ti (OEM)
Summary and conclusions
Connected component labelling
• Different approaches for few-core and many-core approaches
• Few-core approach: reasonable speedups on CPUs
• Many-core approach: reasonable speedups on GPUs
• Suggestions for extending Kalentev et al. approach
• Suggestions for optimizing Kalentev et al. approach
Future work on Connected Component Labelling
• Parallelize few-core label repair step
• Implement and benchmark OpenCL implementation few-core
approach
• Research in finding the break-even point few-core versus many-
core approach
• Implement and benchmark approach suggested by Stava and
Benes (2011), only H/W ^2
References
• Van de Loosdrecht, J., 2013. Accelerating sequential computer vision algorithms using commodity
parallel hardware. Research master project at Limerick Institute of Technology. Expected to be published
in autumn 2013 at www.vdlmv.nl/thesis.
• Haralick, R.M. and Shapiro, L.G., 1992. Computer and Robot Vision. Volume I and Volume II. Reading:
Addison-Welsey Publishing Company.
• He, L., Chao, Y. and Suzuki, K., 2008. A Run-Based Two-Scan Labeling Algorithm. IEEE Transactions on
image processing, 17(5), pp.749-56.
• Rosenfeld, A. and Pfaltz, J.L., 1966. Sequential Operations in Digital Picture Processing. Journal of the
ACM , 13(4), pp.471-94.
• Hawick, K.A., Leist, A. and Playne, D.P., 2010. Parallel graph component labeling with GPUs and CUDA.
Parallel Computing, 36(12), pp.655-78.
• Kalentev, O., Rai, A., Kemnitz, S. and Schneider, S., 2011. Connected component labeling on a 2D grid
using CUDA. Journal of Parallel and Distributed Computing, 71 (4), pp.615-20.
• Stava, O. and Benes, B., 2011. Connected Component Labeling in CUDA. In: Wen-Mei, W.H. ed. 2011. Gpu
Computing Gems, Emerald edition. Burlington: Morgan Kaufman. Ch.35.
Future: Intelligent camera with heterogonous computing
XIMEA Currera G
• AMD T-56N
• Dual-core x64 1.6 GHZ
• 80 core GPU 500 MHz
• 2 GB DDR3
• 32 GB SSD
• 4 USB-3, 1 USB-2
• HDMI
• PoE Gigabit ethernet
• Micro PLC
• 8 digital I/Os
• Many image sensors
<= 5M pixel
Prototype XIMEA Currera G
Prototype XIMEA Currera G
Questions ?
Jaap van de Loosdrecht
NHL Centre of Expertise in Computer Vision
www.nhl.nl/computervision
Van de Loosdrecht Machine Vision BV
www.vdlmv.nl