+ All Categories
Home > Documents > [IEEE Image Processing (ICVGIP) - Bhubaneswar, India (2008.12.16-2008.12.19)] 2008 Sixth Indian...

[IEEE Image Processing (ICVGIP) - Bhubaneswar, India (2008.12.16-2008.12.19)] 2008 Sixth Indian...

Date post: 14-Dec-2016
Category:
Upload: srinivas
View: 222 times
Download: 1 times
Share this document with a friend
8
Fast, Processor-Cardinality Agnostic PRNG with a Tracking Application Andrew Janowczyk & Sharat Chandran ViGIL, Department of Computer Science and Engineering Indian Insitute of Technology Bombay {andrew,sharat}@iitb.ac.in Srinivas Aluru Iowa State University [email protected] Abstract As vision algorithms mature with increasing inspira- tion from the learning community, statistically independent pseudo random number generation (PRNG) becomes in- creasingly important. At the same time, execution time de- mands have seen algorithms being implemented on evolving parallel hardware such as GPUs. The Mersenne Twister (MT) [7] has proven to be the current state of the art for generating high quality random numbers, and the Nvidia provided software for parallel MT is in widespread use. While execution time is important, development time is also critical. As processor cardinality changes, a founda- tion for generating simulations that will vary only in execu- tion time and not in the actual result is useful; otherwise the development time will be impacted. In this paper, we present an implementation of the Lagged Fibonacci Generator (LFG) considered to be of quality equal [7] to MT on the GPU. Unlike MT, LFG has this important processor-cardinality agnostic capability – that is – as the number of processing resources changes, the overall sequence of random numbers remains the same. This feature notwithstanding, our basic implementation is roughly as fast as the parallel MT; an in-memory version is actually 25% faster in execution time. Both parallel MT as well as parallel LFG show enormous speedup over their sequential counterparts. Finally, a prototype particle filter tracking application shows that our method works not just in parallel computing theory, but also in practice for vi- sion applications, providing a decrease of 60% in execution time. 1. Introduction Random numbers are used in scientific applications to model noise, as well as to enable algorithms to reach their optimal solution efficiently. Conventional random numbers are uniformly distributed, as they can later be adjusted based on importance sampling principles. One way to acquire such numbers is to measure random physical quantities using modern sensors. However, it is easily recognized that these physical methods are not very suitable for computer simulation since they are hard to con- trol and non-repeatable. In sequential computing, therefore, there is a rich theory of Pseudo-Random Number Gener- ators (PRNG) [7, 4] of high quality. The current state of the art PRNG is the Mersenne Twister [7] where tests such as the DIEHARD suite [11, 3, 6] show statistical indepen- dence. 1.1. Parallel Requirements As vision methods get implemented on parallel proces- sors (in our case, the GPU), three requirements other than statistical independence and repeatibility emerge: 1. Linear Speedup. The generation of a random number per processor should be constant, and preferably the same as that of the sequential processor. 2. (Parallel) Execution Time. Often, it is not possible to get linear speedup. The reason for this is twofold. First, communication between processing elements is necessary; secondly the algorithm does tend to have some intrinsi- cally sequential portions which limit the speedup (e.g., see Amadahl’s Law.) Also significant, and less frequently men- tioned, is the end-to-end time involved in I/O (e.g., load- ing images) and other operations. Thus, at the end of the day, the effective clock time is also important. An over- all speed improvement of 60% with respect to the best se- quential time is considered useful by the parallel processing practitioners. 3. Processor-Cardinality Agnosticism. For a uniproces- sor PRNG, providing the same initial value, called a seed, should produce the same output sequence. We now take this requirement one step farther by asking that the same output sequence is presented regardless of the processor configu- ration in a multiprocessor system. This property is termed Processor-Cardinality agnosticism (PCA) and has not re- ceived the importance it deserves. PCA is incredibly important when running simulations Sixth Indian Conference on Computer Vision, Graphics & Image Processing 978-0-7695-3476-3/08 $25.00 © 2008 IEEE DOI 10.1109/ICVGIP.2008.90 171 Sixth Indian Conference on Computer Vision, Graphics & Image Processing 978-0-7695-3476-3/08 $25.00 © 2008 IEEE DOI 10.1109/ICVGIP.2008.90 171 Sixth Indian Conference on Computer Vision, Graphics & Image Processing 978-0-7695-3476-3/08 $25.00 © 2008 IEEE DOI 10.1109/ICVGIP.2008.90 171
Transcript

Fast, Processor-Cardinality Agnostic PRNG with a Tracking Application

Andrew Janowczyk & Sharat ChandranViGIL, Department of Computer Science and Engineering

Indian Insitute of Technology Bombay{andrew,sharat}@iitb.ac.in

Srinivas AluruIowa State University

[email protected]

Abstract

As vision algorithms mature with increasing inspira-tion from the learning community, statistically independentpseudo random number generation (PRNG) becomes in-creasingly important. At the same time, execution time de-mands have seen algorithms being implemented on evolvingparallel hardware such as GPUs. The Mersenne Twister(MT) [7] has proven to be the current state of the art forgenerating high quality random numbers, and the Nvidiaprovided software for parallel MT is in widespread use.

While execution time is important, development time isalso critical. As processor cardinality changes, a founda-tion for generating simulations that will vary only in execu-tion time and not in the actual result is useful; otherwise thedevelopment time will be impacted.

In this paper, we present an implementation of theLagged Fibonacci Generator (LFG) considered to be ofquality equal [7] to MT on the GPU. Unlike MT, LFG hasthis important processor-cardinality agnostic capability –that is – as the number of processing resources changes,the overall sequence of random numbers remains the same.This feature notwithstanding, our basic implementation isroughly as fast as the parallel MT; an in-memory versionis actually 25% faster in execution time. Both parallel MTas well as parallel LFG show enormous speedup over theirsequential counterparts. Finally, a prototype particle filtertracking application shows that our method works not justin parallel computing theory, but also in practice for vi-sion applications, providing a decrease of 60% in executiontime.

1. Introduction

Random numbers are used in scientific applications tomodel noise, as well as to enable algorithms to reach theiroptimal solution efficiently. Conventional random numbersare uniformly distributed, as they can later be adjusted basedon importance sampling principles.

One way to acquire such numbers is to measure randomphysical quantities using modern sensors. However, it iseasily recognized that these physical methods are not verysuitable for computer simulation since they are hard to con-trol and non-repeatable. In sequential computing, therefore,there is a rich theory of Pseudo-Random Number Gener-ators (PRNG) [7, 4] of high quality. The current state ofthe art PRNG is the Mersenne Twister [7] where tests suchas the DIEHARD suite [11, 3, 6] show statistical indepen-dence.

1.1. Parallel Requirements

As vision methods get implemented on parallel proces-sors (in our case, the GPU), three requirements other thanstatistical independence and repeatibility emerge:1. Linear Speedup. The generation of a random numberper processor should be constant, and preferably the sameas that of the sequential processor.2. (Parallel) Execution Time. Often, it is not possible toget linear speedup. The reason for this is twofold. First,communication between processing elements is necessary;secondly the algorithm does tend to have some intrinsi-cally sequential portions which limit the speedup (e.g., seeAmadahl’s Law.) Also significant, and less frequently men-tioned, is the end-to-end time involved in I/O (e.g., load-ing images) and other operations. Thus, at the end of theday, the effective clock time is also important. An over-all speed improvement of 60% with respect to the best se-quential time is considered useful by the parallel processingpractitioners.3. Processor-Cardinality Agnosticism. For a uniproces-sor PRNG, providing the same initial value, called a seed,should produce the same output sequence. We now take thisrequirement one step farther by asking that the same outputsequence is presented regardless of the processor configu-ration in a multiprocessor system. This property is termedProcessor-Cardinality agnosticism (PCA) and has not re-ceived the importance it deserves.

PCA is incredibly important when running simulations

Sixth Indian Conference on Computer Vision, Graphics & Image Processing

978-0-7695-3476-3/08 $25.00 © 2008 IEEE

DOI 10.1109/ICVGIP.2008.90

171

Sixth Indian Conference on Computer Vision, Graphics & Image Processing

978-0-7695-3476-3/08 $25.00 © 2008 IEEE

DOI 10.1109/ICVGIP.2008.90

171

Sixth Indian Conference on Computer Vision, Graphics & Image Processing

978-0-7695-3476-3/08 $25.00 © 2008 IEEE

DOI 10.1109/ICVGIP.2008.90

171

that compare two methods for parallel running time. Espe-cially in development, as processing elements change, theonly variable should be the methods and not the stream ofrandom numbers provided to them. This is to say that whiledeveloping a simulation on a single CPU machine that usesthese random numbers, the simulation will be exactly thesame if the number of processors is increased or decreased.The only noticeable change should be the overall executiontime. PCA is often implicit in most deterministic parallelalgorithms; in the sorting problem, since comparisons aretransitive, adding more processors does not change the endresult; the precise order of comparison is irrelevant for cor-rectness. However, this is not the case in general practicefor generating random numbers (in theory we could alsorewrite the sequential algorithm to emulate the parallel al-gorithm; parallel algorithms however usually come muchlater than the sequential algorithm.)

Our main focus in this paper is PCA for random num-bers which is not provided by existing methods in the com-munity. This apathy toward the number of processors usedis provided in our implementation without sacrificing theother requirements (speedup, and parallel time). Once ran-dom numbers are generated, they can be used directly inmany computer vision algorithms. The one discussed inthis paper as a proof of concept is particle filtering.

1.2. Previous Work

A survey of work in PRNG (in the context of GPUs) ap-pears in [11]. Please refer to that paper as it shows the short-comings of existing work on GPUs, such as [10].

An important GPU-based PRNG that [11] does not con-sider in their comparison is the parallel version of MT. Theprimary reason for this is their belief that such a generatoris intrinsically sequential (to quote, “Although there are ef-forts to parallelize such sequential random number genera-tors on GPUs such an approach remains fundamentally dif-ficult”). In stark contrast, the important result that this paperpresents is the explicit parallel implementation of the LFGon the GPU with speed comparable to MT. Indeed the MTis the most popular implementation on GPUs and thereforeour attention is aimed at parallel MT which generates largeamounts of random numbers at the speed of approximately1.8 billion random numbers per second (brps), (which inci-dently vastly outperforms the method presented in [11]).

There exists the following two significant drawbacks toparallel MT.

1. Preprocessing and Setup Time: [7] states that in or-der to use this algorithm in parallel, multiple instances ofthe algorithm should be run with different seeds. In orderto insure that these instances do not produce a lower qualityof random numbers when their output is merged together asignificant amount [8] of setup time is deemed necessary to

compute acceptable initial startup values. The setup time ismost often overcome by providing a file (generated offline)for specific processor configurations. The file contains ini-tial information for the associated number of processors (orblocks). A severe drawback is that it takes about 5 secondsto generate initial information for one of these processors ifit is not already present. In the case where no file is present,and a 128 processor configuration is desired, our experi-ments indicate that it will take over 10 minutes to generatethe initialization information.

2. Processor-Cardinality Agnosticism: FundamentallyMT does not satisfy the PCA property. The initial valuesare directly based on the number of processors that will beperforming the computations. If requirements are modified(number size in bits, number of processors, and so on), theentire initial value set must be recomputed, resulting in anew sequence of output. This is to say that each proces-sor in a MT does not work together on producing a singlesequence of numbers, but in actuality, is each running a sep-arate instance of the algorithm with different initial valuesand then combining all of the outputs to form one sequence.

1.3. Contributions

In this paper the Lagged Fibonacci PRNG is imple-mented on the GPU using CUDA. This generator was pre-viously ignored due to its apparent sequential nature (seeEquation 1). We overcome this; specifically:

We simultaneously provide all three ideal properties(§1.1): speed (1.26 brps) comparable to parallel MT (1.8brps), linear speedup, and PCA. See the result section fordetails on when our algorithm is comparable, and when weoutperform parallel MT.

We show that this method works in practice with a parti-cle filter that utilizes GPU technology. A significant reduc-tion in the development time was witnessed as there was anear seamless transition between the uni-processor versionand the GPU version of the particle filter. Although perfor-mance gains (as opposed to development time which willvary based on developer) of over 30% were detected, it is tobe noted that the emphasis in this paper is not improving thestate of the art in tracking but in providing a useful parallelPRNG.

2. Leap Frog Generator

The complete theory for the LFG appears in [1]. Thisportion is to provide a shortened version of the main con-cepts; the reader should feel free to skip to the end of thissection (Equation 5) to get to the bottom line of the theory.

LFGs are created using the following recurrence for-mula:

xk = xk−p ⊕ xk−p+q mod m (1)

172172172

where ⊕ represents exclusive or (xor). m represents themaximum sample value we wish to compute. p and q are thevariables which control how xk is generated. q is referredto as the lag, while p states the number of previous numbersthat must be kept in order to generate the next p-q numbers.The period of this generator is 2p − 1. Indexing of k beginsfrom zero.

Equation 1 is not in the proper format to take a sequentialalgorithm and convert to a PCA parallel algorithm. In orderto perform this task, expanding the recursion a few stepsand trying to determine a direct equation for the kth numberin the sequence is imperative.

Figure 1. Recursive expansion of the coreLFG equation

From Figure 1 it can be seen that the coefficients of eachterm at a level coincide with the associated level of the Pas-cal’s triangle. As a result, this pattern can be used to com-press the nth level of the recursion of (1) down to the fol-lowing new equation:

xk =n⊕

i=0

(ni)⊕

j=1

xk−np+iq mod m (2)

where(ni

)is the binomial coefficient and n is the row in

the Pascal’s triangle.The next step in the simplification of the equation is to

select m such that the exclusive or operation preceding itwill never yield a number that is greater than m, thus al-lowing us to ignore the modulus completely. In the casewhere m is selected to be equal to 2l where l is the numberof bits in each piece of output, modulus operation can beignored. As a trivial proof consider the case where a < 2l

and b < 2l, the bitwise operation a ⊕ b will never yield anumber equal to or greater than 2l, implying that the mod-ulus operation has already been completed. The only twonumbers then that need to have this requirement explicitlyenforced are the initial values. The registers where we storethe initial values automatically enforce the requirement thatthe numbers are less than 2l by simply not being able tostore any additional bits.

Using properties specific to the xor operation, the equa-tion can be reduced further down to its usable parallel equiv-alent. The first property that is exploited is the fact thatgiven any binary vector a if we apply l xor operations to it,we will either be left with a or 0.

This goes to show that if the value for n is carefully cho-sen so that most of the coefficients are even, the “internal”xor operation:

(ni)⊕

j=1

xk−np+iq (3)

will evaluate to zero often, not only directly reducing thenumber of terms in the computation of xk , but also in thequantity of required numbers stored on each processor.

Figure 2. Pascal’s Triangle: Odd coefficientterms highlighted in black

If n is chosen so that it is a power of two, the optimalmethod is discovered. An analysis of Pascal’s triangle inFigure 2 should be enough to convince the reader of the rea-soning. When n is a power of two, there are only two oddelements contained in that nth level in Pascal’s Triangle,namely the first and last element. Using (3) it becomes pos-sible to eliminate all even elements as they will simply eval-uate to zero. This leaves the only two i’s that are computedbeing when i = 0 and i = n because both

(n0

)=(nn

)= 1,

which again, is also seen to be the only two odd terms in thenth row.

As previously mentioned, each processor should be ableto generate the next item in the sequence based solely onthe numbers already generated and stored on that same pro-cessor. Applying this restriction on equation (2),−np + iqhas to be a multiple of N , where N is the number of pro-cessors available, ensuring that they will reside on the sameprocessor. As previously shown, i = 0 and i = n will be theonly two terms computed in the case where n is a power oftwo, so selecting n to also be equal to N , provides us withthe subscript of the two terms to be −Np and −Np + Nq,which are both multiples of N . As a result the entire for-mula becomes

173173173

xk = xk−Np ⊕ xk−Np+Nq (4)

In conclusion, assuming that each processor has previ-ously generated p random numbers, the above formula en-ables each processor to independently produce numbers inPCA fashion, and at the same speed as a single proces-sor. The sole requirement of using this algorithm is thatthe number of processors must be a power of 2. In normalcases this could be quite handicapping, but as the discussionin the implementation section will show, the GPU allows forthis restriction to be relaxed.

2.1. GPU Implementation

Clearly an initialization stage must occur for each pro-cessor. From Equation 5, this involves finding seeds for prandom numbers and then using the LFG algorithm to gen-erate a total of N ∗ p numbers in the sequence, where N isthe number of processors. These numbers are then directlyfed into the individual processors and the parallel generationcan begin. While this may appear like a small issue, thesenumbers can be saved to file and used for any processorconfiguration just as long as the new number of processorsis less or equal to N in N ∗ p. These data files are com-parable in size to the ones used by the Mersenne Twisterto initialize itself, so are not viewed as being a significanthandicap. Alternatively, instead of storing the numbers in afile, these N ∗ p numbers can actually be generated in par-allel achieving true processor-cardinality agnosticism.

In this implementation, the first p seeds are generatedusing a standard sequential generator on the CPU. The nextN ∗ (p − 1) seeds can be generated on the GPU in sets ofp − q (the maximum parallelization or step size). In thesituation where a very large number of processors are de-sired, or a very large p, (see below for a discussion on p) itbecomes beneficial to take advantage of such a benefit. Inthis specific case, since N ∗ p is quite small, the generationwas implemented on the CPU because the speed differencebetween the parallelized version and uni-processor versionwas rather negligible. The next important aspect in imple-menting the LFG algorithm is the decision of both p and q.Thankfully a large amount of research has already gone intodetermining the quality of their randomness and need not bere-evaluated. [1, 5] have stated that for the certain values ofp that are used in this work, the LFG exhibits satisfactoryrandomness, passing all statistical tests. Some of the val-ues used, listed as {p, q} pairs, are {607,273}, {1279,418},{1279,861}. The main reason why these three pairs wereconsidered is because of their different lag sizes.

Using the CUDA toolkit, which is a C++ derivative forproprietary Nvidia cards, it is possible to select not only thenumber of blocks but also the number of threads to be runin each block. In the most simplistic of terms, a block can

be thought of as a single processor. It can not share any in-formation with the other blocks (processors) unless it usesits global memory (which is notoriously slow). CUDA pro-vides a level of abstraction and expandability as it allowsyou to select a block size that can be significantly greaterthan the number of physical processors on the card. Thereis a scheduler that pushes the next block to the next avail-able processor. This means that if the same compiled codeis run on a card with greater processing power, the schedulerautomatically takes advantage of it, and runs more blocks ata time. In the case of the LFG, the number of blocks chosenby the developer must meet the LFG’s processor require-ment, specifically it must be a power of two.

Also, it is possible to select the number of threads runin each block, with the maximum being 512. The threadsare executed in parallel as a group on each processor basedon the warp size (32 threads in each warp). More clearly,one instruction is issued across a warp of threads, that allperform the same operation on different data. In the case ofthe LFG, the maximum number of threads that can used isbased on the total step size, which is equal to p − q. It isimpossible to have p − q + 1 threads because the p − q +1 thread will need to access information that hasn’t beengenerated yet (in this case p + 1).

In order to be able to generate the next p−q numbers, theprevious p numbers must be stored on the processor. Thenaive way to accomplish this involves having two buffers,using one buffer of size p to generate the second bufferand copying over the last q values from the first buffer, andthen swapping them. This would make the total overheadper processor 2 ∗ p. Using a bit of modulus style math,it is possible to reduce the memory requirement down toexactly 2 ∗ p − q by using a single buffer and having thereading/writing operations continue in a wrap around fash-ion. This was not only able to reduce a bit of the memoryoverhead, but was also able to provide a small efficiencyincrease.

3. Particle Filtering

Particle Filtering is a sequential Monte Carlo techniqueusing ‘particle’ (or point mass) representations of probabil-ity densities. Particle filters recursively estimate the state ofa system, which can be non-linear and non-Gaussian. Thecore of the algorithm is based on using a large quantity ofgood quality random numbers in order to estimate the nextlikely position of the system. A basic particle filter was im-plemented on the GPU using the sequences generated fromthe above LFG. The particle filter presented here is basedvery loosely on [2], its main goal is to show that low levelparticle filtering techniques are well suited to be used in aGPU environment using our Leap Frog Generator. Withthese low level basics in place, various higher level algo-

174174174

rithms can be implemented with ease.

3.1. Theory

A particle filter consists of two parts, an observation vec-tor at time t labeled Zt , and a state vector at time t labeledst.The system transition distribution p(st|st−1) illustratesthe dynamics of the system as they move from st−1 to st.The system is assumed to be a first order Markov process,thus st depends only on st−1. The observation likelihooddistribution, p(zt|st), returns the likelihood of the observa-tion zt given the state st. The time series Z1:t is assumedto be conditionally independent given the unobserved suffi-cient state st. A first-order kinematic model with uniformlyrandom perturbations is chosen as motion model, i.e., statedynamic model. Explicitly:

st = Ft−1st−1 + wt−1

st−1 = [ xt−1 yt−1 vxt−1 vyt−1 ]

wt−1 = [ wxt−1 wyt−1 wvxt−1wvyt−1

]

Ft−1 =

1 0 ∆t 00 1 0 ∆t0 0 1 00 0 0 1

where vx and vy describe the velocities in the x and y di-rection, wt−1 describes the associated noise at time t− 1.

The process entails first generating some initial particlesin a detection zone to locate new objects to track. Next adetection is performed to determine which of these parti-cles have landed on the unknown foreground. Finally theseparticles are propagated based on (6) with re-sampling toreplace the particles whose estimate of the motion modelwas incorrect.

Re-sampling occurs to keep the system filled with liveparticles that have a good model estimate. As particles be-come inaccurate, they essentially die and are replaced bycopies of live particles that have had their location and ve-locity slightly perturbed by noise. This helps the systemto become more robust to changes in velocity in the target,by providing a variance of particles that are all propagateddifferently.

Figure 3 is presented as an example of the entire processacross two frames. The foreground object to be tracked isthe large circle, with the large arrow representing its veloc-ity (in this case, without loss of generality, horizontally tothe left). The blue, green and red circles represent parti-cles that are alive, dead, and re-sampled, respectively. Thesmaller arrows from them indicate their unique trajecto-ries. To progress from the frame on the left to the frameon the right, we first perform the detection (based on anyselected background detection algorithm), resulting in thepresented alive and dead coloring. Then the propagation

Figure 3. Particle Filter example across twoframes. The foreground object to be trackedas the large circle. Blue, green, red circlesindicate alive, dead and re-sampled particles,respectively.

and re-sampling is performed. It is shown that some of theparticles that were once considered good models of our tar-get no longer lay inside of its boundaries, are now labeledas dead (green) and are no longer considered in future it-erations. On the other hand, the live particles (blue) createcopies of themselves with their values slightly perturbed bysystem noise. It can be seen that there are now a largernumber of particles correctly “sitting” on the object, withincreased variance in their associated velocities. As men-tioned before, this makes the overall system more robustbecause as the object alters its course or velocity, there willbe a higher probability that one of the particles will have aclose to correct velocity. This “correct” particle serves asa template for the system to re-sample, thus replenishingitself, and as a final result adapting smoothly.

3.2. GPU Implementation

Particle filtering is well suited to being implemented ona multiprocessor system. Since each of the individual func-tions of the particle filter can be broken down into eitherpixel based calculations or particle based calculations, theycan each be computed in parallel since their calculations areindependent of each other.

A main point was to ensure that the particle’s informa-tion structure size fit inside of the supported coalescing datasizes. According to [9], these sizes must be 4, 8, 16 bytes. Ifthey don’t fall into this strict limit, loading and storing eachparticle would take a significant amount of time as each loadoperation would occur as a serial operation instead of a par-allel operation as mentioned quite thoroughly in [9]. In ourimplementation we were able to shrink the memory require-ment for each particle to fit perfectly in an 8 byte structure.

The first part to implementation was to determine a ba-sic background in order to perform background subtraction.Since the background in the end will reside on the devicememory of the graphics card, it becomes possible to com-pute the background model by simply pushing a numberof frames from the source to the graphics card and having

175175175

a running average performed until a time where the back-ground model is accurate. This is efficiently done becauseeach of the background generation calculations involve onlytwo pieces of information, the previous background and thecurrent frame, and thus all of the pixels can be worked onin complete parallel.

The next task was to initialize the particles. For our ex-periments, 8000 particles were used. Initialization involvedcreating the memory, allocating initial values to them andrandomly placing them over a small window on the righthand side of the image. The process of placing the par-ticles randomly over an area was accomplished using thenumbers generated from the LFG algorithm. This area be-came the detection area, as any new object in front of thebackground would trigger a difference in the backgroundsubtraction and thus begin the true workings of the system.Since these particles are independent of each other and alsoreside on device memory, it became beneficial to have theGPU perform the initialization.

Afterwards, the detection for the 8000 particles takesplace. This consists of doing a background subtraction ateach of their locations against the current frame. Once thecurrent frame is copied to the device, there is no issue withperforming the detection across many processors becauseboth the background model and the particles are present inthe global device memory. The most difficult part of the im-plementation was the re-sampling. The main reason for thischallenge being that it becomes impossible to know whichparticles die and which particles will hit their target. In the-ory this is insignificant, but in terms of implementation itprovides a problem as we’re restricted to very discrete lo-cations to store the re-sampled and live particles. In thisspecific case, the graphics card that was available did notsupport asynchronous operations. This prevented each par-ticle from allocating its own memory without any race con-ditions being present. As a direct result another techniquehad to be applied. More clearly, if each alive particle gen-erates five new particles based off of its current state, usingthe current CUDA frameworks it becomes difficult to deter-mine their storage locations in parallel.

In order to convert this re-sampling function into one thatdoes not require asynchronous operations, it became neces-sary to sort the particles based on their state. Once sorted,the next stage was to determine the total number that arealive and the total number of free spaces available in the sys-tem. Afterwards each of the particles was re-sampled basedoff of a multiple of the number of free spaces to completelyrefill the system. The noise used in the system was basedoff of the numbers generated using the LFG algorithm. Thisspecific function suffered the most efficiency cost in imple-mentation on a GPU, but could easily be brought back downto more acceptable standards by using the newer versions ofthe graphics card that support the asynchronous operations.

Lastly, after the particles are re-sampled, the process ofpropagating them begins. This involves using (6) on eachof the 8000 particles in the system. This process is easilyimplemented as it does not require any cross particle com-munication and thus was greatly enhanced using a parallelenvironment.

4. Results

First, results are documented that will be useful indepen-dent of the specific vision application. Later specific detailsare given for the particle tracker. All tests were run on aHewlett Packard xw9400 workstation running Fedora withan Nvidia GeForce 8800GTS having 340MB of onboardRAM. Seconds timings were calculating by using the stan-dard C clock() function divided by CLOCKS_PER_SEC.

4.1. LFG Generator

As stated before, the Mersenne Twister is able to pro-duce random numbers at about 1.8 billion samples a second.Upon looking at the code provided with the NVIDIA SDK,it was determined that these numbers are generated and lefton the device and not copied back to the host computer’s(i.e. CPU) memory.

Under the same operating conditions, this LFG’s imple-mentation was able to generate random numbers at the av-erage speed of about 1.26 billion samples a second, with themaximum obtainable speed as 1.28 billion. This was usingp=607, q=273 the number of blocks as 128 and the num-ber of threads being the maximum parallelization of p − q(334). The output sequence was saved to a file in binary for-mat and compared against sequences generated by differentconfigurations, simulating a varying number of processorsand threads. Also a comparison against the most basic LFGrunning on a single CPU was performed. The Unix com-mand diff was used to compare these binary files and nodifferences were detected.

Upon running standard DIEHARD tests on the se-quences, results were obtained that were similar to [5], andthus have passed all of the statistical tests as previously con-cluded in the literature.

Different (p, q) configurations were experimented with,as listed earlier in the implementation section. Initially itwas thought that having a greater step size (p−q) would im-prove the overall speed because each iteration of the blockwould produce more numbers. The case where p=1279 andq=418 has a step size of 861, but was quickly ruled out asthe physical limitations of the card prevent more than 512threads being launched. The next attempt at using p=1279and q=861 didn’t provide any significant speed increasesimply because each block had more threads to run (moresets of warps) and the overhead for launching a new block

176176176

CPU Reference Standard Implementation Inline ImplementationMersenne Twister 9.65 x 106 1.8 x 109 2.59 x 109

Leap Frog Generator 2.43 x 107 1.26 x 109 3.5 x 109

Table 1. Speed comparisons of sample generation in number of samples per second

is not significant enough to justify leaving an old block inplace for longer period of time.

Since each of the blocks are required to write their fi-nal output to the complete sequence in device memory, theminimum number of threads recommended by the opera-tions manual [9] is 192 to hide the memory reading/writinglatency. Since there are no statistically proven generatorsthat have a lag this low, there were no experiments run test-ing this directly. Since the actual number of threads wasmuch greater than this minimum requirement, no memorylag due to reading and writing was experienced.

Another approach involved decreasing the number ofthreads from p − q (the maximum possible) to a value thatwas a multiple of the warp size (32 threads). This wouldensure that there were no partially empty warps resulting infull usage of the computational resources. While this didprovide a small speed increase, it wasn’t enough to influ-ence the overall performance. This could simply mean thatthe scheduler’s ability of allocating its resources is able totake such things into account and respond accordingly.

Finally an experiment was run to determine the possiblespeed of output if the values didn’t need to be recorded tothe device memory. This is useful for example in the casethat a modeling process could occur inline with the genera-tion of new random numbers. This could be used where analgorithm no longer needed any type of input from host ma-chine itself. Simply by removing the requirement to writeto device memory, the computational speed jumped up toabout 3.5 billion random numbers a second. This becomesquite significant because the Mersenne Twister, under thesame criteria is only capable of producing 2.59 billion ran-dom numbers a second, allowing us to claim a speed im-provement of over 25%. This can easily be explained be-cause the MT code performs many steps for generating eachsequential number, while the LFG needs to only performone binary operation (the xor), which is supported directlyin hardware.

According to the Nvidia specifications of the GeForce8800GTS, there are 12 multiprocessors present on the card.It is thought that if this algorithm were to be run on a graph-ics card that had a number of multiprocessors that was apower of 2 (the next logical step is 16), a significant in-crease in speed could be expected.

Time (seconds) Frames PerSecond (fps)

Baseline 103.7 167.1C Rand Function 463.9 37.5

Leap Frog Generator 190.9 91.1

Table 2. Particle Filtering Comparisons for17,400 Frames. C Rand was run sequentiallyon the CPU while the LFG was GPU based.

4.2. Particle Filter Integration

The combined LFG and particle filter were developed us-ing C++, OpenCV(for basic image loading and smoothingfunctions) and the CUDA toolkit. All tests were run usingthe same aforementioned hardware.

The parallel LFG and the particle filter have proven to bea happy union of technologies. In most of the experimentsrun a speed increase of approximately 60% was experiencedover the sequential random number generated and sequen-tially run particle filter. This processing time includes theinitialization of the LFG, in which it generates its own N ∗pnumbers (that is not loading them from a previously createdfile).

Figure 4. Output sequence across 3 framesshowing particle filtering adjustment to newdirection of movement. Alive particles inblue, dead particles in green and “groundtruth" as a red box.

What is most important to note is that regardless of thenumber of processors used in the simulation, be it one or4096, the actual marked up image output of the particle fil-ter was exactly the same. The only noticeable differencewas the time of execution shortened by 60%.One of the major bottlenecks that exists only in the GPU

based version of the particle filter, is the necessity to copyeach of the image frames to the device, and afterwards to

177177177

copy the final image back from the device. Benchmarkswere performed to determine that in order for the host ma-chine mentioned to perform a simple addition operation ateach point in a 320x200 image would take 1.83 seconds. Onthe other hand, to allocate 320x200 pixels on the graphicscard and copy the data to it, resulted in only .531 seconds.The same mathematical function performed only took .159seconds to complete, and the copy back of the informationtook .148 seconds. The equivalent total operation has thenonly taken .848 seconds, which is still a 46% improvementover the sequential approach, completely justifying the useof the graphics card.

A baseline was created for comparison by removing allcode specific to particle filtering, leaving only basic imageloading, pre-processing and outputting, This skeleton wasable to process 17,400 frames of size 320 x 240 in 104.1seconds. The sequential particle filter using the standard Crand() command ran in approximately 463.9 seconds. TheGPU implemented version ran in 190.9 seconds. This pro-cess entailed: loading a frame, performing the detection andcoloring in one kernel, propagating in another kernel, andre-sampling in yet another kernel. While it is easier to workwith separate kernels for viewing/debugging purposes, oc-casionally pure speed may be desired. In this specific case,if all of the kernels were combined into one kernel, whichis possible since each particle’s operation is independent ofthe others until the final re-sampling stage, there is only avisible speed benefit of about 1%. Either way, the GPU en-abled version can operate at around 91.1 frames per second,ensuring its real-time capabilities.

As particle filtering has shown before, the greater thenumber of particles with a higher amount of variancepresent in the system, the more robust the overall systemwill be. The GPU version with LFG is able to handle amassive amount of particles quickly and easily, so that evenfor the most naive of particle filters, highly accurate resultscan be expected. Since each frame can be handled so effi-ciently, it leaves time for higher level algorithms to be runwithout fear of breaching the real-time capabilities.

5. Conclusions

An implementation of PRNG on the GPU that matchesor outperforms prior state of the art has been presented. Itsoutput is processor-cardinality agnostic; that is, it is inde-pendent of the number of processors present in the computerconfiguration. Specifically, when combining such a randomnumber generator with a particle filter tracking applicationthat is also run in parallel, the only significant change withrespect to a CPU implementation was the speed of execu-tion. The actual output of the combination (down to the lastparticle’s position and velocity) remained the same in exper-iments regardless of the configuration of the computation

environment. If applications that involve random numbersdo not recurrently need input from the host machine ourrandom number generator will be valuable in ensuring thatspecific results are re-obtainable. Being aware of a certainconfiguration found on a lesser capable machine could fur-ther be investigated using a much more powerful machinewith confidence that the results will not change.

We believe that this result changes prior thinking for par-allel PRNG. The particle filter computer vision applicationwas chosen because it enables even the most naive parti-cle filter to perform incredibly well in a GPU environment.Needless to say, the parallel PRNG will work in other ap-plications as well.

References

[1] S. Aluru. Lagged Fibonacci random number generators fordistributed memory parallel computers. J. Parallel Distrib.Comput., 45(1):1–12, 1997.

[2] S. Arulampalam, S. Maskell, and N. Gordon. A tutorial onparticle filters for online nonlinear non-Gaussian Bayesiantracking. IEEE Transactions on Signal Processing, 50:174–188, 2002.

[3] R. B. D’Agostino and M. A. Stephens, editors. Goodness-of-fit techniques. Marcel Dekker, Inc., New York, USA, 1986.

[4] D. E. Knuth. The art of computer programming. Addison-Wesley Longman Publishing Co., Inc., Boston, USA, 1997.

[5] G. Marsaglia. A current view of random number generators.pages 3–10, 1985.

[6] G. Marsaglia. The marsaglia random number cdrom in-cluding the diehard battery of tests of randomness, 1995.http://www.stat.fsu.edu/pub/diehard.

[7] M. Matsumoto and T. Nishimura. Mersenne twister: a623-dimensionally equidistributed uniform pseudo-randomnumber generator. ACM Trans. Model. Comput. Simul.,8(1):3–30, 1998.

[8] M. Matsumoto and T. Nishimura. Dynamic creation ofpseudorandom number generators. Monte Carlo and Quasi-Monte Carlo Methods 1998, pages 56–69, 2000.

[9] NVIDIA CUDA. Compute unified device architecture pro-gramming guide version 1.1. 2007.

[10] M. Sussman, W. Crutchfield, and M. Papakipos. Pseudo-random number generation on the GPU. In Proceedings ofthe 21st ACM SIGGRAPH/EUROGRAPHICS symposium onGraphics hardware, pages 87–94, New York, USA, 2006.ACM.

[11] S. Tzeng and L.-Y. Wei. Parallel white noise generation on aGPU via cryptographic hash. In Proceedings of the 2008symposium on Interactive 3D graphics and games, pages79–87, New York, USA, 2008. ACM.

178178178


Recommended