Download - Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 1/42

Parallel Hashing John Erol Evangelista



Definition of Terms

• GPU. Graphical Processing Unit

• Parallel Architecture. Architecture where

calculations are done simultaneously

• Serial Architecture. Architecture wherecalculations are done serially

• Voxel. 3D Analog of Pixel

• Kernels. Programs that run on the GPU.



Definition of Terms

• Threads. Smallest unit of processing.

• Latency. Time Delay

• Cache. Storage of data.

• Race condition. Output is dependenton the timing of the events.



GPU

• Graphics Processing Unit

• Its highly parallel architecture wasrecognized for its fast number

crunching abilities, giving rise to

techniques for applying GPU for non-graphical purpose.



Data Structures

• Applications rely on data structures

that can be both built and used

efficiently in parallel environment.

• Defining parallel-friendly data

structures that can be efficiently

created, updated and accessed is a

significant research challenge.



Voxel

• 3D analog of the pixel

• Number of expected occupied voxels:

O(N2).

• Storing N3

grid is extremely wastefulsince most of the grid is empty.



Hash Table

• Popular for these types of data (voxels)

since they can be constructed to answer

queries in O(1) memory accesses.



Figure 1.2. GPU hash tables are being constructed and queried every frame toperform Boolean intersections for these two animated models. Blue parts of onemodel represent voxels inside the other model, while green parts mark surfaceintersections. These images were produced using a 1283 voxel grid for pointclouds of approximately 160k points. We achieve frame rates between 25–29 fpson a GTX 280, with the actual computation of the intersection and flood-fillrequiring between 15–19 ms. Most of the time per frame is devoted to actual

rendering of the meshes.

Application



Hash Tables

Figure 1.3. While allocating storage for the value of every possible key in anarray allows directly indexing into the structure, it is wasteful when the arrayis mostly unused (top). A hash table can be used instead, which allocates farless space than the array (bottom). In this example, each slot holds both a keyand its value. The table is indexed into using a hash function h(k). Becausemultiple keys may map to the same location, the key contained in the slot andthe query key are compared on a retrieval to ensure the right value is returned.



Hash Tables

• Needs to be adapted on a parallel

environment• Serialization

• Memory Accesses are Slow

• Many probes may be required



CUDA

• stands for “Compute Unified Device

Architecture”• provide essential functionality for

parallel applications such as scattered

writes in memory and atomicoperations



CUDA C

• high-level GPU programming language

that extends C with extra constructs for

dealing with the hardware.



How it works

• Programs that run on the GPU are

called kernels and typically consist of

just a few small functions.

• Kernels are executed in parallel by

threads, each performing the same

instructions on a different data.• e.g. programs computing hash function

value of every input key.



Limitation

• Copying data to and from the GPU is

very expensive.

• Kernels do not have access to the hostsystem’s memory.

• Solution: Use data structures that can

be built and used entirely in parallel,allowing data to stay in the GPU while

it is being processed.



How it works

• Threads are grouped into thread blocks

of up to 512 threads, which are assigned

to different streaming multiprocessors (SM) for execution.

• Thread blocks are queued up for the

SMs and fed in as the thread blocks

finish



How it works

• Thread blocks can complete execution

before others are even started, so there

is no way to globally synchronize all thethreads without finishing the kernels.

• Threads in the same block can locally

synchronize using execution barriers,guaranteeing that they have all reachedthe same point before continuing.



How it works

• Multiple thread blocks can be handled by SMs simultaneously, but there is a

hard limit on the number of threads the

SM can handle.



How it works

• Each SM breaks its thread blocks into

groups of 32 consecutive threads called

warps.• SMs manage when each of their warps

will be executed in their SIMD cores,

with each step running the sameinstruction in lockstep, even when a

branch occurs.



Types of memory

• low-latency shared memory

• high-latency global memory



Low latency memory

• used as cache for global memory

• scratchpad for threads working in thesame thread block

• fast but small

• partitioned; does not persist between

kernel operations



Global Memory

• Abundant and shared but slow

• To hide latency, SMs automatically context

switch to other warps while memorytransactions are being performed

• reads up to 128-byte segments of memory

with a single transaction• memory requests of threads in a warp are

coalesced together into fewer transactions.



Atomic Operations

• performed when race conditions are

difficult or impossible to avoid.

• perform a series of actions that cannot

be interrupted.

• e.g. incrementing a counter



Fermi architecture

• higher compute capabilities, more

functionality• efficient atomic operations, cached

memory hierarchy to further reduce

latency when accessing a globalmemory.



Hashing on GPU

• Open Addressing

• While they can be very fast for bothconstruction and retrieval on a GPU,

problems arise when trying to make a

compact table: in the worst case, the

whole table would have to be

traversed to terminate a query.



Hashing on a GPU

• Chaining

• number of probes increases greatly as

the number of slots shrinks.

• linked lists are horribly inefficient in aGPU



Hashing on a GPU

• Collision-free hashing

• larger space = constant probability of no collision

• increased construction time and

inherently sequential on someimplementation



Hashing on a GPU

• Multiple-choice Hashing

• Choose the one that has the lowest

occupancy

• Cuckoo Hashing

• Variation of Open Hashing, limits theslots an item can fall to

• uses multiple hash functions



Performance Metrics

• Construction time

• Retrieval efficiency

• Memory usage



Open Addressing

• Race condition may occur (multiple

threads attempting to insert an item to

the same location simultaneously)



Open Addressing

Figure 3.1. Examples of linear probing (left) and quadratic probing (right).



Open Addressing

• The parallel construction assigns each

input item to a thread, then has eachthread simultaneously probe the hash

table for empty slots

• force serialization of access to the table



Parameters

• Number of slots: ST ≥

N where ST is thenumber of slots and N is the number of

items in the input. ST ≈ 1.25N

• Probe SequenceProbing scheme Hash function

Linear probing h(k) = g(k) + iteration

Quadratic probing h(k) = g(k) + c0 · iteration + c1 · iteration2

Double hashing h(k) = g(k) + jump(k) · iteration

Table 3.1. Open addressing hashing schemes



Parameters

• Maximum allowed length of ProbeSequence. Used to terminate a probe

sequence that is taking too much time.

Min(N,10000).



Hash Function

• Perfect Hash Function. Benefits are

minimal since the hash tables can be

constructed in a way that effectivelylimits the number of probes required to

find an item to just one or two

• Simple randomized hash functions

work well in practice



Hash Function

• g(k) = (f(a,k) + b) mod p mod ST

• Where a and b are randomly generated

constant, p is a prime number and ST is

the number of slots available in the

hash table



Implementation

Algorithm 3.1 Process for creating an open addressing hash table.

1: allocate enough memory for table [ ], which will contain S T 64-bit slots

2: repeat

3: fill each slot with ∅

4: generate a new hash function for the current attempt

5: for all key-value pairs (k, v) in the input do

6: repeat

7: atomically check-and-set table [location]

8: advance location to next location in probe sequence

9: until ∅ is found or max probes hit

10: end for

11: until hash table is built

Listing 3.1. Parallel insertion of items into an open addressing table.



1 d e v ic e b oo l i n s er t e n tr y ( const unsigned key ,

2 const unsigned value ,

3 const unsigned t a b l e s i z e ,

4 Entry ∗ t a b l e ) {

5 // M anage t h e k ey a n d i t s v a l u e a s a s i n g l e 64− b i t e n t ry .

6 E nt ry e n t r y = ( ( E nt ry ) k e y << 3 2 ) + v a l u e ;

7

8 // F i gu r e o u t w h er e t h e i te m n e ed s t o b e h as h ed i n t o .

9 unsigned in d e x = h a s h f u n c t i o n ( key ) ;

10 unsigned d o u bl e h a s h j u m p = j u m p f u n c t i o n ( k e y ) + 1 ;

11

12 // Keep t r y i n g t o i n s e r t t h e e n t ry i n t o t he h as h t a b l e

13 // u n t i l an empty s l o t i s f ou nd .

14 E nt ry o l d e n t r y ;

15

fo r ( unsigned a t te m pt = 1 ; a t te m pt <= kMaxProbes ; ++attempt) {16 // Move t h e i n d ex s o t h a t i t p o i n t s s om ew he re w i t h i n t h e t a b l e .

17 i n de x %= t a b l e s i z e ;

18

19 // A t om i c al l y c h ec k t h e s l o t a nd i n s e r t t h e k ey i f e mp ty .

20 ol d en try = atomicCAS( tab le + index , SLOT EMPTY, entry );

21

22 // I f t h e s l o t was empty , t h e i te m w as i n s e r t ed s a f e l y .

23 i f ( ol d en tr y == SLOT EMPTY) return t r u e ;24

25 // Move t h e i n s e r t i o n i n d ex .

26 i f ( m et ho d == LINEAR ) in d e x += 1 ;

27 e ls e i f ( method == QUADRATIC) in de x += att emp t ∗ attempt ;

28 e l s e index += attempt ∗ d o u b le h a s h j u m p ;

29 }

30

31 return f a l s e ;

32 }



Parallel Retrieval

• Follows same search pattern as

construction



Construction Rates

Figure 3.2. Eff ect of input size on construction retrieval rates for tables con-taining 1.25N slots on both the GTX 280 (top) and 470 (bottom).



Memory Usage

Figure 3.3. Eff ect of the table size on construction and retrieval rates for tables

containing 10 million items.



Limitations

• Performance drops significantly for

compact tables

• High variability in probe sequence

length

• Removing items from the table.



Sources

• Alcantara, D., Efficient Hash Tables on a

GPU.