7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 1/42
Parallel Hashing John Erol Evangelista
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 2/42
Definition of Terms
• GPU. Graphical Processing Unit
• Parallel Architecture. Architecture where
calculations are done simultaneously
• Serial Architecture. Architecture wherecalculations are done serially
• Voxel. 3D Analog of Pixel
• Kernels. Programs that run on the GPU.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 3/42
Definition of Terms
• Threads. Smallest unit of processing.
• Latency. Time Delay
• Cache. Storage of data.
• Race condition. Output is dependenton the timing of the events.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 4/42
GPU
• Graphics Processing Unit
• Its highly parallel architecture wasrecognized for its fast number
crunching abilities, giving rise to
techniques for applying GPU for non-graphical purpose.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 5/42
Data Structures
• Applications rely on data structures
that can be both built and used
efficiently in parallel environment.
• Defining parallel-friendly data
structures that can be efficiently
created, updated and accessed is a
significant research challenge.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 6/42
Voxel
• 3D analog of the pixel
• Number of expected occupied voxels:
O(N2).
• Storing N3
grid is extremely wastefulsince most of the grid is empty.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 7/42
Hash Table
• Popular for these types of data (voxels)
since they can be constructed to answer
queries in O(1) memory accesses.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 8/42
Figure 1.2. GPU hash tables are being constructed and queried every frame toperform Boolean intersections for these two animated models. Blue parts of onemodel represent voxels inside the other model, while green parts mark surfaceintersections. These images were produced using a 1283 voxel grid for pointclouds of approximately 160k points. We achieve frame rates between 25–29 fpson a GTX 280, with the actual computation of the intersection and flood-fillrequiring between 15–19 ms. Most of the time per frame is devoted to actual
rendering of the meshes.
Application
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 9/42
Hash Tables
Figure 1.3. While allocating storage for the value of every possible key in anarray allows directly indexing into the structure, it is wasteful when the arrayis mostly unused (top). A hash table can be used instead, which allocates farless space than the array (bottom). In this example, each slot holds both a keyand its value. The table is indexed into using a hash function h(k). Becausemultiple keys may map to the same location, the key contained in the slot andthe query key are compared on a retrieval to ensure the right value is returned.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 10/42
Hash Tables
• Needs to be adapted on a parallel
environment• Serialization
• Memory Accesses are Slow
• Many probes may be required
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 11/42
CUDA
• stands for “Compute Unified Device
Architecture”• provide essential functionality for
parallel applications such as scattered
writes in memory and atomicoperations
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 12/42
CUDA C
• high-level GPU programming language
that extends C with extra constructs for
dealing with the hardware.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 13/42
How it works
• Programs that run on the GPU are
called kernels and typically consist of
just a few small functions.
• Kernels are executed in parallel by
threads, each performing the same
instructions on a different data.• e.g. programs computing hash function
value of every input key.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 14/42
Limitation
• Copying data to and from the GPU is
very expensive.
• Kernels do not have access to the hostsystem’s memory.
• Solution: Use data structures that can
be built and used entirely in parallel,allowing data to stay in the GPU while
it is being processed.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 15/42
How it works
• Threads are grouped into thread blocks
of up to 512 threads, which are assigned
to different streaming multiprocessors (SM) for execution.
• Thread blocks are queued up for the
SMs and fed in as the thread blocks
finish
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 16/42
How it works
• Thread blocks can complete execution
before others are even started, so there
is no way to globally synchronize all thethreads without finishing the kernels.
• Threads in the same block can locally
synchronize using execution barriers,guaranteeing that they have all reachedthe same point before continuing.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 17/42
How it works
• Multiple thread blocks can be handled by SMs simultaneously, but there is a
hard limit on the number of threads the
SM can handle.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 18/42
How it works
• Each SM breaks its thread blocks into
groups of 32 consecutive threads called
warps.• SMs manage when each of their warps
will be executed in their SIMD cores,
with each step running the sameinstruction in lockstep, even when a
branch occurs.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 19/42
Types of memory
• low-latency shared memory
• high-latency global memory
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 20/42
Low latency memory
• used as cache for global memory
• scratchpad for threads working in thesame thread block
• fast but small
• partitioned; does not persist between
kernel operations
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 21/42
Global Memory
• Abundant and shared but slow
• To hide latency, SMs automatically context
switch to other warps while memorytransactions are being performed
• reads up to 128-byte segments of memory
with a single transaction• memory requests of threads in a warp are
coalesced together into fewer transactions.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 22/42
Atomic Operations
• performed when race conditions are
difficult or impossible to avoid.
• perform a series of actions that cannot
be interrupted.
• e.g. incrementing a counter
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 23/42
Fermi architecture
• higher compute capabilities, more
functionality• efficient atomic operations, cached
memory hierarchy to further reduce
latency when accessing a globalmemory.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 24/42
Hashing on GPU
• Open Addressing
• While they can be very fast for bothconstruction and retrieval on a GPU,
problems arise when trying to make a
compact table: in the worst case, the
whole table would have to be
traversed to terminate a query.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 25/42
Hashing on a GPU
• Chaining
• number of probes increases greatly as
the number of slots shrinks.
• linked lists are horribly inefficient in aGPU
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 26/42
Hashing on a GPU
• Collision-free hashing
• larger space = constant probability of no collision
• increased construction time and
inherently sequential on someimplementation
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 27/42
Hashing on a GPU
• Multiple-choice Hashing
• Choose the one that has the lowest
occupancy
• Cuckoo Hashing
• Variation of Open Hashing, limits theslots an item can fall to
• uses multiple hash functions
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 28/42
Performance Metrics
• Construction time
• Retrieval efficiency
• Memory usage
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 29/42
Open Addressing
• Race condition may occur (multiple
threads attempting to insert an item to
the same location simultaneously)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 30/42
Open Addressing
Figure 3.1. Examples of linear probing (left) and quadratic probing (right).
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 31/42
Open Addressing
• The parallel construction assigns each
input item to a thread, then has eachthread simultaneously probe the hash
table for empty slots
• force serialization of access to the table
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 32/42
Parameters
• Number of slots: ST ≥
N where ST is thenumber of slots and N is the number of
items in the input. ST ≈ 1.25N
• Probe SequenceProbing scheme Hash function
Linear probing h(k) = g(k) + iteration
Quadratic probing h(k) = g(k) + c0 · iteration + c1 · iteration2
Double hashing h(k) = g(k) + jump(k) · iteration
Table 3.1. Open addressing hashing schemes
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 33/42
Parameters
• Maximum allowed length of ProbeSequence. Used to terminate a probe
sequence that is taking too much time.
Min(N,10000).
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 34/42
Hash Function
• Perfect Hash Function. Benefits are
minimal since the hash tables can be
constructed in a way that effectivelylimits the number of probes required to
find an item to just one or two
• Simple randomized hash functions
work well in practice
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 35/42
Hash Function
• g(k) = (f(a,k) + b) mod p mod ST
• Where a and b are randomly generated
constant, p is a prime number and ST is
the number of slots available in the
hash table
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 36/42
Implementation
Algorithm 3.1 Process for creating an open addressing hash table.
1: allocate enough memory for table [ ], which will contain S T 64-bit slots
2: repeat
3: fill each slot with ∅
4: generate a new hash function for the current attempt
5: for all key-value pairs (k, v) in the input do
6: repeat
7: atomically check-and-set table [location]
8: advance location to next location in probe sequence
9: until ∅ is found or max probes hit
10: end for
11: until hash table is built
Listing 3.1. Parallel insertion of items into an open addressing table.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 37/42
1 d e v ic e b oo l i n s er t e n tr y ( const unsigned key ,
2 const unsigned value ,
3 const unsigned t a b l e s i z e ,
4 Entry ∗ t a b l e ) {
5 // M anage t h e k ey a n d i t s v a l u e a s a s i n g l e 64− b i t e n t ry .
6 E nt ry e n t r y = ( ( E nt ry ) k e y << 3 2 ) + v a l u e ;
7
8 // F i gu r e o u t w h er e t h e i te m n e ed s t o b e h as h ed i n t o .
9 unsigned in d e x = h a s h f u n c t i o n ( key ) ;
10 unsigned d o u bl e h a s h j u m p = j u m p f u n c t i o n ( k e y ) + 1 ;
11
12 // Keep t r y i n g t o i n s e r t t h e e n t ry i n t o t he h as h t a b l e
13 // u n t i l an empty s l o t i s f ou nd .
14 E nt ry o l d e n t r y ;
15
fo r ( unsigned a t te m pt = 1 ; a t te m pt <= kMaxProbes ; ++attempt) {16 // Move t h e i n d ex s o t h a t i t p o i n t s s om ew he re w i t h i n t h e t a b l e .
17 i n de x %= t a b l e s i z e ;
18
19 // A t om i c al l y c h ec k t h e s l o t a nd i n s e r t t h e k ey i f e mp ty .
20 ol d en try = atomicCAS( tab le + index , SLOT EMPTY, entry );
21
22 // I f t h e s l o t was empty , t h e i te m w as i n s e r t ed s a f e l y .
23 i f ( ol d en tr y == SLOT EMPTY) return t r u e ;24
25 // Move t h e i n s e r t i o n i n d ex .
26 i f ( m et ho d == LINEAR ) in d e x += 1 ;
27 e ls e i f ( method == QUADRATIC) in de x += att emp t ∗ attempt ;
28 e l s e index += attempt ∗ d o u b le h a s h j u m p ;
29 }
30
31 return f a l s e ;
32 }
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 38/42
Parallel Retrieval
• Follows same search pattern as
construction
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 39/42
Construction Rates
Figure 3.2. Eff ect of input size on construction retrieval rates for tables con-taining 1.25N slots on both the GTX 280 (top) and 470 (bottom).
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 40/42
Memory Usage
Figure 3.3. Eff ect of the table size on construction and retrieval rates for tables
containing 10 million items.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 41/42
Limitations
• Performance drops significantly for
compact tables
• High variability in probe sequence
length
• Removing items from the table.
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 42/42
Sources
• Alcantara, D., Efficient Hash Tables on a
GPU.