Dynamic Load-balancing On Graphics Processors

transcript

On Dynamic Load Balancing on Graphics Processors

Daniel Cederman and Philippas TsigasChalmers University of Technology

Overview

• Motivation

• Methods

• Experimental evaluation

• Conclusion

The problem setting

Task Task Task

Task Task Task Task

Offline

Online

Static Load Balancing

Processor Processor Processor Processor

Task Task Task Task

Subtask Subtask Subtask Subtask

SubtaskSubtask

Subtask

Dynamic Load Balancing

Subtask

SubtaskSubtask

Subtask

Task sharing

Work done?

Try to get task

New tasks

Perform task

Got task?

Add task

Task Set

No, retry

Check condition

Acquire Task

Add Task

No, continue

System Model

• CUDA

• Global Memory

• Gather and scatter

• Compare-And-Swap

• Fetch-And-Inc

• Multiprocessors

• Maximum number ofconcurrent thread blocks

Multi-processor

Thread Block

Multi-processor

Thread Block

Multi-processor

Thread Block

Global Memory

Synchronization

• Blocking

• Uses mutual exclusion to only allow one process at a time to access the object.

• Lockfree

• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.

• Waitfree

• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

Load Balancing Methods

• Blocking Task Queue

• Non-blocking Task Queue

• Task Stealing

• Static Task List

Blocking queue

Non-blocking Queue

T1 T2 T3 T4

ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

Non-blocking Queue

T1 T2 T3 T4

Non-blocking Queue

T1 T2 T3 T4

Non-blocking Queue

T1 T2 T3 T4

Non-blocking Queue

T1 T2 T3 T4 T5

Non-blocking Queue

T1 T2 T3 T4 T5

Task stealing

ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

Task stealing

T1 T4 T5

Task stealing

Static Task List

Octree Partitioning

• Bandwidth bound

Octree Partitioning

• Bandwidth bound

Octree Partitioning

• Bandwidth bound

Octree Partitioning

• Bandwidth bound

Four-in-a-row

• Computation intensive

Graphics Processors

8800GT• 14 Multiprocessors

• 57 GB/sec bandwidth

9600GT• 8 Multiprocessors

• 57 GB/sec bandwidth

Blocking Queue – Octree/9600GT

112 128

Time (ms)

ThreadsBlocks

Time (ms)

Blocking Queue – Octree/8800GT

112 128

Time (ms)

ThreadsBlocks

Time (ms)

Blocking Queue – Four-in-a-row

112 128

1000 1500 2000 2500

Time (ms)

ThreadsBlocks

Time (ms)

500 1000 1500 2000 2500

Non-blocking Queue – Octree/9600GT

112 128

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

Non-blocking Queue – Octree/8800GT

112 128

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

Non-blocking Queue - Four-in-a-row

112 128

Time (ms)

ThreadsBlocks

Time (ms)

Task stealing – Octree/9600GT

112 128

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

Task stealing – Octree/8800GT

112 128

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

Task stealing – Four-in-a-row

112 128

Time (ms)

ThreadsBlocks

Time (ms)

Static List

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280

Octree 9600GT Octree 8800GTS Four-in-a-row

Threads/Block

Octree Comparison

100 150 200 250 300 350 400 450 50010

Blocking Queue Non-Blocking Queue Static ListWork Stealing

Particles (thousands)

Previous work

• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003

• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998

• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005

Conclusion

• Synchronization plays a significant role in dynamic load-balancing

• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming

• Locks perform poorly

• It is good that operations such as CAS and FAA have been introduced in the new GPUs

• Work stealing could outperform static load balancing

Thank you!

http://www.cs.chalmers.se/~dcs