A Dynamic Scheduling Framework forEmerging Heterogeneous Systems
Vignesh Ravi and Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State UniversityColumbus, Ohio - 43210
1
Motivation
• Today Heterogeneous architectures are very common– Eg., Today’s desktops & notebooks
– Multi-core CPU + Graphics card on PCI-E, AMD APUs …
• 3 of the top 5 Supercomputers are heterogeneous (as of Nov 2011)– Use Multi-core CPUs and GPU (C2050) on each node
• Application development for multi-core CPU and GPU usage is still independent– Resources may be under-utilized
• Can Multi-core CPU and GPU be used simultaneously for a single computation
2
Outline
• Challenges Involved• Analyzing Architectural Tradeoffs and
Communication Patterns• Cost Model for choosing Chunk Size• Optimized Dynamic Work Distribution Schemes• Experimental Results• Conclusions
3
Challenges Involved
• CPU / GPU vary in compute power, memory sizes, and latencies
• CPU / GPU relative performance varies across– Each application
– Every combination of CPU and GPU
– Different problem sizes
• Effective work distribution is critical for performance– Manual or Static distribution is extremely cumbersome
• Dynamic distribution schemes are essential– Consider tradeoffs due to heterogeneity in CPU and GPU
– Adaptable to varying CPU/GPU performance across each application
– Adaptable to different problem sizes and combination of CPU and GPU
4
Contributions
• A general dynamic scheduling framework for data parallel loops on heterogeneous CPU/GPU systems– Critical factors on architectural tradeoffs and communication patterns
• Identify “Chunk size” as key factor for performance– Developed Cost Model for heterogeneous systems
• Derived two optimized Dynamic Scheduling Schemes– Non-Uniform-Chunk Distribution Scheme (NUCS)
– Two-Level Hybrid Distribution Scheme
• Using four applications representing two distinct communication patterns, we show:– A 35-75% performance improvement using our dynamic schemes
– Within 7% of performance obtained from best static distribution
5
Analysis of Architectural Tradeoffs and Communication Patterns
6
CPU – GPU Architectural Tradeoffs
7
Important observations based on architecture
• Each CPU core is slower than a GPU
• GPU have smaller memory capacity than the CPU
• GPU memory latency is very high
• CPU memory latency is relatively much small
Required Optimizations
• Minimize GPU memory transfer overheads
• Minimize number of GPU Kernel invocations
• Reduce potential resource idle time
Analysis of Communication Patterns
8
• We analyze communication patterns in data parallel loops– Divide input dataset into large number of chunklets– Chunklets can be scheduled in arbitrary order (data parallel)– Processing each element involve local and/or global updates– Global updates involve only associative/commutative operations– Global updates avoid races by privatizing global elements– Global elements may be shared by all or subset of processing elements
• In this work, we consider two distinct communication patterns:– Generalized Reduction Structure– Structured Grids (Stencil Computations)
Generalized Reduction Computations
9
{* Outer sequential loop*}
While(unfinished) {
{*Reduction loop*}
Foreach( element e in chunklet) {
(i, val) = compute(e)
RObj(i) = Reduc(Robj(i), val)
}
}
Reduction ObjectShared Memory
Comm./Assoc. operation
• Similar to Map-Reduce model• But only one stage, Reduction• Reduction Object, Robj, exposed to programmer• Reduction Object is a shared memory [Race conditions]• Reduction operation, Reduc, is associative or commutative• All updates are Global updates• Global elements are shared by all processing elements
Structured Grid Computations
10
For i = 1 to num_rows_chunklet {
For j = 1 to y-1 {
B[i,j] = C0 * (A[i,j] + A[i+1,j] +
A[i-1,j] + A[i,j+1] + A[i,j-1])
}
}
Example: 2-D, 5-point Stencil Kernel
• Stencil kernels are instances of structured grids• Involves nearest neighbor computations• Input partitioned along rows for parallelization
For i = 1 to num_rows_chunklet {
For j = 1 to y {
if( is local row(i) ) { /*local update*/
B[i,j] += C0 * A[i,j];
B[i+1,j] += C0 * A[i,j];
B[i-1,j] += C0 * A[i,j];
B[i,j+1] += C0 * A[i,j];
B[i,j-1] += C0 * A[i,j]; }
Else /*global update*/
Reduc(offset) = Reduc(offset)
op A[i,j];
}
}
Rewriting Stencil Kernel as Reduction
• Rewrite as reduction and maintain correctness•Processing involve both local and global updates• Global elements are shared by only subset of processing elements
Basic Distribution Scheme & Optimization Goals
11
• Global Work Queue
• Idle processor consumes work from the queue
• FCFS policy
• Fast worker ends up processing more than slow worker
• Slow worker still processes reasonable portion of data
Master/Job Scheduler
Worker 1
Worker n
Worker 1
Worker n
Fast Workers
Slow Workers
Uniform Distribution Scheme
1. Ensure sufficient number of chunks
2. Minimize GPU data transfer and kernel invocation overhead
3. Minimize number of global elements allocation
4. Minimize number of distinct process that share the global a global element
Optimization Goals
Cost Model for Choosing Chunk Size
12
Chunk Size: A Key Factor
13
Chunk Size impacts two important factors that directly impact performance
1. GPU Kernel Invocation and Data Transfer cost
2. Resource Idle time due to heterogeneous processing elements
0102030405060
32 64 128 256 512
Ove
rhea
d Pe
rcen
tage
No. of Chunks
Device Invocation & Transfer costResource Idle Time
Cost Model for Choosing Chunk Size
14
• Happens at the last iteration of the processing
• Slower processor takes more time, while faster processor is idle
• GPU being a fast processor will be idle at the end
Idle TimeGPU Kernel Call & Transfer Overheads (GKT)
We show that: Chunk-size is proportional to the square root of the total processing time
• Each data transfer has 3 factors:
Latency, transfer cost, & Kernel Invocation.
• 1st and 3rd factors are dependent on number of chunks (chunk size)
• 2nd factor is constant for the entire dataset size
Goal: Minimize(Sum(Idle time, GKT)) for the entire processing
Optimized Dynamic Distribution Schemes
15
Non-Uniform Chunk Size Scheme
16
• Start with initial chunk size as indicated by the cost model
• If CPU requests, data with initial size is forwarded
• If GPU requests, a larger chunk is formed by merging smaller chunks
• Minimizes GPU data transfer and device invocation overhead
• At the end of processing, idle time is also minimized
Chunk 1
Chunk 2
…
…
Chunk K
Initial Data Division
Work Dist. System
Job Scheduler
Merging
GPU workers
CPU workersSmall Chunk
Large Chunk
Two-Level Hybrid Scheme
• In the first level data between CPU and GPU is dynamically distributed
• Allows coarse-grained distribution of data
• Reduces the number of global updates
CPU GPU
DynamicChunk Chunk
GPU chunk
CPU chunk
Thread 0
Thread p-1
Thread k
• In the second level, static and equal distribution within CPU cores and GPU cores
• Reduces number of subsets that share global elements (P^2 P-1)
• Reduces Combination overhead
Experimental Results
18
April 21, 2023 19
Experimental Setup
Applications for Generalized Reduction Structure• K-Means Clustering [6.4 GB]• Principal Component Analysis (PCA) [8.5 GB]Applications for Structured Grid Computations• 2-D Jacobi Kernel [1 GB]• Sobel Filter [1 GB]
19
Environment 1 (CPU-centric)• AMD Opteron• 8 CPU cores• 16 GB Main Memory• Nvidia GeForce 9800 GTX• 512 MB Device Memory
Environment 2 (GPU-centric)• AMD Opteron• 4 CPU cores• 8 GB Main Memory• Nvidia Tesla C1060• 4GB Device Memory
April 21, 2023 20
Experimental Goals
• Validate the accuracy of cost model for choosing chunk size
• Evaluate the performance gain from using optimized work distribution schemes (using CPU+GPU simultaneously)
• Study the overheads of dynamic distribution compared to the best static distribution
20
Accuracy of the Cost Model for Choosing Chunk Size
1.01.11.21.31.41.51.61.71.81.9
16 32 64 128 256
Nor
mal
ized
Rel
ative
Spee
dup
No. of Chunks
K-Means PCA Jacobi Sobel Predicted
78
140
1552
• Each application achieves best performance at different chunk sizes
• Poor chunk size selection can impact the performance significantly
• “Predicted” chunk size is always close to the chunk size with best performance
Performance Gains from Using CPU&GPU
0
10
20
30
40
k-means PCA Jacobi Sobel
Rela
tive
Spe
edup
CPU-only GPU-only CPU+GPU
0
5
10
15
20
25
k-means PCA Jacobi Sobel
Rela
tive
Spe
edup
CPU-only GPU-only CPU+GPU
ENV 1 ENV 2
• For K-means and PCA, CPU+GPU version uses NUCS• For Jacobi and Sobel, CPU+GPU version uses 2-Level Hybrid Scheme• In both ENV1 & ENV2, performance improvements ranging from 37% to 75% can be achieved• Shows that our dynamic scheduling framework can adapt to different hardware configurations of CPU and GPU
37%
75%
Scheduling Overheads of Dynamic Distribution Schemes
0
10
20
30
40
1+1 2+1 4+1 8+1
Rela
tive
Spe
edup
Thread Configuration (CPU+GPU)
Naïve-Static M-OPT Static Dynamic
0
5
10
15
1+1 2+1 4+1 8+1
Rela
tive
Spe
edup
Thread Configuration (CPU+GPU)
Naïve-Static M-OPT Static Dynamic
K-Means
Sobel Filter
• We compare Dynamic schemes with static schemes
• “Naïve” static – Distributes work equally between CPU and GPU
• “M-OPT” static – obtained from exhaustive search for every problem size, application and h/w config.
• Dynamic Schemes: At most 7% slower than the “M-OPT” static
• Significantly better than “Naïve”
7%
Conclusions
• We present a dynamic scheduling framework for data parallel loops on heterogeneous systems
• Analyze architectural and communication pattern tradeoffs to infer critical constraints for dynamic scheduling
• A cost model for choosing optimal chunk size in a heterogeneous setup
• Developed two instances of optimized work distribution schemes– NUCS & 2-Level Hybrid scheme
• Our evaluation include 4 applications representing 2 distinct communication patterns
• We show up to 75% improvement from using CPU& GPU simultaneously
24
25
Thank You!
Questions?
Contacts:Vignesh Ravi - [email protected]
Gagan Agrawal - [email protected]