Date post: | 11-Aug-2015 |
Category: |
Software |
Upload: | clay-chang |
View: | 56 times |
Download: | 2 times |
The pocl Kernel Compiler
Clay Chang
CPU versus GPU
• Sophiscated Control• Branch Prediction• Out-of-Order Execution• Large Cache
• Little Control• No or Limited Branch
Prediction• Simple Execution• Small or no cache• Lots of ALUs
OpenCL as the Portable API
Why OpenCL for CPU
Muiti-core CPU is out there E.g. MediaTek Tri-Cluster 10 cores SoC
Mobile GPU is already busy ~25% occupied by system UI in Android
Not every programs run good on GPU Heavy Branch Divergence
OpenCL allows easily exploit multi-core and SIMD Imagine: writing pthread + SIMD in assembly or intrinsics
Running OpenCL Kernels on CPU
One thread per work-item? Thousands of threads being created Context-switching problems How to synchronize threads?
How about running one work-group on a CPU thread?
Related Works
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Clover (http://people.freedesktop.org/~steckdenis/clover) Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
What is to pocl
POrtable Computing Language An efficient implementation of OpenCL standard which can be easily
adapted for new targets http://github.com/pocl/pocl Main developer: Pekka Jääskeläinen from Tampere University of
Technology Supporting Architecture: CPU, tce, cellspu, HSA Current version: 0.11
Components in pocl
The pocl Kernel Compiler
OpenCLKernel Source
Clang / LLVM poclKernel Compiler
clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …)
Single Work-item Kernel
Transformed Kernel
pocl Compilation Chain1
2
3
4 Compile Kernel (OpenCL C) by Clang
1
Linked with target-specific built-in functions, such as sin, cos, geom_distance, etc…
2
Work-group Function Generation / Parallel Work-item Loops Creation
3
Backend Optimizations (Auto-vecs, …) and CodeGen
4
Work-group_function() { for (int i = 0; i < work-group_size; i++) {
}}
Work-group Function Generation
Kernel (single work-item)
What if there are barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)
Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…”
if (tid % 2) { …. barrier(); …}
Kernel Without barriers
• A node in a CFG is a basic block (BB)• BB: branchless sequence of
instructions• BB executed as an entity,
from the first instruction to the last.
• An edge in a CFG represents a branch in the control flow
• Multiple exit BBs are allowed
• pocl Kernel Compiler generates WI-loop around the CFG
Types of Barrier
Un-conditional barriers barrier that dominates the exit node
Conditional barriers Barriers being placed in
if – else for-loop (b-loop)
Kernel with unconditional barriers
pocl Kernel Compiler creates WI-loops before and after the barrier
This forms an algorithm:Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions.Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region.Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier.
barrier
barrier
A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting at the entry node.Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail).Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand.Step4: Start the algorithm at each of the found barrier successors.
A CFG with Two Conditional barriers – After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?
“Peel” the First Loop Iteration
?
?
No more ambiguous branches in WI-
loops!
Barriers in Kernel Loops
Insert implicit barrier into:1. End of loop pre-header
block2. Before the loop latch
branch3. After the PhiNode
region of the loop header block
3
2
1
Horizontal Inner-Loop Parallelization
More parallelization after loop interchange
blockWidth unknown until runtime
Handling of Kernel Variables
1. There will be two parallel regions2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)3. B’s lifetime span across both parallel regions
Context Array
References
Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.
http://github.com/pocl/pocl