Characterization and exploitation of nested parallelism
and concurrent kernel execution to accelerate high
performance applications
A Dissertation Presented
by
Fanny Nina Paravecino
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Computer Engineering
Northeastern University
Boston, Massachusetts
March 2017
NORTHEASTERN UNIVERSITYGraduate School of Engineering
Dissertation Signature Page
Dissertation Ti-
tle:
Characterization and exploitation of nested parallelism and concurrent
kernel execution to accelerate high performance applications
Author: Fanny Nina Paravecino NUID: 001160686
Department: Electrical and Computer Engineering
Approved for Dissertation Requirements of the Doctor of Philosophy Degree
Dissertation Advisor
Dr. David KaeliSignature Date
Dissertation Committee Member
Dr. Qianqian FangSignature Date
Dissertation Committee Member
Dr. Ningfang MiSignature Date
Dissertation Committee Member
Dr. Norm RubinSignature Date
Department Chair
Dr. Miriam LeeserSignature Date
Associate Dean of Graduate School:
Dr. Sara Wadia-FascettiSignature Date
To the science and the pursuit of answers through research.
ii
Contents
List of Figures vi
List of Tables viii
List of Programs x
List of Acronyms xi
Acknowledgments xiii
Abstract of the Dissertation xiv
1 Introduction 1
1.1 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Advanced Parallel Features . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Characterization of Advanced Parallel Features . . . . . . . . . . . . . . . 4
1.3 Challenges in Exploiting Parallel Execution Features . . . . . . . . . . . . 5
1.3.1 Nested Parallelism Challenges . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Concurrent Kernel Execution Challenges . . . . . . . . . . . . . . 7
1.3.3 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii
2 Background 12
2.1 CUDA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 GPU Computing Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Maxwell Architecture . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Pascal Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Related work 24
3.1 Characterization of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Modern GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Multiple Levels of Concurrency . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . 27
4 Characterization of advanced parallel features 29
4.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Control Flow Instructions . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Parallel Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Child Kernel Launching and Synchronization . . . . . . . . . . . . 35
4.1.4 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Resource Contention . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Exploitation of advanced parallel features 44
5.1 Dependent Nested Loop Workloads . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Selective Matrix Addition . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Parallel Recursive Workloads . . . . . . . . . . . . . . . . . . . . . . . . . 46
iv
5.2.1 Breadth-First Search Algorithm . . . . . . . . . . . . . . . . . . . 47
5.2.2 Prim algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Validation with real-world applications 61
6.1 Connected Component Labeling . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Level-Set Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.3 Finalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Summary of Analysis for Real-world Applications . . . . . . . . . . . . . 67
7 Summary 70
7.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 73
v
List of Figures
2.1 Layers of abstraction between software application and GPU hardware. . . 13
2.2 The CUDA model: a kernel, grid, and threads per block. . . . . . . . . . . 14
2.3 The CUDA memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Branch divergence in the GPU . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Work flow of the Grid Management Unit to dispatch, pause, and hold
pending and suspended grids. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Hyper-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell
GTX Titan Ti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Execution time of Sequential, non-nested parallelism and nested parallelism
kernels on GTX Titan - Kepler architecture (lower is better). . . . . . . . . 37
4.3 Execution time of non-nested parallelism and nested parallelism across four
GPUs (2 Kepler and 2Maxwell GPUs). . . . . . . . . . . . . . . . . . . . . 38
4.4 Execution time of sequential execution of kernels versus concurrent kernel
execution for two different GPUs while varying input size (lower is better). 40
4.5 Execution time of sequential execution of kernels versus concurrent kernel
execution for two different GPUs with persistent threads execution (lower is
better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
vi
4.6 Resource utilization for non-persistent thread kernels using different input
data sets for the Maxwell GTX Titan X (lower is better). . . . . . . . . . . 43
4.7 Resource utilization for persistent thread kernels using different input data
sets for Maxwell GTX Titan X (lower is better). . . . . . . . . . . . . . . . 43
5.1 Speedup evaluation of nested parallelism implementation compared to non-
nested parallelism implementation for Selective Matrix Add for the Kepler
GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Graph representation using an adjacency list. . . . . . . . . . . . . . . . . 48
5.3 BFS operations while traversing a graph with six vertices, starting at source
vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 BFS speedup analysis of naive nested parallelism and optimized nested
parallelism versus non-nested parallelism implementation on Kepler GTX
Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 MST Tree of graph G = (V,E), where V = {0, 1, 2, 3, 4, 5}, starting at
source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Prim algorithm step by step work flow. Given a graph G = (V,E) with
an initial source vertex 0; find Minimum Spanning Tree (MST) using Prim
algorithm, where iteration 0 is described as the initialization of a MST tree
with a source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1 Speedup comparison of the nested parallelism and non-nested parallelism
implementations, running CCL on a Kepler GTX Titan. . . . . . . . . . . . 63
6.2 Speedup comparison of the nested parallelism and non-nested parallelism
implementations, running Level-Set segmentation on a Kepler GTX Titan. . 68
vii
List of Tables
2.1 NVIDIA GPU technology evolution [25] . . . . . . . . . . . . . . . . . . . 12
2.2 Fermi chip GF110 versus Kepler chip GK110 [41] . . . . . . . . . . . . . 19
2.3 A comparison of the features available on the four generations of NVIDIA
GPUs considered in this thesis [25]. . . . . . . . . . . . . . . . . . . . . . 23
4.1 Irregular Applications from two different GPU benchmarks which exhibit
control flow dependent nested loops. . . . . . . . . . . . . . . . . . . . . . 30
4.2 Recursive applications which exhibit parallel recursion. . . . . . . . . . . . 30
5.1 Irregular and recursive applications, with potential for exploiting advanced
parallel features in modern GPUs. . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Dynamic metrics for non-nested parallelism Selective Matrix Addition for
different input sets on Kepler GTX Titan. . . . . . . . . . . . . . . . . . . 45
5.3 Execution time of selective Matrix Add with different input sets for non-
nested parallelism and nested parallelism implementations on Kepler GTX
Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Runtime execution analysis of Breadth-First Search with different input sets
from the DIMACS Challenge Ninth [87] and Tenth [88] for naive nested
parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 54
viii
5.5 Runtime execution analysis of Breadth-First Search with different input sets
from DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested
parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 54
5.6 Runtime execution analysis of Prim’s algorithm with different input sets
from the DIMACS Challenge Ninth [87] and Tenth [88] for non-nested
parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 56
5.7 Runtime execution analysis of Prim algorithm with different input sets from
the DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested
parallelism implementation on the Kepler GTX Titan. . . . . . . . . . . . . 60
6.1 Dynamic metrics for non-nested parallelism of CCL for different input sets
on a Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Dynamic metrics for nested parallelism of CCL for different input sets on a
Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Execution time of selective Matrix Add with different input sets for non-
nested parallelism and nested parallelism implementations on the Kepler
GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
List of Programs
4.1 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 31
4.2 Fibonacci recursive scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Fibonacci parallel recursive scheme in CUDA. . . . . . . . . . . . . . . . 35
4.4 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 41
5.1 Graph input using DIMACS challenge structure for file storing. . . . . . . 49
5.2 Breadth-first Search (BFS) recursive implementation on a CPU, graph is
a global variable which contains vertex array and edge array. . . . . . . . . 50
5.3 BFS non-recursive implementation on GPU. . . . . . . . . . . . . . . . . 52
5.4 BFS optimized nested parallelism implementation on GPU. . . . . . . . . 53
5.5 Non-nested parallelism implementation of Prim’s algorithm on a GPU. . . . 57
5.6 Optimized nested parallelism implementation of Prim’s algorithm on a GPU. 58
x
List of Acronyms
GPGPU General Purpose computing on Graphic Processor Units. Definition associated
to the use of graphics processing units (GPU) to perform computation in application
traditionally handled by central processing unit (CPU).
GPU Graphics Processing Unit. Definition associated to the graphics processor unit in the
system.
CCL Connected Component Labeling. Algorithm refers to image segmentation process
using points connected by similarity function.
LSS Level Set Segmentation.
SC Spectral Clustering.
SIMD Single-Instruction Multiple Data.
SIMT Single Instruction Multiple Thread.
API Application Programming Interfaces.
SIMT Single Instruction Multiple Thread.
ISA Instruction Set Architecture.
TLP Thread-level Parallelism.
PTX Parallel Thread Execution.
xi
MPI Message Passing Interface.
CUDA NVIDIA’s Compute Unified Device Architecture Framework.
OpenCL Open Compute Language.
SM Streaming Multiprocessor.
ECC Error Correcting Codes.
CTA Cooperative Thread Arrays.
PT Persistent Threads.
PDE Partial Differential Equation.
PDEs Partial Differential Equations.
BFS Breadth-first Search.
MST Minimum Spanning Tree.
xii
Acknowledgments
It would not have been possible to write this doctoral thesis without the help and support
of the kind people around me, to only some of whom it is possible to give particular mention
here.
First of all, I would like to thank my parents Fani, and Dante for their endless support
through every single step of this journey. I thank my brother Reykjavil, and sister Lisbeth
for keeping me on the path and making me believe that everything is possible. I thank my
boyfriend Jose, for his unlimited love and throughout support, for which my mere expression
of thanks likewise does not suffice.
This thesis would not have been possible without the help, support and patience of
my colleagues and collaborators. A special thanks to all my colleagues at NUCAR group.
Specially, to Leiming, Fritz, Julian and Xiangyu for their contributions towards the concepts,
ideas, and for keeping company on the doctoral journey. I would also like to thanks our
collaborators Dr. Qianqian Fang, Dr. Norm Rubin (NVIDIA) and Dr. Ningfang Mi for their
constructive feedback on this dissertation.
It is my deepest gratitude and warmest affection that I dedicate this thesis to my advisor
Dr. David Kaeli who has been a constant source of knowledge and inspiration.
xiii
Abstract of the Dissertation
Characterization and exploitation of nested parallelism and
concurrent kernel execution to accelerate high performance
applicationsby
Fanny Nina Paravecino
Doctor of Philosophy in Computer Engineering
Northeastern University, March 2017
Dr. David Kaeli, Adviser
Over the past decade, GPU computing has evolved from being a simple task of mapping
data-parallel kernels to Single Instruction Multiple Thread (SIMT) hardware, to a more
complex challenge, mapping multiple complex, and potentially irregular, kernels to more
powerful and sophisticated many-core engines. Further, recent advances in GPU architec-
tures, including support for advanced features such as nested parallelism and concurrent
kernel execution, further complicate the mapping task.
Improving application performance is a central concern for software developers. To
start with, the programmer needs to be able to identify where opportunities for optimization
reside. Many times the right optimization is tied to the underlying nature of the application
and the specific algorithms used. The task of tuning kernels to exploit hardware features can
become an endless manual process. There is a growing need to develop characterization
xiv
techniques that can help the programmer identify opportunities to exploit new hardware
features, and to port a broader range of applications to GPUs efficiently.
In this thesis, we present novel approaches to characterize application behavior that
can exploit nested parallelism and concurrent kernel execution introduced on recent GPU
architectures. To identify bottlenecks that can be improved through the exploitation of
nested parallelism and concurrent kernel execution, we proposed a set of metrics for a range
of GPU kernels.
For nested parallelism, our approach focuses on irregular and recursive kernel applica-
tions. For irregular applications we define, implement, and evaluate three main runtime
components: i) control flow workload analysis, ii) child kernel launching, and iii) child
kernel synchronization. For recursive kernel applications, we define, implement, and eval-
uate: i) degree of thread-level parallelism, ii) work efficiency, and iii) overhead of kernel
launches. For concurrent kernel execution, our characterization captures a kernel’s launch
configuration, the resource consumption, and the degree of overlapped execution. Our pro-
posed metrics help us to better understand when to exploit nested parallelism and concurrent
kernel execution.
We demonstrate the utility of our framework of metrics by focusing on a diverse set
of workloads that include both irregular and recursive program behavior. This suite of
workloads includes: i) a set of microbenchmarks that specifically target the set of new
GPU features discussed in this thesis, ii) the NUPAR suite, iii) the Lonestar suite and iv)
real-world applications. By using our framework, we are able speedup application by more
than 5x-23x as compared to non-advanced-parallel-feature GPU implementations.
xv
Chapter 1
Introduction
In 1965 Gordon Moore proposed Moore’s Law, which states that the number of tran-
sistors on a microprocessor doubles roughly every 18 months [1]. Since 1965, Moore’s
Law has been shown to be remarkably accurate, and microprocessors have doubled their
capabilities every one to two years. However, the translation of increased transistor density
into improved application performance remains a challenging endeavour. There is no silver
bullet that automatically optimizes software, programming frameworks, and algorithms so
that they can benefit from advances in hardware.
In many areas, performance improvement have been possible only due to modifications
in algorithms, providing substantial performance gains that are much higher than those
enabled by increasing processor speed alone. There are still many challenges that need
to be addressed through the discovery of new parallel algorithms, specifically designed to
take advantage of the potential power of parallel hardware, while avoiding some of the
bottlenecks that can occur on these platforms.
In this thesis, we will explore different mechanisms to understand the behavior of the
parallel code (i.e., kernels) at different stages of the computing stack, including multiple
compilation levels, as well as runtime execution. This work will define a characterization
process of parallel execution that will guide and inform the programmer on how best to
exploit new parallel features. Equipped with this knowledge, the programmer can then
exploit parallelism at different grains of concurrency. We test our characterization process
on a broad set of parallel applications, demonstrating the utility of this knowledge to tune
1
CHAPTER 1. INTRODUCTION
application to effectively exploit two recently introduced parallelization features: 1) nested
parallelism and 2) concurrent kernel execution. We will also present a tuning mechanism to
further improve application throughput.
1.1 Parallel Programming
Parallel programming provides a myriad of advantages over sequential programming,
such as increased application throughput, improved utilization of hardware resources, and
enhanced concurrent execution [2]. Given the wide range of parallel computing hardware
platforms available to-date, spanning massively parallel supercomputers to multicore smart-
phones, parallel execution has become the most effective path to improve performance. The
need for high performance has been amplified due to the rate at which raw data is being
generated today and is rapidly growing for the foreseeable future.
Commonly, the easiest way to write parallel code is using a framework such as OpenMP.
OpenMP is a simple, directive-based interface that offers incremental parallelization, which
allows for loops in serial code executed concurrently without changing their structure [3].
However, Using OpenMP does not solve the problem of load imbalance, and the resulting
performance gain is limited by the Amdahl’s law [4], which defines that code improvement
is to the portion of serial code that is suitable for parallelization.
When working with a distributed system, Message Passing Interface (MPI) [5] provides
an effective programming model for expressing parallelization. MPI is commonly used on
distributed memory systems that leverage message passing. However, one notable trend we
are witnessing in the field of parallel scientific computing is the dramatic increase in the
number of applications that utilize GPUs. Based on Flynn’s widely used taxonomy [6, 3], the
large number of cores on the Graphics Processing Unit (GPU) enable us to launch thousands
of compute threads to execute in Single-Instruction Multiple Data (SIMD) fashion. SIMD
provides parallelism by operating on multiple data streams concurrently [3]. Applications
for GPUs are commonly developed using programming frameworks such as Kronos’s Open
Compute Language (OpenCL) [7, 8, 9] and NVIDIA’s NVIDIA’s Compute Unified Device
Architecture Framework (CUDA) [10]. Both OpenCL and CUDA are based on the high-level
2
CHAPTER 1. INTRODUCTION
programming constructs of the C and C++ languages. The data parallel and computationally
intensive portions of an application are offloaded to the GPU for accelerated execution.
These programming frameworks offer a rich set of runtime API, and allow the developer to
write optimized kernels for execution on GPUs.
Researchers and developers have enthusiastically adopted the CUDA programming
model and GPU computing for a diverse range of applications [11, 12, 13, 14]. Given the
varying degrees of parallelism present in many applications, we are motivated to explore
advanced parallel features on the GPU.
1.1.1 Advanced Parallel Features
Recent advances in GPU architectures have pushed a number of computational barriers,
enabling researchers to leverage parallel computing to improve application throughput.
Graphics hardware has substantially evolved over the years to include more functionality
and programmability. NVIDIA’s previous generation of GPUs, the Fermi family, has been
used in a number of applications, promising peak single-precision floating performance of
up to 1.5 TFLOPS. However, NVIDIA’s Kepler GK110 GPU offers more than 4.29 TFLOPs
of single-precision computing capability. The newest features provided on Kepler enable
programmers to move a wider range of applications to the CUDA framework.
Given the new hardware features provided on recent hardware, exploiting these features
to improve overall execution throughput has become paramount. Thread-level parallelism
provides impressive speedups for applications ported to the GPU. Moreover, the addition of
nested parallelism supports conditional-loop execution throughput, which requires working
at a finer thread granularity. Another new feature is the concurrent kernel execution of ker-
nels, which improves the utilization and runtime of multiple kernel, removing the overhead
due to context switching. There is also a performance advantage provided by performing
back-to-back kernel launches. In the CUDA API, kernel invocations are asynchronous. If a
developer can call a kernel (or kernels) multiple times without any intervening synchroniza-
tion (i.e., memory transfers or dependency checking), then the multiple kernel calls will be
batched in the CUDA driver, and the application can overlap kernel execution on the GPU.
Given the level of sophistication provided in the modern GPUs, we have focused our
3
CHAPTER 1. INTRODUCTION
work on the characterization of advanced parallel features in order to guide the improvement
of application throughput. We consider optimization of applications for two new features
available on NVIDIA Kepler GPUs and more recent GPU generations:
• Nested Parallelism: modern GPUs add the capability to launch child kernels within aparent kernel. A pattern commonly found in many sequential algorithms are nested
loops. Nested parallelism allows us to implement a nested loop with variable amounts
of parallelism.
• Concurrent Kernel Execution: modern GPUs provide the ability to run multiplekernels, assigned to different streams, concurrently. The Kepler, Maxwell and Pascal
architectures support up to 32 concurrent streams (as compared to 16 on the Fermi).
Each stream is assigned to a different hardware queue.
1.2 Characterization of Advanced Parallel Features
The utilization of high performance computing resources has also been hampered by the
relative dearth of system software and of tools for monitoring and optimizing performance.
Profilers have evolved to provide application execution insights to the developer in order to
improve application throughput. However, profilers are tightly tied to specific hardware and
do not support the latest advanced parallel features, which makes tuning the applications
targeting modern GPUs a challenge.
New approaches to profiling/instrumentation are needed to understand application in-
teraction with the latest hardware features. Binary instrumentation can be used on a GPU
for performance debugging, correctness checks, workload characterization, and runtime
optimization. Such techniques typically involve inserting code at the instruction level of an
application during back-end compilation, Binary Translation is able to gather data-dependent
application behavior.
Given the presence of data-dependent behavior of an application, we can characterize
different execution patterns. Our focus is to characterize dynamically available parallelism
with the aim to evaluate implementations designed to exploit the execution patterns using ad-
vanced parallel features such as nested parallelism. Our characterization approach evaluates
4
CHAPTER 1. INTRODUCTION
the potential for optimization by analysing the impact on control, memory and synchro-
nization behavior on a GPU. As an illustrative example, our study targets a comprehensive
understanding of the overhead of current nested parallelism supported on GPUs in terms of
kernel launch, control flow, nested synchronization and algorithm overhead.
We also consider another form of parallelism available on modern GPUs: concurrent
kernel execution. Just as a typical CPU application can consist of multiple functions, it
is also common to have multiple GPU kernels present in a single GPU application. A
GPU kernel is a function executed on a GPU device. Managing efficient concurrent kernel
execution using independent thread blocks is cumbersome at best. In particular, this thesis
targets a detailed understanding of the run-time costs of concurrent kernel execution in terms
of kernel launch configuration, resource contention, and overlapped computation.
1.3 Challenges in Exploiting Parallel Execution Features
The software implementation of a GPU application can dramatically influence the
application’s performance. For example, performance will suffer if kernels are stalled due
to control dependence. Delays also occur when data dependencies are encountered. GPU
stream processors are more difficult to utilize effectively if the targeted applications present
dynamic and frequent data dependencies (commonly present in sorting, recursion, dynamic
programming and evolutionary programming).
Along with the challenges of dynamic and global dependencies, many applications
involve the execution of multiple kernels. The current generation of NVIDIA GPUs already
support concurrent execution of kernels using Hyper-Q technology, allowing concurrent
execution of kernels from the same application or different applications. In this thesis,
we characterize concurrent kernel execution, and explore how to perform resource utilize
and minimize kernel launch overhead. Presently, it is difficult to modify an application to
effectively leverage nested parallelism and concurrent kernel execution. Addressing this gap
is the major focus of this thesis.
5
CHAPTER 1. INTRODUCTION
1.3.1 Nested Parallelism Challenges
Depending on the application characteristics and the parallelization strategy, a kernel
can exhibit a range of dynamic behaviors. The dynamic behavior is highly correlated to data-
dependent parallel execution. Data dependencies are found in parallel loops and recursive
calls. Parallel loops and recursive calls are forms of nested Thread-level Parallelism (TLP).
Nested TLP can present a range of control flow behaviors. Explicit control flow con-
structs such as if-then-else or for-loop are fundamental constructs in any high-
level programming language. In kernels with complex control flow, SIMD threads can
follow different paths of execution, causing thread divergence. Thread divergence would
seem to cause a paradox, since all threads in a basic group (e.g., a warp) must execute
the same instruction on each cycle. If the threads in a warp diverge, the warp serially
executes each branch path, disabling threads that do not take that path. Warp divergence can
dramatically degrade application performance.
Understanding control flow effects is a key step towards characterization of nested
parallelism. We have faced the following challenges when trying to exploit dynamic
parallelism:
• For control flow analysis, it is important to quantify the impact of thread divergence bycategorizing divergent and convergent paths in order to understand how performance is
impacted. Control flow divergence effects can severely impact our ability to leverage
nested parallelism. On-the-fly analysis of control flow workload provides a better
understanding for data-dependent applications. In previous work, control flow analysis
has been performed statically.
• To properly characterize child kernel launches, we need to understand kernel launchparameters and device runtime management. There presently are no tools or profilers
that can proper analyze nested-kernel launch overhead.
• Nested parallelism requires that parent kernels and child kernels explicitly synchronizewith each other in order to assure consistent application execution. In order to perform
child kernel synchronization, the device runtime has to save the state of parent kernels
when they are suspended and yield to the child kernels at the synchronization points.
6
CHAPTER 1. INTRODUCTION
To our knowledge, there are no tools available that can measure dynamic child kernel
synchronization.
1.3.2 Concurrent Kernel Execution Challenges
Enabling multiple kernels to execute concurrently on GPUs leads to the physical shar-
ing of compute resources. Concurrent kernel execution can increase overall application
throughput and can also reduce energy consumption. In order to deliver performance im-
provement there needs to be sufficient resources on the GPU to launch concurrent kernels.
In other words, concurrent kernel execution provides performance improvement through
overlapped kernel computation. In order to achieve overlapped kernel computation, we
need to understand the sources of resource contention and the effects of the kernel launch
configuration.
Resources contention is heavily dependent on the application input. For example, a
small input set might not stress the memory, whereas a large input set might. At the same
time, resource contention is dependent on the amount of GPU hardware resources available.
An application binary compiled and optimized for one GPU may perform poorly on another
GPU due to resource contention.
The kernel launch configuration can give us clues leading to resource contention. Each
kernel is launched with a set of variables called the launch configuration variables. Com-
monly, these variables include the number of threads per block, the number of thread-blocks
per grid, the usage of shared memory, and the number of registers used. Most of the time,
these variables are dictated by the number of data elements the kernel operates on. De-
pending on the GPU architecture, the resource usage based on these variables can changed
dramatically. Having a better understanding of the resource contention is a key step towards
the characterization of concurrent kernel execution. We face the following challenges when
trying to exploit concurrent kernel execution:
• To properly understand resource contention, we need to have a better control of theresources utilized by the kernel. We can bring software threads closer to the actual
hardware thread execution by implementing persistent threads. Persistent threads
break the mapping of one software thread to one data element, and instead it is
7
CHAPTER 1. INTRODUCTION
dynamically defined by the availability of resources on the GPU. There is no general
way to map any kernel to persistent threads; persistent threads will not always provide
the best performance for every kernel.
• Resource contention varies dramatically across different GPU architectures, driverversions, and CUDA frameworks. Furthermore, the compiler and driver can have
significant impact on kernel performance. To properly exploit concurrent kernel
execution, we need to understand hardware, driver, compiler and CUDA framework
interaction, which unfortunately are non-disclosed by hardware vendors.
1.3.3 Benchmark Suite
Many applications—both academic research and industrial products—have been accel-
erated using parallel framework to achieve significant parallel speedup. Such applications
encompass a variety of problem domains, including security surveillance, numerical linear
algebra, graph theory problems, among others. Of these many applications, we select a set
of representative real-world applications to focus our discussion.
There has been considerable growth in interest of security surveillance image segmenta-
tion problems. This interest has created an increased need for performant image segmen-
tation kernels. Different approaches of image segmentation have used GPU computing in
a wide variety of applications[15, 14, 16, 17]. Among the different image segmentation
approaches, Connected Component Labeling (CCL), and Level Set Segmentation (LSS) are
the most well-known applications.
CCL is a widely used image segmentation algorithm. It connects neighboring pixels
based on their similarities. The dependencies between the neighboring pixels and continuous
propagation of connectivity between pixels makes CCL a highly sequential application. CCL
is a great candidate for characterization of nested parallelism due its dynamic propagation
of connected components.
LSS is an evolutionary image segmentation algorithm. Given an initial curve C, LSS
expands C, or contracts C, based on the evolution of the function f . The expansion of
the curve is an outward evolution, and the contraction of the curve is an inward evolution.
Every evolution cycle depends on the previous cycle in terms of computing the curve. The
8
CHAPTER 1. INTRODUCTION
dependencies between multiple pixels makes LSS a great candidate for characterization of
nested parallelism and concurrent kernel execution together.
To analyze how best to accelerate recursion, we have explored graph theoretic algorithms,
including BFS and Prim algorithm. In addition, we evaluated selected Lonestar [18], and
NUPAR benchmarks [19] in this thesis. In summary, we have used two real applications
and four different benchmark applications as we developed characterization schemes in this
thesis. Next, we outline the contributions and describe the organization of the remainder of
this thesis.
1.4 Contributions of the Thesis
In this thesis, a number of key contributions towards the deep analysis and exploitation
of advanced parallel features are presented. The key contributions are summarized below:
• We characterize parallel applications, identifying when we can leverage nested paral-lelism available on NVIDIA GPUs (Kepler and Maxwell families). We define three
workload components that can guide the developer on how best to leverage nested
parallelism. To the best of our knowledge, we are the first work to define, implement,
and evaluate these combined three components: i) control flow workload analysis, ii)
child kernel launching, and iii) child kernel synchronization.
• We develop NVIDIA SASS instrumentation handlers to characterize data-dependentapplication behavior. We use an NVIDIA assembly code SASS Instrumentor (SASSI)
to evaluate dynamic application behavior. We provide a handler to profile and measure
binary execution for control flow dependent loops. Our handler can collect and
measure the control dependent loop efficiency.
• We characterize recursive parallel workload, identifying when we can leverage nestedparallelism available on NVIDIA GPUs (Kepler and Maxwell families). We define
three workload components that can guide the developer on how best to leverage
nested parallelism in the case of parallel recursion. We evaluate three components: i)
the degree of thread-level parallelism, ii) the work efficiency, and iii) the overhead of
9
CHAPTER 1. INTRODUCTION
kernel launches. Furthermore, we propose a new approach to increase thread-level
parallelism in order to increase work efficiency and reduce the number of recursive
kernel launches.
• We characterize the execution of concurrent kernels on NVIDIA GPUs (Kepler andMaxwell families). Our characterization captures a kernel’s launch configuration, the
resource consumption, and the degree of overlapped execution. Our proposed metrics
help us to better understand when to use concurrent kernel execution.
• We propose, implement, and evaluate kernels with persistent threads as a mechanismto control resource contention for concurrent kernel execution on GPUs. Our results
show that kernels with persistent threads can be beneficial to identify peak resource
contention. Unfortunately, this does not directly lead to an overall performance
improvement.
• Our proposed workload metrics for irregular applications and parallel recursive kernelshave been applied to a number of CUDA kernels taken from the problem domains of
image processing, linear algebra, and graph theory problems. For these performance-
hungry applications, we achieve 1.3x to more than 100x speedup, as compared to a
flat GPU kernels.
• We compare state-of-the-art image segmentation applications, including connectedcomponent labeling, and level set segmentation, exploring both nested parallelism
and concurrent kernel execution. Our accelerated connected component labeling has
been presented at the International Conference on Computer Vision and Graphics
(ICCVG) [15]. In addition, it has been also presented at the GPU Technology Con-
ference (GTC) [20]. Our work on fast level set segmentation exploiting advanced
parallel features has been presented on Irregular Applications: Architectures and
Algorithms Workshop (IA3) [21] and featured as a poster in Programming and Tuning
Massively Parallel Systems Summer School (PUMPS). Furthermore, our accelerated
connected component labeling has been ported to OpenCL. We have analyzed the
benefits of advanced parallel features on AMD cards, and it has been presented at the
10
CHAPTER 1. INTRODUCTION
3rd International Workshop on OpenCL (IWOCL) [22]. Both of these real-world ap-
plications are part of the NUPAR benchmark presented at the International Conference
on Performance Engineering (ICPE) [19].
1.5 Organization of Thesis
The central focus of this work is to characterize nested parallelism and concurrent kernel
execution in a systematic way that works well for any GPU and any application. The
remainder of the thesis is organized as follows: Chapter 2 presents background information
on GPU architecture, specifically the NVIDIA GPU architecture, the parallel framework
CUDA, and the NVIDIA SASSI instrumentation framework. In Chapter 3, we present
related work in the area of characterization of parallel kernels, nested parallelism, and
concurrent kernel execution in GPU devices. In Chapter 4, we discuss the characterization
of nested parallelism for conditional nested loop, parallel recursion, and concurrent kernel
execution in NVIDIA Kepler and Maxwell architectures. Next, in Chapter 5 we present our
benchmark kernels that are used throughout this thesis to leverage advanced parallel features.
In Chapter 6, we present real applications that leverage our framework to effectively exploit
advanced parallel features. In Chapter 7, we conclude the thesis and summarize our work.
We also suggest directions for future work.
11
Chapter 2
Background
As we enter the era of GPU computing, demanding applications with substantial par-
allelism can leverage the massive parallelism of GPUs to achieve superior performance
and efficiency. Today GPU computing enables applications that were previously thought
to be infeasible because of long execution times. By enjoying the benefits of Moore’s
Law [1, 23, 24], NVIDIA GPUs have evolved since 2001. Table 2.1 shows the evolution of
NVIDIA graphic cards since the first programmable GPU was released.
Date Product Transistors CUDA cores
2001 GeForce 3 60 million -
2002 GeForce FX 125 million -
2004 GeForce 6800 222 million -
2006 GeForce 8800 681 million 128 (First support for CUDA Programming)
2007 Tesla T8, C870 681 million 128
2008 GeForce GTX 280 1.4 billion 240
2008 Tesla T10, S1070 1.4 billion 240
2009 Fermi 3.0 billion 512
2012 GK104 Kepler 3.5 billion 1536
2012 GK110 Kepler 7.0 billion 2688
2014 GM204 Maxwell 5.2 billion 2816
Table 2.1: NVIDIA GPU technology evolution [25]
12
CHAPTER 2. BACKGROUND
With the rapid evolution of GPUs from a configurable graphics processor to a general
purpose programmable parallel processor, the ubiquity of GPUs in every PC, laptop, desktop,
and smartphone was imminent. A large community of researchers and developers have
adopted the CUDA programming framework for a diverse range of applications [26, 27, 28].
The CUDA runtime on an NVIDIA GPU enables us to execute programs developed
in high-level languages, including C, C++, Fortran, OpenCL, DirectCompute, and oth-
ers [26, 25, 2]. The nature of CUDA is to try to preserve elements of common sequential
programming and extend them to a parallel thread execution. CUDA presents a Single
Instruction Multiple Thread (SIMT) abstraction with a straightforward set of configurations
for expressing parallelism.
2.1 CUDA Model
The CUDA model acts as a bridge between an application and its implementation
on available hardware [29]. There are a number of different layers that lie between the
application and the hardware. Figure 2.1 shows the different layers of abstraction between a
software implementation and hardware level. The programming model provides a logical
view of the specific computing architectures.
!"#$%&'()&**+,-&$,".)
-/0&)'/.$,1()
-/0&)0',2(')
3*/)4&'0%&'()
Figure 2.1: Layers of abstraction between software application and GPU hardware.
CUDA enables the developer to write parallel code that can run across tens of thousands
of concurrent threads and hundreds of core processors. CUDA divides execution down,
13
CHAPTER 2. BACKGROUND
hierarchically, using parallel abstractions such as kernels, blocks, and threads per block (see
Figure 2.2). A kernel executes a sequential program on a set of parallel threads. Each thread
has its registers and private local memory. Each block allows communication among its
threads through shared memory. Blocks communicate between themselves using global
memory. This memory hierarchy is illustrated in Figure 2.3.
!"#$%$&&'&()*+'"!(
,,-&*.$&,,(/*0+(1'%2'&34(
5(
(!!6(
7(
!"82+(#$%$&&'&()*+'"!(
9( :(
;(?&*)1(
@ABC(
?&*)1(
Figure 2.2: The CUDA model: a kernel, grid, and threads per block.
Other memories included in the CUDA memory model include:
• texture memory, specialized for 2D read-only coalesced memory
• constant memory, design to support for read-only accesses from different threadsacross blocks.
As mentioned in Chapter 1, the CUDA model follows a SIMT architecture to manage
and execute threads in groups of 32 named warps. Even though all threads in a warp must
execute the same instructions, and the GPU is a SIMD architecture, there are some key
features that differentiate a GPU from traditional SIMD:
14
CHAPTER 2. BACKGROUND
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
12,'"3(."/*'4(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
12,'"3(."/*'4(
!"#$%&"'%(
)*+,-(."/0(
5-*6,-(."/*'4(
7*8%&,8&(."/*'4(
9":&;'"(."/*'4(
Figure 2.3: The CUDA memory hierarchy.
• Each thread in the warp has its own instruction address counter.
• Each thread has its own register state.
• Each thread can have an independent execution path.
Although, the CUDA model enables each thread in a warp to display a different exe-
cution behavior, divergent behavior degrades performance since warps will be executed
serially. Control flow instructions (e.g., if-then-else, for, while) is one of the fun-
damental constructs in CUDA programming that causes this undesired behavior called warp
divergence.
2.1.1 Divergence
The use of control flow instructions is unavoidable in any applications. Modern CPUs in-
clude complex hardware to perform branch prediction [30, 31]. Hardware branch predictors
speculate the direction of conditional control flow in programs [32, 33, 34]. If the predictor
is correct, branch execution incurs little or no performance penalty. If the prediction is
15
CHAPTER 2. BACKGROUND
not correct, the CPU stalls for a number of cycles as the instruction pipeline is flushed,
and instruction fetching resumes at the correct program counter. In comparison, GPUs are
high-throughput, but lack complex branch prediction mechanisms [35, 36, 37]. Execution
on an NVIDIA GPU using the CUDA execution model assumes that all threads in a warp
must execute identical instructions on the same cycle. Executing complex control flow
typically results in divergent execution between the threads in the same warp [38].
Recent GPUs are designed to better handle control flow. The modern GPU hardware
supports condition codes (CC) and CC registers that contain the 4-bit state vector (sign,
carry, zero, overflow) used in integer comparisons [39]. The CC registers can direct the flow
of execution via predication or divergence. Predication allows (or suppresses) the execution
of instructions on a per-thread basis within a warp, while divergence supports conditional
execution of longer instruction sequences.
Due to the additional overhead of managing divergence and convergence, the compiler
uses predication for short instruction sequences. The effect of most instructions can be pred-
icated on a condition; if the condition is not true, the instruction is suppressed. Predication
works well for small fragments of conditional code, especially for if statements with no
corresponding else. For larger conditional code segments, predication becomes inefficient
because every instruction is executed, regardless of whether it will affect the computation.
When the length of the conditional code fragment is long and the cost of predication would
exceed the benefits, the compiler will generate conditional branches. If the threads in a warp
diverge due to a data-dependent conditional branch, the warp serially executes each branch
path taken, disabling threads that are not on that path. Once all paths complete, all threads
re-converge to the original execution path. Figure 2.4 illustrates how warp divergence is
handled on a GPU.
Although warp divergence can have a negative impact on application throughput, this
impact varies dramatically across GPU architectures. In the following sections we address
divergent execution for the latest GPU generations, from the Fermi GPU architecture to the
Pascal GPU architecture.
16
CHAPTER 2. BACKGROUND
!"
!"#$$%
&$!&%
'(%
!"#$$%
#"
'(%
!"#$$%
$"
'(%
!"#$$%
%"
'(%
!"#$$%
&"
'(%
!"#$$%
'"
'(%
!"#$$%
("
)
$&"
!"#$$%
&$!&%
$'"
!"#$$%
&$!&%
$("
!"#$$%
&$!&%
$*"
!"#$$%
&$!&%
$+"
!"#$$%
&$!&%
$,"
!"#$$%
&$!&%
%!"
!"#$$%
&$!&%
%#"
!"#$$%
&$!&%
Figure 2.4: Branch divergence in the GPU
2.2 GPU Computing Architecture
A Streaming Multiprocessor (SM) is the centrepiece of the NVIDIA GPU architecture.
A thread block is scheduled on a single SM, and once it is scheduled on the SM, it remains
there until execution completes. An SM can hold more than one thread block at the same
time. Registers and shared memory are scarce resources in the SM. These resources have to
be partitioned among all threads resident on an SM. Each SM contains hundreds of CUDA
cores, and each GPU device contains tens of SM.
Logically, all threads in a block run in parallel, but not all threads can execute physically
at the same. Therefore, different blocks may make progress at different rates. Since warps
are the atomic unit of execution on the GPU, many warps can be scheduled in an SM, but
depending on the SM resource availability, not all scheduled warps will be active. If a warp
is idle, then the SM schedules another warp from any block that is resident on the same
SM. The benefits of this switching between concurrent warps is that we avoid all overhead.
Given the importance of determining the right warp granularity, we would like to quickly
17
CHAPTER 2. BACKGROUND
find the best grid configuration for any application.
2.2.1 Fermi Architecture
The NVIDIA Fermi (chip GF110) GPU was released in 2009. Fermi introduced an
increased number of CUDA cores per SM, higher space for shared memory, configurable
shared memory, and Error Correcting Codes (ECC) on main memory and caches. Each SM
in Fermi has 32 CUDA processor cores, 16 load/store units, and four special function units
(SFUs). Fermi has a 64-KByte register file, an instruction cache, two multi-thread warp
schedulers and two instruction dispatch units [40].
The SIMT instructions control the execution of an individual thread, including arithmetic,
memory access, and branch/control flow instructions. Fermi extends SIMT to control flow
with support for indirect branches and function-call instructions. With the improvements
introduced in the Fermi Parallel Thread Execution (PTX) 2.0 Instruction Set Architecture
(ISA), individual thread control flow can predicate instructions.
2.2.2 Kepler Architecture
A number of new features were introduced in Kepler as compared to the earlier GPU
Fermi architecture. Table 2.2 compares some of theses features for Fermi (instance of chip
GF110) and Kepler (instance of chip GK110).
Kepler GK110 comprises up to 15 Kepler SM (SMX) units, four warp schedulers, and
eight instruction dispatch units. Thus, it can issue and execute four warps simultaneously.
Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 load/store
units, and 32 special function units, which can operate sine, cosine, reciprocal or square root
per thread per clock [42]. Kepler GK110 can provide up to 4.29 TFLOPS single-precision
and 1.43 TFLOPS double-precision floating point performance [43].
In addition to an increase in the number of CUDA cores per SM and a dramatic increase
in the number of registers per thread, Kepler (compute capability 3.5 or higher) introduced a
number of new features to further simplify parallel program design.
18
CHAPTER 2. BACKGROUND
Fermi (chip GF110) Kepler (chip GK110)
SPs per SM 32 192
Threads per SM 1536 2048
Thread blocks per SM 8 16
Warp schedulers per SM 2 4
Dispatch Units per SM 2 8
Shared Memory/L1 cache 16/48KB 16/32/48KB
32-bit Registers per SM 32K 64K
Registers per thread 63 255
Table 2.2: Fermi chip GF110 versus Kepler chip GK110 [41]
2.2.2.1 Dynamic Parallelism
Dynamic parallelism is an extension to the CUDA programming model, enabling CUDA
kernels to create, and synchronize, new kernel entirely on the GPU. With this feature, any
kernel can launch a child kernel and manage inter-kernel dependencies [35].
In order to manage the execution of dynamic parallelism, the CUDA model added a
new feature known as the Grid Management Unit (GMU) [44, 42, 45], which is able to
dispatch, as well as pause the dispatch, of new grids. The GMU can also queue pending,
and suspend running, grids. A grid includes all thread-blocks associated with the kernel.
Grids are launched in the order that they are received.
In previous GPU generations, the host launched work through the Compute Work
Distributor (CWD) unit [42, 2], and the CWD tracks blocks issued and sends them to the SM
for execution. In Kepler and more recent GPU generations, the GPU launch work from host
or device using the GMU. The GMU communicates with the CWD using a bidirectional
link to prioritize or suspend/pause grids. Also, the GMU has a direct connection to the SM
to support dynamic parallelism, and through this connection device kernels can dispatch
child grids.
The aim of the GMU is to effectively manage grid dispatching, in such way that, if we
need to free up resources for child kernels to execute, the GMU will suspend parent kernel
grids [42, 45]. The device runtime will reschedule the grids on different SMs in order to
19
CHAPTER 2. BACKGROUND
better manage resources. Figure 2.5 illustrates GMU interaction with the CWD and the SM.
!"!"!"
!"#$%&'()$)$*'+',#-$#$-'.)$)$*'/0'1#2-*'
3456'78983:7:9;'$L'M%)>FJ'
Figure 2.5: Work flow of the Grid Management Unit to dispatch, pause, and hold pending
and suspended grids.
Dynamic parallelism enables to create work directly on the GPU. This can remove the
need to transfer execution control and data between the host and the device. The child kernel
launch decisions are made at runtime by threads executing on the device. CUDA model
controls the synchronization and communication between a parent kernel and child kernels.
The local memory and registers associated with a parent thread are still only accessible by the
parent thread, and are not accessible by other threads or any child threads. Communication
with a child thread is only through global memory.
20
CHAPTER 2. BACKGROUND
Using dynamic parallelism, data-dependent parallel work can be generated inline within
a kernel at runtime. These kernels take advantage of the GPU’s hardware scheduler and load
balancer to dynamically adapt execution to make data-driven decisions. Figure 2.6 shows
how dynamic parallelism works on a GPU.
!!"#$%!!'()*+)#,-'
.'
'//0'
'12',3$+415$+-'
' '361#47)*+)#88899'
'//:::'
;'
0'
CHAPTER 2. BACKGROUND
!"#$"%&'& !"#$"%&(& !"#$"%&)(&*
+,"-./0$&
12#"34&5& 12#"34&'& 12#"34&)'&
Figure 2.7: Hyper-Q
increases the total number of work queues between the host and the device by allowing
32 simultaneous hardware-managed connections (as compared to the single connection
available with Fermi). Figure 2.7 illustrates the Hyper-Q feature in Kepler.
2.2.3 Maxwell Architecture
NVIDIA’s Maxwell generation provides only few enhancements to the previous GPU
generation, with a focus on energy efficiency. In addition to providing new features that
include dynamic parallelism and concurrent kernel execution, the Maxwell generation
delivers 2x the performance per watt as compared to the Kepler generation [46].
The Maxwell GTX 980 Ti (chip GM200) comprises 22 Maxwell SMs (SMM). Each
SMM has 128 CUDA cores, four warp schedulers, eight instruction dispatch units, and
eight texture units. Overall, the Maxwell SM looks very similar to a Kepler SM, except that
Maxwell provides fewer CUDA cores per SM.
Another major change, as compared to the Kepler architecture, is in the memory hierar-
chy. Shared memory and L1 cache are not longer combined. Shared memory is dedicated,
and the L1 cache is combined with the texture cache. The Maxwell GTX 980 Ti ships with
up to 96KB in its share memory unit, and 48KB for the L1 cache/texture cache.
2.2.4 Pascal Architecture
NVIDIA introduced Pascal architecture in 2016. The NVIDIA GTX 1080 which includes
a Pascal GP104, comprises 7.2 billions transistors and 2560 single-precision CUDA cores.
22
CHAPTER 2. BACKGROUND
GDDR5x memory is introduced with the GP104, providing 256-bit memory interface,
delivering 43% higher memory bandwidth than NVIDIA’s prior GeForce GTX 980 GPU.
The GP104 GPU consists of four Graphic Processing Clusters (GPCs), 20 Pascal SMs,
and eight memory controllers. Each GPC has a dedicated raster engine and five SMs. Each
SM contains 128 CUDA cores, four warp schedulers, eight instruction dispatch units, a 256
KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage,
and eight texture units [47]. A comparative feature analysis between four NVIDIA GPU
generations is presented in Table 2.3.
GPU GTX 590 GTX Titan GTX 980 Ti GTX 1080
Family Fermi Kepler Maxwell Pascal
Chip GF110 GK110 GM200 GP104
Compute Capability 2.0 3.5 5.2 6.1
SM 16 14 22 20
CUDA cores 32 192 128 128
Total cores 512 2688 2816 2560
Global Mem. 1474 MB 6083 MB 6083 MB 8113 MB
Shared Mem. 48 KB 48 KB 48 KB 48 KB
Threads/SM 1536 2048 2048 2048
Threads/block 1024 1024 1024 1024
Clock rate 1.26 GHz 0.88 GHz 1.29 GHz 1.84 GHz
TFLOPS 1.5 4.29 6.50 9.00
Table 2.3: A comparison of the features available on the four generations of NVIDIA GPUs
considered in this thesis [25].
23
Chapter 3
Related work
In this chapter, we review related work in the areas of GPU characterization, with special
emphasis on modern GPU features. We focus our literature review on advanced parallel
features for multiple levels of concurrency, and different grains of parallelism.
3.1 Characterization of GPUs
There have been studies focusing on GPU characterization to better understand the
improvements during the evolution of these devices. This evolution started with GPUs as
rendering tools, and spans to today where GPUs act as advanced general purpose accelera-
tors [48, 49, 50, 51].
An early characterization study by Jia et al. [48] in 2012 focused on characterizing cache
memories on GPUs. Starting with the NVIDIA Fermi and the AMD Fusion, GPU vendors
have included demand-fetching in their data caches. Earlier, GPU generations were focused
on graphics rendering, providing local memories instead of demand-fetched caches. With
the introduction of demand-fetch caches, a new challenges arrived: 1) understanding the
benefits of cache memories and 2) a lack of intuition for developers to efficiently use them.
They addressed these two problems and provided a mechanism to efficiently utilize cache
memories.
Wong et al. [49] presented a characterization of Tesla GPUs through the execution of a
set of microbenchmarks. Their analysis provided insights about the characteristics of the
24
CHAPTER 3. RELATED WORK
GPUs beyond the information provided by NVIDIA. Another attempt to characterize the
internals of a GPU was presented by Torres et al. [50]. In their study, they focused on the
impact of the CUDA tuning techniques on the Fermi architecture. Jiao et al. [51] presented
a characterization study of GPUs to evaluate power efficiency, and the correlation between
application performance and power consumption.
A large body of work studies how to leverage GPUs effectively through understanding
their characteristics for older [52, 53] and modern [54, 55, 56] generations of GPUs. While
Kerr et al. [52] focused on understanding the behavior of PTX 1.4, Lee et al. [53] developed
an exhaustive performance analysis to capture performance gaps between an NVIDIA
GTX280 Tesla architecture versus an Intel Core i7-960. In this these, we focus our attention
primarily to the characterization of more modern GPUs.
3.1.1 Modern GPUs
Kayiran et al. [54] explored the impact of of memory accesses during concurrent
execution thread execution and the resulting application performance. They provided a
thorough evaluation of 31 applications - from the CUDA SDK to Map-Reduce problems
- to understand resource contention in caches, networks and memory. Furthermore, they
proposed a dynamic Cooperative Thread Arrays (CTA) scheduling mechanism, which
regulates thread level parallelism by allocating an optimal number of CTAs per application.
Mei et al. [55] provided a microbenchmark to dissect the device memory hierarchy to
chacterize the organization of different GPUs cache systems on Fermi, Kepler and Maxwell
architectures. Ukidave et al. [19] provided a set of application benchmarks to analyze the
latest features on modern GPUs, such as nested parallelism, concurrent kernel execution,
atomic operations, and shuffling.
In the next section, we review characterization of multiple levels of concurrency and
thread granularity on modern GPUs.
25
CHAPTER 3. RELATED WORK
3.2 Multiple Levels of Concurrency
3.2.1 Nested Parallelism
One of the earliest characterizations of nested parallelism was presented by DiMarco et
al. [57] in 2013. They aimed to quantify the performance gains of dynamic parallelism pre-
sented by CUDA 5 and the Kepler architecture. Their exploration covered two applications:
K-means and hierarchical clustering. Their results showed that finer granularity of TLP
provides a more efficient way to leverage nested parallelism than just avoiding CPU-GPU
synchronization.
In 2014, Wang et al. [58] presented an evaluation of the impact of nested parallelism
in unstructured GPU applications for the Kepler architecture. Irregular applications suffer
from workload imbalance, which provides a good target for optimization using fine-grained
threads contained in coarse-grained blocks. Their characterization focused on control flow
and memory access measurements. Two metrics were proposed in their study: i.) warp
execution efficiency, and ii.) load/store replay overhead. Although, they provided a thorough
analysis of nested parallelism for control flow instructions and memory accesses, they
did not take into consideration synchronization cost between parent and child kernels to
evaluate the benefits of nested parallelism. Furthermore, they did not take into consideration
a finer grain classification of control flow divergence and their impact on the application
performance.
In 2015, Wang et al. [59] continued their work on characterizing nested parallelism in
GPUs. They proposed a new mechanism called Dynamic Thread Block Launch (DTBL), a
new execution model to support irregular applications on GPUs. DTBL allows coalesced
allocation of child kernels and parent kernels.
Yang et al. [60] analyzed a set of optimized parallel benchmark applications that contain
loops. Their analysis covers the degree of TLP and proposed a framework called CUDA-NP
to exploit nested parallelism in CUDA. CUDA-NP is a pragma-based compiler approach that
generates GPU kernels with nested parallelism. Basically, their approach reads the OpenMP-
like pragma directives in the input kernels and creates the respective child kernels with a
grid configuration based on the parallel-loop-TLP degree. However, they did not analyze
26
CHAPTER 3. RELATED WORK
the implications of parent-child synchronization. Furthermore, they relied on developer’s
knowledge to identify potential parallel loops that can exploit nested parallelism without
providing any insight about the behavior of the architectures.
Further studies [61, 62, 63] characterized nested parallelism based on the irregularity
of an application. Applications containing parallel loops and recursive calls are suitable
to leverage nested parallelism. Zhang et al. [61] adapted two irregular and data-driven
problems—bread-first search and single-source shortest path— to leverage nested paral-
lelism. Li et al. [62] proposed parallelization templates to leverage nested parallelism for
tree and graphs problems. These type of problems present irregular nested loops and parallel
recursive computation. Wang et al. [63] provided insights on leveraging nested parallelism
for general irregular applications. However, none of these approaches provided a holistic
analysis on the implications on leveraging nested parallelism and the effects across different
architecture/compiler versions.
3.2.2 Concurrent Kernel Execution
In early GPU architectures, concurrent kernel execution was poorly supported. In 2011,
Wang et al. [64] proposed a mechanism to exploit concurrent kernel execution through
manual context funnelling. They compared CUDA 4 automatic context funnelling versus
their approach for Fermi architectures. They showed that manual control of shared resources
might provide slight improvements in application performance. However, they did not
discuss resource contention based on the interplay between concurrent kernels.
In 2012, Wende et al. [65] provided a kernel reordering mechanism to exploit concurrent
kernel execution for Fermi architectures. Their execution model is designed to partition
kernels into small-scale computations, and by using producer-consumer principles, manage
GPU kernel invocations after reordering them. Later, in 2014 Wende et al. [66] continued
their work on exploitation of concurrent kernel execution, and proposed a characterization of
NVIDIA Hyper-Q feature for Kepler architecture, using an offloading mechanism to allow
running multiple kernels simultaneously. Their analysis explored synthetic benchmarks
and developed a performance evaluation, complementing their previous work on kernel
reordering.
27
CHAPTER 3. RELATED WORK
Gregg et al. [67] proposed a kernel scheduler mechanism called KernelMerge that
allows to run two OpenCL kernels concurrently on AMD cards. KernelMerge takes into
consideration kernel configuration and investigates the interaction between concurrent
kernels to analyze interference for sharing resources.
Since the Kepler architecture, NVIDIA provides a modern hardware design to adequately
support concurrent kernel execution. In 2014, Jog et al. [68]—moving to the next logical
step—proposed an Application-aware memory system for fair and efficient execution of
concurrent applications. Their approach takes into consideration memory awareness by
providing a new scheduling mechanism for serving memory requests in a round-robin fash-
ion. They considered four metrics based on the Instructions Per Cycle for each application.
However, they did not consider resources contention on registers, nor the grid configu-
ration. Furthermore, they focused on memory-bound applications, and did not discuss
arithmetic-bound applications.
In 2016, Luley et al. [69] proposed a framework to exploit NVIDIA’s Hyper-Q. Their
framework oversubscribes kernels and defragments memory transfers to effectively overlap
accesses with computations. Furthermore, they proposed multiple mechanisms to reorder
kernels with the aim to improve application throughput. Although, they have studied the
impact of memory transfers, they have not analyzed resource contention between concurrent
kernels, which can be a key bottleneck when attempting to leverage concurrent kernel
execution.
28
Chapter 4
Characterization of advanced parallel
features
Acceleration of high performance applications that exhibit complex and irregular execu-
tion behavior is an ever-growing open problem. A naive port of an irregular applications to a
parallel platform often leads to underutilization of hardware resources, significantly limiting
performance. In this chapter, we present a characterization of advanced parallel features on
a GPU that be effectively exploited to tune any application with a high degree of irregularity.
4.1 Nested Parallelism
Irregularity in an application can result in poor workload balance when attempting
to exploit fine-grained thread-level parallelism. We next consider examples of high-level
language behavior that can suffer from a lack of inherent thread-level parallelism.
A number of irregular applications contain control-flow dependent nested loops. This
kind irregularity can inhibit thread-level parallelism, since independence can only be de-
duced at runtime. Because many loops tend to be data dependent, GPU hardware vendors
introduced support for nested parallelism, leveraging nested TLP through the addition of a
new level of parallelism. We have studied a number of irregular applications to identify how
frequently control flow dependent nested loops are used. Table 4.1 shows characterization
29
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
data from two different GPU benchmark suites, where control flow dependent nested loops
occur.
Applications Benchmark Suite Number of Control Flow
Dependent Nested Loop
Barnes Hut Lonestar [18] 6
Delaunay Mesh Refinement Lonestar [18] 7
Points-to Analysis Lonestar [18] 31
Survey Propagation Lonestar [18] 7
Single-Source Shortest Paths Lonestar [18] 2
Connected Component Labeling NUPAR [19] 1
Level Set Segmentation NUPAR [19] 1
Table 4.1: Irregular Applications from two different GPU benchmarks which exhibit control
flow dependent nested loops.
We have also explored recursive algorithm patterns that can benefit from nested par-
allelism. Parallel recursion is a solution to efficiently execute recursive algorithms which
exhibit the ability to spawn multiple threads per recursive call. Before the introduction of
nested parallelism on the GPU, recursive solutions required GPU and CPU intervention -
or an implementation of the GPU kernel void of recursive kernel calls. However, constant
communication between the CPU and the GPU produces memory copies and results in
communication overhead. In addition, most recursive solutions are data-dependent solutions,
therefore it is challenging to anticipate the amount of overhead will be introduced. On
the other hand, we cannot always use a single GPU kernel call version for all recursive
algorithms. Table 4.2 shows a list of recursive kernels that can be as parallel recursion.
Applications Benchmark Suite Number of Control Flow Number of
Dependent Nest Loop Recursion calls
Breadth-First Search Lonestar 0 1
Prim’s Algorithm - 1 1
Table 4.2: Recursive applications which exhibit parallel recursion.
30
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 g l o b a l vo id s i n g l e K e r n e l ( i n t ∗ A, i n t ∗B , i n t ∗C , i n t rows , i n t c o l s )2 {3 i n t i d x = b l o c k I d x . x ∗ blockDim.x + t h r e a d I d x . x ;4 i f (A[ i d x ∗ c o l s ] == 1)5 {6 f o r ( i n t i =0 ; i < c o l s ; i ++)
7 C[ i d x ∗ c o l s + i ] = A[ i d x ∗ c o l s + i ] + B[ i d x ∗ c o l s + i ] ;8 }9 }
Program Listing 4.1: Micro-benchmark Kernel with irregular nested loop execution.
With nested parallelism, a recursive solution can be naturally ported to the GPU and can
avoid CPU-GPU communication overhead. Nonetheless, the recursive spawning of threads
cannot always lead to enough TLP to exploit the GPU, and it could lead to substantial kernel
launch overhead and hardware underutilization.
Exploiting nested parallelism, either in the presence of control-flow dependent nested
loop or parallel recursion, is not straightforward. Since nested loop can include control
flow divergences; a recursive solution can lead to poor TLP and low warp efficiency. At the
same time, nested synchronization can turn into a large number of thread stalls and global
communication between parent-child kernels. Next, we explore each of these factors and
present metrics to quantify their impact on kernel performance.
4.1.1 Control Flow Instructions
Mapping parallel programs exhibiting arbitrary control flow onto parallel units can be a
difficult task. There is generally no guarantee that parallel units will execute the same control
flow path. For instance, Program 4.1 presents a micro-benchmark kernel that executes a
loop based on an input parameter data. Figure 4.1 illustrates the dynamic execution of the
micro-benchmark for two architectures: a Kepler GTX Titan, and a Maxwell GTX Titan Ti.
Both of the execution examples are run with the same input parameters, the same NVIDIA
driver, and the same CUDA version. However, the number of instructions executed varies
along the control flow path.
31
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.1: Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell GTX
Titan Ti.
CUDA binary tools such as nvdisasm [45], and cuobjdump [45] have been widely
used to produce control flow graphs (CFG). However, nvdisasm and cuobjdump gather
kernel behavior statically, and do not allow dynamic analysis of an application’s irregular-
ity. On the other hand, the SASS Instrumentation tool (SASSI) [70] allows the dynamic
collection of metrics during execution time. Moreover, SASSI is able to retrieve developer-
specified metrics about the control flow instructions executed at runtime. SASSI, alongside
with nvprof [45] allow us to collect the following runtime metrics:
1. instExec, Number of instructions executed. Reported by nvprof.
2. warpDivEff : Ratio of the average active threads per warp and the maximum number
of threads per warp supported on a multiprocessor, expressed as percentage. Reported
by nvprof.
3. cfExecuted: Number of executed control-flow instructions. Reported by nvprof.
4. cfDependentNestedLoop: Number of instructions executed inside a control flow
loop-instruction. Reported by our handler and injected using SASSI.
32
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
These metrics are intrinsically related to the execution of kernel control flow and capture
the efficiency of the warp execution. We evaluate the impact of the percentage of instructions
executed inside the loop to identify potential hotspots to identify opportunities to exploit
nested TLP. We compute the ratio of instructions executed inside of loop bodies, as a
fraction of all instructions executed, in order compute the impact of instructions inside these
common control flow structures.
loopInstExec =cfDependentNestedLoop
instExec(4.1)
We also consider the amount of the idle resources due to the warp divergence. warpDivEff
allows us to compute the reciprocal metric to measure the idle threads waiting until a loop
execution ends. This can be computed as: warpDivIdle = 1 − (warpDivEff/100).Next, we propose and compute loop warp efficiency by taking into account the ratio of
instructions executed during loop execution (i.e. loopInstExec). However, a simple mul-
tiplication of warpDivIdle ∗ loopInstrExec would give us the loop warp threads idle(loopWarpThreadsIdle), in order to compute the efficiency metric, we compute its re-
ciprocal by subtracting 1 − loopWarpThreadsIdle and multiple by 100 to expressed aspercentage:
loopWarpEff = (1− loopWarpThreadsIdle) ∗ 100 (4.2)
Our proposed metrics are specifically designed to measure workload imbalance generated
by irregular applications. These applications have data-dependent workload, unpredictable
control flow behavior that causes severe workload imbalance, and eventually poor GPU
utilization.
4.1.2 Parallel Recursion
Recursion is a method of making self referential calls, commonly used to compute
problems through breaking them into smaller sub-problems, and using a divide-and-conquer
strategy. For instance, Program 4.2 illustrates a simple recursive program which implements
the fibonacci sequence [71]. In a recursive solution, the problem is broken into a base case
33
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 i n t f i b ( i n t n ) {2 i f ( n
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 g l o b a l vo id f i b k e r n e l p a r r e c ( i n t n , u n s i g n e d long i n t ∗ vFib ) {2 i f ( n == 0 | | n == 1)3 r e t u r n ;
4
5 f i b k e r n e l p a r r e c (n−2, vFib ) ;6 f i b k e r n e l p a r r e c (n−1, vFib ) ;7 c u d a D e v i c e S y n c h r o n i z e ( ) ;
8 vFib [ n ] = vFib [ n−1] + vFib [ n−2];9
10 }
Program Listing 4.3: Fibonacci parallel recursive scheme in CUDA.
called CUDA blocks, also known as CTAs! (CTAs!) [72]. TLPDegree is the number
of threads synchronized across the CTA.
2. workEfficiency: Ratio of the number of operations executed that contribute to
solving the problem, divided by the total operations executed on the GPU. The
goal of this metric is to provide a measure of the number of non-redundant (vs.
redundant+ non− redundant) operations executed per GPU kernel. For instance,a 100% work efficiency indicates that there are no redundant operations executed.
3. depthKernelRecursion: Number of nested kernel calls.
Our proposed metrics are specifically designed to measure the efficiency of recursive
execution by parallel recursive applications. These applications have data-dependent work-
load, nested kernel calls, and irregular parallel recursion which lead to unbalanced workload
execution, low work efficiency, and eventually poor GPU utilization.
4.1.3 Child Kernel Launching and Synchronization
Nested Parallelism in CUDA allows explicit synchronization by child kernels by call-
ing Application Programming Interfaces (API) cudaDeviceSynchronize. When used,
the parent thread-block will wait until the child threads finish their execution. cudaDeviceSynchronize
is expensive, and should not be used. However, for many irregular applications the parent
35
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
thread will require results of the child threads to continue execution. We characterize the
overhead of device synchronization by measuring its impact on the overall performance.
Once a potential nested parallelism hotspot was identified by our metrics, we imple-
mented a nested parallelism kernel, and compared it to the non-nested parallelism kernel, a
well as a sequential implementation of the kernel, in order to characterize the overhead of a
child kernel launching.
Figure 4.2 shows the execution time of the 3 different implementations (i.e., sequential,
non-nested parallelism and nested parallelism) for our micro-benchmark kernel. The micro-
benchmark computes the addition of two matrices if the value of the first element in the row
matches to the condition in the first control flow instruction. We defined a set of experiments,
varying the input sizes in terms of rows and columns. In addition, we controlled the level of
divergence starting from 12.5% and increasing it up to 75%. We argue that higher divergence
leads to better exploitation of nested parallelism, but this divergence is going to be data
dependent.
We expected that small input sets will lead to poor performance in a GPU due to low
utilization of the high TLP available. However, nested parallelism starts outperforming
non-nested parallelism as the degree of TLP increases, especially in the presence of a high
degree of divergence.
In order to characterize the behavior of nested parallelism across different GPU archi-
tectures, we used two Kepler and two Maxwell architectures, running with the same input
sets, the same NVIDIA driver, and the same CUDA version. Figure 4.3 shows the runtime
execution for different input sets, with data values generating a degree of 75% divergence
across the four different GPUs.
Although the Kepler GT 730 has the same number of CUDA cores per SM as the Kepler
GTX Titan, it also has a smaller number of SMs as compared to GTX Titan, while the GTX
Titan has 15 SMs, the GT 730 has 2 SMs. The number of SMs has high impact on our
ability to exploit nested parallelism. For instance, the child kernel that is launched will have
to allocate a number of blocks on the remaining available SMs on the device. If the device
does not have enough SMs free to leverage nested parallelism, then the benefits of nested
parallelism will not be enjoyed.
We present an equation 4.3 to characterize kernel overhead across different architectures,
36
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.2: Execution time of Sequential, non-nested parallelism and nested parallelism
kernels on GTX Titan - Kepler architecture (lower is better).
based on the SM usage. NumberThreadsPerBlock and NumberBlocks are application
specific, and MaxThreadsPerSM is architecture specific. If SMUsage surpasses the
available number of SMs on the GPU, it will prevent us to effectively leverage nested
parallelism. We have found we also benefit from the use of persistent threads to control
SMUsage.
SMUsage =NumberThreadsPerBlock ∗NumberBlocks
MaxThreadsPerSM(4.3)
In our analysis, we characterized cudaDeviceSynchronize API calls for kernel
synchronization using CUDA counters, such as clocks and the frequency rate. We verified
that synchronization can negatively impact application performance when an application
launches a small number of threads per block and a reduced number of blocks per kernel (i.e.
37
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.3: Execution time of non-nested parallelism and nested parallelism across four
GPUs (2 Kepler and 2Maxwell GPUs).
poor TLP). However, we found that kernel synchronization can be hidden by incrementing
the TLP and loopWarpEff .
4.1.4 Memory Overhead
When using nested parallelism, global memory in the GPUs is the only channel of com-
munication between the parent and child kernels, and it may also be tied by the device run-
time for child kernel launches. The device runtime keeps track of the kernel launches by cre-
ating a pool for all launches. Kernels that are not enable to launch due to a lack of resources
available remain in the pool on pending kernels. The size of this pool is referred to as th