+ All Categories
Home > Documents > Characterization and exploitation of nested parallelism and ......NORTHEASTERN UNIVERSITY Graduate...

Characterization and exploitation of nested parallelism and ......NORTHEASTERN UNIVERSITY Graduate...

Date post: 27-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
99
Characterization and exploitation of nested parallelism and concurrent kernel execution to accelerate high performance applications A Dissertation Presented by Fanny Nina Paravecino to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts March 2017
Transcript
  • Characterization and exploitation of nested parallelism

    and concurrent kernel execution to accelerate high

    performance applications

    A Dissertation Presented

    by

    Fanny Nina Paravecino

    to

    The Department of Electrical and Computer Engineering

    in partial fulfillment of the requirements

    for the degree of

    Doctor of Philosophy

    in

    Computer Engineering

    Northeastern University

    Boston, Massachusetts

    March 2017

  • NORTHEASTERN UNIVERSITYGraduate School of Engineering

    Dissertation Signature Page

    Dissertation Ti-

    tle:

    Characterization and exploitation of nested parallelism and concurrent

    kernel execution to accelerate high performance applications

    Author: Fanny Nina Paravecino NUID: 001160686

    Department: Electrical and Computer Engineering

    Approved for Dissertation Requirements of the Doctor of Philosophy Degree

    Dissertation Advisor

    Dr. David KaeliSignature Date

    Dissertation Committee Member

    Dr. Qianqian FangSignature Date

    Dissertation Committee Member

    Dr. Ningfang MiSignature Date

    Dissertation Committee Member

    Dr. Norm RubinSignature Date

    Department Chair

    Dr. Miriam LeeserSignature Date

    Associate Dean of Graduate School:

    Dr. Sara Wadia-FascettiSignature Date

  • To the science and the pursuit of answers through research.

    ii

  • Contents

    List of Figures vi

    List of Tables viii

    List of Programs x

    List of Acronyms xi

    Acknowledgments xiii

    Abstract of the Dissertation xiv

    1 Introduction 1

    1.1 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 Advanced Parallel Features . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Characterization of Advanced Parallel Features . . . . . . . . . . . . . . . 4

    1.3 Challenges in Exploiting Parallel Execution Features . . . . . . . . . . . . 5

    1.3.1 Nested Parallelism Challenges . . . . . . . . . . . . . . . . . . . . 6

    1.3.2 Concurrent Kernel Execution Challenges . . . . . . . . . . . . . . 7

    1.3.3 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    iii

  • 2 Background 12

    2.1 CUDA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.1.1 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.2 GPU Computing Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2.1 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2.2 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2.3 Maxwell Architecture . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.2.4 Pascal Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3 Related work 24

    3.1 Characterization of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.1.1 Modern GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.2 Multiple Levels of Concurrency . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . 27

    4 Characterization of advanced parallel features 29

    4.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.1.1 Control Flow Instructions . . . . . . . . . . . . . . . . . . . . . . 31

    4.1.2 Parallel Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.1.3 Child Kernel Launching and Synchronization . . . . . . . . . . . . 35

    4.1.4 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.2.1 Resource Contention . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5 Exploitation of advanced parallel features 44

    5.1 Dependent Nested Loop Workloads . . . . . . . . . . . . . . . . . . . . . 45

    5.1.1 Selective Matrix Addition . . . . . . . . . . . . . . . . . . . . . . 45

    5.2 Parallel Recursive Workloads . . . . . . . . . . . . . . . . . . . . . . . . . 46

    iv

  • 5.2.1 Breadth-First Search Algorithm . . . . . . . . . . . . . . . . . . . 47

    5.2.2 Prim algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    6 Validation with real-world applications 61

    6.1 Connected Component Labeling . . . . . . . . . . . . . . . . . . . . . . . 61

    6.2 Level-Set Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    6.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    6.2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.2.3 Finalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.3 Summary of Analysis for Real-world Applications . . . . . . . . . . . . . 67

    7 Summary 70

    7.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    Bibliography 73

    v

  • List of Figures

    2.1 Layers of abstraction between software application and GPU hardware. . . 13

    2.2 The CUDA model: a kernel, grid, and threads per block. . . . . . . . . . . 14

    2.3 The CUDA memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.4 Branch divergence in the GPU . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.5 Work flow of the Grid Management Unit to dispatch, pause, and hold

    pending and suspended grids. . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.6 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.7 Hyper-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.1 Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell

    GTX Titan Ti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.2 Execution time of Sequential, non-nested parallelism and nested parallelism

    kernels on GTX Titan - Kepler architecture (lower is better). . . . . . . . . 37

    4.3 Execution time of non-nested parallelism and nested parallelism across four

    GPUs (2 Kepler and 2Maxwell GPUs). . . . . . . . . . . . . . . . . . . . . 38

    4.4 Execution time of sequential execution of kernels versus concurrent kernel

    execution for two different GPUs while varying input size (lower is better). 40

    4.5 Execution time of sequential execution of kernels versus concurrent kernel

    execution for two different GPUs with persistent threads execution (lower is

    better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    vi

  • 4.6 Resource utilization for non-persistent thread kernels using different input

    data sets for the Maxwell GTX Titan X (lower is better). . . . . . . . . . . 43

    4.7 Resource utilization for persistent thread kernels using different input data

    sets for Maxwell GTX Titan X (lower is better). . . . . . . . . . . . . . . . 43

    5.1 Speedup evaluation of nested parallelism implementation compared to non-

    nested parallelism implementation for Selective Matrix Add for the Kepler

    GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2 Graph representation using an adjacency list. . . . . . . . . . . . . . . . . 48

    5.3 BFS operations while traversing a graph with six vertices, starting at source

    vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.4 BFS speedup analysis of naive nested parallelism and optimized nested

    parallelism versus non-nested parallelism implementation on Kepler GTX

    Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.5 MST Tree of graph G = (V,E), where V = {0, 1, 2, 3, 4, 5}, starting at

    source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.6 Prim algorithm step by step work flow. Given a graph G = (V,E) with

    an initial source vertex 0; find Minimum Spanning Tree (MST) using Prim

    algorithm, where iteration 0 is described as the initialization of a MST tree

    with a source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    6.1 Speedup comparison of the nested parallelism and non-nested parallelism

    implementations, running CCL on a Kepler GTX Titan. . . . . . . . . . . . 63

    6.2 Speedup comparison of the nested parallelism and non-nested parallelism

    implementations, running Level-Set segmentation on a Kepler GTX Titan. . 68

    vii

  • List of Tables

    2.1 NVIDIA GPU technology evolution [25] . . . . . . . . . . . . . . . . . . . 12

    2.2 Fermi chip GF110 versus Kepler chip GK110 [41] . . . . . . . . . . . . . 19

    2.3 A comparison of the features available on the four generations of NVIDIA

    GPUs considered in this thesis [25]. . . . . . . . . . . . . . . . . . . . . . 23

    4.1 Irregular Applications from two different GPU benchmarks which exhibit

    control flow dependent nested loops. . . . . . . . . . . . . . . . . . . . . . 30

    4.2 Recursive applications which exhibit parallel recursion. . . . . . . . . . . . 30

    5.1 Irregular and recursive applications, with potential for exploiting advanced

    parallel features in modern GPUs. . . . . . . . . . . . . . . . . . . . . . . 44

    5.2 Dynamic metrics for non-nested parallelism Selective Matrix Addition for

    different input sets on Kepler GTX Titan. . . . . . . . . . . . . . . . . . . 45

    5.3 Execution time of selective Matrix Add with different input sets for non-

    nested parallelism and nested parallelism implementations on Kepler GTX

    Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.4 Runtime execution analysis of Breadth-First Search with different input sets

    from the DIMACS Challenge Ninth [87] and Tenth [88] for naive nested

    parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 54

    viii

  • 5.5 Runtime execution analysis of Breadth-First Search with different input sets

    from DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested

    parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 54

    5.6 Runtime execution analysis of Prim’s algorithm with different input sets

    from the DIMACS Challenge Ninth [87] and Tenth [88] for non-nested

    parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 56

    5.7 Runtime execution analysis of Prim algorithm with different input sets from

    the DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested

    parallelism implementation on the Kepler GTX Titan. . . . . . . . . . . . . 60

    6.1 Dynamic metrics for non-nested parallelism of CCL for different input sets

    on a Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    6.2 Dynamic metrics for nested parallelism of CCL for different input sets on a

    Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    6.3 Execution time of selective Matrix Add with different input sets for non-

    nested parallelism and nested parallelism implementations on the Kepler

    GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    ix

  • List of Programs

    4.1 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 31

    4.2 Fibonacci recursive scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3 Fibonacci parallel recursive scheme in CUDA. . . . . . . . . . . . . . . . 35

    4.4 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 41

    5.1 Graph input using DIMACS challenge structure for file storing. . . . . . . 49

    5.2 Breadth-first Search (BFS) recursive implementation on a CPU, graph is

    a global variable which contains vertex array and edge array. . . . . . . . . 50

    5.3 BFS non-recursive implementation on GPU. . . . . . . . . . . . . . . . . 52

    5.4 BFS optimized nested parallelism implementation on GPU. . . . . . . . . 53

    5.5 Non-nested parallelism implementation of Prim’s algorithm on a GPU. . . . 57

    5.6 Optimized nested parallelism implementation of Prim’s algorithm on a GPU. 58

    x

  • List of Acronyms

    GPGPU General Purpose computing on Graphic Processor Units. Definition associated

    to the use of graphics processing units (GPU) to perform computation in application

    traditionally handled by central processing unit (CPU).

    GPU Graphics Processing Unit. Definition associated to the graphics processor unit in the

    system.

    CCL Connected Component Labeling. Algorithm refers to image segmentation process

    using points connected by similarity function.

    LSS Level Set Segmentation.

    SC Spectral Clustering.

    SIMD Single-Instruction Multiple Data.

    SIMT Single Instruction Multiple Thread.

    API Application Programming Interfaces.

    SIMT Single Instruction Multiple Thread.

    ISA Instruction Set Architecture.

    TLP Thread-level Parallelism.

    PTX Parallel Thread Execution.

    xi

  • MPI Message Passing Interface.

    CUDA NVIDIA’s Compute Unified Device Architecture Framework.

    OpenCL Open Compute Language.

    SM Streaming Multiprocessor.

    ECC Error Correcting Codes.

    CTA Cooperative Thread Arrays.

    PT Persistent Threads.

    PDE Partial Differential Equation.

    PDEs Partial Differential Equations.

    BFS Breadth-first Search.

    MST Minimum Spanning Tree.

    xii

  • Acknowledgments

    It would not have been possible to write this doctoral thesis without the help and support

    of the kind people around me, to only some of whom it is possible to give particular mention

    here.

    First of all, I would like to thank my parents Fani, and Dante for their endless support

    through every single step of this journey. I thank my brother Reykjavil, and sister Lisbeth

    for keeping me on the path and making me believe that everything is possible. I thank my

    boyfriend Jose, for his unlimited love and throughout support, for which my mere expression

    of thanks likewise does not suffice.

    This thesis would not have been possible without the help, support and patience of

    my colleagues and collaborators. A special thanks to all my colleagues at NUCAR group.

    Specially, to Leiming, Fritz, Julian and Xiangyu for their contributions towards the concepts,

    ideas, and for keeping company on the doctoral journey. I would also like to thanks our

    collaborators Dr. Qianqian Fang, Dr. Norm Rubin (NVIDIA) and Dr. Ningfang Mi for their

    constructive feedback on this dissertation.

    It is my deepest gratitude and warmest affection that I dedicate this thesis to my advisor

    Dr. David Kaeli who has been a constant source of knowledge and inspiration.

    xiii

  • Abstract of the Dissertation

    Characterization and exploitation of nested parallelism and

    concurrent kernel execution to accelerate high performance

    applicationsby

    Fanny Nina Paravecino

    Doctor of Philosophy in Computer Engineering

    Northeastern University, March 2017

    Dr. David Kaeli, Adviser

    Over the past decade, GPU computing has evolved from being a simple task of mapping

    data-parallel kernels to Single Instruction Multiple Thread (SIMT) hardware, to a more

    complex challenge, mapping multiple complex, and potentially irregular, kernels to more

    powerful and sophisticated many-core engines. Further, recent advances in GPU architec-

    tures, including support for advanced features such as nested parallelism and concurrent

    kernel execution, further complicate the mapping task.

    Improving application performance is a central concern for software developers. To

    start with, the programmer needs to be able to identify where opportunities for optimization

    reside. Many times the right optimization is tied to the underlying nature of the application

    and the specific algorithms used. The task of tuning kernels to exploit hardware features can

    become an endless manual process. There is a growing need to develop characterization

    xiv

  • techniques that can help the programmer identify opportunities to exploit new hardware

    features, and to port a broader range of applications to GPUs efficiently.

    In this thesis, we present novel approaches to characterize application behavior that

    can exploit nested parallelism and concurrent kernel execution introduced on recent GPU

    architectures. To identify bottlenecks that can be improved through the exploitation of

    nested parallelism and concurrent kernel execution, we proposed a set of metrics for a range

    of GPU kernels.

    For nested parallelism, our approach focuses on irregular and recursive kernel applica-

    tions. For irregular applications we define, implement, and evaluate three main runtime

    components: i) control flow workload analysis, ii) child kernel launching, and iii) child

    kernel synchronization. For recursive kernel applications, we define, implement, and eval-

    uate: i) degree of thread-level parallelism, ii) work efficiency, and iii) overhead of kernel

    launches. For concurrent kernel execution, our characterization captures a kernel’s launch

    configuration, the resource consumption, and the degree of overlapped execution. Our pro-

    posed metrics help us to better understand when to exploit nested parallelism and concurrent

    kernel execution.

    We demonstrate the utility of our framework of metrics by focusing on a diverse set

    of workloads that include both irregular and recursive program behavior. This suite of

    workloads includes: i) a set of microbenchmarks that specifically target the set of new

    GPU features discussed in this thesis, ii) the NUPAR suite, iii) the Lonestar suite and iv)

    real-world applications. By using our framework, we are able speedup application by more

    than 5x-23x as compared to non-advanced-parallel-feature GPU implementations.

    xv

  • Chapter 1

    Introduction

    In 1965 Gordon Moore proposed Moore’s Law, which states that the number of tran-

    sistors on a microprocessor doubles roughly every 18 months [1]. Since 1965, Moore’s

    Law has been shown to be remarkably accurate, and microprocessors have doubled their

    capabilities every one to two years. However, the translation of increased transistor density

    into improved application performance remains a challenging endeavour. There is no silver

    bullet that automatically optimizes software, programming frameworks, and algorithms so

    that they can benefit from advances in hardware.

    In many areas, performance improvement have been possible only due to modifications

    in algorithms, providing substantial performance gains that are much higher than those

    enabled by increasing processor speed alone. There are still many challenges that need

    to be addressed through the discovery of new parallel algorithms, specifically designed to

    take advantage of the potential power of parallel hardware, while avoiding some of the

    bottlenecks that can occur on these platforms.

    In this thesis, we will explore different mechanisms to understand the behavior of the

    parallel code (i.e., kernels) at different stages of the computing stack, including multiple

    compilation levels, as well as runtime execution. This work will define a characterization

    process of parallel execution that will guide and inform the programmer on how best to

    exploit new parallel features. Equipped with this knowledge, the programmer can then

    exploit parallelism at different grains of concurrency. We test our characterization process

    on a broad set of parallel applications, demonstrating the utility of this knowledge to tune

    1

  • CHAPTER 1. INTRODUCTION

    application to effectively exploit two recently introduced parallelization features: 1) nested

    parallelism and 2) concurrent kernel execution. We will also present a tuning mechanism to

    further improve application throughput.

    1.1 Parallel Programming

    Parallel programming provides a myriad of advantages over sequential programming,

    such as increased application throughput, improved utilization of hardware resources, and

    enhanced concurrent execution [2]. Given the wide range of parallel computing hardware

    platforms available to-date, spanning massively parallel supercomputers to multicore smart-

    phones, parallel execution has become the most effective path to improve performance. The

    need for high performance has been amplified due to the rate at which raw data is being

    generated today and is rapidly growing for the foreseeable future.

    Commonly, the easiest way to write parallel code is using a framework such as OpenMP.

    OpenMP is a simple, directive-based interface that offers incremental parallelization, which

    allows for loops in serial code executed concurrently without changing their structure [3].

    However, Using OpenMP does not solve the problem of load imbalance, and the resulting

    performance gain is limited by the Amdahl’s law [4], which defines that code improvement

    is to the portion of serial code that is suitable for parallelization.

    When working with a distributed system, Message Passing Interface (MPI) [5] provides

    an effective programming model for expressing parallelization. MPI is commonly used on

    distributed memory systems that leverage message passing. However, one notable trend we

    are witnessing in the field of parallel scientific computing is the dramatic increase in the

    number of applications that utilize GPUs. Based on Flynn’s widely used taxonomy [6, 3], the

    large number of cores on the Graphics Processing Unit (GPU) enable us to launch thousands

    of compute threads to execute in Single-Instruction Multiple Data (SIMD) fashion. SIMD

    provides parallelism by operating on multiple data streams concurrently [3]. Applications

    for GPUs are commonly developed using programming frameworks such as Kronos’s Open

    Compute Language (OpenCL) [7, 8, 9] and NVIDIA’s NVIDIA’s Compute Unified Device

    Architecture Framework (CUDA) [10]. Both OpenCL and CUDA are based on the high-level

    2

  • CHAPTER 1. INTRODUCTION

    programming constructs of the C and C++ languages. The data parallel and computationally

    intensive portions of an application are offloaded to the GPU for accelerated execution.

    These programming frameworks offer a rich set of runtime API, and allow the developer to

    write optimized kernels for execution on GPUs.

    Researchers and developers have enthusiastically adopted the CUDA programming

    model and GPU computing for a diverse range of applications [11, 12, 13, 14]. Given the

    varying degrees of parallelism present in many applications, we are motivated to explore

    advanced parallel features on the GPU.

    1.1.1 Advanced Parallel Features

    Recent advances in GPU architectures have pushed a number of computational barriers,

    enabling researchers to leverage parallel computing to improve application throughput.

    Graphics hardware has substantially evolved over the years to include more functionality

    and programmability. NVIDIA’s previous generation of GPUs, the Fermi family, has been

    used in a number of applications, promising peak single-precision floating performance of

    up to 1.5 TFLOPS. However, NVIDIA’s Kepler GK110 GPU offers more than 4.29 TFLOPs

    of single-precision computing capability. The newest features provided on Kepler enable

    programmers to move a wider range of applications to the CUDA framework.

    Given the new hardware features provided on recent hardware, exploiting these features

    to improve overall execution throughput has become paramount. Thread-level parallelism

    provides impressive speedups for applications ported to the GPU. Moreover, the addition of

    nested parallelism supports conditional-loop execution throughput, which requires working

    at a finer thread granularity. Another new feature is the concurrent kernel execution of ker-

    nels, which improves the utilization and runtime of multiple kernel, removing the overhead

    due to context switching. There is also a performance advantage provided by performing

    back-to-back kernel launches. In the CUDA API, kernel invocations are asynchronous. If a

    developer can call a kernel (or kernels) multiple times without any intervening synchroniza-

    tion (i.e., memory transfers or dependency checking), then the multiple kernel calls will be

    batched in the CUDA driver, and the application can overlap kernel execution on the GPU.

    Given the level of sophistication provided in the modern GPUs, we have focused our

    3

  • CHAPTER 1. INTRODUCTION

    work on the characterization of advanced parallel features in order to guide the improvement

    of application throughput. We consider optimization of applications for two new features

    available on NVIDIA Kepler GPUs and more recent GPU generations:

    • Nested Parallelism: modern GPUs add the capability to launch child kernels within aparent kernel. A pattern commonly found in many sequential algorithms are nested

    loops. Nested parallelism allows us to implement a nested loop with variable amounts

    of parallelism.

    • Concurrent Kernel Execution: modern GPUs provide the ability to run multiplekernels, assigned to different streams, concurrently. The Kepler, Maxwell and Pascal

    architectures support up to 32 concurrent streams (as compared to 16 on the Fermi).

    Each stream is assigned to a different hardware queue.

    1.2 Characterization of Advanced Parallel Features

    The utilization of high performance computing resources has also been hampered by the

    relative dearth of system software and of tools for monitoring and optimizing performance.

    Profilers have evolved to provide application execution insights to the developer in order to

    improve application throughput. However, profilers are tightly tied to specific hardware and

    do not support the latest advanced parallel features, which makes tuning the applications

    targeting modern GPUs a challenge.

    New approaches to profiling/instrumentation are needed to understand application in-

    teraction with the latest hardware features. Binary instrumentation can be used on a GPU

    for performance debugging, correctness checks, workload characterization, and runtime

    optimization. Such techniques typically involve inserting code at the instruction level of an

    application during back-end compilation, Binary Translation is able to gather data-dependent

    application behavior.

    Given the presence of data-dependent behavior of an application, we can characterize

    different execution patterns. Our focus is to characterize dynamically available parallelism

    with the aim to evaluate implementations designed to exploit the execution patterns using ad-

    vanced parallel features such as nested parallelism. Our characterization approach evaluates

    4

  • CHAPTER 1. INTRODUCTION

    the potential for optimization by analysing the impact on control, memory and synchro-

    nization behavior on a GPU. As an illustrative example, our study targets a comprehensive

    understanding of the overhead of current nested parallelism supported on GPUs in terms of

    kernel launch, control flow, nested synchronization and algorithm overhead.

    We also consider another form of parallelism available on modern GPUs: concurrent

    kernel execution. Just as a typical CPU application can consist of multiple functions, it

    is also common to have multiple GPU kernels present in a single GPU application. A

    GPU kernel is a function executed on a GPU device. Managing efficient concurrent kernel

    execution using independent thread blocks is cumbersome at best. In particular, this thesis

    targets a detailed understanding of the run-time costs of concurrent kernel execution in terms

    of kernel launch configuration, resource contention, and overlapped computation.

    1.3 Challenges in Exploiting Parallel Execution Features

    The software implementation of a GPU application can dramatically influence the

    application’s performance. For example, performance will suffer if kernels are stalled due

    to control dependence. Delays also occur when data dependencies are encountered. GPU

    stream processors are more difficult to utilize effectively if the targeted applications present

    dynamic and frequent data dependencies (commonly present in sorting, recursion, dynamic

    programming and evolutionary programming).

    Along with the challenges of dynamic and global dependencies, many applications

    involve the execution of multiple kernels. The current generation of NVIDIA GPUs already

    support concurrent execution of kernels using Hyper-Q technology, allowing concurrent

    execution of kernels from the same application or different applications. In this thesis,

    we characterize concurrent kernel execution, and explore how to perform resource utilize

    and minimize kernel launch overhead. Presently, it is difficult to modify an application to

    effectively leverage nested parallelism and concurrent kernel execution. Addressing this gap

    is the major focus of this thesis.

    5

  • CHAPTER 1. INTRODUCTION

    1.3.1 Nested Parallelism Challenges

    Depending on the application characteristics and the parallelization strategy, a kernel

    can exhibit a range of dynamic behaviors. The dynamic behavior is highly correlated to data-

    dependent parallel execution. Data dependencies are found in parallel loops and recursive

    calls. Parallel loops and recursive calls are forms of nested Thread-level Parallelism (TLP).

    Nested TLP can present a range of control flow behaviors. Explicit control flow con-

    structs such as if-then-else or for-loop are fundamental constructs in any high-

    level programming language. In kernels with complex control flow, SIMD threads can

    follow different paths of execution, causing thread divergence. Thread divergence would

    seem to cause a paradox, since all threads in a basic group (e.g., a warp) must execute

    the same instruction on each cycle. If the threads in a warp diverge, the warp serially

    executes each branch path, disabling threads that do not take that path. Warp divergence can

    dramatically degrade application performance.

    Understanding control flow effects is a key step towards characterization of nested

    parallelism. We have faced the following challenges when trying to exploit dynamic

    parallelism:

    • For control flow analysis, it is important to quantify the impact of thread divergence bycategorizing divergent and convergent paths in order to understand how performance is

    impacted. Control flow divergence effects can severely impact our ability to leverage

    nested parallelism. On-the-fly analysis of control flow workload provides a better

    understanding for data-dependent applications. In previous work, control flow analysis

    has been performed statically.

    • To properly characterize child kernel launches, we need to understand kernel launchparameters and device runtime management. There presently are no tools or profilers

    that can proper analyze nested-kernel launch overhead.

    • Nested parallelism requires that parent kernels and child kernels explicitly synchronizewith each other in order to assure consistent application execution. In order to perform

    child kernel synchronization, the device runtime has to save the state of parent kernels

    when they are suspended and yield to the child kernels at the synchronization points.

    6

  • CHAPTER 1. INTRODUCTION

    To our knowledge, there are no tools available that can measure dynamic child kernel

    synchronization.

    1.3.2 Concurrent Kernel Execution Challenges

    Enabling multiple kernels to execute concurrently on GPUs leads to the physical shar-

    ing of compute resources. Concurrent kernel execution can increase overall application

    throughput and can also reduce energy consumption. In order to deliver performance im-

    provement there needs to be sufficient resources on the GPU to launch concurrent kernels.

    In other words, concurrent kernel execution provides performance improvement through

    overlapped kernel computation. In order to achieve overlapped kernel computation, we

    need to understand the sources of resource contention and the effects of the kernel launch

    configuration.

    Resources contention is heavily dependent on the application input. For example, a

    small input set might not stress the memory, whereas a large input set might. At the same

    time, resource contention is dependent on the amount of GPU hardware resources available.

    An application binary compiled and optimized for one GPU may perform poorly on another

    GPU due to resource contention.

    The kernel launch configuration can give us clues leading to resource contention. Each

    kernel is launched with a set of variables called the launch configuration variables. Com-

    monly, these variables include the number of threads per block, the number of thread-blocks

    per grid, the usage of shared memory, and the number of registers used. Most of the time,

    these variables are dictated by the number of data elements the kernel operates on. De-

    pending on the GPU architecture, the resource usage based on these variables can changed

    dramatically. Having a better understanding of the resource contention is a key step towards

    the characterization of concurrent kernel execution. We face the following challenges when

    trying to exploit concurrent kernel execution:

    • To properly understand resource contention, we need to have a better control of theresources utilized by the kernel. We can bring software threads closer to the actual

    hardware thread execution by implementing persistent threads. Persistent threads

    break the mapping of one software thread to one data element, and instead it is

    7

  • CHAPTER 1. INTRODUCTION

    dynamically defined by the availability of resources on the GPU. There is no general

    way to map any kernel to persistent threads; persistent threads will not always provide

    the best performance for every kernel.

    • Resource contention varies dramatically across different GPU architectures, driverversions, and CUDA frameworks. Furthermore, the compiler and driver can have

    significant impact on kernel performance. To properly exploit concurrent kernel

    execution, we need to understand hardware, driver, compiler and CUDA framework

    interaction, which unfortunately are non-disclosed by hardware vendors.

    1.3.3 Benchmark Suite

    Many applications—both academic research and industrial products—have been accel-

    erated using parallel framework to achieve significant parallel speedup. Such applications

    encompass a variety of problem domains, including security surveillance, numerical linear

    algebra, graph theory problems, among others. Of these many applications, we select a set

    of representative real-world applications to focus our discussion.

    There has been considerable growth in interest of security surveillance image segmenta-

    tion problems. This interest has created an increased need for performant image segmen-

    tation kernels. Different approaches of image segmentation have used GPU computing in

    a wide variety of applications[15, 14, 16, 17]. Among the different image segmentation

    approaches, Connected Component Labeling (CCL), and Level Set Segmentation (LSS) are

    the most well-known applications.

    CCL is a widely used image segmentation algorithm. It connects neighboring pixels

    based on their similarities. The dependencies between the neighboring pixels and continuous

    propagation of connectivity between pixels makes CCL a highly sequential application. CCL

    is a great candidate for characterization of nested parallelism due its dynamic propagation

    of connected components.

    LSS is an evolutionary image segmentation algorithm. Given an initial curve C, LSS

    expands C, or contracts C, based on the evolution of the function f . The expansion of

    the curve is an outward evolution, and the contraction of the curve is an inward evolution.

    Every evolution cycle depends on the previous cycle in terms of computing the curve. The

    8

  • CHAPTER 1. INTRODUCTION

    dependencies between multiple pixels makes LSS a great candidate for characterization of

    nested parallelism and concurrent kernel execution together.

    To analyze how best to accelerate recursion, we have explored graph theoretic algorithms,

    including BFS and Prim algorithm. In addition, we evaluated selected Lonestar [18], and

    NUPAR benchmarks [19] in this thesis. In summary, we have used two real applications

    and four different benchmark applications as we developed characterization schemes in this

    thesis. Next, we outline the contributions and describe the organization of the remainder of

    this thesis.

    1.4 Contributions of the Thesis

    In this thesis, a number of key contributions towards the deep analysis and exploitation

    of advanced parallel features are presented. The key contributions are summarized below:

    • We characterize parallel applications, identifying when we can leverage nested paral-lelism available on NVIDIA GPUs (Kepler and Maxwell families). We define three

    workload components that can guide the developer on how best to leverage nested

    parallelism. To the best of our knowledge, we are the first work to define, implement,

    and evaluate these combined three components: i) control flow workload analysis, ii)

    child kernel launching, and iii) child kernel synchronization.

    • We develop NVIDIA SASS instrumentation handlers to characterize data-dependentapplication behavior. We use an NVIDIA assembly code SASS Instrumentor (SASSI)

    to evaluate dynamic application behavior. We provide a handler to profile and measure

    binary execution for control flow dependent loops. Our handler can collect and

    measure the control dependent loop efficiency.

    • We characterize recursive parallel workload, identifying when we can leverage nestedparallelism available on NVIDIA GPUs (Kepler and Maxwell families). We define

    three workload components that can guide the developer on how best to leverage

    nested parallelism in the case of parallel recursion. We evaluate three components: i)

    the degree of thread-level parallelism, ii) the work efficiency, and iii) the overhead of

    9

  • CHAPTER 1. INTRODUCTION

    kernel launches. Furthermore, we propose a new approach to increase thread-level

    parallelism in order to increase work efficiency and reduce the number of recursive

    kernel launches.

    • We characterize the execution of concurrent kernels on NVIDIA GPUs (Kepler andMaxwell families). Our characterization captures a kernel’s launch configuration, the

    resource consumption, and the degree of overlapped execution. Our proposed metrics

    help us to better understand when to use concurrent kernel execution.

    • We propose, implement, and evaluate kernels with persistent threads as a mechanismto control resource contention for concurrent kernel execution on GPUs. Our results

    show that kernels with persistent threads can be beneficial to identify peak resource

    contention. Unfortunately, this does not directly lead to an overall performance

    improvement.

    • Our proposed workload metrics for irregular applications and parallel recursive kernelshave been applied to a number of CUDA kernels taken from the problem domains of

    image processing, linear algebra, and graph theory problems. For these performance-

    hungry applications, we achieve 1.3x to more than 100x speedup, as compared to a

    flat GPU kernels.

    • We compare state-of-the-art image segmentation applications, including connectedcomponent labeling, and level set segmentation, exploring both nested parallelism

    and concurrent kernel execution. Our accelerated connected component labeling has

    been presented at the International Conference on Computer Vision and Graphics

    (ICCVG) [15]. In addition, it has been also presented at the GPU Technology Con-

    ference (GTC) [20]. Our work on fast level set segmentation exploiting advanced

    parallel features has been presented on Irregular Applications: Architectures and

    Algorithms Workshop (IA3) [21] and featured as a poster in Programming and Tuning

    Massively Parallel Systems Summer School (PUMPS). Furthermore, our accelerated

    connected component labeling has been ported to OpenCL. We have analyzed the

    benefits of advanced parallel features on AMD cards, and it has been presented at the

    10

  • CHAPTER 1. INTRODUCTION

    3rd International Workshop on OpenCL (IWOCL) [22]. Both of these real-world ap-

    plications are part of the NUPAR benchmark presented at the International Conference

    on Performance Engineering (ICPE) [19].

    1.5 Organization of Thesis

    The central focus of this work is to characterize nested parallelism and concurrent kernel

    execution in a systematic way that works well for any GPU and any application. The

    remainder of the thesis is organized as follows: Chapter 2 presents background information

    on GPU architecture, specifically the NVIDIA GPU architecture, the parallel framework

    CUDA, and the NVIDIA SASSI instrumentation framework. In Chapter 3, we present

    related work in the area of characterization of parallel kernels, nested parallelism, and

    concurrent kernel execution in GPU devices. In Chapter 4, we discuss the characterization

    of nested parallelism for conditional nested loop, parallel recursion, and concurrent kernel

    execution in NVIDIA Kepler and Maxwell architectures. Next, in Chapter 5 we present our

    benchmark kernels that are used throughout this thesis to leverage advanced parallel features.

    In Chapter 6, we present real applications that leverage our framework to effectively exploit

    advanced parallel features. In Chapter 7, we conclude the thesis and summarize our work.

    We also suggest directions for future work.

    11

  • Chapter 2

    Background

    As we enter the era of GPU computing, demanding applications with substantial par-

    allelism can leverage the massive parallelism of GPUs to achieve superior performance

    and efficiency. Today GPU computing enables applications that were previously thought

    to be infeasible because of long execution times. By enjoying the benefits of Moore’s

    Law [1, 23, 24], NVIDIA GPUs have evolved since 2001. Table 2.1 shows the evolution of

    NVIDIA graphic cards since the first programmable GPU was released.

    Date Product Transistors CUDA cores

    2001 GeForce 3 60 million -

    2002 GeForce FX 125 million -

    2004 GeForce 6800 222 million -

    2006 GeForce 8800 681 million 128 (First support for CUDA Programming)

    2007 Tesla T8, C870 681 million 128

    2008 GeForce GTX 280 1.4 billion 240

    2008 Tesla T10, S1070 1.4 billion 240

    2009 Fermi 3.0 billion 512

    2012 GK104 Kepler 3.5 billion 1536

    2012 GK110 Kepler 7.0 billion 2688

    2014 GM204 Maxwell 5.2 billion 2816

    Table 2.1: NVIDIA GPU technology evolution [25]

    12

  • CHAPTER 2. BACKGROUND

    With the rapid evolution of GPUs from a configurable graphics processor to a general

    purpose programmable parallel processor, the ubiquity of GPUs in every PC, laptop, desktop,

    and smartphone was imminent. A large community of researchers and developers have

    adopted the CUDA programming framework for a diverse range of applications [26, 27, 28].

    The CUDA runtime on an NVIDIA GPU enables us to execute programs developed

    in high-level languages, including C, C++, Fortran, OpenCL, DirectCompute, and oth-

    ers [26, 25, 2]. The nature of CUDA is to try to preserve elements of common sequential

    programming and extend them to a parallel thread execution. CUDA presents a Single

    Instruction Multiple Thread (SIMT) abstraction with a straightforward set of configurations

    for expressing parallelism.

    2.1 CUDA Model

    The CUDA model acts as a bridge between an application and its implementation

    on available hardware [29]. There are a number of different layers that lie between the

    application and the hardware. Figure 2.1 shows the different layers of abstraction between a

    software implementation and hardware level. The programming model provides a logical

    view of the specific computing architectures.

    !"#$%&'()&**+,-&$,".)

    -/0&)'/.$,1()

    -/0&)0',2(')

    3*/)4&'0%&'()

    Figure 2.1: Layers of abstraction between software application and GPU hardware.

    CUDA enables the developer to write parallel code that can run across tens of thousands

    of concurrent threads and hundreds of core processors. CUDA divides execution down,

    13

  • CHAPTER 2. BACKGROUND

    hierarchically, using parallel abstractions such as kernels, blocks, and threads per block (see

    Figure 2.2). A kernel executes a sequential program on a set of parallel threads. Each thread

    has its registers and private local memory. Each block allows communication among its

    threads through shared memory. Blocks communicate between themselves using global

    memory. This memory hierarchy is illustrated in Figure 2.3.

    !"#$%$&&'&()*+'"!(

    ,,-&*.$&,,(/*0+(1'%2'&34(

    5(

    (!!6(

    7(

    !"82+(#$%$&&'&()*+'"!(

    9( :(

    ;(?&*)1(

    @ABC(

    ?&*)1(

    Figure 2.2: The CUDA model: a kernel, grid, and threads per block.

    Other memories included in the CUDA memory model include:

    • texture memory, specialized for 2D read-only coalesced memory

    • constant memory, design to support for read-only accesses from different threadsacross blocks.

    As mentioned in Chapter 1, the CUDA model follows a SIMT architecture to manage

    and execute threads in groups of 32 named warps. Even though all threads in a warp must

    execute the same instructions, and the GPU is a SIMD architecture, there are some key

    features that differentiate a GPU from traditional SIMD:

    14

  • CHAPTER 2. BACKGROUND

    !"#$%&"'%(

    )*+,-(."/0(

    !"#$%&"'%(

    )*+,-(."/0(

    !"#$%&"'%(

    )*+,-(."/0(

    !"#$%&"'%(

    )*+,-(."/0(

    12,'"3(."/*'4(

    !"#$%&"'%(

    )*+,-(."/0(

    !"#$%&"'%(

    )*+,-(."/0(

    !"#$%&"'%(

    )*+,-(."/0(

    12,'"3(."/*'4(

    !"#$%&"'%(

    )*+,-(."/0(

    5-*6,-(."/*'4(

    7*8%&,8&(."/*'4(

    9":&;'"(."/*'4(

    Figure 2.3: The CUDA memory hierarchy.

    • Each thread in the warp has its own instruction address counter.

    • Each thread has its own register state.

    • Each thread can have an independent execution path.

    Although, the CUDA model enables each thread in a warp to display a different exe-

    cution behavior, divergent behavior degrades performance since warps will be executed

    serially. Control flow instructions (e.g., if-then-else, for, while) is one of the fun-

    damental constructs in CUDA programming that causes this undesired behavior called warp

    divergence.

    2.1.1 Divergence

    The use of control flow instructions is unavoidable in any applications. Modern CPUs in-

    clude complex hardware to perform branch prediction [30, 31]. Hardware branch predictors

    speculate the direction of conditional control flow in programs [32, 33, 34]. If the predictor

    is correct, branch execution incurs little or no performance penalty. If the prediction is

    15

  • CHAPTER 2. BACKGROUND

    not correct, the CPU stalls for a number of cycles as the instruction pipeline is flushed,

    and instruction fetching resumes at the correct program counter. In comparison, GPUs are

    high-throughput, but lack complex branch prediction mechanisms [35, 36, 37]. Execution

    on an NVIDIA GPU using the CUDA execution model assumes that all threads in a warp

    must execute identical instructions on the same cycle. Executing complex control flow

    typically results in divergent execution between the threads in the same warp [38].

    Recent GPUs are designed to better handle control flow. The modern GPU hardware

    supports condition codes (CC) and CC registers that contain the 4-bit state vector (sign,

    carry, zero, overflow) used in integer comparisons [39]. The CC registers can direct the flow

    of execution via predication or divergence. Predication allows (or suppresses) the execution

    of instructions on a per-thread basis within a warp, while divergence supports conditional

    execution of longer instruction sequences.

    Due to the additional overhead of managing divergence and convergence, the compiler

    uses predication for short instruction sequences. The effect of most instructions can be pred-

    icated on a condition; if the condition is not true, the instruction is suppressed. Predication

    works well for small fragments of conditional code, especially for if statements with no

    corresponding else. For larger conditional code segments, predication becomes inefficient

    because every instruction is executed, regardless of whether it will affect the computation.

    When the length of the conditional code fragment is long and the cost of predication would

    exceed the benefits, the compiler will generate conditional branches. If the threads in a warp

    diverge due to a data-dependent conditional branch, the warp serially executes each branch

    path taken, disabling threads that are not on that path. Once all paths complete, all threads

    re-converge to the original execution path. Figure 2.4 illustrates how warp divergence is

    handled on a GPU.

    Although warp divergence can have a negative impact on application throughput, this

    impact varies dramatically across GPU architectures. In the following sections we address

    divergent execution for the latest GPU generations, from the Fermi GPU architecture to the

    Pascal GPU architecture.

    16

  • CHAPTER 2. BACKGROUND

    !"

    !"#$$%

    &$!&%

    '(%

    !"#$$%

    #"

    '(%

    !"#$$%

    $"

    '(%

    !"#$$%

    %"

    '(%

    !"#$$%

    &"

    '(%

    !"#$$%

    '"

    '(%

    !"#$$%

    ("

    )

    $&"

    !"#$$%

    &$!&%

    $'"

    !"#$$%

    &$!&%

    $("

    !"#$$%

    &$!&%

    $*"

    !"#$$%

    &$!&%

    $+"

    !"#$$%

    &$!&%

    $,"

    !"#$$%

    &$!&%

    %!"

    !"#$$%

    &$!&%

    %#"

    !"#$$%

    &$!&%

    Figure 2.4: Branch divergence in the GPU

    2.2 GPU Computing Architecture

    A Streaming Multiprocessor (SM) is the centrepiece of the NVIDIA GPU architecture.

    A thread block is scheduled on a single SM, and once it is scheduled on the SM, it remains

    there until execution completes. An SM can hold more than one thread block at the same

    time. Registers and shared memory are scarce resources in the SM. These resources have to

    be partitioned among all threads resident on an SM. Each SM contains hundreds of CUDA

    cores, and each GPU device contains tens of SM.

    Logically, all threads in a block run in parallel, but not all threads can execute physically

    at the same. Therefore, different blocks may make progress at different rates. Since warps

    are the atomic unit of execution on the GPU, many warps can be scheduled in an SM, but

    depending on the SM resource availability, not all scheduled warps will be active. If a warp

    is idle, then the SM schedules another warp from any block that is resident on the same

    SM. The benefits of this switching between concurrent warps is that we avoid all overhead.

    Given the importance of determining the right warp granularity, we would like to quickly

    17

  • CHAPTER 2. BACKGROUND

    find the best grid configuration for any application.

    2.2.1 Fermi Architecture

    The NVIDIA Fermi (chip GF110) GPU was released in 2009. Fermi introduced an

    increased number of CUDA cores per SM, higher space for shared memory, configurable

    shared memory, and Error Correcting Codes (ECC) on main memory and caches. Each SM

    in Fermi has 32 CUDA processor cores, 16 load/store units, and four special function units

    (SFUs). Fermi has a 64-KByte register file, an instruction cache, two multi-thread warp

    schedulers and two instruction dispatch units [40].

    The SIMT instructions control the execution of an individual thread, including arithmetic,

    memory access, and branch/control flow instructions. Fermi extends SIMT to control flow

    with support for indirect branches and function-call instructions. With the improvements

    introduced in the Fermi Parallel Thread Execution (PTX) 2.0 Instruction Set Architecture

    (ISA), individual thread control flow can predicate instructions.

    2.2.2 Kepler Architecture

    A number of new features were introduced in Kepler as compared to the earlier GPU

    Fermi architecture. Table 2.2 compares some of theses features for Fermi (instance of chip

    GF110) and Kepler (instance of chip GK110).

    Kepler GK110 comprises up to 15 Kepler SM (SMX) units, four warp schedulers, and

    eight instruction dispatch units. Thus, it can issue and execute four warps simultaneously.

    Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 load/store

    units, and 32 special function units, which can operate sine, cosine, reciprocal or square root

    per thread per clock [42]. Kepler GK110 can provide up to 4.29 TFLOPS single-precision

    and 1.43 TFLOPS double-precision floating point performance [43].

    In addition to an increase in the number of CUDA cores per SM and a dramatic increase

    in the number of registers per thread, Kepler (compute capability 3.5 or higher) introduced a

    number of new features to further simplify parallel program design.

    18

  • CHAPTER 2. BACKGROUND

    Fermi (chip GF110) Kepler (chip GK110)

    SPs per SM 32 192

    Threads per SM 1536 2048

    Thread blocks per SM 8 16

    Warp schedulers per SM 2 4

    Dispatch Units per SM 2 8

    Shared Memory/L1 cache 16/48KB 16/32/48KB

    32-bit Registers per SM 32K 64K

    Registers per thread 63 255

    Table 2.2: Fermi chip GF110 versus Kepler chip GK110 [41]

    2.2.2.1 Dynamic Parallelism

    Dynamic parallelism is an extension to the CUDA programming model, enabling CUDA

    kernels to create, and synchronize, new kernel entirely on the GPU. With this feature, any

    kernel can launch a child kernel and manage inter-kernel dependencies [35].

    In order to manage the execution of dynamic parallelism, the CUDA model added a

    new feature known as the Grid Management Unit (GMU) [44, 42, 45], which is able to

    dispatch, as well as pause the dispatch, of new grids. The GMU can also queue pending,

    and suspend running, grids. A grid includes all thread-blocks associated with the kernel.

    Grids are launched in the order that they are received.

    In previous GPU generations, the host launched work through the Compute Work

    Distributor (CWD) unit [42, 2], and the CWD tracks blocks issued and sends them to the SM

    for execution. In Kepler and more recent GPU generations, the GPU launch work from host

    or device using the GMU. The GMU communicates with the CWD using a bidirectional

    link to prioritize or suspend/pause grids. Also, the GMU has a direct connection to the SM

    to support dynamic parallelism, and through this connection device kernels can dispatch

    child grids.

    The aim of the GMU is to effectively manage grid dispatching, in such way that, if we

    need to free up resources for child kernels to execute, the GMU will suspend parent kernel

    grids [42, 45]. The device runtime will reschedule the grids on different SMs in order to

    19

  • CHAPTER 2. BACKGROUND

    better manage resources. Figure 2.5 illustrates GMU interaction with the CWD and the SM.

    !"!"!"

    !"#$%&'()$)$*'+',#-$#$-'.)$)$*'/0'1#2-*'

    3456'78983:7:9;'$L'M%)>FJ'

    Figure 2.5: Work flow of the Grid Management Unit to dispatch, pause, and hold pending

    and suspended grids.

    Dynamic parallelism enables to create work directly on the GPU. This can remove the

    need to transfer execution control and data between the host and the device. The child kernel

    launch decisions are made at runtime by threads executing on the device. CUDA model

    controls the synchronization and communication between a parent kernel and child kernels.

    The local memory and registers associated with a parent thread are still only accessible by the

    parent thread, and are not accessible by other threads or any child threads. Communication

    with a child thread is only through global memory.

    20

  • CHAPTER 2. BACKGROUND

    Using dynamic parallelism, data-dependent parallel work can be generated inline within

    a kernel at runtime. These kernels take advantage of the GPU’s hardware scheduler and load

    balancer to dynamically adapt execution to make data-driven decisions. Figure 2.6 shows

    how dynamic parallelism works on a GPU.

    !!"#$%!!'()*+)#,-'

    .'

    '//0'

    '12',3$+415$+-'

    ' '361#47)*+)#88899'

    '//:::'

    ;'

    0'

  • CHAPTER 2. BACKGROUND

    !"#$"%&'& !"#$"%&(& !"#$"%&)(&*

    +,"-./0$&

    12#"34&5& 12#"34&'& 12#"34&)'&

    Figure 2.7: Hyper-Q

    increases the total number of work queues between the host and the device by allowing

    32 simultaneous hardware-managed connections (as compared to the single connection

    available with Fermi). Figure 2.7 illustrates the Hyper-Q feature in Kepler.

    2.2.3 Maxwell Architecture

    NVIDIA’s Maxwell generation provides only few enhancements to the previous GPU

    generation, with a focus on energy efficiency. In addition to providing new features that

    include dynamic parallelism and concurrent kernel execution, the Maxwell generation

    delivers 2x the performance per watt as compared to the Kepler generation [46].

    The Maxwell GTX 980 Ti (chip GM200) comprises 22 Maxwell SMs (SMM). Each

    SMM has 128 CUDA cores, four warp schedulers, eight instruction dispatch units, and

    eight texture units. Overall, the Maxwell SM looks very similar to a Kepler SM, except that

    Maxwell provides fewer CUDA cores per SM.

    Another major change, as compared to the Kepler architecture, is in the memory hierar-

    chy. Shared memory and L1 cache are not longer combined. Shared memory is dedicated,

    and the L1 cache is combined with the texture cache. The Maxwell GTX 980 Ti ships with

    up to 96KB in its share memory unit, and 48KB for the L1 cache/texture cache.

    2.2.4 Pascal Architecture

    NVIDIA introduced Pascal architecture in 2016. The NVIDIA GTX 1080 which includes

    a Pascal GP104, comprises 7.2 billions transistors and 2560 single-precision CUDA cores.

    22

  • CHAPTER 2. BACKGROUND

    GDDR5x memory is introduced with the GP104, providing 256-bit memory interface,

    delivering 43% higher memory bandwidth than NVIDIA’s prior GeForce GTX 980 GPU.

    The GP104 GPU consists of four Graphic Processing Clusters (GPCs), 20 Pascal SMs,

    and eight memory controllers. Each GPC has a dedicated raster engine and five SMs. Each

    SM contains 128 CUDA cores, four warp schedulers, eight instruction dispatch units, a 256

    KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage,

    and eight texture units [47]. A comparative feature analysis between four NVIDIA GPU

    generations is presented in Table 2.3.

    GPU GTX 590 GTX Titan GTX 980 Ti GTX 1080

    Family Fermi Kepler Maxwell Pascal

    Chip GF110 GK110 GM200 GP104

    Compute Capability 2.0 3.5 5.2 6.1

    SM 16 14 22 20

    CUDA cores 32 192 128 128

    Total cores 512 2688 2816 2560

    Global Mem. 1474 MB 6083 MB 6083 MB 8113 MB

    Shared Mem. 48 KB 48 KB 48 KB 48 KB

    Threads/SM 1536 2048 2048 2048

    Threads/block 1024 1024 1024 1024

    Clock rate 1.26 GHz 0.88 GHz 1.29 GHz 1.84 GHz

    TFLOPS 1.5 4.29 6.50 9.00

    Table 2.3: A comparison of the features available on the four generations of NVIDIA GPUs

    considered in this thesis [25].

    23

  • Chapter 3

    Related work

    In this chapter, we review related work in the areas of GPU characterization, with special

    emphasis on modern GPU features. We focus our literature review on advanced parallel

    features for multiple levels of concurrency, and different grains of parallelism.

    3.1 Characterization of GPUs

    There have been studies focusing on GPU characterization to better understand the

    improvements during the evolution of these devices. This evolution started with GPUs as

    rendering tools, and spans to today where GPUs act as advanced general purpose accelera-

    tors [48, 49, 50, 51].

    An early characterization study by Jia et al. [48] in 2012 focused on characterizing cache

    memories on GPUs. Starting with the NVIDIA Fermi and the AMD Fusion, GPU vendors

    have included demand-fetching in their data caches. Earlier, GPU generations were focused

    on graphics rendering, providing local memories instead of demand-fetched caches. With

    the introduction of demand-fetch caches, a new challenges arrived: 1) understanding the

    benefits of cache memories and 2) a lack of intuition for developers to efficiently use them.

    They addressed these two problems and provided a mechanism to efficiently utilize cache

    memories.

    Wong et al. [49] presented a characterization of Tesla GPUs through the execution of a

    set of microbenchmarks. Their analysis provided insights about the characteristics of the

    24

  • CHAPTER 3. RELATED WORK

    GPUs beyond the information provided by NVIDIA. Another attempt to characterize the

    internals of a GPU was presented by Torres et al. [50]. In their study, they focused on the

    impact of the CUDA tuning techniques on the Fermi architecture. Jiao et al. [51] presented

    a characterization study of GPUs to evaluate power efficiency, and the correlation between

    application performance and power consumption.

    A large body of work studies how to leverage GPUs effectively through understanding

    their characteristics for older [52, 53] and modern [54, 55, 56] generations of GPUs. While

    Kerr et al. [52] focused on understanding the behavior of PTX 1.4, Lee et al. [53] developed

    an exhaustive performance analysis to capture performance gaps between an NVIDIA

    GTX280 Tesla architecture versus an Intel Core i7-960. In this these, we focus our attention

    primarily to the characterization of more modern GPUs.

    3.1.1 Modern GPUs

    Kayiran et al. [54] explored the impact of of memory accesses during concurrent

    execution thread execution and the resulting application performance. They provided a

    thorough evaluation of 31 applications - from the CUDA SDK to Map-Reduce problems

    - to understand resource contention in caches, networks and memory. Furthermore, they

    proposed a dynamic Cooperative Thread Arrays (CTA) scheduling mechanism, which

    regulates thread level parallelism by allocating an optimal number of CTAs per application.

    Mei et al. [55] provided a microbenchmark to dissect the device memory hierarchy to

    chacterize the organization of different GPUs cache systems on Fermi, Kepler and Maxwell

    architectures. Ukidave et al. [19] provided a set of application benchmarks to analyze the

    latest features on modern GPUs, such as nested parallelism, concurrent kernel execution,

    atomic operations, and shuffling.

    In the next section, we review characterization of multiple levels of concurrency and

    thread granularity on modern GPUs.

    25

  • CHAPTER 3. RELATED WORK

    3.2 Multiple Levels of Concurrency

    3.2.1 Nested Parallelism

    One of the earliest characterizations of nested parallelism was presented by DiMarco et

    al. [57] in 2013. They aimed to quantify the performance gains of dynamic parallelism pre-

    sented by CUDA 5 and the Kepler architecture. Their exploration covered two applications:

    K-means and hierarchical clustering. Their results showed that finer granularity of TLP

    provides a more efficient way to leverage nested parallelism than just avoiding CPU-GPU

    synchronization.

    In 2014, Wang et al. [58] presented an evaluation of the impact of nested parallelism

    in unstructured GPU applications for the Kepler architecture. Irregular applications suffer

    from workload imbalance, which provides a good target for optimization using fine-grained

    threads contained in coarse-grained blocks. Their characterization focused on control flow

    and memory access measurements. Two metrics were proposed in their study: i.) warp

    execution efficiency, and ii.) load/store replay overhead. Although, they provided a thorough

    analysis of nested parallelism for control flow instructions and memory accesses, they

    did not take into consideration synchronization cost between parent and child kernels to

    evaluate the benefits of nested parallelism. Furthermore, they did not take into consideration

    a finer grain classification of control flow divergence and their impact on the application

    performance.

    In 2015, Wang et al. [59] continued their work on characterizing nested parallelism in

    GPUs. They proposed a new mechanism called Dynamic Thread Block Launch (DTBL), a

    new execution model to support irregular applications on GPUs. DTBL allows coalesced

    allocation of child kernels and parent kernels.

    Yang et al. [60] analyzed a set of optimized parallel benchmark applications that contain

    loops. Their analysis covers the degree of TLP and proposed a framework called CUDA-NP

    to exploit nested parallelism in CUDA. CUDA-NP is a pragma-based compiler approach that

    generates GPU kernels with nested parallelism. Basically, their approach reads the OpenMP-

    like pragma directives in the input kernels and creates the respective child kernels with a

    grid configuration based on the parallel-loop-TLP degree. However, they did not analyze

    26

  • CHAPTER 3. RELATED WORK

    the implications of parent-child synchronization. Furthermore, they relied on developer’s

    knowledge to identify potential parallel loops that can exploit nested parallelism without

    providing any insight about the behavior of the architectures.

    Further studies [61, 62, 63] characterized nested parallelism based on the irregularity

    of an application. Applications containing parallel loops and recursive calls are suitable

    to leverage nested parallelism. Zhang et al. [61] adapted two irregular and data-driven

    problems—bread-first search and single-source shortest path— to leverage nested paral-

    lelism. Li et al. [62] proposed parallelization templates to leverage nested parallelism for

    tree and graphs problems. These type of problems present irregular nested loops and parallel

    recursive computation. Wang et al. [63] provided insights on leveraging nested parallelism

    for general irregular applications. However, none of these approaches provided a holistic

    analysis on the implications on leveraging nested parallelism and the effects across different

    architecture/compiler versions.

    3.2.2 Concurrent Kernel Execution

    In early GPU architectures, concurrent kernel execution was poorly supported. In 2011,

    Wang et al. [64] proposed a mechanism to exploit concurrent kernel execution through

    manual context funnelling. They compared CUDA 4 automatic context funnelling versus

    their approach for Fermi architectures. They showed that manual control of shared resources

    might provide slight improvements in application performance. However, they did not

    discuss resource contention based on the interplay between concurrent kernels.

    In 2012, Wende et al. [65] provided a kernel reordering mechanism to exploit concurrent

    kernel execution for Fermi architectures. Their execution model is designed to partition

    kernels into small-scale computations, and by using producer-consumer principles, manage

    GPU kernel invocations after reordering them. Later, in 2014 Wende et al. [66] continued

    their work on exploitation of concurrent kernel execution, and proposed a characterization of

    NVIDIA Hyper-Q feature for Kepler architecture, using an offloading mechanism to allow

    running multiple kernels simultaneously. Their analysis explored synthetic benchmarks

    and developed a performance evaluation, complementing their previous work on kernel

    reordering.

    27

  • CHAPTER 3. RELATED WORK

    Gregg et al. [67] proposed a kernel scheduler mechanism called KernelMerge that

    allows to run two OpenCL kernels concurrently on AMD cards. KernelMerge takes into

    consideration kernel configuration and investigates the interaction between concurrent

    kernels to analyze interference for sharing resources.

    Since the Kepler architecture, NVIDIA provides a modern hardware design to adequately

    support concurrent kernel execution. In 2014, Jog et al. [68]—moving to the next logical

    step—proposed an Application-aware memory system for fair and efficient execution of

    concurrent applications. Their approach takes into consideration memory awareness by

    providing a new scheduling mechanism for serving memory requests in a round-robin fash-

    ion. They considered four metrics based on the Instructions Per Cycle for each application.

    However, they did not consider resources contention on registers, nor the grid configu-

    ration. Furthermore, they focused on memory-bound applications, and did not discuss

    arithmetic-bound applications.

    In 2016, Luley et al. [69] proposed a framework to exploit NVIDIA’s Hyper-Q. Their

    framework oversubscribes kernels and defragments memory transfers to effectively overlap

    accesses with computations. Furthermore, they proposed multiple mechanisms to reorder

    kernels with the aim to improve application throughput. Although, they have studied the

    impact of memory transfers, they have not analyzed resource contention between concurrent

    kernels, which can be a key bottleneck when attempting to leverage concurrent kernel

    execution.

    28

  • Chapter 4

    Characterization of advanced parallel

    features

    Acceleration of high performance applications that exhibit complex and irregular execu-

    tion behavior is an ever-growing open problem. A naive port of an irregular applications to a

    parallel platform often leads to underutilization of hardware resources, significantly limiting

    performance. In this chapter, we present a characterization of advanced parallel features on

    a GPU that be effectively exploited to tune any application with a high degree of irregularity.

    4.1 Nested Parallelism

    Irregularity in an application can result in poor workload balance when attempting

    to exploit fine-grained thread-level parallelism. We next consider examples of high-level

    language behavior that can suffer from a lack of inherent thread-level parallelism.

    A number of irregular applications contain control-flow dependent nested loops. This

    kind irregularity can inhibit thread-level parallelism, since independence can only be de-

    duced at runtime. Because many loops tend to be data dependent, GPU hardware vendors

    introduced support for nested parallelism, leveraging nested TLP through the addition of a

    new level of parallelism. We have studied a number of irregular applications to identify how

    frequently control flow dependent nested loops are used. Table 4.1 shows characterization

    29

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    data from two different GPU benchmark suites, where control flow dependent nested loops

    occur.

    Applications Benchmark Suite Number of Control Flow

    Dependent Nested Loop

    Barnes Hut Lonestar [18] 6

    Delaunay Mesh Refinement Lonestar [18] 7

    Points-to Analysis Lonestar [18] 31

    Survey Propagation Lonestar [18] 7

    Single-Source Shortest Paths Lonestar [18] 2

    Connected Component Labeling NUPAR [19] 1

    Level Set Segmentation NUPAR [19] 1

    Table 4.1: Irregular Applications from two different GPU benchmarks which exhibit control

    flow dependent nested loops.

    We have also explored recursive algorithm patterns that can benefit from nested par-

    allelism. Parallel recursion is a solution to efficiently execute recursive algorithms which

    exhibit the ability to spawn multiple threads per recursive call. Before the introduction of

    nested parallelism on the GPU, recursive solutions required GPU and CPU intervention -

    or an implementation of the GPU kernel void of recursive kernel calls. However, constant

    communication between the CPU and the GPU produces memory copies and results in

    communication overhead. In addition, most recursive solutions are data-dependent solutions,

    therefore it is challenging to anticipate the amount of overhead will be introduced. On

    the other hand, we cannot always use a single GPU kernel call version for all recursive

    algorithms. Table 4.2 shows a list of recursive kernels that can be as parallel recursion.

    Applications Benchmark Suite Number of Control Flow Number of

    Dependent Nest Loop Recursion calls

    Breadth-First Search Lonestar 0 1

    Prim’s Algorithm - 1 1

    Table 4.2: Recursive applications which exhibit parallel recursion.

    30

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    1 g l o b a l vo id s i n g l e K e r n e l ( i n t ∗ A, i n t ∗B , i n t ∗C , i n t rows , i n t c o l s )2 {3 i n t i d x = b l o c k I d x . x ∗ blockDim.x + t h r e a d I d x . x ;4 i f (A[ i d x ∗ c o l s ] == 1)5 {6 f o r ( i n t i =0 ; i < c o l s ; i ++)

    7 C[ i d x ∗ c o l s + i ] = A[ i d x ∗ c o l s + i ] + B[ i d x ∗ c o l s + i ] ;8 }9 }

    Program Listing 4.1: Micro-benchmark Kernel with irregular nested loop execution.

    With nested parallelism, a recursive solution can be naturally ported to the GPU and can

    avoid CPU-GPU communication overhead. Nonetheless, the recursive spawning of threads

    cannot always lead to enough TLP to exploit the GPU, and it could lead to substantial kernel

    launch overhead and hardware underutilization.

    Exploiting nested parallelism, either in the presence of control-flow dependent nested

    loop or parallel recursion, is not straightforward. Since nested loop can include control

    flow divergences; a recursive solution can lead to poor TLP and low warp efficiency. At the

    same time, nested synchronization can turn into a large number of thread stalls and global

    communication between parent-child kernels. Next, we explore each of these factors and

    present metrics to quantify their impact on kernel performance.

    4.1.1 Control Flow Instructions

    Mapping parallel programs exhibiting arbitrary control flow onto parallel units can be a

    difficult task. There is generally no guarantee that parallel units will execute the same control

    flow path. For instance, Program 4.1 presents a micro-benchmark kernel that executes a

    loop based on an input parameter data. Figure 4.1 illustrates the dynamic execution of the

    micro-benchmark for two architectures: a Kepler GTX Titan, and a Maxwell GTX Titan Ti.

    Both of the execution examples are run with the same input parameters, the same NVIDIA

    driver, and the same CUDA version. However, the number of instructions executed varies

    along the control flow path.

    31

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    Figure 4.1: Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell GTX

    Titan Ti.

    CUDA binary tools such as nvdisasm [45], and cuobjdump [45] have been widely

    used to produce control flow graphs (CFG). However, nvdisasm and cuobjdump gather

    kernel behavior statically, and do not allow dynamic analysis of an application’s irregular-

    ity. On the other hand, the SASS Instrumentation tool (SASSI) [70] allows the dynamic

    collection of metrics during execution time. Moreover, SASSI is able to retrieve developer-

    specified metrics about the control flow instructions executed at runtime. SASSI, alongside

    with nvprof [45] allow us to collect the following runtime metrics:

    1. instExec, Number of instructions executed. Reported by nvprof.

    2. warpDivEff : Ratio of the average active threads per warp and the maximum number

    of threads per warp supported on a multiprocessor, expressed as percentage. Reported

    by nvprof.

    3. cfExecuted: Number of executed control-flow instructions. Reported by nvprof.

    4. cfDependentNestedLoop: Number of instructions executed inside a control flow

    loop-instruction. Reported by our handler and injected using SASSI.

    32

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    These metrics are intrinsically related to the execution of kernel control flow and capture

    the efficiency of the warp execution. We evaluate the impact of the percentage of instructions

    executed inside the loop to identify potential hotspots to identify opportunities to exploit

    nested TLP. We compute the ratio of instructions executed inside of loop bodies, as a

    fraction of all instructions executed, in order compute the impact of instructions inside these

    common control flow structures.

    loopInstExec =cfDependentNestedLoop

    instExec(4.1)

    We also consider the amount of the idle resources due to the warp divergence. warpDivEff

    allows us to compute the reciprocal metric to measure the idle threads waiting until a loop

    execution ends. This can be computed as: warpDivIdle = 1 − (warpDivEff/100).Next, we propose and compute loop warp efficiency by taking into account the ratio of

    instructions executed during loop execution (i.e. loopInstExec). However, a simple mul-

    tiplication of warpDivIdle ∗ loopInstrExec would give us the loop warp threads idle(loopWarpThreadsIdle), in order to compute the efficiency metric, we compute its re-

    ciprocal by subtracting 1 − loopWarpThreadsIdle and multiple by 100 to expressed aspercentage:

    loopWarpEff = (1− loopWarpThreadsIdle) ∗ 100 (4.2)

    Our proposed metrics are specifically designed to measure workload imbalance generated

    by irregular applications. These applications have data-dependent workload, unpredictable

    control flow behavior that causes severe workload imbalance, and eventually poor GPU

    utilization.

    4.1.2 Parallel Recursion

    Recursion is a method of making self referential calls, commonly used to compute

    problems through breaking them into smaller sub-problems, and using a divide-and-conquer

    strategy. For instance, Program 4.2 illustrates a simple recursive program which implements

    the fibonacci sequence [71]. In a recursive solution, the problem is broken into a base case

    33

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    1 i n t f i b ( i n t n ) {2 i f ( n

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    1 g l o b a l vo id f i b k e r n e l p a r r e c ( i n t n , u n s i g n e d long i n t ∗ vFib ) {2 i f ( n == 0 | | n == 1)3 r e t u r n ;

    4

    5 f i b k e r n e l p a r r e c (n−2, vFib ) ;6 f i b k e r n e l p a r r e c (n−1, vFib ) ;7 c u d a D e v i c e S y n c h r o n i z e ( ) ;

    8 vFib [ n ] = vFib [ n−1] + vFib [ n−2];9

    10 }

    Program Listing 4.3: Fibonacci parallel recursive scheme in CUDA.

    called CUDA blocks, also known as CTAs! (CTAs!) [72]. TLPDegree is the number

    of threads synchronized across the CTA.

    2. workEfficiency: Ratio of the number of operations executed that contribute to

    solving the problem, divided by the total operations executed on the GPU. The

    goal of this metric is to provide a measure of the number of non-redundant (vs.

    redundant+ non− redundant) operations executed per GPU kernel. For instance,a 100% work efficiency indicates that there are no redundant operations executed.

    3. depthKernelRecursion: Number of nested kernel calls.

    Our proposed metrics are specifically designed to measure the efficiency of recursive

    execution by parallel recursive applications. These applications have data-dependent work-

    load, nested kernel calls, and irregular parallel recursion which lead to unbalanced workload

    execution, low work efficiency, and eventually poor GPU utilization.

    4.1.3 Child Kernel Launching and Synchronization

    Nested Parallelism in CUDA allows explicit synchronization by child kernels by call-

    ing Application Programming Interfaces (API) cudaDeviceSynchronize. When used,

    the parent thread-block will wait until the child threads finish their execution. cudaDeviceSynchronize

    is expensive, and should not be used. However, for many irregular applications the parent

    35

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    thread will require results of the child threads to continue execution. We characterize the

    overhead of device synchronization by measuring its impact on the overall performance.

    Once a potential nested parallelism hotspot was identified by our metrics, we imple-

    mented a nested parallelism kernel, and compared it to the non-nested parallelism kernel, a

    well as a sequential implementation of the kernel, in order to characterize the overhead of a

    child kernel launching.

    Figure 4.2 shows the execution time of the 3 different implementations (i.e., sequential,

    non-nested parallelism and nested parallelism) for our micro-benchmark kernel. The micro-

    benchmark computes the addition of two matrices if the value of the first element in the row

    matches to the condition in the first control flow instruction. We defined a set of experiments,

    varying the input sizes in terms of rows and columns. In addition, we controlled the level of

    divergence starting from 12.5% and increasing it up to 75%. We argue that higher divergence

    leads to better exploitation of nested parallelism, but this divergence is going to be data

    dependent.

    We expected that small input sets will lead to poor performance in a GPU due to low

    utilization of the high TLP available. However, nested parallelism starts outperforming

    non-nested parallelism as the degree of TLP increases, especially in the presence of a high

    degree of divergence.

    In order to characterize the behavior of nested parallelism across different GPU archi-

    tectures, we used two Kepler and two Maxwell architectures, running with the same input

    sets, the same NVIDIA driver, and the same CUDA version. Figure 4.3 shows the runtime

    execution for different input sets, with data values generating a degree of 75% divergence

    across the four different GPUs.

    Although the Kepler GT 730 has the same number of CUDA cores per SM as the Kepler

    GTX Titan, it also has a smaller number of SMs as compared to GTX Titan, while the GTX

    Titan has 15 SMs, the GT 730 has 2 SMs. The number of SMs has high impact on our

    ability to exploit nested parallelism. For instance, the child kernel that is launched will have

    to allocate a number of blocks on the remaining available SMs on the device. If the device

    does not have enough SMs free to leverage nested parallelism, then the benefits of nested

    parallelism will not be enjoyed.

    We present an equation 4.3 to characterize kernel overhead across different architectures,

    36

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    Figure 4.2: Execution time of Sequential, non-nested parallelism and nested parallelism

    kernels on GTX Titan - Kepler architecture (lower is better).

    based on the SM usage. NumberThreadsPerBlock and NumberBlocks are application

    specific, and MaxThreadsPerSM is architecture specific. If SMUsage surpasses the

    available number of SMs on the GPU, it will prevent us to effectively leverage nested

    parallelism. We have found we also benefit from the use of persistent threads to control

    SMUsage.

    SMUsage =NumberThreadsPerBlock ∗NumberBlocks

    MaxThreadsPerSM(4.3)

    In our analysis, we characterized cudaDeviceSynchronize API calls for kernel

    synchronization using CUDA counters, such as clocks and the frequency rate. We verified

    that synchronization can negatively impact application performance when an application

    launches a small number of threads per block and a reduced number of blocks per kernel (i.e.

    37

  • CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

    Figure 4.3: Execution time of non-nested parallelism and nested parallelism across four

    GPUs (2 Kepler and 2Maxwell GPUs).

    poor TLP). However, we found that kernel synchronization can be hidden by incrementing

    the TLP and loopWarpEff .

    4.1.4 Memory Overhead

    When using nested parallelism, global memory in the GPUs is the only channel of com-

    munication between the parent and child kernels, and it may also be tied by the device run-

    time for child kernel launches. The device runtime keeps track of the kernel launches by cre-

    ating a pool for all launches. Kernels that are not enable to launch due to a lack of resources

    available remain in the pool on pending kernels. The size of this pool is referred to as th


Recommended