An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim...

Developing an OpenMP Offloading Runtime for UVM-Capable GPUs

Hashim Sharif and Vikram Adve

University of Illinois at Urbana-Champaign

Hal Finkel

Argonne National Laboratory

Lingda Li

Brookhaven National Laboratory

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

➢ OpenMP 4.0/4.5 supports heterogeneous programming

CUDA Unified Virtual Memory

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

New technology!


➢ Pascal introduces a page migration engine to enable automatic page migration on data access


New technology!


➢ Pascal introduces a page migration engine to enable automatic page migration on data access

➢ Pascal UVM is only limited by the overall System Memory size


int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}


// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies


vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code













}









}

CUDA UVM Example


explicit data copies













}









}

CUDA UVM Example


No explicit copies

Processor XProcessor Y

Cache XCache Y

OpenMP Offloading

RAM

Shared Data

Processor X

RAM

Processor Y

Cache XCache Y

Shared Data

OpenMP threads share memory

OpenMP Offloading


Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP Offloading

RAM

Shared Data


Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP 4 includes

directives for

mapping data to device

memory

OpenMP Offloading

RAM

Shared Data

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

OpenMP Target Directives - Saxpy



{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}


directs target

compilation



{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}

map specifies data

movement


Clang/LLVM Binary

Host Code Device CodeTarget constructsNon offloading constructs

OpenMP Offloading Runtime - libomptarget

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp


Serves OpenMP Runtime calls for non-target regions

Device Code

Clang/LLVM Binary



Offloading Runtime

libomptarget

Device-agnostic openMP runtime; implementing

support for target offload regions


Device Code

Clang/LLVM Binary



Offloading Runtime

libomptarget

CUDA Device Plugin

libomptarget

CUDA Driver API

Device plugins implements device

specific operations e.gdata copies, kernel

execution


Device Code

An OpenMP Framework for UVM

➢Improved Performance

➢Performance Portability

➢How can we leverage OpenMP target constructs? ➢Does performance scale with large datasets?

Design Considerations

Goals

An OpenMP Framework for UVM

➢Developed as LLVM Transformations➢Extracts important application-specific information

➢e.g data access probability

➢Our implementation extends the libomptarget library➢More specifically, we extend the CUDA device plugin➢Developed a UVM-compatible offloading plugin➢ Includes UVM-specific performance optimizations

OpenMP Runtime

Compiler Technology

Optimizations - Prefetching



{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}

ExecuteTargetRegion()

Runtime




{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}


Runtime

Data pages are fetched on demand. This leads to page fault processing overhead




{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)


PrefetchToHost(y)

Runtime




{

#pragma omp target

#pragma omp teams



y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)


PrefetchToHost(y)

Runtime

Prefetching used data pages helps avoid page fault processing overhead

App source

LLVM IRBuild

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

Compiler



App source



Probability

LLVM Pass:Add Access Probability

Transformed IR

Compiler



App source



Probability

LLVM Pass:Add Access Probability

Transformed IR

Binary

Compiler



Build

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

Prefetching Workflow - Runtime



CUDA Device Plugin-

libomptarget




CUDA Device Plugin-

libomptarget


Uses a cost-model to

determine the profitability of

prefetching


libomptarget

Device Memory

RAM

API calls including access probabilities

Prefetch data with high access probability

Data Prefetching

CUDA Device Plugin-

libomptarget


Uses a cost-model to

determine the profitability of

prefetching

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Memory Oversubscription

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Prefetch

(Partial)

Compute

(Partial)

Prefetch

(Partial)

Compute

(Partial)

(Now everything fits)

➢ Pipelining partial prefetches with partial compute

Memory Oversubscription

App source

LLVM IRBuild

Pipelining Workflow - Compiler

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler


App source



Compiler


Required for chunking the

iteration space

App source



Compiler


LLVM Pass:Extract memory

access expressions


iteration space

App source



Compiler



access expressions


iteration space


data prefetches

App source



Transformed IR

Compiler


LLVM Pass:Add OpenMP Region Info


access expressions


iteration space


data prefetches

App source



Transformed IR

Compiler


LLVM Pass:Add OpenMP Region Info


access expressions


iteration space


data prefetches

BinaryBuild

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

tgt_run_target()



tgt_run_target()

Application Binary

libomptarget

Application calls into the runtime to

invoke the target offload region

tgt_run_target()



tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin



tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary




The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary






tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary






tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary






tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary






tgt_run_target()



tgt_run_target()



prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary






Experiments

➢Do the optimizations help improve performance for computational kernels?

➢Do the optimizations scale with increasing dataset sizes?

Experiments

➢Baseline: Demand Paging

➢Comparisons: Prefetching, Pipelined Prefetching

➢Benchmarks: Saxpy, Kmeans (Rodinia)

Experiments

Experimental Setup

➢ Summitdev cluster at ORNL ➢ Pascal P100 GPU cards➢16GB device memory

➢Nvlink interconnect

➢ Clang, LLVM ➢clang-ykt project

➢ libomp, libomptarget➢clang-ykt project

Software

Hardware

Results - SAXPY

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements

Paging Prefetch Pipelining+Prefetching

Results - SAXPY

This does not fit in device memory

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements


Kmeans – 10 Iterations

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points



0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points


This does not fit in device memory


0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining


0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining

Demand paging performs better – hardware learns to use a more suitable eviction

policy

➢Developed an OpenMP Framework for UVM-capable GPUs

➢Develop optimizations to reduce page fault processing overhead

➢Prefetching

➢Pipelining

➢Optimizations enable reasonable improvements on benchmarks

➢Future Work: Developing more sophisticated pipelining strategies

Summary

Thanks & Questions?

• The LLVM community (including our many contributing vendors)

• The SOLLVE project (part of the Exascale Computing Project)

• OLCF, ORNL for access to the SummitDev system.

• ALCF, ANL, and DOE

Acknowledgements

References

• https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

• http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf

• https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view

• https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

• http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

• https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf

https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf

https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view

https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf

Date post:	02-Apr-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim...

Documents