An Overview for the Library to Thrust

NVIDIA Research

Thrust

Jared Hoberock and Nathan Bell

© 2008 NVIDIA Corporation

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>

int main(void)

{

// generate 16M random numbers on the host

thrust::host_vector<int> h_vec(1 << 24);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device

thrust::device_vector<int> d_vec = h_vec;

// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

return 0;

}

Diving In


Programmer productivityRapidly develop complex applications

Leverage parallel primitives

Encourage generic programmingDon’t reinvent the wheel

E.g. one reduction to rule them all

High performance With minimal programmer effort

InteroperabilityIntegrates with CUDA C/C++ code

Objectives

3


What is Thrust?

C++ template library for CUDAMimics Standard Template Library (STL)

Containersthrust::host_vector<T>

thrust::device_vector<T>

Algorithmsthrust::sort()

thrust::reduce()

thrust::inclusive_scan()

Etc.

4


Containers

Make common operations concise and readable Hides cudaMalloc, cudaMemcpy and cudaFree

// allocate host vector with two elementsthrust::host_vector<int> h_vec(2); // copy host vector to devicethrust::device_vector<int> d_vec = h_vec;

// manipulate device values from the hostd_vec[0] = 13;d_vec[1] = 27; std::cout << "sum: " << d_vec[0] + d_vec[1] << std::endl;

// vector memory automatically released w/ free() or cudaFree()

5


Containers

Compatible with STL containersEases integration

vector, list, map, ...

// list container on hoststd::list<int> h_list;h_list.push_back(13);h_list.push_back(27);

// copy list to device vectorthrust::device_vector<int> d_vec(h_list.size());thrust::copy(h_list.begin(), h_list.end(), d_vec.begin());

// alternative methodthrust::device_vector<int> d_vec(h_list.begin(), h_list.end());

6


Sequences defined by pair of iterators

Iterators

7

// allocate device vectorthrust::device_vector<int> d_vec(4);

d_vec.begin(); // returns iterator at first element of d_vecd_vec.end() // returns iterator one past the last element of d_vec

// [begin, end) pair defines a sequence of 4 elements

d_vec.begin() d_vec.end()


Iterators act like pointers

Iterators

8


thrust::device_vector<int>::iterator begin = d_vec.begin();thrust::device_vector<int>::iterator end = d_vec.end();

int length = end - begin; // compute size of sequence [begin, end)

end = d_vec.begin() + 3; // define a sequence of 3 elements

begin end


Use iterators like pointers

Iterators

9


thrust::device_vector<int>::iterator begin = d_vec.begin();

*begin = 13; // same as d_vec[0] = 13;int temp = *begin; // same as temp = d_vec[0];

begin++; // advance iterator one position

*begin = 25; // same as d_vec[1] = 25;


Iterators

Track memory space (host/device)Guides algorithm dispatch

// initialize random values on host thrust::host_vector<int> h_vec(1000);thrust::generate(h_vec.begin(), h_vec.end(), rand);

// copy values to devicethrust::device_vector<int> d_vec = h_vec;

// compute sum on hostint h_sum = thrust::reduce(h_vec.begin(), h_vec.end());

// compute sum on deviceint d_sum = thrust::reduce(d_vec.begin(), d_vec.end());

10


Convertible to raw pointers

Iterators

11


// obtain raw pointer to device vector’s memoryint * ptr = thrust::raw_pointer_cast(&d_vec[0]);

// use ptr in a CUDA C kernelmy_kernel<<<N/256, 256>>>(N, ptr);

// Note: ptr cannot be dereferenced on the host!


Wrap raw pointers with device_ptr

Iterators

12

int N = 10;

// raw pointer to device memoryint * raw_ptr;cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// wrap raw pointer with a device_ptr thrust::device_ptr<int> dev_ptr(raw_ptr);

// use device_ptr in thrust algorithmsthrust::fill(dev_ptr, dev_ptr + N, (int) 0);

// access device memory through device_ptrdev_ptr[0] = 1;

// free memorycudaFree(raw_ptr);


C++ supports namespacesThrust uses thrust namespace

thrust::device_vector

thrust::copy

STL uses std namespacestd::vector

std::list

Avoids collisionsthrust::sort()

std::sort()

For brevityusing namespace thrust;

Namespaces

13


ContainersManage host & device memory

Automatic allocation and deallocation

Simplify data transfers

IteratorsBehave like pointers

Keep track of memory spaces

Convertible to raw pointers

NamespacesAvoids collisions

Recap

14


Function templates

C++ Background

15

// function template to add numbers (type of T is variable)template< typename T >T add(T a, T b){ return a + b;}

// add integersint x = 10; int y = 20; int z;z = add<int>(x,y); // type of T explicitly specifiedz = add(x,y); // type of T determined automatically

// add floatsfloat x = 10.0f; float y = 20.0f; float z;z = add<float>(x,y); // type of T explicitly specifiedz = add(x,y); // type of T determined automatically


Function objects (Functors)

C++ Background

16

// templated functor to add numberstemplate< typename T >class add{ public: T operator()(T a, T b) { return a + b; }};

int x = 10; int y = 20; int z;add<int> func; // create an add functor for T=intz = func(x,y); // invoke functor on x and y

float x = 10; float y = 20; float z;add<float> func; // create an add functor for T=floatz = func(x,y); // invoke functor on x and y


Generic Algorithms

C++ Background

17

// apply function f to sequences x, y and store result in ztemplate <typename T, typename Function>void transform(int N, T * x, T * y, T * z, Function f){ for (int i = 0; i < N; i++) z[i] = f(x[i], y[i]);}

int N = 100;int x[N]; int y[N]; int z[N];

add<int> func; // add functor for T=int

transform(N, x, y, z, func); // compute z[i] = x[i] + y[i]

transform(N, x, y, z, add<int>()); // equivalent


Algorithms

Thrust provides many standard algorithmsTransformations

Reductions

Prefix Sums

Sorting

Generic definitionsGeneral Types

Built-in types (int, float, …)

User-defined structures

General Operatorsreduce with plus operator

scan with maximum operator

18


Algorithms

General types and operators

19

#include <thrust/reduce.h>

// declare storagedevice_vector<int> i_vec = ... device_vector<float> f_vec = ...

// sum of integers (equivalent calls)reduce(i_vec.begin(), i_vec.end());reduce(i_vec.begin(), i_vec.end(), 0, plus<int>());

// sum of floats (equivalent calls)reduce(f_vec.begin(), f_vec.end());reduce(f_vec.begin(), f_vec.end(), 0.0f, plus<float>());

// maximum of integersreduce(i_vec.begin(), i_vec.end(), 0, maximum<int>());


Algorithms


20

struct negate_float2{ __host__ __device__ float2 operator()(float2 a) { return make_float2(-a.x, -a.y); }};

// declare storagedevice_vector<float2> input = ... device_vector<float2> output = ...

// create functornegate_float2 func;

// negate vectorstransform(input.begin(), input.end(), output.begin(), func);


Algorithms


21

// compare x component of two float2 structuresstruct compare_float2{ __host__ __device__ bool operator()(float2 a, float2 b) { return a.x < b.x; }};

// declare storagedevice_vector<float2> vec = ...

// create comparison functorcompare_float2 comp;

// sort elements by x componentsort(vec.begin(), vec.end(), comp);


Operators with State

Algorithms

22

// compare x component of two float2 structuresstruct is_greater_than{ int threshold;

is_greater_than(int t) { threshold = t; }

__host__ __device__ bool operator()(int x) { return x > threshold; }};

device_vector<int> vec = ...

// create predicate functor (returns true for x > 10)is_greater_than pred(10);

// count number of values > 10int result = count_if(vec.begin(), vec.end(), pred);


AlgorithmsGeneric

Support general types and operators

Statically dispatched based on iterator typeMemory space is known at compile time

Have default argumentsreduce(begin, end)

reduce(begin, end, init, binary_op)

Recap

23


Fancy Iterators

Behave like “normal” iteratorsAlgorithms don't know the difference

Examplesconstant_iterator

counting_iterator

transform_iterator

permutation_iterator

zip_iterator

24


Fancy Iterators

constant_iteratorMimics an infinite array filled with a constant value

25

A A A A A

// create iteratorsconstant_iterator<int> begin(10);constant_iterator<int> end = begin + 3;

begin[0] // returns 10begin[1] // returns 10begin[100] // returns 10

// sum of [begin, end)reduce(begin, end); // returns 30 (i.e. 3 * 10)


Fancy Iterators

counting_iteratorMimics an infinite array with sequential values

26

0 0 1 2 3

// create iteratorscounting_iterator<int> begin(10);counting_iterator<int> end = begin + 3;

begin[0] // returns 10begin[1] // returns 11begin[100] // returns 110

// sum of [begin, end)reduce(begin, end); // returns 33 (i.e. 10 + 11 + 12)


F( ) F( )

transform_iteratorYields a transformed sequence

Facilitates kernel fusion

Fancy Iterators

27

X

X Y Z

F( x ) F( ) Y Z


Fancy Iterators

transform_iteratorConserves memory capacity and bandwidth

28

// initialize vectordevice_vector<int> vec(3);vec[0] = 10; vec[1] = 20; vec[2] = 30;

// create iterator (type omitted)begin = make_transform_iterator(vec.begin(), negate<int>());end = make_transform_iterator(vec.end(), negate<int>());

begin[0] // returns -10begin[1] // returns -20begin[2] // returns -30

// sum of [begin, end)reduce(begin, end); // returns -60 (i.e. -10 + -20 + -30)


Fancy Iterators

zip_iteratorLooks like an array of structs (AoS)

Stored in structure of arrays (SoA)

29

A B C

X Y Z

A B CX Y Z


Fancy Iterators

zip_iterator

30

// initialize vectorsdevice_vector<int> A(3);device_vector<char> B(3);A[0] = 10; A[1] = 20; A[2] = 30;B[0] = ‘x’; B[1] = ‘y’; B[2] = ‘z’;

// create iterator (type omitted)begin = make_zip_iterator(make_tuple(A.begin(), B.begin()));end = make_zip_iterator(make_tuple(A.end(), B.end()));

begin[0] // returns tuple(10, ‘x’)begin[1] // returns tuple(20, ‘y’)begin[2] // returns tuple(30, ‘z’)

// maximum of [begin, end)maximum< tuple<int,char> > binary_op;reduce(begin, end, begin[0], binary_op); // returns tuple(30, ‘z’)


FusionCombine related operations together

Structure of ArraysEnsure memory coalescing

Implicit SequencesEliminate memory accesses

Best Practices

31


Combine related operations togetherConserves memory bandwidth

Example: SNRM2Square each element

Compute sum of squares and take sqrt()

Fusion

32


Unoptimized implementation

Fusion

33

// define transformation f(x) -> x^2struct square{ __host__ __device__ float operator()(float x) { return x * x; }};

float snrm2_slow(device_vector<float>& x){ // without fusion device_vector<float> temp(x.size()); transform(x.begin(), x.end(), temp.begin(), square());

return sqrt( reduce(temp.begin(), temp.end()) );}


Optimized implementation (3.8x faster)

Fusion

34

// define transformation f(x) -> x^2struct square{ __host__ __device__ float operator()(float x) { return x * x; }};

float snrm2_fast(device_vector<float>& x){ // with fusion return sqrt( transform_reduce(x.begin(), x.end(), square(), 0.0f, plus<float>());}


Array of Structures (AoS)Often does not obey coalescing rules

device_vector<float3>

Structure of Arrays (SoA)Obeys coalescing rules

Components stored in separate arraysdevice_vector<float> x, y, z;

Example: Rotate 3d vectorsSoA is 2.8x faster

Structure of Arrays (SoA)

35



36

struct rotate_float3{ __host__ __device__ float3 operator()(float3 v) { float x = v.x; float y = v.y; float z = v.z;

float rx = 0.36f*x + 0.48f*y + -0.80f*z; float ry =-0.80f*x + 0.60f*y + 0.00f*z; float rz = 0.48f*x + 0.64f*y + 0.60f*z;

return make_float3(rx, ry, rz); }};

...

device_vector<float3> vec(N);

transform(vec.begin(), vec.end, vec.begin(), rotate_float3());



37

struct rotate_tuple{ __host__ __device__ tuple<float,float,float> operator()(tuple<float,float,float> v) { float x = get<0>(v); float y = get<1>(v); float z = get<2>(v);

float rx = 0.36f*x + 0.48f*y + -0.80f*z; float ry =-0.80f*x + 0.60f*y + 0.00f*z; float rz = 0.48f*x + 0.64f*y + 0.60f*z;

return make_tuple(rx, ry, rz); }};

...

device_vector<float> x(N), y(N), z(N);

transform(make_zip_iterator(make_tuple(x.begin(), y.begin(), z.begin())), make_zip_iterator(make_tuple(x.end(), y.end(), z.end())), make_zip_iterator(make_tuple(x.begin(), y.begin(), z.begin())), rotate_tuple());


Avoid storing sequences explicitlyConstant sequences

[1, 1, 1, 1, … ]

Incrementing sequences[0, 1, 2, 3, … ]

Implicit sequences require no storageconstant_iterator

counting_iterator

ExampleIndex of the smallest element

Implicit Sequences

38


Implicit Sequences

39

// return the smaller of two tuplesstruct smaller_tuple{ tuple<float,int> operator()(tuple<float,int> a, tuple<float,int> b) { if (a < b) return a; else return b; }};

int min_index(device_vector<float>& vec){ // create explicit index sequence [0, 1, 2, ... ) device_vector<int> indices(vec.size()); sequence(indices.begin(), indices.end());

tuple<float,int> init(vec[0],0); tuple<float,int> smallest;

smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), indices.begin())), make_zip_iterator(make_tuple(vec.end(), indices.end())), init, smaller_tuple());

return get<1>(smallest);}


Implicit Sequences

40

// return the smaller of two tuplesstruct smaller_tuple{ tuple<float,int> operator()(tuple<float,int> a, tuple<float,int> b) { if (a < b) return a; else return b; }};

int min_index(device_vector<float>& vec){ // create implicit index sequence [0, 1, 2, ... ) counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());

tuple<float,int> init(vec[0],0); tuple<float,int> smallest;

smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)), init, smaller_tuple());

return get<1>(small);}


Best PracticesFusion

3.8x faster

Structure of Arrays2.8x faster

Implicit Sequences3.4x faster

Recap

41


Additional Resources

ThrustHomepage

Quick Start Guide

Documentation

Examples

MegaNewtons (blog)

thrust-users (mailing list)

OtherNVIDIA Research

CUDA

42

http://code.google.com/p/thrust/

http://code.google.com/p/thrust/wiki/QuickStartGuide

http://code.google.com/p/thrust/wiki/Documentation

http://code.google.com/p/thrust/downloads/list

http://www.meganewtons.com/

http://groups.google.com/group/thrust-users

http://www.nvidia.com/research

http://www.nvidia.com/cuda

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 359 #1

CHAPTER

Thrust: AProductivity-OrientedLibrary for CUDA 26

Nathan Bell and Jared Hoberock

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard TemplateLibrary (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing whileremaining fully interoperable with the rest of the CUDA software ecosystem. Applications writtenwith Thrust are concise, readable, and efficient.

26.1 MOTIVATIONWith the introduction of CUDA C/C++, developers can harness the massive parallelism of the GPUthrough a standard programming language. CUDA allows developers to make fine-grained decisionsabout how computations are decomposed into parallel threads and executed on the device. The levelof control offered by CUDA C/C++ (henceforth CUDA C) is an important feature: it facilitates thedevelopment of high-performance algorithms for a variety of computationally demanding tasks which(1) merit significant optimization and (2) profit from low-level control of the mapping onto hardware.For this class of computational tasks CUDA C is an excellent solution.

Thrust [1] solves a complementary set of problems, namely those that are (1) implemented effi-ciently without a detailed mapping of work onto the target architecture or those that (2) do not meritor simply will not receive significant optimization effort by the user. With Thrust, developers describetheir computation using a collection of high-level algorithms and completely delegate the decisionof how to implement the computation to the library. This abstract interface allows programmers todescribe what to compute without placing any additional restrictions on how to carry out the computa-tion. By capturing the programmer’s intent at a high level, Thrust has the discretion to make informeddecisions on behalf of the programmer and select the most efficient implementation.

The value of high-level libraries is broadly recognized in high-performance computing. Forexample, the widely-used BLAS standard provides an abstract interface to common linear algebraoperations. First conceived more than three decades ago, BLAS remains relevant today in large partbecause it allows valuable, platform-specific optimizations to be introduced behind a uniform interface.

Whereas BLAS is focused on numerical linear algebra, Thrust provides an abstract interface tofundamental parallel algorithms such as scan, sort, and reduction. Thrust leverages the power of C++templates to make these algorithms generic, enabling them to be used with arbitrary user-definedtypes and operators. Thrust establishes a durable interface for parallel computing with an eye towardsgenerality, programmer productivity, and real-world performance.

GPU Computing Gemsc© 2012 Elsevier Inc. All rights reserved.

359

nbell

Text Box

Appears in GPU Computing Gems: Jade Edition Published 2011 by Morgan Kaufmann Publishers http://mkp.com/news/3405

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 360 #2

360 CHAPTER 26 Thrust: A Productivity-Oriented Library for CUDA

26.2 DIVING INBefore going into greater detail, let us consider the program in Listing 26.1, which illustrates the salientfeatures of Thrust.

#include <thrust/host vector.h>

#include <thrust/device vector.h>

#include <thrust/generate.h>

#include <thrust/sort.h>

#include <thrust/copy.h>

#include <cstdlib>

int main(void){

// generate 16M random numbers on the hostthrust::host vector<int> h vec(1 << 24);thrust::generate(h vec.begin(), h vec.end(), rand);

// transfer data to the devicethrust::device vector<int> d vec = h vec;

// sort data on the devicethrust::sort(d vec.begin(), d vec.end());

// transfer data back to hostthrust::copy(d vec.begin(), d vec.end(), h vec.begin());

return 0;}

Listing 26.1. A complete Thrust program which sorts data on the GPU.

Thrust provides two vector containers: host vector and device vector. As the names sug-gest, host vector is stored in host memory while device vector lives in device memory on theGPU. Like the vector container in the C++ STL, host vector and device vector are generic con-tainers (i.e., they are able to store any data type) that can be resized dynamically. As the example shows,containers automate the allocation and deallocation of memory and simplify the process of exchangingdata between the host and device.

The program acts on the vector containers using the generate, sort, and copy algorithms. Here,we adopt the STL convention of specifying ranges using pairs of iterators. In this example, the iteratorsh vec.begin() and h vec.end() can be thought of as a pair of int pointers, where the formerpoints to the first element in the array and the latter to the element one past the end of the array.Together the pair defines a range of integers of size h vec.end() - h vec.begin().

Note that even though the computation implied by the call to the sort algorithm suggests one ormore CUDA kernel launches, the programmer has not specified a launch configuration. Thrust’s inter-face abstracts these details. The choice of performance-sensitive variables such as grid and block size,

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 361 #3

26.2 Diving In 361

the details of memory management, and even the choice of sorting algorithm are left to the discretionof the library implementor.

26.2.1 Iterators and Memory SpacesAlthough vector iterators are similar to pointers, they carry additional information. Notice that we didnot have to instruct the sort algorithm that it was operating on the elements of a device vectoror hint that the copy was from device memory to host memory. In Thrust the memory spaces ofeach range are automatically inferred from the iterator arguments and used to dispatch the appropriateimplementation.

In addition to memory space, Thrust’s iterators implicitly encode a wealth of information which canguide the dispatch process. For instance, our sort example above operates on ints, a primitive datatype with a fundamental comparison operation. In this case, Thrust dispatches a highly-tuned RadixSort algorithm [2] which is considerably faster than alternative comparison-based sorting algorithmssuch as Merge Sort [3]. It is important to realize that this dispatch process incurs no performance orstorage overhead: metadata encoded by iterators exists only at compile time, and dispatch strategiesbased on it are selected statically. In general, Thrust’s static dispatch strategies may capitalize on anyinformation that is derivable from the type of an iterator.

26.2.2 InteroperabilityThrust is implemented entirely within CUDA C/C++ and maintains interoperability with the rest ofthe CUDA ecosystem. Interoperability is an important feature because no single language or libraryis the best tool for every problem. For example, although Thrust algorithms use CUDA features likeshared memory internally, there is no mechanism for users to exploit shared memory directly

through Thrust. Therefore, it is sometimes necessary for applications to access CUDA C directly toimplement a certain class of specialized algorithms, as illustrated in the software stack of Figure 26.1.

Interfacing Thrust to CUDA C is straightforward and analogous to the use of the C++ STL withstandard C code. Data that resides in a Thrust container can be accessed by external libraries by

Application

Thrust

CUDA C/C++

BLAS, FFT ...

CUDA

FIGURE 26.1

Thrust is an abstraction layer on top of CUDA C/C++.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 362 #4


size t N = 1024;

// allocate Thrust containerdevice vector<int> d vec(N);

// extract raw pointer from containerint ∗ raw ptr = raw pointer cast(&d vec[0]);

// use raw ptr in non−Thrust functionscudaMemset(raw ptr, 0, N ∗ sizeof(int));

// pass raw ptr to a kernelmy kernel<<<N / 128, 128>>>(N, raw ptr);

// memory is automatically freed

(a) Interfacing Thrust to CUDA

size t N = 1024;

// raw pointer to device memoryint ∗ raw ptr;cudaMalloc(&raw ptr, N ∗ sizeof(int));

// wrap raw pointer with a device ptrdevice ptr<int> dev ptr = device pointer cast(raw ptr);

// use device ptr in Thrust algorithmssort(dev ptr, dev ptr + N);

// access device memory through device ptrdev ptr[0] = 1;

// free memorycudaFree(raw ptr);

(b) Interfacing CUDA to Thrust

Listing 26.2. Thrust interoperates smoothly with CUDA C/C++.

extracing a “raw” pointer from the vector. The code sample in Listing 26.2 illustrates the use ofraw pointer cast to obtain an int pointer to the contents of a device vector.

Applying Thrust algorithms to raw pointers is also straightforward. Once the raw pointer has beenwrapped by a device ptr it can be used like an ordinary Thrust iterator. The wrapped pointer providesthe memory space information Thrust needs to invoke the appropriate algorithm implemention and alsoallows a convenient mechanism for accessing device memory from the host.

Thrust’s native CUDA C interoperability is a powerful feature. Interoperability ensures that Thrustalways complements CUDA C and that a Thrust plus CUDA C combination is never worse thaneither Thrust or CUDA C alone. Indeed, while it may be possible to write whole parallel applicationsentirely with Thrust functions, it is often valuable to implement domain-specific functionality directlyin CUDA C. The level of abstraction targeted by native CUDA C affords programmers fine-grainedcontrol over the precise mapping of computational resources to a particular problem. Programming atthis level provides developers the flexibility to implement exotic or otherwise specialized algorithms.Interoperability also facilitates an iterative development strategy: (1) quickly prototype a parallelapplication entirely in Thrust, (2) identify the application’s hot spots, and (3) write more specializedalgorithms in CUDA C and optimize as necessary.

26.3 GENERIC PROGRAMMINGThrust presents a style of programming emphasizing genericity and composability. Indeed, the vastmajority of Thrust’s functionality is derived from four fundamental parallel algorithms: for each,reduce, scan, and sort. For example, the transform algorithm is a derivative of for each whileinner product is implemented with reduce.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 363 #5

26.4 Benefits of Abstraction 363

Thrust algorithms are generic in both the type of the data to be processed and the operations to beapplied to the data. For instance, the reduce algorithm may be employed to compute the sum of arange of integers (a plus reduction applied to int data) or the maximum of a range of floating pointvalues (a max reduction applied to float data). This generality is implemented via C++ templates,which allows user-defined types and functions to be used in addition to built-in types such as int orfloat or Thrust operators such as plus.

Generic algorithms are extremely valuable because it is impractical to anticipate precisely whichparticular types and operators users will require. Indeed, while the computational structure of analgorithm is fixed, the number of instantiations of the algorithm is truly limitless. However, it is worthremarking that while Thrust’s interface is general, the abstraction affords implementors the opportu-nity to specialize for specific types and operations known to be important use cases. As with inferencesfrom memory space, these opportunities may be exploited statically.

In Thrust, user-defined operations take the form of C++ function objects, or functors. Functorsallow the programmer to adapt a generic algorithm to implement a specific user-defined operation.For example, the code samples in Listing 26.3 implement SAXPY, the well-known BLAS operation,using CUDA C and Thrust respectively. Here, the generic transform algorithm is called with theuser-defined saxpy functor.

26.4 BENEFITS OF ABSTRACTIONIn this section we’ll describe the benefits of Thrust’s abstraction layer with respect to programmerproductivity, robustness, and real-world performance.

26.4.1 Programmer ProductivityThrust’s high-level algorithms enhance programmer productivity by automating the mapping of com-putational tasks onto the GPU. Recall the two implementations of SAXPY shown in Listing 26.3. Inthe CUDA C implementation of SAXPY the programmer has described a specific decomposition ofthe parallel vector operation into a grid of blocks with 256 threads per block. In contrast, the Thrustimplementation does not prescribe a launch configuration. Instead, the only specifications are the inputand output ranges and a functor to apply to them. Otherwise, the two codes are roughly the same interms of length and code complexity.

Delegating the launch configuration to Thrust has a subtle yet profound implication: the launchparameters can be automatically chosen based on a model of machine performance. Currently, Thrusttargets maximal occupancy and will compare the resource usage of the kernel (e.g., number of registers,amount of shared memory) with the resources of the target GPU to determine a launch configura-tion with highest occupancy. While the maximal occupancy heuristic is not necessarily optimal, it isstraightforward to compute and effective in practice. Furthermore, there is nothing to preclude theuse of more sophisticated performance models. For instance, a run-time tuning system that exam-ined hardware performance counters could be introduced behind this abstraction without alteringclient code.

Thrust also boosts programmer productivity by providing a rich set of algorithms for commonpatterns. For instance, the map-reduce pattern is conveniently implemented with Thrust’s sort by keyand reduce by key algorithms, which implement key-value sorting and reduction respectively.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 364 #6


globalvoid saxpy kernel(int n, float a, float ∗ x, float ∗ y){

const int i = blockDim.x ∗ blockIdx.x + threadIdx.x;

if (i < n)y[i] = a ∗ x[i] + y[i];

}

void saxpy(int n, float a, float ∗ x, float ∗ y){

// set launch configuration parametersint block size = 256;int grid size = (n + block size − 1) / block size;

// launch saxpy kernelsaxpy kernel<<< grid size, block size >>>(n, a, x, y);}

(a) CUDA C

struct saxpy functor{

const float a;

saxpy functor(float a) : a( a) {}

host devicefloat operator()(float x, float y){

return a ∗ x + y;}

};

void saxpy(float a, device vector<float>& x, device vector<float>& y){

// setup functorsaxpy functor func(a);

// call transformtransform(x.begin(), x.end(), y.begin(), y.begin(), func);}

(b) Thrust

Listing 26.3. SAXPY implementations in CUDA C and Thrust.

26.4.2 RobustnessThrust’s abstraction layer also enhances the robustness of CUDA applications. In the previous sectionwe noted that by delegating the launch configuration details to Thrust we could automatically obtainmaximum occupancy during execution. In addition to maximizing occupancy, the abstraction layer also

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 365 #7

26.4 Benefits of Abstraction 365

ensures that algorithms “just work,” even in uncommon or pathological use cases. For instance, Thrustautomatically handles limits on grid dimensions (no more than 64K), works around limitations on thesize of global function arguments, and accommodates large user-defined types in most algorithms.To the degree possible, Thrust circumvents such factors and ensures correct program execution acrossthe full spectrum of CUDA-capable GPUs.

26.4.3 Real-World PerformanceIn addition to enhancing programmer productivity and improving robustness, the high-level abstrac-tions provided by Thrust improve performance in real-world use cases. In this section we examine twoinstances where the discretion afforded by Thrust’s high-level interface is exploited for meaningfulperformance gains.

To begin, consider the operation of filling an array with a particular value. In Thrust, this is imple-mented with the fill algorithm. Unfortunately, a straightforward implementation of this seeminglysimple operation is subject to severe performance hazards. Recall that processors based on the G80architecture (i.e., Compute Capability 1.0 and 1.1) impose strict conditions on which memory accesspatterns may benefit from memory coalescing [4]. In particular, memory accesses of sub-word gran-ularity (i.e., less than four bytes) are not coalesced by these processors. This artifact is detrimental toperformance when initializing arrays of char or short types.

Fortunately, the iterators passed to fill implicitly encode all the information necessary to inter-cept this case and substitute an optimized implementation. Specifically, when fill is dispatched forsmaller types, Thrust selects a “wide” version of the algorithm that issues word-sized accesses perthread. While this optimization is straightforward to implement, users are unlikely to invest the effortof making this optimization themselves. Nevertheless, the benefit, shown in Table 26.1, is worthwhile,particularly on earlier architectures.

Like fill, Thrust’s sorting functionality exploits the discretion afforded by the abstract sort andstable sort functions. As long as the algorithm achieves the promised result, we are free to utilize

Table 26.1 Memory Bandwidth of Two fill Kernels

GPU data type naive fill thrust::fill Speedup

GeForce 8800 GTS char 1.2 GB/s 41.2 GB/s 34.15xshort 2.4 GB/s 41.2 GB/s 17.35xint 41.2 GB/s 41.2 GB/s 1.00xlong 40.7 GB/s 40.7 GB/s 1.00x

GeForce GTX 280 char 33.9 GB/s 75.0 GB/s 2.21xshort 51.6 GB/s 75.0 GB/s 1.45xint 75.0 GB/s 75.0 GB/s 1.00xlong 69.2 GB/s 69.2 GB/s 1.00x

GeForce GTX 480 char 74.1 GB/s 156.9 GB/s 2.12xshort 136.6 GB/s 156.9 GB/s 1.15xint 146.1 GB/s 156.9 GB/s 1.07xlong 156.9 GB/s 156.9 GB/s 1.00x

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 366 #8


0

500

1000

1500

2000

2500

0 4 8 12 16 20 24 28 32

Sort

ing p

erf

orm

ance (

M K

ey/

s)

Bits per key

FIGURE 26.2

Sorting 32-bit integers on the GeForce GTX 480: Thrust’s dynamic sorting optimizations improve performanceby a considerable margin in common use cases.

sophisticated static (compile-time) and dynamic (run-time) optimizations to implement the sortingoperation in the most efficient manner.

As mentioned in Section 26.2.1, Thrust statically selects a highly-optimized Radix Sort algo-rithm [2] for sorting primitive types (e.g., char, int, float, and double) with the standard lesscomparison operator. For all other types (e.g., user-defined data types) and comparison operators,Thrust uses a general Merge Sort algorithm. Because sorting primitives with Radix Sort is considerablyfaster than Merge Sort, this static optimization has significant value.

Thrust also applies dynamic optimizations to improve sorting performance. Before invoking theRadix Sort, Thrust quickly computes the minimum and maximum among the keys to be sorted.Since the cost of Radix Sort is proportional to the number of significant key bits, we can exploitknowledge of the extremal values to reduce the cost of sorting. For instance, when all integer keys arein the range [0, 16), only four bits must be sorted, and we observe a 2.71× speedup versus a full 32-bitsort. The relationship between key bits and radix sort performance is plotted in Figure 26.2.

26.5 BEST PRACTICESIn this section we highlight three high-level optimization techniques that programmers may employ toyield significant performance speedups when using Thrust.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 367 #9

26.5 Best Practices 367

26.5.1 FusionThe balance of computational resources on modern GPUs implies that algorithms are often bandwidthlimited. Specifically, computations with low arithmetic intensity, the ratio of calculations per memoryaccess, are constrained by the available memory bandwidth and do not fully utilize the computationalresources of the GPU. One technique for increasing the computational intensity of an algorithm isto fuse multiple pipeline stages together into a single operation. In this section we demonstrate howThrust enables developers to exploit opportunities for kernel fusion and better utilize GPU memorybandwidth.

The simplest form of kernel fusion is scalar function composition. For example, suppose we havethe functions f (x)→ y and g(y)→ z and would like to compute g( f (x))→ z for a range of scalarvalues. The most straightforward approach is to read x from memory, compute the value y= f (x), andthen write y to memory, and then do the same to compute z= g(y). In Thrust this approach would beimplemented with two separate calls to the transform algorithm, one for f and one for g. While thisapproach is straightforward to understand and implement, it needlessly wastes memory bandwidth,which is a scarce resource.

A better approach is to fuse the functions into a single operation g( f (x)) and halve the number ofmemory transactions. Unless f and g are computationally expensive operations, the fused implementa-tion will run approximately twice as fast as the first approach. In general, scalar function compositionis a profitable optimization and should be applied liberally.

Thrust enables developers to exploit other, less-obvious opportunities for fusion. For example, con-sider the following two Thrust implementations of the BLAS function SNRM2 shown in Listing 26.4,which computes the Euclidean norm of a float vector.

Note that SNRM2 has low arithmetic intensity: each element of the vector participates in only twofloating point operations, one multiply (to square the value) and one addition (to sum values together).Therefore, SNRM2 is an ideal candidate for fusion and the transform reduce implementation,which fuses the square transformation with a plus reduction should be considerably faster. Indeedthis is true and snrm2 fast is fully 3.8 times faster than snrm2 slow for a 16M element vector on aTesla C1060.

While the previous examples represent some of the more common opportunities for fusion, wehave only scratched the surface. As we have seen, fusing a transformation with other algorithms isa worthwhile optimization. However, Thrust would become unwieldy if all algorithms came with atransform variant. For this reason Thrust provides transform iterator which allows transfor-mations to be fused with any algorithm. Indeed, transform reduce is simply a convenience wrapperfor the appropriate combination of transform iterator and reduce. Similarly, Thrust providespermutation iterator which enables gather and scatter operations to be fused with otheralgorithms.

26.5.2 Structure of ArraysIn the previous section we examined how fusion minimizes the number of off-chip memory trans-actions and conserves bandwidth. Another way to improve memory efficiency is to ensure that allmemory accesses benefit from coalescing, since coalesced memory access patterns are considerablyfaster than non-coalesced transactions.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 368 #10


struct square{

host devicefloat operator()(float x) const{

return x ∗ x;}

};

float snrm2 slow(const thrust::device vector<float>& x){

// without fusiondevice vector<float> temp(x.size());transform(x.begin(), x.end(), temp.begin(), square());

return sqrt( reduce(temp.begin(), temp.end()) );}

float snrm2 fast(const thrust::device vector<float>& x){

// with fusionreturn sqrt( transform reduce(x.begin(), x.end(), square(), 0.0f, plus<float>());}

Listing 26.4. SNRM2 has low arithmetic intensity and therefore benefits greatly from fusion.

struct float3{

float x;float y;float z;};

float3 ∗ aos;...

aos[0].x = 1.0f;

(a) Array of Structures

struct float3 soa{

float ∗ x;float ∗ y;float ∗ z;};

float3 soa soa;...

soa.x[0] = 1.0f;

(b) Structure of Arrays

Listing 26.5. Data layouts for three-dimensional float vectors.

Perhaps the most common violation of the memory coalescing rules arises when using a so-calledArray of Structures (AoS) data layout. Generally speaking, access to the elements of an array filledwith C struct or C++ class variables will be uncoalesced. Only special structures such as uint2 orfloat4 satisfy the memory coalescing rules.

An alternative to the AoS layout is the Structure of Arrays (SoA) approach, where the componentsof each struct are stored in separate arrays. Listing 26.5 illustrates the AoS and SoA methods of repre-senting a range of three-dimensional float vectors. The advantage of the SoA method is that regular

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 369 #11

26.5 Best Practices 369

struct rotate tuple{

host devicetuple<float,float,float> operator()(tuple<float,float,float>& t){

float x = get<0>(t);float y = get<1>(t);float z = get<2>(t);

float rx = 0.36f ∗ x + 0.48f ∗ y + −0.80f ∗ z;float ry =−0.80f ∗ x + 0.60f ∗ y + 0.00f ∗ z;float rz = 0.48f ∗ x + 0.64f ∗ y + 0.60f ∗ z;

return make tuple(rx, ry, rz);}

};

...

device vector<float> x(N), y(N), z(N);

transform(make zip iterator(make tuple(x.begin(), y.begin(), z.begin())),make zip iterator(make tuple(x.end(), y.end(), z.end())),make zip iterator(make tuple(x.begin(), y.begin(), z.begin())),rotate tuple());

Listing 26.6. The zip iterator facilitates processing of data in structure of arrays format.

access to the x, y, and z components of a given vector is coalesceable (because float satisfies thecoalescing rules), while regular access to the float3 structures in the AoS approach is not.

The problem with SoA is that there is nothing to logically encapsulate the members of each ele-ment into a single entity. Whereas we could immediately apply Thrust algorithms to AoS containerslike device vector<float3>, we have no direct means of doing the same with three separatedevice vector<float> containers. Fortunately Thrust provides zip iterator, which providesencapsulation of SoA ranges.

The zip iterator [5] takes a number of iterators and zips them together into a virtual range oftuples. For instance, binding three device vector<float> iterators together yields a range of typetuple<float,float,float>, which is analogous to the float3 structure.

Consider the code sample in Listing 26.6 which uses zip iterator to construct a range of three-dimensional float vectors stored in SoA format. Each vector is transformed by a rotation matrix in therotate tuple functor before being written out again. Note that zip iterator is used for both inputand output ranges, transparently packing the underlying scalar ranges into tuples and then unpackingthe tuples into the scalar ranges. On a Tesla C1060, this SoA implementation is 2.85× faster than theanalogous AoS implementation (not shown).

26.5.3 Implicit RangesIn the previous sections we considered ways to efficiently transform ranges of values and ways toconstruct ad hoc tuples of values from separate ranges. In either case, there was some underlying data

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 370 #12


stored explicitly in memory. In this section we illustrate the use of implicit ranges, i.e., ranges whosevalues are defined programmatically and not stored anywhere in memory.

For instance, consider the problem of finding the index of the element with the smallest value ina given range. We could implement a special reduction kernel for this algorithm, which we’ll callmin index, but that would be time-consuming and unnecessary. A better approach is to implementmin index in terms of existing functionality, such as a specialized reduction over (value, index)tuples, to achieve the desired result. Specifically, we can zip the range of values v[0], v[1], v[2], . . .

together with a range of integer indices 0, 1, 2, . . . to form a range of tuples (v[0], 0), (v[1], 1),(v[2],2) . . . and then implement min index with the standard reduce algorithm. Unfortunately,this scheme will be much slower than a customized reduction kernel, since the index range must becreated and stored explicitly in memory.

To resolve this issue Thrust provides counting iterator [5], which acts just like the explicitrange of values we need to implement min index, but does not carry any overhead. Specifically, whencounting iterator is dereferenced it generates the appropriate value “on the fly” and yields thatvalue to the caller. An efficient implementation of min index using counting iterator is shownin Listing 26.7.

struct smaller tuple{

tuple<float,int> operator()(tuple<float,int> a, tuple<float,int> b){

// return the tuple with the smaller float valueif (get<0>(a) < get<0>(b))return a;

elsereturn b;

}

};

int min index(device vector<float>& values){

// [begin,end) form the implicit sequence [0,1,2, ... value.size())counting iterator<int> begin(0);counting iterator<int> end(values.size());

// initial value of the reductiontuple<float,int> init(values[0], 0);

// compute the smallest tupletuple<float,int> smallest = reduce(make zip iterator(make tuple(values.begin(), begin)),

make zip iterator(make tuple(values.end(), end)),init,smaller tuple());

// return the indexreturn get<1>(smallest);}

Listing 26.7. Implicit ranges improve performance by conserving memory bandwidth.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 371 #13

References 371

Here counting iterator has allowed us to efficiently implement a special-purpose reductionalgorithm without the need to write a new, special-purpose kernel. In addition to counting iteratorThrust provides constant iterator, which defines an implicit range of constant value. Notethat these implicitly-defined iterators can be combined with the other iterators to create morecomplex implicit ranges. For instance, counting iterator can be used in combination withtransform iterator to produce a range of indices with nonunit stride.

In practice there is no need to implement min index since Thrust’s min element algorithm pro-vides the equivalent functionality. Nevertheless the min index example is instructive of best practices.Indeed, Thrust algorithms such as min element, max element, and find if apply the exact samestrategy internally.

References[1] J. Hoberock, N. Bell, Thrust: A parallel template library, 2011. Version 1.4.0.[2] D. Merrill, A. Grimshaw, Revisiting sorting for gpgpu stream architectures, Technical Report CS2010-03,

University of Virginia, Department of Computer Science, Charlottesville, VA, 2010.[3] N. Satish, M. Harris, M. Garland, Designing efficient sorting algorithms for manycore GPUs, in Proceedings

23rd IEEE Int’l Parallel & Distributed Processing Symposium, IEEE Computer Society, Washington, DC,2009.

[4] NVIDIA Corporation, CUDA C Best Practices Guide v3.2, NVIDIA Corporation, Santa Clara, CA, 2010(Section 3.2.1).

[5] Boost Iterator Library. www.boost.org/doc/libs/release/libs/iterator/.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 372 #14

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 373 #15

Non-Print Items

AbstractThis chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard TemplateLibrary (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing whileremaining fully interoperable with the rest of the CUDA software ecosystem. Applications writtenwith Thrust are concise, readable, and efficient.

Date post:	14-Apr-2016
Category:	Documents
Upload:	qer111
View:	224 times
Download:	0 times

An Overview for the Library to Thrust

Documents