BLAS Extensions for Algebraic Pricing Methodszubair/papers/SSRN-id2519478.pdf · Paolo Regondi...

Electronic copy available at: http://ssrn.com/abstract=2519478

BLAS Extensions for Algebraic Pricing Methods

Paolo RegondiGlobal Valuation Limited

[email protected]

Mohammad ZubairOld Dominion University

[email protected]

Claudio AlbaneseGlobal Valuation Limitedclaudio.albanese@global-

valuation.com

AbstractPartial differential equation (PDE) pricing methods such asbackward and forward induction are typically implementedas unconditionally marginally stable algorithms in doubleprecision for individual transactions. In this paper, we recon-sider this strategy and argue that optimal GPU implementa-tions should be based on a quite different strategy involv-ing higher level BLAS routines. We argue that it is advanta-geous to use conditionally strongly stable algorithms in sin-gle precision and to price concurrently sub-portfolios of sim-ilar transactions. To support these operator algebraic meth-ods, we propose some BLAS extensions. CUDA implementa-tions of our extensions turn out to be significantly faster thanimplementations based on standard cuBLAS. The key to theperformance gain of our implementation is in the efficientutilization of the memory system of the new GPU architec-ture.

1. IntroductionRecently, one of the authors of this paper developed amethod for combined value-risk analysis within a globalmarket using mathematical formalism around matrix oper-ations making it amenable to efficient implementation onmassively parallel architectures like GPUs[1]. The methodis based on fast exponentiation to find transition probabilitykernels on long time intervals and matrix-matrix multiplica-tion to backtrack the pricing matrix for an entire portfolio.One of the matrix operations that dominates the overall exe-cution time of the calibration component of the combinedvalue-risk analysis framework is a BLAS-like operation,which we refer to it as Sgemv4.

The Sgemv4 computation can be viewed as a multiplica-tion type operation of a vector of matrices A1, A2, . . . Am

with a matrix B, where Ai is multiplied by a number ofcolumns {Bj1 , Bj2 , Bj3 , . . . , Bjqi

}, not necessarily con-tiguous, of matrix B. Note that qi, which is the numberof columns of B that need to be used for multiplication,varies and is a function of i. In this paper, we discuss dif-ferent ways of implementing Sgemv4 on Kepler GPU withvarying level of performance. A naive implementation isto use a level-2 BLAS[2], for this operation. CUDA Toolkit5.5 is shipped with CUDA Basic Linear Algebra Subroutines

(cuBLAS) library[4]. We make qi calls of cublasSgemv (partof cuBLAS) for multiplying Ai by qi columns of B. We im-prove on the naive implementation by collecting columns{Bj1 , Bj2 , Bj3 , . . . , Bjqi

} to form a matrix and using acublasSgemm (part of cuBLAS), a level-3 BLAS[7] that re-places qi calls of cublasSgemv. We observed that for smallvalues of qi, the performance of cublasSgemm is worse thanrepeated calls of cublasSgemv. The value of qi in our appli-cation ranges from 1 to 50. In our experimentation with ma-trices of size 1024× 1024, we found that for value of qi < 8it is better not to use cublasSgemm. This required that wedevelop our own BLAS-like extension, Sgemm8, that is op-timized for multiplying a matrix with one to eight columns.NVIDIA cuBLAS library does not support such an exten-sion1.

We developed an optimized implementation of Sgemm8for Tesla Kepler architecture. The performance of Sgemm8 issignificantly better than that of cublasSgemv or cublasSgemmfor matrix multiplication with one to eight vectors. Table 1summarizes the performance of Sgemm8 as compared to thatof cuBLAS library on a GK104 device. Sgemm8 is 40% betterthan the cublasSgemv for multiplying a matrix with a singlevector; and is two times better than cublasSgemm for mul-tiplying a matrix with eight vectors for matrix sizes that oc-cur in our financial application. The key to the performancegain of Sgemm8 is due to a novel algorithm that utilizes thenew intrinsic function shuffle, which allows sharing of databetween threads of a warp and helps in reducing memorylatency. Note that Sgemm8 is close to a level-2 BLAS and theexecution time is dominated by the time to access the datafrom device memory.

We incorporated Sgemm8 in the Sgemv4 computation andobserved an overall speedup of 7 as compared to a base levelimplementation of Sgemv4 that uses cublasSgemv.

The rest of the paper is organized as follows. In the nextsection, we briefly discusses CUDA and Kepler architecture.We discuss implementation of Sgemv4 using CUDA BLASlibrary in Sect. 3. In Section 4, we give details of the kernelSgemm8 that is key to enhancing the performance of Sgemv4.We discuss performance results of Sgemv4 that utilizes the

1 Intel MKL library does support, Sgem2vu, an extension of Sgemv formultiplying a matrix with two vectors[3]

1

Electronic copy available at: http://ssrn.com/abstract=2519478

Table 1. Performance Comparison of Sgemm8 withcublasSgemv and cublasSgemm for matrix size1024× 1024.

No. of Sgemm8 cublasSgemv cublasSgemm

vectors (GFLOPS) (GFLOPS) (GFLOPS)

1 52 30 142 95 30 273 128 30 314 161 30 545 175 30 686 188 30 827 202 30 948 200 30 107

kernel Sgemm8 and cublasSgemm in Sect. 5. Finally weconclude in Sect. 6.

2. CUDA and NVIDIA KeplerA typical program on a system with a single GPU device is aC/C++ program with CUDA APIs to move data between sys-tem memory and GPU device memory, and to launch com-putation kernels on GPU[5]. The data between system mem-ory and the device memory is moved using the PCI Express(PCIe ) bus. These transfers are costly and therefore applica-tions that have a higher computation to I/O ratio are suitablefor GPU computing. Also, if possible these transfers shouldbe minimized and it is desirable to leave the data on GPU ifa subsequent kernel is going to use the same data. A GPUdevice uses several memory spaces that differ in their size,access latency, and read/write restrictions. These memoryspaces include global, local, shared, texture, and registers.Global, local, and texture memory have the greatest accesslatency, followed by constant memory, registers, and sharedmemory.

CUDA provides an abstraction of thread hierarchy to al-low computation from different domain to nicely map to dif-ferent cores of the underlying hardware. The GPU hardwareconsists of a number of streaming multiprocessor which inturn consists of multiple cores. Threads are organized inblocks, where one or more block runs on a streaming mul-tiprocessors. The threads in a block are further partitionedinto subgroups of 32 threads referred as Warps. A Warp, thatis a sub block of 32 threads, runs on eight or sixteen coresof a streaming multiprocessor in multiple clock cycles. Typ-ically, data sharing between threads of a block is facilitatedby the shared memory. The Kepler architecture supports an-other way of sharing data between threads of a warp, namelyby using an intrinsic shuffle function. In this paper, we focusour implementation on the NVIDIA Kepler architecture.

The Kepler architecture comes in two models K10 (GK104)and K20 (GK110). We limit our discussion to K10 as we ranall our experiments on this model. Note that the K10 board

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 >8

Freq

uen

cy

No. Columns

Figure 1. Frequency distribution for values of qi.

has two GK104 devices. However, our performance resultswill hold on K20 also. All our performance results reportedin this paper are on a single GK104 device. The KeplerGK104 model has 8 streaming multiprocessors (SMX) with192 cores on each SMX for a total of 1536 cores. Keplerhas a larger register file per multiprocessor as compared toearlier models, which helps in improving occupancy[6].

3. Performance Result using cuBLASWe tested the performance of Sgemv4 that occurs in thecalibration of US dollar interest rate model that is part ofthe counter party credit risk analysis framework. For thisscenario, the number of A matrices that is the value ofm = 57, size of A matrix is 1024 × 1024, and size of Bmatrix is 1024 × 1213. Recall that in Sgemv4 computation,we are multiplying a matrix Ai with qi columns of B, fori = 1 to 57.

Base Level Using cublasSgemv The base level imple-mentation of Sgemv4 was done using cublasSgemv. Wemake qi calls of cublasSgemv to multiply Ai with qicolumns of B, for i = 1 to 57. We collected frequency dis-tribution for values of qis to capture how often we need tomultiply a matrix with multiple vectors. The result is shownin Fig. 1. We also calculated the percentage of time spent forqi ≤ 8 and qi > 8, see Fig. 2. The reason for this was toexplore whether using cublasSgemm can help in improvingperformance. The choice of 8 was based on our experimenta-tion that indicated that cublasSgemm has reasonable perfor-mance beyond 8 columns. In this figure we also included thepercentage of time we spent in data movement. After mul-tiplying a vector of B with the matrix, we need to store theresult vector back in B at the same location. This requiresthat we use a temporary buffer in device memory to hold theoutput vector before moving it back to the storage area of B.These results indicate that we are spending around 60% ofthe time on multiplying matrices with qi > 8. This suggestit may be better to use cublasSgemm when multiplying Ai

with multiple vectors, that is for qi > 8.

Using cublasSgemv and cublasSgemm For using cublasSgemm,we need to gather qi vectors of B into a matrix. We make qi

2

21%19%

60%

0%

10%

20%

30%

40%

50%

60%

70%

cublasScopy cublasSgemv <= 8 cols cublasSgemv > 8 cols

Perc

enta

ge o

f ti

me

Figure 2. Percentage of time spent by cublasSgemv forqi ≤ 8, and qi > 8. The cost of data movement requiredto support cublasSgemv is also shown in the figure.

61%

26%

13%

0%

10%

20%

30%

40%

50%

60%

70%

cublasScopy cublasSgemv <= 8 cols cublasSgemm > 8 cols

Perc

enta

ge o

f ti

me

Figure 3. Percentage of time spent by cublasSgemv forqi ≤ 8, and cublasSgemm for qi > 8. The costof data movement required to support cublasSgemm andcublasSgemv is also shown in the figure.

calls of cudaCopy to collect qi vectors into a matrix, beforemaking a call to cublasSgemm. The performance results ofthis approach, where we call cublasSgemm, for qi > 8, isshown in Fig. 3. Observe that though the execution time formatrix operations has significantly reduced, but the over-head of moving data within device memory has significantlyincreased. The major reason of this overhead is that we aremaking q = σm

i=1qi calls of cudaCopy and in each call weare copying a small amount of data. To address this issue,we wrote two kernel programs: gather and scatter. The gatherprogram copies all q columns of B scattered in the memoryto a temporary contiguous area of memory; and the scatterprogram does the reverse. This reduced the data movementtime by a factor of 100. We also observed that for small val-ues of qi in the range from 2 to 4, the cublasSgemm is moreexpensive then calling cublasSgemv qi times. The perfor-mance results where we call cublasSgemm for qi > 8 alongwith the efficient gather and scatter is shown in Fig. 4.

2%

66%

32%

0%

10%

20%

30%

40%

50%

60%

70%

Gather & Scatter cublasSgemv <= 8 cols cublasSgemm > 8 cols

Perc

enta

ge o

f ti

me

Figure 4. Percentage of time spent by cublasSgemv forqi ≤ 8, and cublasSgemm for qi > 8. The cost of datamovement using efficient gather and scatter is also shown inthe figure.

4. Sgemm8The Sgemm8 kernel is an extension of BLAS to support ma-trix multiplication with one to eight vectors. For the case ofone vector the Sgemm8 is essentially a standard Sgemv BLASroutine. We first describe the implementation of Sgemm8 forthe case of one vector, y = Ax, and later we show how itcan be extended for multiplying two to eight vectors. Forkeeping our discussion simple, we consider square matrices.

The key idea of the algorithm is to use the new intrinsicshuffle function available on device of compute capability3.x that enables data sharing between threads of a warpwithout going to the shared memory. This has two benefits:(a) the sharing of data between threads happen with lowlatency, and (b) use of shared memory reduces, which in turnhelps in improving occupancy. The shuffle function comes infour flavors[5].

• shfl(): Copy from a specified source lane• shfl up(): Copy from a lane with lower ID by a spec-

ified delta relative to caller• shfl down(): Copy from a lane with higher ID by a

specified delta relative to caller• shfl xor(): Copy from a lane based on bitwise XOR

of own lane ID

In the proposed implementation, we use shfl() functionto share a register value of a thread with other threads in awarp. Fig. 5 illustrates the working of shfl(xt, j).

The algorithm uses a block of size Dx × Dy, whereDx = 32 and Dy = 8. A column of threads in a blockrepresent a warp. A warp is assigned to compute a vectorof 32 elements of y with partial values. A block of threadsgenerate Dy such vectors, which are summed up to obtaina final vector of 32 elements of y. We now explain thecomputation for a warp, see Fig. 6.

3

0 1 2 31j

xt

gxt

Figure 5. A thread th reads j element of x into xt. Theshuffle function as shown in the code segment broadcastvalue of xt at thread j to all threads in the warp

32

32

ny = n/Dy

Figure 6. Processing of a block of size 32×Dy by a warp.The block is processed in chunks of 32 columns.

A warp processes a block of size 32× ny of matrix A inchunks of 32 columns, where ny = n/Dy (for our discus-sion here we assume ny is a multiple of 32). At the begin-ning of the processing of a chunk, a warp loads 32 consecu-tive elements of x into 32 registers ensuring coalesced load.These 32 values of x are consumed in the next 32 iterationsof chunk processing by the warp. More specifically, in thefirst iteration we share element of x at thread 0 with all otherthreads in the warp and use it for multiplication with the firstcolumn of the chunk of Amatrix. In the second iteration, weshare element of x at thread 1 with all other threads in thewarp and use it for multiplication with the second columnof the chunk. This process continues till we have consumedall the 32 elements of x. Next, we repeat the process for thenext chunk. A complete listing of the GPU kernel is shownin Fig. 7. The code shown in Fig. 7 works for a square matrixsize that is multiple of block size (in our case it is 256). Thelines 15-22 is where a warp processes a block of size 32×ny.Observe that at the start of processing a chunk, a warp loads32 consecutive elements of x (line 16). These values of xare consumed when a warp processes a chunk (lines 17-21).In line 23, we store partial results in the shared memory. A

C:\Users\zubair\2014\papers\UCLGV\paperDraft\perfNewSgemv.cu Thursday, March 13, 2014 3:33 AM

1 #define BLOCK_X 32

2 #define BLOCK_Y 8

3 __global__ void

4 newSgemv_kernel( float *A, float *x, float *y, int n)

5 {

6 __shared__ float sd[BLOCK_X*BLOCK_Y];

7 int tx = threadIdx.x; int ty = threadIdx.y;

8 int Dx = blockDim.x; int Dy = blockDim.y;

9 int bx = blockIdx.x;

10 int row = bx * Dx + tx;

11 float res=0,xt;

12 int ny = n/Dy; // assume ny is a multiple of 32

13 int kb = ty*ny;

14 float gxt;

15 for (int k=kb; k < kb+ny; k+=32) {

16 xt = x[k+tx]; //load element of x

17 for (int j=0; j<32; ++j)

18 {

19 gxt = __shfl(xt, j); // thread j shares value of x with others

20 res += A[row + n*(k+j)] * gxt;

21 }

22 }

23 sd[tx+ty*Dx] = res;

24 __syncthreads();

25 while ( Dy > 1 )

26 {

27 Dy >>= 1;

28 if (ty < Dy) sd[tx+ty*Dx] += sd[tx+Dx*(ty+Dy)];

29 __syncthreads();

30 }

31 if ( ty == 0 ) y[row] = sd[tx]; // write final result in Y

32 }

-1-

Figure 7. Sgemm8 kernel code for multiplying matrix withone vector.

warp stores 32 partial results in consecutive locations in theshared memory ensuring no bank conflict.

Extension to multiple vectors The code shown in Fig. 7can be easily extended to handle multiple vectors. As an ex-ample, we show how we extend the code to handle two vec-tors, see Fig. 8. The two vectors are stored in column ma-jor order in array x. Similar to one vector case, a warp pro-cesses a block of size 32 × ny of matrix A in chunks of 32columns (line 15-26). At the beginning of the processing of achunk, a warp loads 32 consecutive elements of both vectors(line 16-17). These 32 values of both vectors are are con-sumed in the next 32 iterations of chunk processing by thewarp (line 18-25). Observe that for two vector case, a warploads 32 consecutive elements of A matrix and use it twice.The bandwidth requirement increases minimally comparedto the one vector case, as a result we get the second vectormultiplication free. As in the case of one vector all loads andstores to the device memory are coalesced and there are nobank conflicts when accessing shared memory.

Performance results of Sgemm8 We compare the perfor-mance of Sgemm8 for various matrix sizes with that ofcublasSgemv and cublasSgemm, and the results are sum-marized in Figures 9-12. These results indicate that the per-formance of Sgemm8 is 40% better than the cublasSgemv

for multiplying a matrix with a single vector for matrix sizesup to 1500. For larger matrices Sgemm8 has a consistent per-formance for a single vector case, and is slightly better thancublasSgemv. The performance of Sgemm8 is up to twotimes better compared to cublasSgemm or cublasSgemv,for multiplying matrix with multiple columns. We also sum-marize the performance results for Sgemm8 with varying

4

C:\Users\zubair\2014\papers\UCLGV\perfNewSgemmv.cu Thursday, March 13, 2014 3:23 AM

1 #define BLOCK_X 32

2 #define BLOCK_Y 8

3 __global__ void

4 newSgemmv_kernel( float *A, float *x, float *y, int n)

5 {

6 __shared__ float sd1[BLOCK_X*BLOCK_Y];

7 __shared__ float sd2[BLOCK_X*BLOCK_Y];

8 int tx = threadIdx.x; int ty = threadIdx.y;

9 int Dx = blockDim.x; int Dy = blockDim.y;

10 int bx = blockIdx.x;

11 int row = bx * Dx + tx;

12 float res1=0,res2=0,xt1,xt2;

13 int ny = n/Dy; // assume ny is a multiple of 32.

14 int kb = ty*ny; float gxt1,gxt2,a;

15 for (int k=kb; k < kb+ny; k+=32) {

16 xt1 = x[k+tx]; // fetch element of first vector

17 xt2 = x[n+k+tx]; // fetch element of second vector

18 for (int j=0; j<32; ++j)

19 {

20 gxt1 = __shfl(xt1, j);

21 a = A[row + n*(k+j)];

22 gxt2 = __shfl(xt2, j);

23 res1 += a * gxt1;

24 res2 += a * gxt2;

25 }

26 }

27 sd1[tx+ty*Dx] = res1;

28 sd2[tx+ty*Dx] = res2;

29 __syncthreads();

30 while ( Dy > 1 )

31 {

32 Dy >>= 1;

33 if (ty < Dy) {

34 sd1[tx+ty*Dx] += sd1[tx+Dx*(ty+Dy)];

35 sd2[tx+ty*Dx] += sd2[tx+Dx*(ty+Dy)];

36 }

37 __syncthreads();

38 }

39 if ( ty == 0 ) { // Write final result in Y

40 y[row] = sd1[tx];

41 y[n+row] = sd2[tx];

42 }

43 }

44

-1-

Figure 8. Sgemm8 kernel code for multiplying matrix withtwo vectors.

0

10

20

30

40

50

60

70

GFL

OP

S

Matrix Size (N x N)

Sgemm8 cublasSgemv cublasSgemm

Figure 9. Performance comparison of Sgemm8 with cuBlasroutines for multiplying a matrix with one column.

number of columns for matrix size 1024 × 1024 that occurin our financial application, see Fig. 13.

5. Performance Results for Sgemv4 usingSgemm8 Kernel

The performance results where we call Sgemm8 for qi ≤ 8and cublasSgemm for qi > 8 along with the efficient gather

0

25

50

75

100

125

150

GFL

OP

S

Matrix Size (N x N)


Figure 10. Performance comparison of Sgemm8 withcuBlas routines for multiplying a matrix with two columns.

020406080

100120140160180200

GFL

OP

S

Matrix Size (N x N)


Figure 11. Performance comparison of Sgemm8 withcuBlas routines for multiplying a matrix with threecolumns.

0

25

50

75

100

125

150

175

200

225

GFL

OP

S

Matrix Size (N x N)


Figure 12. Performance comparison of Sgemm8 withcuBlas routines for multiplying a matrix with four columns.

5

0

50

100

150

200

250

1 2 3 4 5 6 7 8

GFL

OP

S

No. of columns

Sgemm8 cublasSgemm

Figure 13. Performance results for Sgemm8 with varyingnumber of columns.

3%

34%

63%

0%

10%

20%

30%

40%

50%

60%

70%

Gather & Scatter Sgemm8 cublasSgemm > 8 cols

Perc

enta

ge o

f ti

me

Figure 14. Percentage of time spent by Sgemm8 for qi ≤ 8,and cublasSgemm for qi > 8. The cost of data movementusing efficient gather and scatter is also shown in the figure.

13101135

3655

22295

540

0

500

1000

1500

2000

2500

3000

3500

4000

cublasScopy vs Gather& Scatter

cublasSgemv vsSgemm8 <= 8 cols

cublasSgemv vscublasSgemm > 8 cols

Tim

e (m

s)

Figure 15. The performance comparison of the base levelwith the final implementation that has all the optimizationslike efficient gather/scatter, use of Sgemm8 for columns 1 to8, and cublasSgemm for columns greater than 8.

and scatter is shown in Fig. 14 The performance comparisonof the base level with the final implementation that has allthe optimizations like efficient gather/scatter, use of Sgemm8for columns 1 to 8, and cublasSgemm for columns greaterthan 8 is shown in Fig. 15. Observe that the total executiontime of the base level implementation is 6100ms comparedto that of 857ms for the final implementation using all theoptimizations, giving an overall speedup of over 7.

6. ConclusionIn this paper, we give an optimized implementation of a fi-nancial calibration application that relies heavily on matrixoperations. We developed a BLAS-like kernel, Sgemm8, insupport of the financial application that is used for multi-plying a matrix with one to eight vectors. We demonstratedthat the performance of Sgemm8 is significantly better thanthat of cublasSgemv or cublasSgemm for matrix multipli-cation with one to eight vectors. Sgemm8 is 40% better thanthe cublasSgemv for multiplying a matrix with a single vec-tor; and is two times better than cublasSgemm for multiply-ing a matrix with eight vectors for matrix sizes that occur inour financial application. The key to the performance gainof Sgemm8 is due to a novel algorithm that utilizes the newintrinsic function shuffle, which allows sharing of data be-tween threads of a warp and helps in reducing latency. Wedemonstrated that the calibration model speeds up by a fac-tor of over 7 when we use Sgemm8 compared to a base levelimplementation using cuBLAS routines.

References[1] C. Albanese, T. Bellaj, G. Gimonet, and G. Pietronero. Co-

herent global market simulations and securitization measuresfor counterparty credit risk. Quantitative Finance, 11(1):1–20,2011. http://EconPapers.repec.org/.

[2] E. Anderson, Z. Bai, C. H. Bischof, S. Blackford, J. Demmel,J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,A. McKenney, and D. C. Sorensen. LAPACK Users’ Guide.SIAM, Philadelphia, PA, USA, 3rd edition, 1999. ISBN 0-89871-447-8. http://www.netlib.org/lapack/lug/.

[3] I. Corporation. Computes two matrix-vector products using a general matrix, 2013.http://software.intel.com/en-us/node/468648.

[4] N. Corporation. Nvidia cuda basic linear algebra subroutines,2013. https://developer.nvidia.com/cublas.

[5] N. Corporation. Cuda c programming guide, 2013.http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[6] N. Corporation. Tuning cuda applications for kepler, 2013.http://docs.nvidia.com/cuda/kepler-tuning-guide/.

[7] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. Aset of Level 3 Basic Linear Algebra Subprograms. ACM Trans.Math. Software, 16:1–28, 1990. (Algorithm 679).

6

Date post:	01-Mar-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

BLAS Extensions for Algebraic Pricing Methodszubair/papers/SSRN-id2519478.pdf · Paolo Regondi...

Documents