+ All Categories
Home > Documents > An efficient graphics processing unit‐based parallel algorithm for pricing multi‐asset American...

An efficient graphics processing unit‐based parallel algorithm for pricing multi‐asset American...

Date post: 15-Nov-2023
Category:
Upload: utoronto
View: 0 times
Download: 0 times
Share this document with a friend
15
Electronic copy available at: http://ssrn.com/abstract=1673626 An efficient GPU-based parallel algorithm for pricing multi-asset American options Duy Minh Dang, Christina C. Christara and Kenneth R. Jackson Department of Computer Science University of Toronto, Toronto, ON, M5S 3G4, Canada {dmdang, ccc, krj}@cs.toronto.edu Abstract—We develop highly-efficient parallel Partial Differential Equation (PDE) based pricing methods on Graphics Processing Units (GPUs) for multi-asset Amer- ican options. Our pricing approach is built upon a com- bination of a discrete penalty approach for the linear complementarity problem arising due to the free boundary and a GPU-based parallel Alternating Direction Implicit Approximate Factorization technique with finite differ- ences on uniform grids for the solution of the linear algebraic system arising from each penalty iteration. A timestep size selector implemented efficiently on GPUs is used to further increase the efficiency of the methods. We demonstrate the efficiency and accuracy of the parallel numerical methods by pricing American options written on three assets. Index Terms—American option, multi-asset, penalty method, Alternating Direction Implicit Approximate Fac- torization, time adaptivity, Graphics Processing Units, GPUs, parallel computing, finite difference I. I NTRODUCTION The pricing of an American option is a challeng- ing task, mainly due to the early exercise feature of the option, which leads to an additional constraint that the value of an American option must be greater than or equal to its payoff [15]. This constraint requires special treatment, a fact that makes an explicit closed form solution for an American option intractable for most cases. Consequently, numerical methods must be used. Recently, multi-asset op- tions, i.e. options written on more than one under- lying asset, have become increasingly popular. The problem of pricing multi-asset American options is not only mathematically challenging but also very computationally intensive. As a result, there is great interest in developing efficient numerical methods for pricing multi-asset American options. Although several approaches, such as lattice (tree) methods and Monte Carlo simulations, can be used for pricing an American option, for problems in low dimensions, i.e. less than four dimensions, the partial differential equation (PDE) approach is very popular, due to its efficiency, global character and ease in computing accurate hedging parameters, such as delta and gamma. Using a PDE approach, the American option pricing problem can be formu- lated as a time-dependent linear complementarity problem (LCP) with the inequalities involving the Black-Scholes PDE and some additional constraints [17]. In this paper, we adopt the penalty method of [6] to solve the LCP. In this approach, a penalty term is added to the discretized equations to enforce the early exercise constraint. The solution of the result- ing discrete nonlinear equations at each timestep can be computed via a penalty iteration. 1 An advantage of the penalty method of [6] is that it is readily ex- tendible to handle multi-factor problems. In a multi- dimensional application, applying direct methods, such as LU factorization, to solve the linear system arising at each penalty iteration can be computa- tionally expensive. A very popular alternative is to use iterative methods, such as Biconjugate Gradient Stabilized (BiCGStab), in combination with a pre- conditioning technique, such as an Incomplete LU factorization [15]. Another possible approach is to employ Alternating Direction Implicit Approximate Factorization (ADI-AF) techniques, which involve solving only a few tridiagonal systems in each spatial dimension. It is rather surprising that, while these efficient techniques have been widely used in the numerical solution of multi-dimensional nonlin- ear PDEs arising in computational fluid dynamics 1 The penalty iteration described in [6] is essentially a Newton iteration, but, to be consistent with [6], we use the term “penalty iteration” throughout this paper.
Transcript

Electronic copy available at: http://ssrn.com/abstract=1673626

An efficient GPU-based parallel algorithm for

pricing multi-asset American options

Duy Minh Dang, Christina C. Christara and Kenneth R. JacksonDepartment of Computer Science

University of Toronto, Toronto, ON, M5S 3G4, Canada

dmdang, ccc, [email protected]

Abstract—We develop highly-efficient parallel Partial

Differential Equation (PDE) based pricing methods on

Graphics Processing Units (GPUs) for multi-asset Amer-

ican options. Our pricing approach is built upon a com-

bination of a discrete penalty approach for the linear

complementarity problem arising due to the free boundary

and a GPU-based parallel Alternating Direction Implicit

Approximate Factorization technique with finite differ-

ences on uniform grids for the solution of the linear

algebraic system arising from each penalty iteration. A

timestep size selector implemented efficiently on GPUs is

used to further increase the efficiency of the methods. We

demonstrate the efficiency and accuracy of the parallel

numerical methods by pricing American options written

on three assets.

Index Terms—American option, multi-asset, penalty

method, Alternating Direction Implicit Approximate Fac-

torization, time adaptivity, Graphics Processing Units,

GPUs, parallel computing, finite difference

I. INTRODUCTION

The pricing of an American option is a challeng-

ing task, mainly due to the early exercise feature of

the option, which leads to an additional constraint

that the value of an American option must be greater

than or equal to its payoff [15]. This constraint

requires special treatment, a fact that makes an

explicit closed form solution for an American option

intractable for most cases. Consequently, numerical

methods must be used. Recently, multi-asset op-

tions, i.e. options written on more than one under-

lying asset, have become increasingly popular. The

problem of pricing multi-asset American options is

not only mathematically challenging but also very

computationally intensive. As a result, there is great

interest in developing efficient numerical methods

for pricing multi-asset American options.

Although several approaches, such as lattice (tree)

methods and Monte Carlo simulations, can be used

for pricing an American option, for problems in

low dimensions, i.e. less than four dimensions, the

partial differential equation (PDE) approach is very

popular, due to its efficiency, global character and

ease in computing accurate hedging parameters,

such as delta and gamma. Using a PDE approach,

the American option pricing problem can be formu-

lated as a time-dependent linear complementarity

problem (LCP) with the inequalities involving the

Black-Scholes PDE and some additional constraints

[17]. In this paper, we adopt the penalty method of

[6] to solve the LCP. In this approach, a penalty term

is added to the discretized equations to enforce the

early exercise constraint. The solution of the result-

ing discrete nonlinear equations at each timestep can

be computed via a penalty iteration.1 An advantage

of the penalty method of [6] is that it is readily ex-

tendible to handle multi-factor problems. In a multi-

dimensional application, applying direct methods,

such as LU factorization, to solve the linear system

arising at each penalty iteration can be computa-

tionally expensive. A very popular alternative is to

use iterative methods, such as Biconjugate Gradient

Stabilized (BiCGStab), in combination with a pre-

conditioning technique, such as an Incomplete LU

factorization [15]. Another possible approach is to

employ Alternating Direction Implicit Approximate

Factorization (ADI-AF) techniques, which involve

solving only a few tridiagonal systems in each

spatial dimension. It is rather surprising that, while

these efficient techniques have been widely used in

the numerical solution of multi-dimensional nonlin-

ear PDEs arising in computational fluid dynamics

1The penalty iteration described in [6] is essentially a Newton

iteration, but, to be consistent with [6], we use the term “penalty

iteration” throughout this paper.

Electronic copy available at: http://ssrn.com/abstract=1673626

[18], to the best of our knowledge, these techniques

have not been successfully extended to multi-asset

American option pricing.

Over the last few years, the rapid evolution of

Graphics Processing Units (GPUs) into powerful,

cost-efficient, programmable computing architec-

tures for general purpose computations has provided

application potential beyond the primary purpose

of graphics processing. In computational finance,

although there has been great interest in utilizing

GPUs in developing efficient pricing architectures

for computationally intensive problems [1], [4], the

existing literature on GPU-based numerical methods

for multi-asset American option pricing is rather

sparse and mostly focuses on Monte Carlo sim-

ulations [1] and quadrature integrations [16]. The

literature on GPU-based PDE methods for multi-

asset American options is even less developed. In

addition, to the best of our knowledge, a combi-

nation of an efficient GPU-based parallelization of

ADI-AF techniques with a penalty approach for

the pricing of multi-asset American options has not

been previously discussed in the literature. These

shortcomings motivated our work.

This paper discusses the application of GPUs to

price multi-factor American options in the Black-

Scholes framework via a PDE approach. Our ap-

proach is built upon the penalty method of [6]

and an efficient GPU-based parallel ADI-AF al-

gorithm, extended from [5], for solving the lin-

ear algebraic system arising at each penalty iter-

ation. Finite difference (FD) methods on uniform

grids are considered for the space discretization

of the pricing PDE, while several finite difference

schemes, such as Crank-Nicolson and two-level

backward difference formula, are used for its time

discretization. A timestep size selector, efficiently

implemented on the GPU, is used to further increase

the performance of the methods. The results of

this paper demonstrate the efficiency of the par-

allel numerical methods and show that GPUs can

provide a significant increase in performance over

CPUs when pricing multi-factor options with early

exercise features. Although we primarily focus on

a three-factor model, many of the ideas and results

in this paper can be naturally extended to higher-

dimensional applications with constraints.

The remainder of this paper is organized as fol-

lows. Section II presents a PDE formulation of the

pricing problem for a multi-asset American option

and the discretization methods. To illustrate our

approach, we apply it to American put options on

the arithmetic average of three underlying assets. A

penalty iteration for the discretized American option

and associated ADI-AF schemes are discussed in

Section III. Section IV introduces a simple, but

effective, timestep size selector. Section V discusses

a GPU-based parallel implementation of the ADI-

AF methods and of the timestep selector. Numerical

results and related discussions are presented in

Section VI. Section VII concludes the paper and

outlines possible future work.

II. FORMULATION OF THE PRICING PDE AND

DISCRETIZATION

A. Formulation

We denote by si(t), i = 1, 2, 3, the value at time tof the ith underlying asset, by T the expiry time of

the option, and by τ = T − t the time to expiry. For

simplicity, let s = (s1, s2, s3). The early exercise

constraint leads to the following LCP for the value

u(s, τ) of an American put option [17]

∂u

∂τ−Lu = 0

u− u∗ ≥ 0

or

∂u

∂τ− Lu > 0

u− u∗ = 0

,

s∈Ω≡(0, s1,∞)×(0, s2,∞)×(0, s3,∞), τ ∈ (0, T ],(1)

subject to the initial (payoff) condition

u(s, 0) = u∗(s) ≡ max(E −

3∑

i=1

wisi, 0)

on (∂Ω ∪ Ω)× 0, (2)

and the boundary conditions [11]

u(s, τ) = u∗(s) on ∂Ω× (0, T ], (3)

where

Lu≡1

2

3∑

i,j=1

ρijσiσjsisj∂2u

∂si∂sj+

3∑

i=1

(r−di)si∂u

∂si−ru.

(4)

Here, ∂Ω is the boundary of Ω; r > 0 is the constant

riskless interest rate; di ≥ 0 is the constant asset

dividend yield; σi ≥ 0 is the constant volatility of

the ith underlying asset; ρij is the correlation factor

between the ith and jth assets satisfying |ρij | ≤ 1for i, j = 1, 2, 3, and ρii = 1 for i = 1, 2, 3; E > 0is the strike price of the option, wi > 0 is the weight

2

of the ith asset in the basket, and si,∞, i = 1, 2, 3,

is the right boundary of the spatial domain of

the ith underlying asset. In the exact mathematical

formulation of the problem, si,∞ = ∞, but, in

the numerical approximation, we truncate the semi-

infinite domain and take si,∞ to be an appropriately

chosen large value, as is explained in more detail in

Section VI.

Following [6], we use a penalty parameter ζ ′,ζ ′ → ∞, and consider the non-linear PDE for

the penalty formulation of the price u(s, τ) of an

American put option written on three underlying

assets

∂u

∂τ−Lu=ζ ′max(u∗− u, 0), s ∈Ω, τ ∈ [0, T ], (5)

subject to the initial and boundary conditions (2)

and (3), respectively. The penalty parameter ζ ′ effec-

tively ensures that the solution satisfies u−u∗ ≥ −ǫfor 0 < ǫ ≪ 1. Essentially, in the region where u ≥u∗, the PDE (5) resembles the three-dimensional (3-

D) Black-Scholes equation. On the other hand, when

−ǫ ≤ u−u∗ < 0, the 3-D Black-Scholes inequality∂u∂τ

− Lu > 0 is satisfied and u ≈ u∗.

B. Discretization

For the rest of the paper, we adopt the following

notation. Let the number of subintervals be n + 1,

p + 1 and q + 1, in the s1-, s2- and s3-directions,

respectively. The uniform grid mesh widths in the

respective direction are denoted by ∆s1 =s1,∞n+ 1

,

∆s2 =s2,∞p + 1

, and ∆s3 =s3,∞q + 1

. Let the time

interval [0, T ], for a given number of subintervals

l, be partitioned via

0 = τ0 < τ1 < . . . < τl = T,

with

∆τm = τm − τm−1, cm =∆τm∆τm−1

, m = 1, . . . , l.

(6)

Let the gridpoint values of a FD approximation be

denoted by

umi,j,k≈u(s1i,s2j ,s3k,τm)=u(i∆s1,j∆s2,k∆s3,τm),

where i = 0, . . . , n + 1, j = 0, . . . , p + 1, k =0, . . . , q + 1, m = 0, 1, . . . , l.

1) Space discretization: For the discretization of

the space variables in the differential operator L,

we employ second-order central differences in the

interior of the rectangular domain Ω. Second-order

FD approximations to the first and second partial

derivatives of the space variables in (4) are obtained

by central schemes, while the cross-derivatives are

approximated by a four-point FD stencil. For exam-

ple, at the reference point (s1i, s2j, s3k, τm),∂u∂s1

and∂2u∂s21

are respectively approximated by

∂u

∂s1≈

umi+1,j,k−um

i−1,j,k

2∆s1,

∂2u

∂s21≈

umi+1,j,k−2um

i,j,k+umi−1,j,k

(∆s1)2,

(7)

while the cross derivative ∂2u∂s1∂s2

is approximated by

umi+1,j+1,k + um

i−1,j−1,k − umi−1,j+1,k − um

i+1,j−1,k

4∆s1∆s2.

(8)

Similar approximations can be obtained for the

remaining spatial derivatives. For brevity, we omit

the derivations of (7) and (8), but, using Taylor

expansions, it can be verified that each of these

formulas has a second-order truncation error, pro-

vided that the function u is sufficiently smooth.

At the spatial grid Ω, the FD discretization of the

spatial differential operator L of (5) is performed

by replacing each spatial derivative appearing in

the operator L with its corresponding FD scheme

(as in (7) and (8)). We denote by Lumi,j,k the FD

discretization of L at (s1i, s2j, s3k, τm).

2) Time discretization: In this paper, we con-

sider two second-order accurate time discretization

schemes, namely the Crank-Nicolson (CN) method

and the two-level backward difference formula

(BDF2), as well as the first-order accurate fully-

implicit method, used primarily for smoothing.

Both the CN and the fully-implicit methods be-

long to the standard θ-timestepping discretization

scheme, in which the time derivative is approxi-

mated by a first-order backward difference, while

the discretized differential operator is treated as

a θ-weighted average between the fully-implicit

and the fully-explicit steps. More specifically, when

proceeding from time τm−1 to time τm, applying the

standard θ-timestepping discretization scheme to (5)

3

gives

(I−θ∆τmL)umi,j,k=(I+(1−θ)∆τmL)u

m−1i,j,k+Pum

i,j,k,(9)

where 0 ≤ θ ≤ 1. Here, I and P denote the

identity and penalty operators, respectively, where

P is defined by

Pumi,j,k = ζmax(u∗

i,j,k − umi,j,k, 0),

with ζ being the penalty factor related to the de-

sired tolerance. Essentially, we have the relation

ζ ′ ∼ ζ/∆τm. The boundary conditions (3) are incor-

porated by setting umi,j,k=u∗

i,j,k, if i=0, n+ 1, or

j=0, p+1, or k=0, q+1, with u∗i,j,k being the

payoff value at the reference point (s1i, s2j , s3k, ·).In (9), the values θ = 1/2 and θ = 1 give rise to

the standard CN and the fully-implicit methods, re-

spectively. It is known that, although the CN method

is second-order accurate and unconditionally stable,

i.e. no restriction on the timestep sizes is required

for stability, this method is prone to producing

spurious oscillations [17]. On the other hand, the

fully-implicit method is first-order accurate, but is

strongly stable (e.g. [13]). To maintain the accuracy

of CN as well as smoothness of the solution, we

use the Rannacher smoothing technique [14], which

applies the fully-implicit method for the first few

(usually two) timesteps followed by the CN method

on the remaining timesteps.

For the BDF2 scheme, the time derivative in

(5) is approximated by the second-order difference

formula2

∂u

∂τ=

1

∆τm

(1+2cm1+cm

um−(1+cm)um−1+

c2m1+cm

um−2),

where cm is the timestep size ratio defined in

(6), while the discretized differential operator L is

treated fully-implicitly. This gives rise to the scheme

(I −

1 + cm1 + 2cm

∆τmL)umi,j,k =

(1 + cm)2

1 + 2cmum−1i,j,k

−c2m

1 + 2cmum−2i,j,k + Pum

i,j,k, (10)

2For a uniform partition of the time interval with ∆τm = ∆τ = Tl

and cm = 1, we have the well-known BDF2 formula

∂u

∂τ=

3um− 4um−1 + um−2

2∆τ.

where the operators I and P are previously defined.

The boundary conditions (3) are incorporated into

(10) in the same fashion as in the θ-timestepping

method.

It is important to note that in the case of the

BDF2 method (10), the numerical solution of the

first timestep, i.e. timestep m = 1, must be obtained

using another method. The most natural choice for

this is the fully-implicit method, which we use in

our experiments. It is worth emphasizing that, since

the BDF2 method is L-stable, a stronger property

than the unconditional stability of the CN method

[7], the BDF2 method has more favorable damp-

ing properties than the CN method does. Having

good damping properties is particularly important

in computing accurate hedging parameters, such as

delta and gamma.

It is also important to emphasize that, for both

the CN and BDF2 schemes, we also adopt another

smoothing technique suggested in [13]. That is, the

grids in the experiments are chosen so that there is

a gridpoint at the strike E (the initial kink point)

along each space dimension.

We adapt the penalty iteration algorithm in [6] to

solve the set of discrete nonlinear penalized equa-

tions (9) and (10). In the next section, we present the

penalty iteration algorithm and associated ADI-AF

schemes.

III. PENALTY ITERATION AND AN ASSOCIATED

ADI-AF TECHNIQUE

Unless otherwise stated, assume that the mesh

points are ordered in the s1-, s2-, then s3- direc-

tions. Let um denote the vector of values at time

τm on the mesh Ω that approximates the exact

solution um = u(s, τm). Furthermore, denote by

u∗ the vector of the payoff values on Ω. Let κ,

κ ≥ 0, be the index of the penalty iteration. Let

um,(κ) be the κth estimate of um, and denote by

∆um,(κ) = um,(κ+1) − um,(κ) the correction to the

κth iterate of the penalty iteration at time τm.

At each penalty iteration, the θ-timestepping

scheme (9) and the BDF2 scheme (10) must solve

an npq × npq algebraic system of the form [6]

(I+ θ∆τmA+Pm,(κ))um,(κ+1)=

(I−(1− θ)∆τmA)um−1 +Pm,(κ)u∗ +∆τmg, (11)

4

and(I+

1+cm1+2cm

∆τmA+Pm,(κ)

)um,(κ+1)=

(1+cm)2

1+2cmum−1

−c2m

1 + 2cmum−2+Pm,(κ)u∗+

1 + cm1 + 2cm

∆τmg, (12)

respectively. Here, I denotes the identity matrix;

−A is the matrix FD approximation to the differen-

tial operator L; Pm,(κ) is the diagonal penalty matrix

and g is a vector containing values arising from

the boundary conditions. For brevity, we omit the

explicit formula for A. The penalty matrix Pm,(κ)

is defined by

(Pm,(κ)

)ij≡

ζ if u

m,(κ)i < u∗

i and i = j,

0 otherwise.(13)

Note that the matrices on the left hand size of (11)

and (12) are essentially the Jacobian of the discrete

nonlinear penalized systems arising from (9) and

(10), respectively. In general, if we want to solve

(5) with a relative precision tol, we should have

ζ ≃ 1tol

[2]. For future reference, we decompose

the matrix A into four submatrices A = A0+A1+A2+A3. The matrices A1, A2 and A3 are the parts

of A that correspond to the spatial derivatives in

the s1-, s2- and s3-directions, respectively, while the

matrix A0 is the part of A that comes from the FD

discretization of the mixed derivative terms in the

operator L. The term ru in L is distributed evenly

over A1, A2 and A3. For simplicity, let Dm,(κ) =I+Pm,(κ).

We adapt the ADI-AF approach discussed in [18]

to solve (11) and (12). For brevity, we present only

the derivation of the ADI-AF scheme for (11). It is

straightforward to apply a similar technique to (12).

We first write an ADI-AF scheme for (11) in the

form(Dm,(κ)+θ∆τmA1

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA2

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA3

)um,(κ+1) =

(I− (1− θ)∆τmA)um−1 +Pm,(κ)u∗ +∆τmg

+(Dm,(κ)

)−1(θ∆τm)

2(A1A2+A1A3+A2A3

)um,(κ)

+(Dm,(κ)

)−2(θ∆τm)

3A1A2A3um,(κ)−θ∆τmA0u

m,(κ).(14)

We then subtract(Dm,(κ)+θ∆τmA1

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA2

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA3

)um,(κ)

from both sides of (14). The resulting ADI-AF

scheme, referred to as ADI-AF-CN, for the correc-

tion ∆um,(κ) is given by(Dm,(κ)+θ∆τmA1

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA2

)(Dm,(κ)

)−1(Dm,(κ)+θ∆τmA3

)∆um,(κ) =

− (I+ θ∆τmA)um,(κ) + (I− (1− θ)∆τmA)um−1

+Pm,(κ)(u∗−um,(κ)

)+∆τmg,

(15)

which can be rewritten equivalently as

(Dm,(κ)+∆τmA1

)(∆um,(κ)

)(1)=bm,(κ), (16a)

(Dm,(κ)+∆τmA2

)(∆um,(κ)

)(2)=Dm,(κ)

(∆um,(κ)

)(1),

(16b)(Dm,(κ)+∆τmA3

)(∆um,(κ)

)(3)=Dm,(κ)

(∆um,(κ)

)(2),

(16c)

∆um,(κ)=(∆um,(κ)

)(3), (16d)

with ∆τm = θ∆τm, and bm,(κ) being the right-hand-

side of (15).

Similarly, the ADI-AF scheme for the correction

∆um,(κ) for (12), referred to as ADI-AF-BDF2,

is given by relations (16a)-(16d), with ∆τm =( 1 + cm1 + 2cm

)∆τm and the vector bm,(κ) of the right-

hand-side of (16a) defined as

bm,(κ) = −(I+ ∆τmA)um,(κ) +(1 + cm)

2

1 + 2cmum−1

−c2m

1 + 2cmum−2+Pm,(κ)

(u∗−um,(κ)

)+∆τmg.

Thus, both the ADI-AF-CN and ADI-AF-BDF2

schemes require the solution of the linear systems in

(16a)-(16c). The corresponding ADI-AF FD penalty

algorithm based on the ADI-AF-CN or the ADI-AF-

BDF2 scheme is presented in Algorithm 1.

REMARK 1: For both the ADI-AF-CN and ADI-

AF-BDF2 schemes, we use as the initial guess

um,(0) for the penalty iteration the linear two-level

extrapolation of the numerical solution from the two

previous timesteps, i.e. um,(0) = (1 + cm)um−1 −

cmum−2, except on the first timestep, where u1,(0) =

u0 ≡ u∗. In addition to conceptually fitting the

ADI-AF-BDF2 scheme, this initial guess also (i)

gives rise to more efficient ADI-AF-CN schemes

than those using um,(0) = um−1, and (ii) enables

consistent and fair efficiency comparisons between

5

Algorithm 1: ADI-AF FD penalty iteration for

American options

1: initialize um,(0);

2: construct Pm,(0) using (13)

3: for κ = 0, . . . , until convergence do

4: carry out (16) to obtain ∆um,(κ);

set um,(κ+1) = um,(κ) +∆um,(κ);

5: construct Pm,(κ+1) using (13)

6: if[

max1≤i≤npq

|um,(κ+1)i − u

m,(κ)i |

max(1, |um,(κ+1)i |)

< tol

]

or[Pm,(κ) = Pm,(κ+1)

]then

7: break;

8: end if

9: end for

10: um+1 = um,(κ+1);

the two ADI-AF schemes. Detailed numerical re-

sults are presented in Tables I-II, with a discussion

in Section VI.

REMARK 2: Due to the similarities between the

aforementioned ADI-AF schemes and an ADI

method, such as the Douglas and Rachford scheme

[10], the fact that the mixed derivatives are treated

solely explicitly in these ADI-AF schemes might

lead one to expect that second-order convergence

of the numerical methods would be lost. This is

a typical problem for ADI methods (e.g. see [10]).

However, as our numerical results indicate, the ADI-

AF schemes presented in this paper exhibit second-

order convergence. This does not contradict the

aforementioned problem for ADI methods, since

these ADI methods are used in a non-iterative

context, whereas, in our case, the ADI-AF schemes

are applied iteratively. While the first iterate um,(1)

could be a first-order accurate approximate solution,

it seems that, with further penalty iterations, um,(κ)

converges to a second-order accurate approximate

solution at each timestep.

A heuristic explanation for this observation is as

follows. The total error in the iterative solution at

each penalty iteration can be viewed as arising from

two sources: (1) the iteration error, which includes

the first-order approximate factorization error aris-

ing from the AF schemes, and (2) the second-order

truncation error. In the first iteration, the dominant

source of error is the iteration error, but this error

is reduced with further penalty iterations. If enough

penalty iterations are performed, the major source of

error is the truncation error, rather than the iteration

error. Hence, we are able to observe the second-

order accuracy of the numerical solution at each

timestep. Detailed results are given in Tables I-II,

with a discussion in Section VI.

REMARK 3: A possible extension of the ADI-

AF schemes presented in this paper is to modify

them so that they maintain second-order accuracy

for um,(κ) at each penalty iteration. To this end, after

carrying out the steps in (16), a special correction to

the cross-derivative terms, similar to those suggested

in [3] in the context of ADI timestepping methods,

could be added to the right-side vector of (16a),

followed by solving an additional tridiagonal linear

system along each spatial dimension, similar to

(16a)-(16c). However, the computational cost per

penalty iteration of an ADI-AF scheme based on

this approach is approximately double that of the

ADI-AF schemes considered in this paper.

Following [3], we carried out experiments with

correction terms of the form ∆τmA0∆um,(κ) with

∆τm set accordingly for ADI-AF-CN and ADI-

AF-BDF2. Although the resulting ADI-AF methods

did reduce the total number of iterations required

for convergence compared to the ADI-AF schemes

presented in this paper, the reduction was less than

50%. As a result, it seems that a straightforward

extension of the ADI-AF-CN/BDF2 schemes based

on this correction term is not cost effective. Possi-

bly, a more productive correction term specifically

designed for ADI-AF schemes could be developed.

REMARK 4: For the FD discretization for the

spatial variables described in (7), if the gridpoints

are ordered appropriately, all the linear systems in

(16) are block-diagonal with tridiagonal blocks. As

a result, the number of floating-point operations

per iteration is directly proportional to npq, which

yields a significant reduction in computational cost

compared to the application of a direct method.

Moreover, the block diagonal structure of these

matrices gives rise to a simple, yet efficient, par-

allelization for the solution of the linear systems in

(16), as discussed in Section V.

REMARK 5: We now determine the complexity

of the ADI-AF penalty algorithm. We assume that

all the matrices stored are in sparse format, and that

variable timestep sizes are used. The cost for deter-

6

mining these timestep sizes is considered separately

in Section IV. Each of the penalty iteration requires

(i) about 43npq flops3 for the matrix-vector

multiplications −(I + θ∆τmA)um,(κ) and

Pm,(κ)(u∗−um,(κ)), and the addition involving

the vector g, assuming that the PDE coeffi-

cients are available (see Step (x) below);4

(ii) about 2npq flops for updating the two right-

side vectors of (16b)-(16c).

(iii) about 15npq flops for updating the three

tridiagonal matrices in (16a)-(16c), assuming

A1,A2, and A3 are available (see Step (xi)

below);

(iv) about 12npq flops for the solutions of the

three tridiagonal systems in (16a)-(16c);5

(v) about npq flops for updating the vector

um,(κ+1);

(vi) about npq flops for checking the stopping

criterion.

In addition to the above costs, at each timestep, the

ADI-AF-CN and ADI-AF-BDF2 schemes require

(vii) about 3npq flops for the initial guess;

(viii) about 40npq flops for the matrix-vector mul-

tiplication involving um−1 (ADI-AF-CN) and

about 3npq for the vector-vector addition of

um−1 and um−2 (ADI-AF-BDF2).

Moreover, we also need to include

(ix) about 16(np+nq+ pq) flops for computing

values arising from the boundary conditions,

assuming that the coefficients of the PDE are

available (see Step (x));

(x) about 57npq flops for computing the coeffi-

cients of the PDE; and

(xi) 21npq flops for assembling the three tridiag-

onal matrices A1,A2, and A3, assuming that

the coefficients of the PDE are available (see

Step (x)).

Since the values arising from the boundary con-

ditions are not time-dependent and are relatively

cheap to store, in the implementation, we compute

the values in Step (ix) only once at the first penalty

iteration of the first timestep, and store them for

use in subsequent penalty iterations. On the other

3A flop is one addition, or one multiplication, or one division of

two floating-point numbers.4Since the matrix A has at about 19 nonzero entries per row, the

matrix-vector product Aum,(κ) requires about 38npq flops.

5About 4npq flops are needed for the solution of each of these

tridiagonal systems.

hand, while the values computed in Steps (x)-(xi)

could be stored for use at each subsequent timestep

and/or penalty iteration, in our implementation, we

recompute these values at each penalty iteration.

This extra computation is often cheaper than the

memory bandwidth to retrieve pre-computed values,

not to mention a significant reduction in memory

footprint.

Let κ denote the total number of penalty iter-

ations over all timesteps required by the penalty

method. Thus, the approximate total number of

flops required by our implementation of the ADI-

AF-CN and ADI-AF-BDF2 penalty algorithms are

152npqκ+43npql+16(np+nq+pq) and 152npqκ+6npql + 16(np+nq+pq) respectively.

IV. TIMESTEP SIZE SELECTOR

We use a simple, but effective, timestep size

selector presented in [6] that was shown to work

well for pricing of American options written on one

asset (e.g. see [2] and [6]). The idea underlying this

scheme is to predict a suitable timestep size for the

next timestep, using only information from the cur-

rent and previous timesteps. We extend this timestep

size selector for use with ADI-AF methods applied

to multi-asset American options and investigate its

efficient implementation on GPUs.

According to the formula in [6], given the current

stepsize ∆τm, m ≥ 1, the new stepsize ∆τm+1 is

given by

∆τm+1 =

(min

1≤ι≤npq

[dnorm

|umι −u

m−1ι |

max(N,|umι |,|um−1

ι |)

])∆τm.

(17)

Here, dnorm is a user-defined target relative

change, and the scale N is chosen so that the

method does not take an excessively large stepsize

where the value of the option is small. Normally,

for option values in dollars, N = 1 is used. The

total number of flops required for this timestep size

selector is about 3npq per timestep, which results

in an approximate total of 3(l− 1)npq flops for all

timesteps.

V. GPU IMPLEMENTATION

A. GPU device architecture and CUDA

A GPU is a hierarchically arranged multipro-

cessor unit, in which several scalar processors are

7

grouped into a smaller number of streaming mul-

tiprocessors (SMs). Each SM has shared memory

accessed by all its scalar processors. In addition,

the GPU has global (device) memory (slower than

shared memory) accessed by all scalar processors

on the chip, as well as a small amount of cache

for storing constants. According to the program-

ming model of CUDA, which we adopt, the host

(CPU/master) uploads the intensive work to the

GPU as a single program, called kernel. Multiple

copies of the kernel, referred to as threads, are

then distributed to the available processors, where

they are executed in parallel. Within the CUDA

framework, threads are grouped into threadblocks,

which are in turn arranged on a grid. Threads in

a threadblock run on at most one multiprocessor,

and can communicate with each other efficiently

via the shared memory, as well as synchronize their

executions. For a more detailed description of the

GPU, as well as a discussion on memory coalescing,

an important issue in optimizing the performance of

CUDA applications, interested readers are referred

to [12]. The NVIDIA Tesla 10-series (T10) GPUs

(Tesla S1060/S1070 - server version), which are

used for the experiments in this paper, consist of

30 independent SMs, each containing 8 processors

running at 1.44GHz, a total of 16384 registers, and

16 KB of shared memory per SM.

B. GPU implementation of the ADI-AF schemes

We first discuss a GPU-based parallel algorithm

for each of the penalty iterations of Algorithm 1.

As an example, we focus on describing the parallel

implementation of the ADI-AF-CN scheme and the

stopping criterion (Line 6) of the penalty algorithm.

The implementation the ADI-AF-BDF2 scheme is

essentially the same, and hence omitted.

For presentation purposes, let

wm−1=(1− θ)∆τmAum−1,w(κ)=θ∆τmAum,(κ),

Am,(κ)i = Dm,(κ) + θ∆τmAi, i = 1, 2, 3,

∆u(κ),i

= Dm,(κ)(∆um,(κ)

)(i−1), i = 2, 3,

and notice that

bm,(κ) = um−1 − um,(κ) − (wm−1 +w(κ))

+Pm,(κ)(u∗ − um,(κ)

)+∆τmg.

Here, to simplify the notation, we do not indicate

the superscript for the timestep index of the vectors

w(κ) and ∆u(κ),i

, i = 2, 3. The computation of

the ADI-AF-CN scheme (16) and the checking of

the stopping criterion of Algorithm 1 consist of the

following steps:

(i) Step a.1: Compute the matrices Dm,(κ), and

Ai, Am,(κ)i , i = 1, 2, 3, and the vectors wm−1,

w(κ) and bm,(κ);

(ii) Step a.2: Solve Am,(κ)1

(∆um,(κ)

)(1)= bm,(κ);

(iii) Step a.3: Compute ∆u(κ),2

and solve

Am,(κ)2

(∆um,(κ)

)(2)= ∆u

(κ),2;

(iv) Step a.4: Compute ∆u(κ),3

and solve

Am,(κ)3

(∆um,(κ)

)(3)= ∆u

(κ),3;

(v) Step a.5: Check the stopping criterion.

In [5], we describe a parallel ADI timestepping

method implemented efficiently on GPUs for the so-

lution of multi-dimensional linear parabolic PDEs.

We observe similarities between the computation of

the ADI-AF schemes and that of the ADI timestep-

ping method in [5]. More specifically, the compu-

tation of the vector wm−1 in Step a.1 resembles the

explicit Euler predictor step, while Steps a.2-a.4 are

essentially the same as the three implicit corrector

steps in [5], each of which involves solving a

block-diagonal system with tridiagonal blocks along

a spatial dimension. As a result, the GPU-based

parallelization of the ADI-AF scheme considered in

this paper can be viewed as a natural extension of

the parallelization of the ADI timestepping method

presented in [5]. For brevity, we only present the

main steps of the parallel algorithm for the ADI-

AF-CN scheme. A detailed discussion of the parallel

ADI algorithm can be found in [5].

1) Step a.1: We assume that, initially, the vectors

um−1, um,(κ) and u∗ are in the global memory,

and any needed constants (model parameters) are

in the constant cache. Note that the data copying

from the host memory to the device memory occurs

on the first timestep only, for the initial condition

(payoff) data and the model constants. Data for

the subsequent timesteps and steps of the ADI-AF

schemes are stored in the global memory.

We partition the n×p×q computational grid into

3-D blocks of size nb × pb × q, each of which can

be viewed as consisting of q two-dimensional (2-

D) blocks, referred to as tiles, of size nb × pb. For

Step a.1, we let the kernel generate a ceil(

nnb

8

pb = 2nb = 4

q = 10

p = 8 n = 8

s3

s1

s2Fig. 1. An illustration of the partitioning approach considered for

Step a.1. The computational domain is partitioned into 3-D blocks

of size nb × pb × q ≡ 4× 2× 10, each of which can be viewed as

consisting of ten 4×2 tiles, or as 8(= 4×2) stacks of 10 gridpoints.

ceil(

p

pb

)grid of threadblocks, where ceil de-

notes the ceiling function. Each of the threadblocks,

in turn, consists of nbpp threads arranged in 2-

D arrays, each of size nb × pb. All gridpoints

of a nb × pb × q 3-D block are assigned to one

threadblock only, with one thread for each “stack”

of q gridpoints in the s3 direction (see Figure 1),

and this assignment results in a q-iteration loop in

the kernel.

The computation required in Step a.1 includes

construction of matrices, addition of matrices, mul-

tiplication of vectors and matrices by scalars,

as well as matrix-vector multiplication. With the

approach of data partitioning and assignment to

threads/threadblocks described above, all the above

computations, except possibly the matrix-vector

multiplication, can be executed in parallel in a

natural way, without the need for communication

between threads in different threadblocks. However,

due to the FD schemes (7) and (8), the computation

of matrix-vector multiplications embedded in wm−1

and w(κ) requires communication between threads

in different threadblocks. More specifically, at each

instance of the q-iteration loop, a threadblock carry-

ing the computation of a tile needs the values (halo

values) of neighbouring gridpoints from adjacent

tiles in the s1 and s2 directions. These adjacent tiles

belong to different threadblocks. In our approach,

this type of communication is realized via copies

to/from the global memory, which, although they

are slow, they involve small amounts of data transfer

compared to the amount of computation. Further-

more, because 16KB of shared memory available

per multiprocessor are not sufficient to store many

data tiles, we adopt a three-plane strategy, in which,

to process a tile of a 3-D block, each threadblock

works with three data tiles of size nb × pb and their

halo values during each iteration of the loop. As we

proceed in the s3-direction, at each iteration, the

next tile data are loaded, the current tile data are

being computed and the previous tile data are then

being discarded. A similar strategy is described in

mode details in Section 4.1.3 of [5].

REMARK 6: As explained in [5], the data loading

strategy described above allows memory coalescing

along the s1-, but not along the s2-direction. How-

ever, experimental results indicate that the approach

is highly effective. It is worth emphasizing that the

vector wm−1 is computed only in the first penalty

iteration of the mth timestep (see Remark 5). This

vector is then loaded in a coalesced fashion from the

global to the shared memory for use in subsequent

penalty iterations of that timestep.

2) Steps a.2, a.3, a.4: The data partitioning for

each of Steps a.2, a.3 and a.4 is different from that

for Step a.1 and is motivated by the block structure

of the tridiagonal matrices Am,(κ)i , i = 1, 2, 3,

respectively. For example, Am,(κ)1 has pq diagonal

blocks, each block being n×n tridiagonal, thus the

solution of Am,(κ)1

(∆um,(κ)

)(1)= bm,(κ) (Step a.2)

is computed by first partitioning Am,(κ)1 and bm,(κ)

into pq independent n× n tridiagonal systems, and

then assigning each tridiagonal system to one of the

pq threads generated, i.e. each thread is assigned ngridpoints along the s1-direction.

In our implementation, each of the 2-D thread-

blocks used in Steps a.2, a.3 and a.4 has the identical

size rt × ct, where the values of rt and ct are de-

termined by numerical experiments to maximize the

performance. The size of the grid of threadblocks is

determined accordingly. For example, for the paral-

lel solution of Am,(κ)1

(∆um,(κ)

)(1)= bm,(κ), a 2-D

grid of threadblocks of size ceil( p

rt) × ceil( q

ct)

is invoked. The details of the implementation are

similar to those described in Section 4.2 of [5].

3) Step a.5: In the current implementation,

checking the stopping criterion is done during the

kernel generated in Step a.4. More specifically, each

threadblock of the kernel launched in Step a.4,

after computing its component of the vector ∆u(κ),3

corresponding to the reference point (s1i, s2j, s3k, ·),

9

computes the quantity

|um,(κ+1)i,j,k − u

m,(κ)i,j,k |

max(1, |um,(κ+1)i,j,k |)

(18)

and the corresponding row of the penalty matrix

Pm,(κ+1) (one entry). If the quantity (18) is greater

than or equal to the tolerance tol or if two corre-

sponding rows of the matrices Pm,(κ) (obtained from

the matrix Dm,(κ)) and Pm,(κ+1) are different, the

thread then respectively changes the pre-set values

of two different flag variables stored in a global

memory location. Note that the two pre-set flag vari-

ables are copied from the host memory to the device

memory before the kernel of Step a.4 is launched.

After the kernel has ended, the values of the two

flag variables are copied back to the host memory

to be checked. (These host-device copies are cheap.)

The stopping criterion is satisfied if the two pre-set

values were not altered during the kernel. Although

it may happen that multiple threads try to write to

the same memory location of a flag variable at the

same time, it is guaranteed that one of the writes

will occur. Although we do not know which one,

this does not matter for the purpose of checking

the stopping criterion. Consequently, this approach

suffices and works well.

REMARK 7: In the current implementation, the

data between Steps a.1, a.2, a.3 and a.4 are ordered

in the s1-, then s2-, then s3-directions. As a result,

memory coalescence is fully achieved only for the

tridiagonal solves in the s2- and s3-directions, but

not in the s1-direction. (Hence, the loading of com-

ponents of the vectors um,(κ) and u∗ used for the

checking of the stopping criterion in Step a.5 is fully

coalesced.) See [5] for a more detailed discussion.

C. GPU implementation of the timestep size selec-

tor

The key part in implementing the timestep size

selector (17) on the GPU involves finding the min-

imum element of an array of real numbers. In this

regard, we adapt the parallel reduction technique

discussed in [8]. The idea is to partition the array

into multiple sub-arrays of size st, each of which

is assigned to a 1-D threadblock of the same size.

During the first kernel launch, each threadblock

carries out the reduction operation via a tree-based

approach to find the minimum of the corresponding

sub-array and writes the intermediate result to a

location in an array in the global memory. This

array of intermediate minimum elements is then

processed in the same manner by passing it on

to a kernel again. This process is repeated until

the array of partial minimums can be handled by

a kernel launch with only one threadblock of size

st, after which the minimum element of the initial

array is found. In this approach, the kernel launch

serves as a global synchronization point to process

partial results; this is inexpensive because a kernel

launch has negligible overheads. In addition, the

kernel code for all levels of reduction is virtually

the same, except that in the first kernel call, each

thread of a threadblock needs to load its compo-

nents of the vectors um and um−1 (via a coa-

lesced pattern) and computes the respective quantity|um

ι − um−1ι |

max(N, |umι |, |u

m−1ι |)

, before performing reduction

to find the minimum element of the threadblock. For

this loading phase, barrier synchronization among

threads in the same threadblock is enforced by using

the function __syncthreads().

In the parallel reduction described above, each

element of the array is accessed multiple times,

and therefore, it should really be cached. Since

the GPU used for the experiments does not have

cache, we manually cache the entries of the array

being processed in the shared memory by using a

__shared__ array in the kernel. Although several

advanced optimization techniques, such as unrolling

the last warp, loop unrolling, and processing multi-

ple elements per thread, can be used to improve

the performance of the GPU-based timestep size

selector, we decided not to implement these, since

the current implementation of the timestep size

selector is sufficiently fast and its computation times

occupy a very small fraction, about 1%, of the total

computation times (see Table II).

VI. NUMERICAL RESULTS

Although most basket options are written on

arithmetic averages, using geometric averages in-

stead allows us to compute an accurate benchmark

solution using a dimension reduction approach. In

this section, we first present selected numerical

results to compare the efficiency of the numerical

methods applied to the American put options on

geometric averages. We then consider the American

put options on arithmetic averages, and present

10

results that demonstrate the efficiency of the GPU-

based parallel ADI-AF methods.

Several methods are considered, namely, the ADI-

AF-CN and ADI-AF-BDF2 methods with uniform

timestep sizes (uniform-timestep-size ADI-AF-CN

and uniform-timestep-size ADI-AF-BDF2, respec-

tively), and the ADI-AF-CN and ADI-AF-BDF2

methods with variable timestep sizes automatically

chosen by (17) (variable-timestep-size ADI-AF-CN

and variable-timestep-size ADI-AF-BDF2, respec-

tively).

We use the set of parameters for three assets

taken from [11]: E = 100, r = 0.03, T = 0.25,

σ1 = σ2 = σ3 = 0.2, d1 = d2 = d3 = 0,

ρ12 = ρ13 = ρ23 = 0.5. The spot prices are chosen

to be s1(0) = s2(0) = s3(0) = E. We consider the

weights of the assets to be w1 = w2 = w3 =13, so

that we have∑3

i=1wisi(0) = 13

∑3i=1 si(0) = E.

The penalty parameter ζ = 107 is used. We choose

s1,∞= s2,∞= s3,∞=3E=300. Note that, with this

choice of the truncated computational domain and

for all grid sizes considered, there is gridpoint at E(the initial kink point) in each asset price grid.

We used the CUDA 3.1 driver and toolkit, and all

the experiments with the GPU code were conducted

on a NVIDIA Tesla T10 connected to a two quad-

core Intel “Harpertown” host system with Intel

Xeon E5430 CPUs running at 2.66GHz with 8GB

of FB-DIMM PC 5300 RAM. Note that only one

CPU core was employed for the experiments with

the (non-multithreaded) CPU code written by us.

The CPU and GPU computation times, respectively

denoted by “CPU time” and “GPU time”, mea-

sure the total computational times in seconds (s.)

using the CUDA functions cutStartTimer()

and cutStopTimer(). The GPU times include

the overhead for memory transfers from the CPU

to the device memory. If the timestep size selector

(17) is used, the “CPU time” and “GPU time” also

include the total computation times for all timesteps

required by the procedure.

All computations are carried out in double-

precision. The size of each tile used in Step a.1

is chosen to be nb × pb ≡ 32 × 4, and the size

of each threadblock used in the parallel solution of

the independent tridiagonal systems in Steps a.2, a.3

and a.4 is rt × ct ≡ 32 × 4, which appears to be

optimal on a Tesla T10. Regarding the timestep size

selector (17), we use the parameters ∆τ1 = 10−3

(the initial timestep size) and dnorm = 0.4 on

the coarsest grids. The value of dnorm is reduced

by a factor of two at each grid refinement, while

∆τ1 is reduced by a factor of four. The size for

each threadblock used in the parallelization of the

timestep size selector is st = 128.

We denote by “iter. #” the total number of

penalty iterations over all timesteps required by the

penalty method. The quantity “speedup” is defined

as the ratio of the CPU time over the correspond-

ing GPU time. The quantity “value” denotes the

spot value of the option, and the quantity “error”

is computed as the absolute difference between

our numerical solution and an accurate reference

solution. To show convergence, we compute the

quantity “logη ratio” which is defined by logη ratio=

logη

(uref−uapprox(∆s)

uref−uapprox(∆sη)

), where uref is an accurate

reference solution. When an accurate reference solu-

tion is not available, we estimate the rate of conver-

gence by computing the “change” as the difference

in values between a coarser grid and a finer one,

and the “ratio” as the ratio of changes between

successive grids. For second-order methods, such

as those considered in this paper, the quantities

logη-ratio and “ratio” are expected to be about 2

and 4, respectively. The quantity “work” denotes

the approximate total flops required by a method,

and is computed as described in Remark 5 and in

Section IV.

A. American options on geometric averages

We take the payoff of a geometric average Amer-

ican put option to be max(E − g(t), 0), where g(t)

is defined by g(t) =(∏3

i=1 si(t)) 1

3. Using the

multi-dimensional Ito’s formula, it can be shown

[9] that this option is equivalent to an Ameri-

can put option written on one asset with starting

value g(0) =(∏3

i=1 si(0)) 1

3, strike E, volatility

σg =(

132

∑3i,j=1 ρijσiσj

) 12

and risk-neutral drift

rg = r−(

13

∑3i=1

(di+

12σ2i

)− 1

2σ2g

). With the set of

parameters used, we have g(0) = 100, σg = 0.1633,

and rg = 0.03− 0.0067. The benchmark solution is

3.00448 obtained using an accurate, adaptive, high-

order pricing method developed in [2] for pricing

American put options written on one asset.

11

n p q l value error log2 iter work l value error log2 iter. work

(s1) (s2) (s3) (τ ) ratio # (flops) (τ ) ratio # (flops)

uniform-timestep-size ADI-AF-CN variable-timestep-size ADI-AF-CN

45 45 45 20 2.9569 4.8e-2 53 8.1×108 10 2.9619 4.3e-2 54 7.8×108

90 90 90 40 2.9931 1.1e-2 2.0 125 1.5×1010 18 2.9948 1.0e-2 2.1 112 1.3×1010

180 180 180 80 3.0015 2.9e-3 2.0 330 3.1×1011 34 3.0022 2.3e-3 2.1 280 2.6×1011

uniform-timestep-size ADI-AF-BDF2 variable-timestep-size ADI-AF-BDF2

45 45 45 20 2.9571 4.7e-2 63 8.8×108 10 2.9748 2.9e-2 53 7.4×108

90 90 90 40 2.9931 1.1e-2 2.0 134 1.5×1010 18 2.9990 5.5e-3 2.4 118 1.3×1010

180 180 180 80 3.0016 2.8e-3 2.0 304 2.7×1011 34 3.0034 1.1e-3 2.3 292 2.6×1011

TABLE I

OBSERVED ERRORS FOR AN AT-THE-MONEY AMERICAN PUT OPTION ON THE GEOMETRIC AVERAGE OF THREE ASSETS AND

RESPECTIVE ORDERS OF CONVERGENCE FOR VARIOUS METHODS. THE BENCHMARK VALUE IS 3.00448.

Table I presents selected numerical results for

the at-the-money American put option on the ge-

ometric average of three assets described above

obtained with various methods. The total number of

timesteps, l, for the variable-timestep-size methods

is automatically determined by the timestep size

selector (17). The computed option prices enjoy

second-order convergence, in all cases, a favorable

behavior due to the iterative application of the ADI-

AF schemes; see also Remark 2.

109

1010

1011

1012

10−3

10−2

10−1

flops

err

ors

uniform−timestep−size CN

uniform−timestep−size BDF2

variable−timestep−size CN

variable−timestep−size BDF2

Fig. 2. Efficiency comparison of various methods applied to the

geometric average American put option pricing problem.

We next compile an efficiency comparison be-

tween various methods for solving the American put

option on the geometric average of three assets. In

Figure 2, we plot errors (“error”) versus the approx-

imate total number of flops required by each of the

methods (“work”). It is evident that the variable-

timestep-size methods significantly outperform the

uniform-timestep-size methods, with the variable-

timestep-size ADI-AF-BDF2 being the most effi-

cient, followed by the variable-timestep-size ADI-

AF-CN. Between the uniform-timestep-size ADI-

AF-CN and ADI-AF-BDF2 methods, the ADI-AF-

BDF2 is only marginally more efficient. It seems

that the benefit of the timestep size selector (17) is

more pronounced with the ADI-AF-BDF2 scheme

than with the ADI-AF-CN scheme.

Interestingly, we can also see from Table I that the

average number of iterations per timestep required

by a variable-timestep-size ADI-AF method is sig-

nificantly larger than that required by its uniform-

timestep-size counterpart. A possible explanation

for this observation is as follows. For variable-

timestep-size ADI-AF methods, although a few ini-

tial timestep sizes are usually quite small (due to

small initial timestep size ∆τ1), subsequent timestep

sizes increase rapidly. However, a larger timestep

size also gives rise to a larger error in the initial

guess for the solution um of (9) or (10) as well

as a larger error in an AF scheme (see (14)). Con-

sequently, more penalty iterations may be required

for the convergence of the penalty iteration at that

timestep (also see Remark 2).

B. American options on arithmetic averages

Table II presents selected numerical results for

the American put option on the arithmetic av-

erage of three assets, described above, obtained

using the two most efficient methods, namely the

variable-timestep-size ADI-AF-CN and ADI-AF-

BDF2 methods. In the last three columns of this

table, we also present the total CPU and GPU

times, expressed in milliseconds (ms.), required by

the timestep size selector (17) for all timesteps,

respectively denoted by “CPU (17)” and “GPU

(17)”, and the respective speedups. Note that these

times are already included in the total CPU and

GPU times (“CPU time” and “GPU time”). Other

12

n p q l value change ratio iter. CPU GPU speed CPU (17) GPU (17) speed

(s1) (s2) (s3) (τ ) # time (s.) time (s.) up (ms.) (ms.) up

variable-timestep-size ADI-AF-CN

45 45 45 11 2.8924 22 0.5 0.3 1.8 6.2 1.6 3.9

90 90 90 20 2.9309 3.9e-2 54 12.7 1.0 12.8 115.7 14.3 8.0

180 180 180 37 2.9408 9.9e-3 3.9 181 289.5 18.9 15.3 1677.2 175.3 9.6

variable-timestep-size ADI-AF-BDF2

45 45 45 11 2.9059 26 0.7 0.3 2.3 6.1 1.6 3.9

90 90 90 20 2.9348 2.9e-2 66 15.3 1.2 11.9 115.2 14.3 8.0

180 180 180 37 2.9419 7.1e-3 4.1 217 336.3 21.8 15.5 1678.2 176.3 9.6

TABLE II

OBSERVED SPOT PRICES AND PERFORMANCE RESULTS FOR AN AT-THE-MONEY AMERICAN PUT OPTION ON THE ARITHMETIC AVERAGE

OF THREE ASSETS OBTAINED USING VARIABLE-TIMESTEP-SIZE ADI-AF-CN AND ADI-AF-BDF2 METHODS. THE REFERENCE PRICE

IS 2.94454 [11].

n p q l ADI-AF-CN ADI-AF-BDF2 (17)

(s1) (s2) (s3) (τ ) (GFLOP/s) (GFLOP/s.) (GFLOP/s.)

45 45 45 11 1.16 1.23 1.70

90 90 90 20 6.65 6.20 3.04

180 180 180 37 9.01 8.89 3.70

TABLE III

ESTIMATED PERFORMANCE RESULTS IN GFLOP/S FOR THE

GPU-BASED VARIABLE-TIMESTEP-SIZE ADI-AF METHODS AND

THE TIMESTEP SIZE SELECTOR (17) USING DOUBLE PRECISION.

than the computation times and the speedups, our

experiments on the CPU and on the GPU give

similar results.

Table II shows a second-order rate of convergence

for the computed “value”. Moreover, these values

are consistent with the reference price 2.94454

quoted in [11]. Table II also shows that the GPU im-

plementation of each method is significantly faster

than the corresponding CPU implementation for any

size of the discretized problem. In particular, for the

largest grid considered, we observe a speedup ratio

of about 15 for the total computation times, while

the GPU-based timestep size selector is about 10

times faster than its CPU-based counterpart.

Estimated overall rates of computation, reported

in units of GFLOP/s, for the two GPU-based

variable-timestep-size ADI-AF methods and for the

timestep size selector (17) are presented in Table III.

Note that these performance results may actually

be underestimated, since the flops used to compute

these performance results do not include those that

are duplicated on different threads during the com-

putations. An investigation of this topic, however,

is beyond the scope of this short paper; we plan to

address it in the future.

C. Other discussions

It is evident from Tables I-II that the numerical

methods are more efficient, i.e. require a smaller

number of penalty iterations for the same grid sizes,

when applied to pricing an American put option

on arithmetic averages than when applied to an

American put option on geometric averages.

To investigate this further, we look at the payoff

functions of the two options. For the option on the

arithmetic average, the “kink region” for the payoff,

i.e. the region where the first derivative of the payoff

with respect to the space variables is not continu-

ous, is a plane segment determined by ( Ew1, 0, 0),

(0, Ew2, 0), and (0, 0, E

w3). On the other hand, for the

option on the geometric average, the kink region,

determined by the equation E−(∏3

i=1 si(0)) 1

3 = 0,

is an unbounded surface.6 Thus, in a sense, the

topology of the payoff function of a geometric-

average American put option is much more complex

and harder to handle than that of an arithmetic-

average American put option. This may explain

the observed difficulty of the numerical methods

applied to pricing a geometric average option.

VII. CONCLUSIONS AND FUTURE WORK

This paper discusses a GPU-based parallel algo-

rithm for pricing multi-asset American options via

a PDE approach. The algorithm incorporates the

penalty approach for handling the LCP and parallel

ADI-AF methods on GPUs for the solution of the

linear algebraic system arising at each penalty it-

eration. A GPU-based timestep-size-selector is em-

ployed to further increase the performance of the

numerical methods. Numerical results indicate that

the proposed algorithm is very effective for pricing

such derivatives. Furthermore, the ADI-AF-BDF2

6In the two-stock case, the kink region of a geometric average

option resembles a hyperbola, while that of an arithmetic average

option is just a line segment.

13

timestepping method is slightly more efficient than

ADI-AF-CN, and the variable-timestep-size ADI-

AF-BDF2 and ADI-AF-CN methods are more effi-

cient than their uniform-timestep-size counterparts.

At the time of writing this paper, more powerful

GPUs, with more processors, such as the NVIDIA

Tesla 20-series, based on the “Fermi” architecture,

have become available on the market. The increase

in the number of parallel processors (448 proces-

sors in the Tesla C2050) and significant improve-

ments in the peak double-precision performance

(515 GFLOPS on the Tesla C2050), as well as the

increase in the memory bandwidth (144GB/s on

the Tesla C2050) should increase the performance

of the parallel GPU-based ADI-AF algorithm. In

addition, each SM on the new “Fermi” GPUs has

64KB of on-chip memory that can be configured

between the shared memory (16KB/48KB) and the

L1 cache (48KB/16KB). Tripling the amount of

shared memory could yield performance improve-

ments for the GPU-based ADI-AF methods or lead

to a better design of the parallel algorithm. Also,

with the availability of L1/L2 cache on this new

GPU architecture, the programming is expected to

be much simpler.

We conclude the paper by mentioning some pos-

sible extensions of this work. It would be desirable

to have a theoretical analysis of the second-order

convergence of the ADI-AF techniques observed

in the experiments. In addition, it would be in-

teresting to investigate the damping properties of

the two ADI-AF timestepping methods and their

effects on the Greeks delta and gamma of the

options. It would also be interesting to investigate

other possible extensions of the ADI-AF schemes,

such as those discussed in Remark 3. Moreover,

it would be desirable to develop efficient ADI-

AF schemes for multi-dimensional PDEs with time-

dependent coefficients. These PDEs arise very fre-

quently in financial applications, where the local

volatility functions and/or time-dependent cross-

correlations between stochastic processes in the

model are present. To further increase the accuracy

and efficiency of the numerical methods, support for

non-uniform grids, as in [2], could be incorporated.

From a parallelization perspective, utilizing GPU-

based parallel cyclic reduction techniques for the

solution of the tridiagonal systems in the ADI-

AF schemes is expected to increase the efficiency

of the parallel methods (see Remark 4 in [5]).

Extending the current implementation to a multi-

GPU platform should increase the performance of

the GPU algorithm presented here.

ACKNOWLEDGMENT

This research was supported in part by the Nat-

ural Sciences and Engineering Research Council

(NSERC) of Canada. Access to a GPU cluster

was provided by the Shared Hierarchical Aca-

demic Research Computing Network (SHARCNET:

www.sharcnet.ca).

REFERENCES

[1] L. A. ABBAS-TURKI AND B. LAPEYRE, American options

pricing on multi-core graphic cards, in Proceedings of In-

ternational Conference on Business Intelligence and Financial

Engineering, IEEE Computer Society, 2009, pp. 307–311.

[2] C. CHRISTARA AND D. M. DANG, Adaptive and high-order

methods for valuing American options, To appear in the Journal

of Computational Finance, (2011).

[3] J. CRAIG AND A. SNEYD, An alternating-direction implicit

scheme for parabolic equations with mixed derivatives, Comp.

Math. Appl., 16 (1988), pp. 341–350.

[4] D. M. DANG, Pricing of cross-currency interest rate deriva-

tives on Graphics Processing Units, in Proceedings of the

International Symposium on Parallel & Distributed Processing

(IPDPS), IEEE Computer Society, 2010, pp. 1–8.

[5] D. M. DANG, C. CHRISTARA, AND K. JACKSON, A parallel

implementation on GPUs of ADI finite difference methods for

parabolic PDEs with applications in finance, To appear in the

Canadian Applied Mathematics Quarterly (CAMQ), (2011).

[6] P. A. FORSYTH AND K. VETZAL, Quadratic convergence for

valuing American options using a penalty method, SIAM J. Sci.

Comput., 23 (2002), pp. 2095–2122.

[7] E. HAIRER AND G. WANNER, Solving ordinary differential

equations II: stiff and differential-algebraic problems, Springer-

Verlag, second ed., 1996.

[8] M. HARRIS, S. SENGUPTA, AND J. D. OWENS, Parallel prefix

sum (scan) with CUDA, in GPU Gems 3, NVIDIA, 2010,

pp. 851–877.

[9] J. C. HULL, Options, Futures, and Other Derivatives, Prentice

Hall, seventh ed., 2008.

[10] K. IN’T HOUT AND B. WELFERT, Unconditional stability

of second-order ADI schemes applied to multi-dimensional

diffusion equations with mixed derivative terms, Appl. Numer.

Math, 59 (2009), pp. 677–692.

[11] P. KOVALOV, V. LINETSKY, AND M. MARCOZZI, Pricing

multi-asset American options: A finite element method-of-lines

with smooth penalty, Journal of Scientific Computing, 33

(2007), pp. 209–237.

[12] NVIDIA, NVIDIA Compute Unified Device Architecture pro-

gramming guide version 2.3, NVIDIA Developer Web Site,

(2009). http://developer.download.nvidia.com.

[13] D. M. POOLEY, K. R. VERZAL, AND P. A. FORSYTH, Con-

vergence remedies for non-smooth payoffs in option pricing,

Journal of Computational Finance, 6 (2003), pp. 25–40.

[14] R. RANNACHER, Finite element solution of diffusion prob-

lems with irregular data, Numerische Mathematik, 43 (1984),

pp. 309–327.

14

[15] D. TAVELLA AND C. RANDALL, Pricing financial instruments:

The finite difference method, John Wiley & Sons, Chichester,

2000.

[16] A. H. TSE, D. B. THOMAS, AND W. LUK, Option pricing with

multi-dimensional quadrature architectures, in Proceedings of

the 2009 International Conference on Field-Programmable

Technology, IEEE Computer Society, 2009, pp. 427 – 430.

[17] P. WILMOTT, J. DEWYNNE, AND S. HOWISON, Option pric-

ing: Mathematical Models and Computation, Oxford Financial

Press, 1993.

[18] T. P. WITELSKI AND M. BOWEN, ADI schemes for higher-

order nonlinear diffusion equations, Applied Numerical Math-

ematics, 45 (2001), pp. 331–351.

15


Recommended