Citi | Markets Quantitative Analysis · Citi | Markets Quantitative Analysis Optimising Risk...

Post on 22-Jun-2020

0 views 0 download

transcript

Terry Spitz

Citi

14 May 2013

Citi | Markets Quantitative Analysis

Optimising Risk Management

Computational Methods and Technologies for Finance

Agenda

1. Increasing Compute Requirements

2. High Performance Hardware

3. Software Optimisation Technologies

4. Writing efficient parallel code

5. Case Study: Optimising Pricing Models

6. Conclusions

Increasing Compute Requirements

Pre-2008

• More trades

• More complex payoffs, more observation dates

• More complex models, more factors, more complex calibration

Post-2008

• More competitive marketplace demands reduced cost per trade

• More stability/accuracy from more steps/trials

• More market data scenarios

• More regulatory testing, for example hedging simulations

In t h e t w i l igh t of Moore’s La w , t h e t ra ns i t ions t o

m ult icore p roces s ors , GPU com p ut ing , a nd c loud

com p ut ing a re not s ep a ra t e t rend s , bu t a s p ec t s of a

s ing le t rend – m a ins t rea m com p ut ers from d es k t op s t o

s m a rt p h ones a re be ing p er m a nent ly t ra ns for m ed in t o

h e t erogen eous s up ercom p ut er c lus t ers . Hencefor t h , a

s ing le com p ut e -in t ens ive a p p l ica t ion w i l l need t o

h a r nes s d i fferen t k ind s of cores , in im m ens e num bers ,

t o ge t i t s job d one .

Th e free lunch is over . Now w elcom e t o t h e h a rd w a re

jungle .

Herb Su t ter (2012)

High Performance Hardware

• NVidia

– GeForce/Quadro (1999-)

– Tesla (2007-)

– Fermi (2010-)

– Kepler (2012-)

• Intel

– Xeon (2004-)

– Larrabee (2006-2009)

– MIC Architecture: Knight’s Ferry / Intel Xeon Phi (2012-)

• Sony Cell (2005-2009)

• FPGA vendors

Hardware cost

Device Intel Xeon on

Grid NVidia Tesla

M2090 Intel Xeon Phi

Cost per year $6300/server $2000/card est. $2000/card

Cores 12x 3Ghz 512x 1.3Ghz 62x 1.05Ghz

Cost per core per year

$525/core $4/core est. $31/core

Speed 300 Gflops 665 Gflops est. 1290 Gflop

Cost per GFlop $21/Gflop $3/Gflop est. $1.6/Gflop

Memory 16Gb 6Gb 8Gb

We s h ould forge t a bou t s m a l l

e ffic ien c ies , s a y a bou t 9 7 % of t h e

t im e p rem a t ure op t im iz a t ion is t h e

root of a l l evi l .

Yet w e s h ould not p a s s up our

op p ort un i t ies in t h a t cr i t ica l 3 %.

Don a ld Kn u th (1974 )

Software Optimisation Technologies

• Optimise serial code – Profile and optimise algorithms & code

– Better compiler: e.g. Intel CC

• Parallelise on CPU – SSE/AVX

– Multicore

– OpenMP

– MPI

– Grid

• Port to GPU – CUDA

– OpenCL

– C++ AMP

– Data parallel: Thrust, Microsoft Accelerator

Software Parallelisation APIs

Raw OS fork() / pthread_create(…) / CreateThread(…);

OpenMP #pragma omp parallel for for(int i = 0; i < N; i++) { …

TBB/PPL parallel_for (0, size, [&](int i) { …

Grid Session::sendTaskInput(Message* taskInput);

CUDA __global__ void VecAdd(float* A, float* B) {…

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(…);

OpenCL kernel = clCreateKernel(program, …) clEnqueueNDRangeKernel(queue, kernel, …)

C++ AMP parallel_for_each(array.extent, [=](index<2> i) restrict(amp) { …

Thrust thrust::transform(rng.begin(), rng.end(), payoffs.begin(), compute_payoff(…))

Writing efficient GPU code

• Balance memory versus compute bottlenecks

• You need to run > 10,000 threads to keep the GPU busy

• SIMD (Single instruction, Multiple Data) with few divergent branches/loops

• Find the parallelism, for example:

– Multiple contracts

– Outer loops, e.g. MC Paths

– Third-party functions, e.g. matrix operations, RNG, parallel_reduce

• Extreme optimisation

– coherent memory access, shared memory, block/tile size, synchronisation, atomics, fast_maths, asynch/overlapped operations, …

• Limitations

– No exceptions, virtual functions, STL/vectors, new/delete, debugging, recursive functions, IEEE compliant maths, unsupported types, complex objects, …

Porting existing applications

https://developer.nvidia.com/content/assess-parallelize-optimize-deploy

• Analyse

– identify the hot spots by profiling with one or more realistic data sets

– estimate performance improvements considering the strong and weak scaling

• Parallelise

– GPU-accelerated libraries

– OpenACC directives

– GPU programming languages

• Optimise

• Deploy

Th e k ey w it h GPUs is t o a s k w h y y ou

w a n t t o u s e t h em – a re y ou look in g t o

d o s om et h in g a lot ch ea p er, or a lot

fa s t e r? Or a re y ou look in g t o d o

s om et h in g t h a t y ou ca n n ot fea s ibly

d o t od a y ?

Th e a s s es s p h a s e p re t t y m u ch h a s t o

h a ve w h a t ever m e t r ic is a p p lica ble

(t im e t o s olu t ion , # s olu t ion s /s econd

p er w a t t , s olu t ion in u n d er 5 m in u t es

– w h a t ever) fir m ly in m in d .

J oh n As h ley - NVid ia

Case Study:

Migrating Pricing Models to

CUDA

Binomial Tree

BinomialTree::calculateFairValue(…)

treeResult = tree.calculate();

applyTerminalCondition();

for(int i = startBackwardAt; i>=0; --i)

{

_steps[i]->update(*this, spots, fairs);

applyExerciseDecision(…)

}

kernel

Binomial Tree Parallel

BinomialTree::calculateFairValue(…)

treeResult = tree.calculate();

applyTerminalCondition();

for(int i = startBackwardAt; i>=0; --i)

{

_steps[i]->update(*this, spotsVector, fairsVector);

applyExerciseDecision(…)

}

host

device_vector<double> d_spots(numContracts * numSteps);

device_vector<double> d_fairs(numContracts * numSteps);

applyTerminalConditions(d_spots, d_fairs);

binomialTreePricer<<<numContracts>>>(…)

for(int nContract=0; …)

fairValue[nContract] = h_fairs[nContract*numSteps + offset];

CUDA Approach

CPU CUDA

1. Prepare data

2. Loop over ‘kernel’

3. Summarise results

1. Prepare data 2. Allocate device memory 3. Copy data to device 4. Invoke parallel kernel 5. Copy data from device 6. Summarise results

VarianceSwap Pricer

VarianceSwap::calculateFairValue(…)

VarianceSwapReplication::calculateFairValue(…)

calcFutureVariance(…)

integrandStart = createLogContractIntegrand(…)

integrator.integrate(…)

for (i=0; i<degreeDiv2; ++i)

calculateEuropean(…)

engine.calculate(…)

European::calculateFairValue(…)

call = new Call(…)

integration.integrate(…)

for (i=0; i<degree; ++i)

double payoff = blackPremium(…)

VarianceSwap Pricer Parallel

VarianceSwap::calculateFairValue(…)

VarianceSwapReplication::calculateFairValue(…)

calcFutureVariance(…)

integrandStart = createLogContractIntegrand(…)

integrator.integrate(…)

for (i=0; i<degreeDiv2; ++i)

calculateEuropean(…)

engine.calculate(…)

European::calculateFairValue(…)

call = new Call(…)

integration.integrate(…)

for (i=0; i<degree; ++i)

double payoff = blackPremium(…)

kernel

host

device_vector<double> bsPayoffs(…)

device_vector<double> inputs(…)

calculateBS<<<varSwapDegree, europeanDegree>>>(…)

MonteCarlo Pricer

EuropeanMCPricer::calculateFairValue(…) { MCPathsGeneratorUtils::generatePaths(…) pathGen->generateIndependentNormals(…) _RNGFactory->createNewRNG(…) | RNG->getIndependentNormals(…) | getIndependentUniforms(variates); | convertUniformstoNormals(variates); pathGen->generateCorrelatedNormals(…) pathGen->generatePaths(…) _model->diffuse(…) for( size_t iTrial = 0; iTrial<trials; ++iTrial) { contract->calcContractualFlows( mcFixingSchedule for(size_t i=0; i < nFlows; ++i) { ImplementorUtils::getDefaultImplementor(…) calcEngine.calculate( flowImpl, fvRequest, results) fairValue += results->getFairValue(); } priceCollector(fairValue); result->setFairValue(acc::mean(priceCollector)); result->set(FairValueStdDev, sqrt(variance(priceCollector)/trials));

MonteCarlo Pricer Parallel

EuropeanMCPricer::calculateFairValue(…) { MCPathsGeneratorUtils::generatePaths(…) pathGen->generateIndependentNormals(…) _RNGFactory->createNewRNG(…) RNG->getIndependentNormals(…) getIndependentUniforms(variates); convertUniformstoNormals(variates); pathGen->generateCorrelatedNormals(…) pathGen->generatePaths(…) _model->diffuse(…) for( size_t iTrial = 0; iTrial<trials; ++iTrial) { contract->calcContractualFlows( mcFixingSchedule for(size_t i=0; i < nFlows; ++i) { ImplementorUtils::getDefaultImplementor(…) calcEngine.calculate( flowImpl, fvRequest, results fairValue += results->getFairValue(); } priceCollector(fairValue); result->setFairValue(acc::mean(priceCollector)); result->set(FairValueStdDev, sqrt(variance(priceCollector)/trials));

host

curandCreateGenerator(…)

curandGenerateNormalDouble(…)

thrust::device_vector<double> contractFlows(nFlows*trials);

doTrialKernel<<<nTrials>>>(…)

pathGen->generateCorrelatedNormal(nTrial, …) kernel

pathGen->generatePath(nTrial, …)

_model->diffuse(…)

contract->calcContractualFlows(nTrial, …)

thrust::transform(contractFlows, flowPVs, PVs);

double fairValue = thrust::reduce(PVs)/trials;

double variance = thrust::inner_product(PVs, PVs, 0.0)

– fairValue*fairValue;

Copying Object Graphs

C++ Data Marshalling Best Practices - Cliff Woolley, NVIDIA http://on-demand.gputechconf.com/gtc/2012/presentations/S0377-GTC2012-Data-Marshalling-Practices.pdf

• We need data marshalling (serialization) to GPU

– We are moving data from one physical address space to another

– Virtual function tables must be updated

– Possible differences in structure layout

– Want bus transfers to be as efficient as possible

– Want parallel-friendly data organization to benefit the GPU

• C-style struct: cudaMemcpy and you’re done (even for an array)

– Except for fixing alignment

• TRICKIER CASES

– Virtual functions (device vtable ≠ host vtable)

- Split off any class containing virtual functions into:

• Base class contains only Plain Old Data members

• Derived class contains the virtual functions

– Bitfields

– AoS vs. SoA

– STL

Copying Object Graphs - Example

class PathGenerator : public Base { public: __host__ __device__ shared_ptr<MCPaths> generateCorrelatedNormals( shared_ptr<MCPaths> uniformRandomNumbers ) const; private: shared_ptr<PathParams> _params; ublas::matrix<double> _cholesky; };

PathGenerator vtbl

_params

_cholesky

PathParams vtbl

_antithetic

_brownianBridge

_size2

_data

ublas::matrix

_size1

Copying Object Graphs - DeviceWriter

class PathGenerator : public Base { public: void prepareDeviceMemory(DeviceWriter& writer) { writer.writeObject(this); writer.writeObject(_params); writer.writeMatrix(&_cholesky); } };

PathGen vtbl

_params

_cholesky

PathParams vtbl

_antithetic

_brownianBridge

_size2

_data

ublas::matrix

_size1

vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Host:

Original:

vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:

Calling member function on device

file: PathGenerator.cu #include ‚cuda_shared_ptr.hpp‛ #include ‚cuda_vector.h‛ #include ‚cuda_matrix.hpp‛ #include ‚MCPaths.h‛ PathGenerator::generateCorrelatedNormals(shared_ptr<MCPaths> uniformRandomNumbers) { shared_ptr<thrust::device_vector<double>> correlatedNormals( thrust::device_vector<double>(uniformRandomNumbers.size())); DeviceWriter writer; prepareDeviceMemory(writer); writer.writeObject(uniformRandomNumbers); writer.copyToDevice(); generateCorrelatedNormalKernel<<1, uniformRandomNumbers.getTrials>>>( writer.getDevicePtr(this), writer.getDevicePtr(uniformRandomNumbers), thrust::raw_pointer_cast(correlatedNormals->data())); return correlatedNormals; }

vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:

file: cuda_shared_ptr.hpp template<class T> class shared_ptr { __host__ __device__ T* operator->() const {…} }

Calling virtual function on device

file: PathGenerator.cu #include ‚cuda_shared_ptr.hpp‛ #include ‚cuda_vector.h‛ #include ‚cuda_matrix.hpp‛ #include ‚MCPaths.h‛ PathGenerator::generatePaths(shared_ptr<MCPaths> correlatedNormals) { shared_ptr<thrust::device_vector<double>> paths( thrust::device_vector<double>(correlatedNormals())); DeviceWriter writer; prepareDeviceMemory(writer); writer.writeObject(correlatedNormals); writer.addHostvtbl(“BSModel”, getvtbl(_bsModel)); writer.addDevicevtbl(“BSModel”, getDevicevtbl<BSModel>()); writer.copyToDevice(); generatePathKernel<<1, correlatedNormals.getTrials>>>(..) _model->diffuse(…) }

vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:

file: Model.h class ModelBase { __host__ __device__ virtual void diffuse(…) {…} }

__host__ __device__? file: PathGenerator.2.h

#ifndef __CUDACC__

shared_ptr<MCPaths> PathGenerator::generateCorrelatedNormals(…) const

{

shared_ptr<MCPaths> dZs_ptr(new MCPaths(…));

MCPaths& dZs = *dZs_ptr;

#else

__device__ void DevicePathGenerator::generateCorrelatedNormal(const int iTrial, …) const

{

MCPathsRaw& dZs = *pdZs;

#endif

const size_t nStateVariables = dZs.getIndices();

const size_t nTimeSteps = dZs.getTimeSteps();

#ifndef __CUDACC__

for( size_t iTrial =0; iTrial< randomNumbers.getTrials(); ++iTrial)

{

#endif

for(size_t index =0; index< nStateVariables; ++index)

{

for(size_t jTime=0; jTime< nTimeSteps; ++jTime)

{

dZs(2*iTrial+1,index,jTime) = crng;

Case Study: Conclusions

1. Understand the balance between performance needs, hardware, development complexity/costs, risk

2. To CUDA-enable existing libraries:

– Choose appropriate granularity for parallelisation, review code which will need porting to CUDA

– Simple objects are unchanged

– Move object allocation (new/vector resizing) out of lower level functions

– Add *.cu files

• #include cuda-enabled STL and boost headers

• write kernels to dispatch to __device__ member functions

• write host code to prepare/copy objects and call kernel

– Modify/split existing headers

• Use #ifdef __CUDACC__ to wrap device functions/subclasses

• CUDAHOSTDEVICE prefix for shared header functions

– Project files

• Add a build configuration to disable CUDA

3. Issues

– Persisting objects on device between calls (using this paradigm)

– nvcc compiler warnings all on/off

– Debugging errors is painful (Nsight should help)

Citi | Markets Quantitative Analysis

The End