GPU Accelerated Machine Learning for Bond Price Prediction...TheMachineLearningPipeline DATA...

GPU Accelerated Machine Learning forBond Price Prediction

Venkat Bala Rafael Nicolas Fermin Cota

Motivation

Primary Goals

• Demonstrate potential benefits of using GPUs over CPUs for machine learning

• Exploit inherent parallelism to improve model performance

• Real world application using a bond trade dataset

1

Highlights

Ensemble

• Bagging: Train independent regressors on equal sized bags of samples• Generally, performance is superior to any single individual regressor• Scalable: Each individual model can be trained independently and in parallel

Hardware Specifications

• CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz• GPU: GeForce GTX 1080 Ti• RAM : 1 TB (DDR4 2400 MHZ)

2

Bond Trade Dataset

Feature Set

• 100+ features per trade• Trade Size/Historical Features• Coupon Rate/Time to Maturity• Bond Rating• Trade Type: Buy/Sell• Reporting Delays• Current Yield/Yield To Maturity

Response

• Trade Price

3

Modeling Approach

The Machine Learning Pipeline

DATAPROCESSING

TRAINING SET

CV/TEST SET

MODELBUILDING

EVALUATE

DEPLOY

Accelerate each stage in the pipeline for maximum performance

4

Data Preprocessing

Exposing Data Parallelism

• Important stage in the pipeline (Garbage In→ Garbage out)• Many models rely on input data being on the same scale• Standardization, log transformations, imputations, polynomial/non-linear featuregeneration, etc.

• Most cases, no data dependence so each operation can be executed independently• Significant speedups can be obtained using GPUs, given sufficientdata/computation

5

Data Preprocessing: Sequential Approach

Apply function F (·) sequentially to each element in a feature column

a0 a1 a2 a3 . . . aN

F (·)

6

Data Preprocessing: Parallel Approach

Apply function F (·) in parallel to each element in a feature column

a0 a1 a2 a3 . . . aN

b0 b1 b2 b3 . . . bN

F (·) F (·) F (·) F (·) F (·)

7

Programming Details

Implementation Basics

• Task is embarrassingly parallel• Improve CPU code performance

• Auto vectorizations + compiler optimizations• Using performance libraries (Intel MKL)• Adopting Threaded (OpenMP)/Distributed computing (MPI) approaches

• Great application case for GPUs• Offload computations onto the GPU via CUDA kernels• Launch as many threads as there are data elements• Launch several kernels concurrently using CUDA streams

8

Toy Example: Speedup Over Sequential C++

• Log transformation of an array of floats• N = 2p, Number of elements, p = log2(N)

18 19 20 21 22 23p

0

2

4

6

8

10

Sp

eedu

pO

ver

Seq

uent

ial

C+

+

Vectorized C++

CUDA

9

Bond Dataset Preprocessing

Applied Transformations

• Log transformation of highly skewed features (Trade Size, Time to Maturity)• Standardization (Trade Price & historical prices)• Missing value imputation• Winsorizing features to handle outliers• Feature generation (Price differences, Yield measurements)

Implementation Details

• CPU: C++ implementation using Intel MKL/Armadillo• GPU: CUDA

10

GPU Speedup over CPU implementation

• Nearly 10x speedup obtained after CUDA optimizations

20 21 22 23 24 25p

0

2

4

6

8

10

Sp

eedu

pov

erC

PU

Unoptimized CUDA

Optimized CUDA

11

CUDA Optimizations

Standard Tricks

• Concurrent kernel executions of kernels using CUDA streams to maximizing GPUutilization

• Use of optimized libraries such as cuBLAS/Thrust• Coalesced memory access• Maximizing memory bandwidth for low arithmetic intensive operations• Caching using GPU shared memory

12

Model Building

Ensemble Model

Model Choices

• GBT: XGBoost, DNN: Tensorflow/Keras

ENSEMBLEMODEL

GBT

MODELSDNN

13

Hyperparameter Tuning: Hyperopt

GBT: XGBoost

• Learning Rate• Max depth• Minimum child weight• Subsample, Colsample-bytree• Regularization parameters

DNN: MLPs

• Learning Rate/Decay Rate• Batch Size• Epochs• Hidden layers/Layer width• Activations/Dropouts

14

Hyperparameters Tuning: Hyperopt

0 200 400 600 800 1000Iterations

0.0

0.2

0.4

0.6

0.8

1.0

Lea

rnin

gR

ate

15

XGBoost: Training & Hyperparameter Optimization Time

0 2 4 6 8Avg. Training Time (H)

GPU

CPU

GBT, Speedup ≈ 3x

Intel(R) Xeon(R) E5-2699, 32 cores

GTX 1080 Ti

16

TensorFlow/Keras Time Per Epoch

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Time Per Epoch (s)

15

16

17

18

p Speedup ≈ 3 x

GTX 1080 Ti

Intel(R) Xeon(R) E5-2699, 32 cores

17

Model Test Set Performance

20 40 60 80 100 120 140 160Prediction

20

40

60

80

100

120

140

160

Val

id

TEST SET R2 : 0.9858

18

Summary

Summary

Final Remarks

• Leveraging the GPU computation power→ dramatic speedups• Maximum performance when GPUs incorporated into every stage of the pipeline• Ensembles: Bagging/Boosting to improve model accuracy/throughput• Shorter training times allows more experimentation• Extensive support available• Deploy this pipeline now in our in-house DGX-1

19

Questions?

Date post:	02-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

GPU Accelerated Machine Learning for Bond Price Prediction...TheMachineLearningPipeline DATA...

Documents