GPU Accelerated Machine Learning forBond Price Prediction
Venkat Bala Rafael Nicolas Fermin Cota
Motivation
Primary Goals
• Demonstrate potential benefits of using GPUs over CPUs for machine learning
• Exploit inherent parallelism to improve model performance
• Real world application using a bond trade dataset
1
Highlights
Ensemble
• Bagging: Train independent regressors on equal sized bags of samples• Generally, performance is superior to any single individual regressor• Scalable: Each individual model can be trained independently and in parallel
Hardware Specifications
• CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz• GPU: GeForce GTX 1080 Ti• RAM : 1 TB (DDR4 2400 MHZ)
2
Bond Trade Dataset
Feature Set
• 100+ features per trade• Trade Size/Historical Features• Coupon Rate/Time to Maturity• Bond Rating• Trade Type: Buy/Sell• Reporting Delays• Current Yield/Yield To Maturity
Response
• Trade Price
3
Modeling Approach
The Machine Learning Pipeline
DATAPROCESSING
TRAINING SET
CV/TEST SET
MODELBUILDING
EVALUATE
DEPLOY
Accelerate each stage in the pipeline for maximum performance
4
Data Preprocessing
Exposing Data Parallelism
• Important stage in the pipeline (Garbage In→ Garbage out)• Many models rely on input data being on the same scale• Standardization, log transformations, imputations, polynomial/non-linear featuregeneration, etc.
• Most cases, no data dependence so each operation can be executed independently• Significant speedups can be obtained using GPUs, given sufficientdata/computation
5
Data Preprocessing: Sequential Approach
Apply function F (·) sequentially to each element in a feature column
a0 a1 a2 a3 . . . aN
F (·)
6
Data Preprocessing: Parallel Approach
Apply function F (·) in parallel to each element in a feature column
a0 a1 a2 a3 . . . aN
b0 b1 b2 b3 . . . bN
F (·) F (·) F (·) F (·) F (·)
7
Programming Details
Implementation Basics
• Task is embarrassingly parallel• Improve CPU code performance
• Auto vectorizations + compiler optimizations• Using performance libraries (Intel MKL)• Adopting Threaded (OpenMP)/Distributed computing (MPI) approaches
• Great application case for GPUs• Offload computations onto the GPU via CUDA kernels• Launch as many threads as there are data elements• Launch several kernels concurrently using CUDA streams
8
Toy Example: Speedup Over Sequential C++
• Log transformation of an array of floats• N = 2p, Number of elements, p = log2(N)
18 19 20 21 22 23p
0
2
4
6
8
10
Sp
eedu
pO
ver
Seq
uent
ial
C+
+
Vectorized C++
CUDA
9
Bond Dataset Preprocessing
Applied Transformations
• Log transformation of highly skewed features (Trade Size, Time to Maturity)• Standardization (Trade Price & historical prices)• Missing value imputation• Winsorizing features to handle outliers• Feature generation (Price differences, Yield measurements)
Implementation Details
• CPU: C++ implementation using Intel MKL/Armadillo• GPU: CUDA
10
GPU Speedup over CPU implementation
• Nearly 10x speedup obtained after CUDA optimizations
20 21 22 23 24 25p
0
2
4
6
8
10
Sp
eedu
pov
erC
PU
Unoptimized CUDA
Optimized CUDA
11
CUDA Optimizations
Standard Tricks
• Concurrent kernel executions of kernels using CUDA streams to maximizing GPUutilization
• Use of optimized libraries such as cuBLAS/Thrust• Coalesced memory access• Maximizing memory bandwidth for low arithmetic intensive operations• Caching using GPU shared memory
12
Model Building
Ensemble Model
Model Choices
• GBT: XGBoost, DNN: Tensorflow/Keras
ENSEMBLEMODEL
GBT
MODELSDNN
13
Hyperparameter Tuning: Hyperopt
GBT: XGBoost
• Learning Rate• Max depth• Minimum child weight• Subsample, Colsample-bytree• Regularization parameters
DNN: MLPs
• Learning Rate/Decay Rate• Batch Size• Epochs• Hidden layers/Layer width• Activations/Dropouts
14
Hyperparameters Tuning: Hyperopt
0 200 400 600 800 1000Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Lea
rnin
gR
ate
15
XGBoost: Training & Hyperparameter Optimization Time
0 2 4 6 8Avg. Training Time (H)
GPU
CPU
GBT, Speedup ≈ 3x
Intel(R) Xeon(R) E5-2699, 32 cores
GTX 1080 Ti
16
TensorFlow/Keras Time Per Epoch
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Time Per Epoch (s)
15
16
17
18
p Speedup ≈ 3 x
GTX 1080 Ti
Intel(R) Xeon(R) E5-2699, 32 cores
17
Model Test Set Performance
20 40 60 80 100 120 140 160Prediction
20
40
60
80
100
120
140
160
Val
id
TEST SET R2 : 0.9858
18
Summary
Summary
Final Remarks
• Leveraging the GPU computation power→ dramatic speedups• Maximum performance when GPUs incorporated into every stage of the pipeline• Ensembles: Bagging/Boosting to improve model accuracy/throughput• Shorter training times allows more experimentation• Extensive support available• Deploy this pipeline now in our in-house DGX-1
19
Questions?