Towards Heterogeneous Solvers for Large-Scale Linear Systems
Stylianos I. Venieris, Grigorios Mingas, Christos-Savvas Bouganis
FPL 2015, London
2 Sept 2015
Introduction – Solving Linear Systems
• Given 𝑨 ∈ 𝑅(𝑚×𝑛), 𝒃 ∈ 𝑅𝑚 and 𝑚 ≥ 𝑛:
min𝒙∈𝑅𝑛
𝑨𝒙 − 𝒃 22
• Find vector 𝒙
A Samples
Measurements
b Target Values
x
Weights
2
Introduction – Solving Linear Systems
• Given 𝑨 ∈ 𝑅(𝑚×𝑛), 𝒃 ∈ 𝑅𝑚 and 𝑚 ≥ 𝑛:
min𝒙∈𝑅𝑛
𝑨𝒙 − 𝒃 22
• Find vector 𝒙
A Samples
Measurements
b Target Values
x
Weights
Patients
DNA Nucleotides Biological Data
Phenotypes
2
Introduction – Solving Linear Systems
Linear Systems
Big Data Computing
Infrastructure
• Biology
• Wearables
• Internet
• Bioinformatics
• Econometrics
• Control
Feasibility
3
Introduction – Data Systems with different structure
A
Square
System
A
Tall-Skinny
System
• Many samples
(rows)
• A few
measurements
(columns)
Anything in between
𝑚 = 𝑛 𝑚 ≫ 𝑛
4
Introduction – Linear Systems in Genetic Analysis
• Real-life Example:
− Genetic Analysis [1]
− Search for gene combinations by solving lots of linear systems
• Application Characteristics:
− A lot of linear systems
− Linear systems of varying size
− Up to 100,000 rows and a few hundred columns
[1] L. Bottolo and S. Richardson, “Evolutionary Stochastic Search for Bayesian Model Exploration”, Bayesian Analysis, vol. 5, no. 3, pp. 583–618, 09 2010.
Genes Search Space
5
Our focus
• Towards heterogeneity for high performance across matrix sizes
• Novel FPGA solver for tall-skinny linear systems
• Modelling framework
– Performance and resource estimation (compile and runtime)
– Optimal hardware configuration (compile-time)
• Up to 18x speed-up in GFLOPS across matrix sizes compared to
existing works
6
Route Map
• Background
• Towards Heterogeneity for Performance
• An Enhanced FPGA Solver
• Evaluation Results
• Conclusions
7
Background – Solving Linear Systems
• The QR factorisation-based methods are dominant
because of their properties
• 𝑨 = 𝑸𝑹 – Orthogonal 𝑸, upper-triangular 𝑹
• 𝑨𝒙 = 𝒃 ⇒ 𝑸𝑹𝒙 = 𝒃
• 𝑸𝑇𝒃 – Matrix-vector product
• 𝒙 = 𝑹−𝟏𝑸𝑇𝒃
Solution using
back-substitution
8
The Challenge
• Existing solvers
― Target CPUs, GPUs and FPGAs
― Employ different algorithms
― Tailored to specific matrix sizes
• Key Challenge: Sustain high performance across matrix sizes
9
Heterogeneity for Performance – Different Solvers
FPGA GPU CPU
CAQR TSQR Householder QR
Platforms
Algorithms
Problem Space
Anything in between
𝑚 = 𝑛 𝑚 ≫ 𝑛
10
Heterogeneity for Performance – Different Solvers
FPGA GPU CPU
CAQR TSQR Householder QR
Platforms
Algorithms
Problem Space
Anything in between
𝑚 = 𝑛 𝑚 ≫ 𝑛
10
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Varying
Matrix Size A
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
A
Square
System
CPU Solver
MUX Workload Allocation
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
A
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
A
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
A
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
A
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Bank of Solvers
11
Tall-Skinny
System
Heterogeneity for Performance - A Heterogeneous Solver
FPGA Solver
GPU Solver
CPU Solver
MUX Workload Allocation
A
Tall-Skinny
System
• Have a heterogeneous bank of solvers
• Select the highest performer based on
the matrix size
Ta
rge
t A
pp
lica
tio
n
Bank of Solvers
11
Compute Engines
12
Compute Engines: FPGA Solver
• Based on existing architecture for tall-skinny QR factorisations [2]
• Functionality Extension
– From QR to Linear Systems workloads
– Exploited an algorithmic property for acceleration
• “Concurrent Solution and Factorisation”
• Any no. of rows
• Up to a max no. of columns
– 5x the max no. of columns of existing FPGA work for the same device
13
[2] A. Rafique et al., FPL 2012.
Compute Engines: FPGA Solver
• Splits the matrix into blocks along its rows
• Any no. of rows
• Up to a max no. of columns
– 5x the max no. of columns of existing FPGA work for
the same device
• Configurable parameters
– Size of arithmetic units
– No. of blocks to be active in parallel in the
architecture
14
Concurrent Solution and Factorisation
1. 𝑨 = 𝑸𝑹
2. 𝑨𝒙 = 𝒃 ⇒ 𝑸𝑹𝒙 = 𝒃
3. 𝑸𝑻𝒃 – Matrix-vector product
4. 𝒙 = 𝑹−𝟏𝑸𝑇𝒃
Solution
• Numerically stable algorithms do
not return 𝑸 explicitly
• Additional computations for the
reconstruction of 𝑸 and for 𝑸𝑻𝒃
15
Concurrent Solution and Factorisation
• Compute 𝑸𝑇𝒃 without forming 𝑸 explicitly
A 𝑸
𝑅
𝒃
Previous works This work
A
𝑅
𝑸𝑻𝒃 QR algorithm
Modified
QR algorithm
Q reconstructed from partial results 16
Compute Engines: GPU and CPU Solvers
• GPU Solver
– Based on the state-of-the-art work on tall-skinny QR factorisations [3]
– Extended its functionality to Linear Systems by means of the Concurrent Solution and Factorisation
• CPU Solver
– Optimised multithreaded linear algebra library (OpenBLAS)
[3] M. Anderson, C. Ballard, J. Demmel, and K. Keutzer, “Communication-Avoiding QR Decomposition for GPUs”, in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, May 2011, pp.48-58. 17
Modelling Framework
• Compile-time Framework
– Performance in GFLOPS and resource estimation
models for the FPGA solver
– Optimal hardware configuration of the FPGA
solver
• Runtime Framework
– Workload allocation among the available solvers
18
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
19
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
FPGA
Resource Info
19
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
FPGA
Resource Info
FPGA Performance
Model
19
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
FPGA
Resource Info
FPGA Performance
Model
FPGA Solver
Instance
GPU and CPU
Solvers
19
Bank of
Solvers
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
Compile-time
GPU and CPU
Profiling
FPGA
Resource Info
FPGA Performance
Model
FPGA Solver
Instance
GPU and CPU
Solvers
19
Bank of
Solvers
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
Compile-time
GPU and CPU
Profiling
FPGA
Resource Info
FPGA Performance
Model
New
Workload
FPGA Solver
Instance
GPU and CPU
Solvers
19
Bank of
Solvers
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
Compile-time
GPU and CPU
Profiling
FPGA
Resource Info
FPGA Performance
Model Runtime
Workload
Allocation
New
Workload
FPGA Solver
Instance
GPU and CPU
Solvers
19
Bank of
Solvers
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
Compile-time
GPU and CPU
Profiling
FPGA
Resource Info
FPGA Performance
Model Runtime
Workload
Allocation
New
Workload
FPGA Solver
Instance
GPU and CPU
Solvers
FPGA Performance Estimate
19
Bank of
Solvers
Compile-time
Modelling
Framework
(Application-Specific)
Typical Matrix Size
Hardware Configuration
Settings
Supplied by
Application
Expert
Heterogeneity for Performance - Proposed Design Flow
Available Solvers
Specifications
Compile-time
GPU and CPU
Profiling
FPGA
Resource Info
FPGA Performance
Model Runtime
Workload
Allocation
New
Workload
FPGA Solver
Instance
GPU and CPU
Solvers
Solution
Solver Selection
FPGA Performance Estimate
19
Bank of
Solvers
Evaluation
20
Experimental Setup
• FPGA
– Xilinx Virtex-6 SX475 at 200 MHz with 2016 DSPs
– Double-precision floating-point
– Up to 275 columns (𝑛 = 275) – 5x the max no. of columns of existing FPGA work
– Any number of rows
– 84.23% BRAM utilization – post place-and-route
– 99.45% DSP utilization – post place-and-route
• GPU
– NVIDIA Tesla K20
– 2496 cores at 706 MHz
• CPU
– Intel i7-4770 at 3.40 GHz, 16 GB RAM, 8 MB cache
– 4 cores, 8 threads
21
Internal Comparisons
No. of Columns (𝑛) = 51
CPU
22
Internal Comparisons
No. of Columns (𝑛) = 51
CPU
4.67 ×
2.74 ×
22
Internal Comparisons
No. of Columns (𝑛) = 51
CPU
Proposed Approach
4.67 ×
2.74 ×
22
Internal Comparisons
CPU
2 ×
No. of Rows (𝑚) = 6400 23
Internal Comparisons
CPU
Proposed Approach
2 ×
No. of Rows (𝑚) = 6400 23
External Comparisons
Speedup • Up to 25.84x
against
software
• Up to 32.67x
against
CULA
• Up to18.07x
against
FPGA
CPU
CPU
- GPU
No. of Columns (𝑛) = 51 24 [4] A. Rafique et al., FPL 2012.
External Comparisons
Speedup • Up to 25.84x
against
software
• Up to 32.67x
against
CULA
• Up to18.07x
against
FPGA
CPU - GPU
CPU
No. of Rows (𝑚) = 6400 25 [4] A. Rafique et al., FPL 2012.
Conclusions
• Different solvers perform better on different matrix
sizes
• Using heterogeneous solvers in a complementary way
enable the high-performance solution of complex
problems in fields such as genetic analysis
26
Thank You & Questions ?
27