Towards Heterogeneous Solvers for Large-Scale Linear Systems · Towards Heterogeneous Solvers for...

Post on 26-Aug-2020

0 views 0 download

transcript

Towards Heterogeneous Solvers for Large-Scale Linear Systems

Stylianos I. Venieris, Grigorios Mingas, Christos-Savvas Bouganis

stylianos.venieris10@imperial.ac.uk

FPL 2015, London

2 Sept 2015

Introduction – Solving Linear Systems

• Given 𝑨 ∈ 𝑅(𝑚×𝑛), 𝒃 ∈ 𝑅𝑚 and 𝑚 ≥ 𝑛:

min𝒙∈𝑅𝑛

𝑨𝒙 − 𝒃 22

• Find vector 𝒙

A Samples

Measurements

b Target Values

x

Weights

2

Introduction – Solving Linear Systems

• Given 𝑨 ∈ 𝑅(𝑚×𝑛), 𝒃 ∈ 𝑅𝑚 and 𝑚 ≥ 𝑛:

min𝒙∈𝑅𝑛

𝑨𝒙 − 𝒃 22

• Find vector 𝒙

A Samples

Measurements

b Target Values

x

Weights

Patients

DNA Nucleotides Biological Data

Phenotypes

2

Introduction – Solving Linear Systems

Linear Systems

Big Data Computing

Infrastructure

• Biology

• Wearables

• Internet

• Bioinformatics

• Econometrics

• Control

Feasibility

3

Introduction – Data Systems with different structure

A

Square

System

A

Tall-Skinny

System

• Many samples

(rows)

• A few

measurements

(columns)

Anything in between

𝑚 = 𝑛 𝑚 ≫ 𝑛

4

Introduction – Linear Systems in Genetic Analysis

• Real-life Example:

− Genetic Analysis [1]

− Search for gene combinations by solving lots of linear systems

• Application Characteristics:

− A lot of linear systems

− Linear systems of varying size

− Up to 100,000 rows and a few hundred columns

[1] L. Bottolo and S. Richardson, “Evolutionary Stochastic Search for Bayesian Model Exploration”, Bayesian Analysis, vol. 5, no. 3, pp. 583–618, 09 2010.

Genes Search Space

5

Our focus

• Towards heterogeneity for high performance across matrix sizes

• Novel FPGA solver for tall-skinny linear systems

• Modelling framework

– Performance and resource estimation (compile and runtime)

– Optimal hardware configuration (compile-time)

• Up to 18x speed-up in GFLOPS across matrix sizes compared to

existing works

6

Route Map

• Background

• Towards Heterogeneity for Performance

• An Enhanced FPGA Solver

• Evaluation Results

• Conclusions

7

Background – Solving Linear Systems

• The QR factorisation-based methods are dominant

because of their properties

• 𝑨 = 𝑸𝑹 – Orthogonal 𝑸, upper-triangular 𝑹

• 𝑨𝒙 = 𝒃 ⇒ 𝑸𝑹𝒙 = 𝒃

• 𝑸𝑇𝒃 – Matrix-vector product

• 𝒙 = 𝑹−𝟏𝑸𝑇𝒃

Solution using

back-substitution

8

The Challenge

• Existing solvers

― Target CPUs, GPUs and FPGAs

― Employ different algorithms

― Tailored to specific matrix sizes

• Key Challenge: Sustain high performance across matrix sizes

9

Heterogeneity for Performance – Different Solvers

FPGA GPU CPU

CAQR TSQR Householder QR

Platforms

Algorithms

Problem Space

Anything in between

𝑚 = 𝑛 𝑚 ≫ 𝑛

10

Heterogeneity for Performance – Different Solvers

FPGA GPU CPU

CAQR TSQR Householder QR

Platforms

Algorithms

Problem Space

Anything in between

𝑚 = 𝑛 𝑚 ≫ 𝑛

10

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Varying

Matrix Size A

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

A

Square

System

CPU Solver

MUX Workload Allocation

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

A

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

A

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

A

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

A

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Bank of Solvers

11

Tall-Skinny

System

Heterogeneity for Performance - A Heterogeneous Solver

FPGA Solver

GPU Solver

CPU Solver

MUX Workload Allocation

A

Tall-Skinny

System

• Have a heterogeneous bank of solvers

• Select the highest performer based on

the matrix size

Ta

rge

t A

pp

lica

tio

n

Bank of Solvers

11

Compute Engines

12

Compute Engines: FPGA Solver

• Based on existing architecture for tall-skinny QR factorisations [2]

• Functionality Extension

– From QR to Linear Systems workloads

– Exploited an algorithmic property for acceleration

• “Concurrent Solution and Factorisation”

• Any no. of rows

• Up to a max no. of columns

– 5x the max no. of columns of existing FPGA work for the same device

13

[2] A. Rafique et al., FPL 2012.

Compute Engines: FPGA Solver

• Splits the matrix into blocks along its rows

• Any no. of rows

• Up to a max no. of columns

– 5x the max no. of columns of existing FPGA work for

the same device

• Configurable parameters

– Size of arithmetic units

– No. of blocks to be active in parallel in the

architecture

14

Concurrent Solution and Factorisation

1. 𝑨 = 𝑸𝑹

2. 𝑨𝒙 = 𝒃 ⇒ 𝑸𝑹𝒙 = 𝒃

3. 𝑸𝑻𝒃 – Matrix-vector product

4. 𝒙 = 𝑹−𝟏𝑸𝑇𝒃

Solution

• Numerically stable algorithms do

not return 𝑸 explicitly

• Additional computations for the

reconstruction of 𝑸 and for 𝑸𝑻𝒃

15

Concurrent Solution and Factorisation

• Compute 𝑸𝑇𝒃 without forming 𝑸 explicitly

A 𝑸

𝑅

𝒃

Previous works This work

A

𝑅

𝑸𝑻𝒃 QR algorithm

Modified

QR algorithm

Q reconstructed from partial results 16

Compute Engines: GPU and CPU Solvers

• GPU Solver

– Based on the state-of-the-art work on tall-skinny QR factorisations [3]

– Extended its functionality to Linear Systems by means of the Concurrent Solution and Factorisation

• CPU Solver

– Optimised multithreaded linear algebra library (OpenBLAS)

[3] M. Anderson, C. Ballard, J. Demmel, and K. Keutzer, “Communication-Avoiding QR Decomposition for GPUs”, in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, May 2011, pp.48-58. 17

Modelling Framework

• Compile-time Framework

– Performance in GFLOPS and resource estimation

models for the FPGA solver

– Optimal hardware configuration of the FPGA

solver

• Runtime Framework

– Workload allocation among the available solvers

18

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

19

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

FPGA

Resource Info

19

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

FPGA

Resource Info

FPGA Performance

Model

19

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

FPGA

Resource Info

FPGA Performance

Model

FPGA Solver

Instance

GPU and CPU

Solvers

19

Bank of

Solvers

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

Compile-time

GPU and CPU

Profiling

FPGA

Resource Info

FPGA Performance

Model

FPGA Solver

Instance

GPU and CPU

Solvers

19

Bank of

Solvers

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

Compile-time

GPU and CPU

Profiling

FPGA

Resource Info

FPGA Performance

Model

New

Workload

FPGA Solver

Instance

GPU and CPU

Solvers

19

Bank of

Solvers

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

Compile-time

GPU and CPU

Profiling

FPGA

Resource Info

FPGA Performance

Model Runtime

Workload

Allocation

New

Workload

FPGA Solver

Instance

GPU and CPU

Solvers

19

Bank of

Solvers

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

Compile-time

GPU and CPU

Profiling

FPGA

Resource Info

FPGA Performance

Model Runtime

Workload

Allocation

New

Workload

FPGA Solver

Instance

GPU and CPU

Solvers

FPGA Performance Estimate

19

Bank of

Solvers

Compile-time

Modelling

Framework

(Application-Specific)

Typical Matrix Size

Hardware Configuration

Settings

Supplied by

Application

Expert

Heterogeneity for Performance - Proposed Design Flow

Available Solvers

Specifications

Compile-time

GPU and CPU

Profiling

FPGA

Resource Info

FPGA Performance

Model Runtime

Workload

Allocation

New

Workload

FPGA Solver

Instance

GPU and CPU

Solvers

Solution

Solver Selection

FPGA Performance Estimate

19

Bank of

Solvers

Evaluation

20

Experimental Setup

• FPGA

– Xilinx Virtex-6 SX475 at 200 MHz with 2016 DSPs

– Double-precision floating-point

– Up to 275 columns (𝑛 = 275) – 5x the max no. of columns of existing FPGA work

– Any number of rows

– 84.23% BRAM utilization – post place-and-route

– 99.45% DSP utilization – post place-and-route

• GPU

– NVIDIA Tesla K20

– 2496 cores at 706 MHz

• CPU

– Intel i7-4770 at 3.40 GHz, 16 GB RAM, 8 MB cache

– 4 cores, 8 threads

21

Internal Comparisons

No. of Columns (𝑛) = 51

CPU

22

Internal Comparisons

No. of Columns (𝑛) = 51

CPU

4.67 ×

2.74 ×

22

Internal Comparisons

No. of Columns (𝑛) = 51

CPU

Proposed Approach

4.67 ×

2.74 ×

22

Internal Comparisons

CPU

2 ×

No. of Rows (𝑚) = 6400 23

Internal Comparisons

CPU

Proposed Approach

2 ×

No. of Rows (𝑚) = 6400 23

External Comparisons

Speedup • Up to 25.84x

against

software

• Up to 32.67x

against

CULA

• Up to18.07x

against

FPGA

CPU

CPU

- GPU

No. of Columns (𝑛) = 51 24 [4] A. Rafique et al., FPL 2012.

External Comparisons

Speedup • Up to 25.84x

against

software

• Up to 32.67x

against

CULA

• Up to18.07x

against

FPGA

CPU - GPU

CPU

No. of Rows (𝑚) = 6400 25 [4] A. Rafique et al., FPL 2012.

Conclusions

• Different solvers perform better on different matrix

sizes

• Using heterogeneous solvers in a complementary way

enable the high-performance solution of complex

problems in fields such as genetic analysis

26

Thank You & Questions ?

27