+ All Categories
Home > Documents > NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02...

NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02...

Date post: 22-Jan-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
45
NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1
Transcript
Page 1: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

NUMA-aware Matrix-Matrix-Multiplication

Max Reimann, Philipp Otto

1

Page 2: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

About this talk

• Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example

• Code was written in C with numa.h, pthread.h • Tested on FSOC

– ubuntu-0101 • 2 Nodes, 24 Cores

– dl980 • 8 Nodes, 128 Cores

• Compiled with gcc – –O3

2

Page 3: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Naïve Matrix-Matrix-Multiplication

• We will examine MMM for large n x n matrices

• 𝒪 𝑛3

3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

Page 4: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Naïve MMM implementation

4

Page 5: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Performance of Naive vs. MKL

5

0,38

11,79

98,14

0,02

0,13

1,02

0,015625

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

512 1024 2048

Naive

MKL

dl980 on one core

Page 6: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Intel Math Kernel Library (MKL)

• BLAS: Basic Linear Algebra Subprograms

– Standard for Linear Algebra

• MKL:

– Implements BLAS for Intel hardware

– Vectorized and threaded for highest performance

6

Page 7: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Analysis of Naïve MMM

• Testsetup • Use ubuntu-numa machine

• No thread or memory pinning

• Use numatop/pcm

• Performance tools show:

– Unused cores (obvious)

– QPI cannot be fully loaded with one thread

8

Page 8: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization I

• How can the work be divided?

– 1. Partition computation of matrixC by rows or columns

• Problem: All threads need matrixA and matrixB

• Solution: – Accept overhead for remote memory access or

– Copy input/output matrices to the other nodes (preprocessing)

9

* =

Page 9: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization – Partition by rows

10

Page 10: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization – Partition by rows

11

0,38

11,79

98,14

0,05

0,26

2,54

0,19

0,27 0,28

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

512 1024 2048

Naive Sequential

Naive Parallel

MKL Parallel

dl980 on 128 cores

Page 11: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization II

• How can the work be divided?

– 2. Partition computation of matrixC by summands

• Benefit: – for computing the i-th summand, only the i-th row of matrixA

/ column of matrixB is needed

– This allows to only copy the needed parts to the other nodes

• Disadvantage: – matrixB has to be transposed to be able to partition the

memory (preprocessing)

– locking or merging of matrixC is needed

12

Page 12: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization II

13

Page 13: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Performance of „Parallel Sum“ Method

14

1,59

2,81 3,34

14,91

218,84

0,27

1,41

2,94

17,24

186,39

0,19

0,27 0,28

0,43

2,41

0,13

0,25

0,50

1,00

2,00

4,00

8,00

16,00

32,00

64,00

128,00

256,00

512 1024 2048 4096 8192

Parallel sum

Naive Parallel

MKL Parallel

dl980 on 128 cores

Page 14: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen

• Runtime Complexity: – Naive algorithm 𝒪 𝑛3

• Can we get better? – Strassens algorithm, published 1969, was the first

to improve asymptotic complexity

– Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical

– Uses only 7 multiplications instead of 8 per recursion step

15

Page 15: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Matrix definition

16

A = 𝐴1,1 𝐴1,2

𝐴2,1 𝐴2,2, 𝐵 =

𝐵1,1 𝐵1,2

𝐵2,1 𝐵2,2, 𝐶 =

𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:

𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2

𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2

Conventional Algorithm uses 8 (expensive) multiplications:

Page 16: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen’s algorithm

17

𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1

𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2

𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1

𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2

Define temporary matrices:

𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6

Compose final matrix

Only 7 multiplications!

Page 17: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen - Example

18

𝐶1,2 = 𝑀3 + 𝑀5

= 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2

= 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2

= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2

𝐴1,1 𝐴1,2

𝐴2,1 𝐴2,2∙𝐵1,1 𝐵1,2

𝐵2,1 𝐵2,2=

𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2

Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:

Page 18: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen - Analysis

• Cost: 7 multiplications and 18 additions

– 8 multiplications and 4 additions for naïve

• Only practical for large matrices n > 1000

– Although our results indicate otherwise (later)

• Define cutoff point for recursion

– If n is sufficiently small, do naïve multiplication

19

Page 19: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen - Implementation

20

Page 20: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Execution Time: Single-threaded

0,00

0,00

0,01

0,05

0,38

11,79

98,14

0,00

0,00

0,00

0,02

0,12

0,87

6,12

0,00 0,00

0,00

0,00

0,02

0,13

1,02

0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000

16,000032,000064,0000

128,0000

32 64 128 256 512 1024 2048

Seco

nd

s

N-dimension

Naive Strassen MKL

21

strassen: BREAK = 64

dl980 on 1 core

Page 21: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization of Strassen I

• Data dependencies:

– Have to do additions in 𝑀𝑖 before multiplication

• e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2

– Have to calculate 𝑀𝑖 before calculating C

• 𝐶1,2 = 𝑀3 + 𝑀5

• Easiest solution:

– Calculate in 𝑀𝑖s in parallel

– Then calculate 𝐶𝑖,𝑗 in parallel

22

Page 22: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization of Strassen II

• Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads

– Most systems have number of processors on base 2

• We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:

• Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices

• Exponential rising memory requirements

23

Page 23: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Execution Time – 49 Threads

0,05

0,26

2,54

27,61

228,57

0,05

0,14

0,49

2,06

13,53

0,19 0,27 0,28

0,44

1,84

0,03125

0,0625

0,125

0,25

0,5

1

2

4

8

16

32

64

128

256

512 1024 2048 4096 8192

seco

nd

s

N-dimension

Naive Strassen MKL

24 dl980 on 49 cores

Page 24: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

NUMA-Optimizations

• Try to have as much memory local as possible to avoid remote memory access

– Because it is slower by a factor of ~ 1.4

• Partition data and work depending on #nodes and #cores

• Pin threads to nodes with the memory they need

• (Topology for other algorithms)

25

Page 25: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Distributing memory and threads

0,34

11,39

101,12

0,35

18,34

182,45

0,35

21,96

204,85

0,37

14,33

143,44

0,25

0,5

1

2

4

8

16

32

64

128

256

1024 2048 4096

Distributed Memory andThreads

Neither distributed

Distributed threads

Distributed memory

26 Parallel naive on ubuntu-numa0101 on 24 cores

Page 26: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

DEMO

27

Page 27: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Application of NUMA-Optimizations

• Copy all data to every node: – Duration of preprocessing:

• 11.11s for a 8192x8192 matrix to 8 nodes

• Partition data and move to corresponding nodes – Duration of preprocessing:

• 1.03s for a 8192x8192 matrix to 8 nodes

• Pin threads to nodes – int numa_run_on_node(int node);

28

Page 28: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Parallelization – Partition by rows Copying memory to different nodes

29

Page 29: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Strassen Memory Distribution Effects

22,147083 19,611477

21,316545

14,671332

0

5

10

15

20

25

30

35

40

6 7 8 distributed

Dimension: 16384

memory copy multiplication result combination

30 dl980 on 128 core

Page 30: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Other optimization techniques

• Tiling

• Vectorization

• Scalar replacement

• Precomputation of constants

• (unrolling)

31

Page 31: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Tiling

• Divide computational work into tiles to leverage cache

• Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)

33

Page 32: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048

34

1

8

64

512

4096

32768

262144

2097152

16777216

134217728

1,074E+09

8,59E+09

Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed

97

39

13 12

0

20

40

60

80

100

120

Time

s

dl980 on 128 cores

Page 33: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Vectorization

• SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have

Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to

multiple floats • Can only operate efficiently on aligned data (16

bit aligned) • SSE operate on 128bit registers

– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit

– Dl980 machine only support 128bit operations

35

Page 34: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Auto-Vectorization

• Can this be done automatically?

– Gcc –O3 tries to auto-vectorize

• only possible for simple statements

36

Page 35: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Assembler

37

Page 36: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Aligned Malloc

38

Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned

Page 37: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Intrinsics

39

Example

Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Page 38: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Use Parallelism for MMM

• We try to construct a 4x4 matrix multiplication

• How to process rows ?

40

continuous memory

X Can’t be loaded in one instr.

Page 39: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Use parallelism for MMM

• We try to construct a 4x4 matrix multiplication

• How to process rows ?

• Idea: process all elements of row of B in parallel

41

X =

A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14

Add up results

Page 40: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

4x4 Kernel

42

Page 41: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

SSE – Single Threaded

1024 2048 4096

naiveSSE 0,27 2 20

tiledSSE 0,48 5 41

tiled 2 24 213

naive 11 97 879

0,25

0,5

1

2

4

8

16

32

64

128

256

512

1024

Seco

nd

s

N-dimensions

naiveSSE tiledSSE tiled naive

43 dl980 on 1 core

Page 42: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Cache Misses of SSE Variants

0

1.000.000.000

2.000.000.000

3.000.000.000

4.000.000.000

5.000.000.000

6.000.000.000

L1 cache misses dTLB misses

naiveSSE tiledSSE

44

Page 43: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Performance for Small Matrices

0,00

0,05

0,10

0,15

0,20

0,25

64 128 256 512

seco

nd

s

naiveSSE tiled strassen MKL

45 dl980 on 128 cores

Page 44: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Performance for Large Matrices

0,79

7,29

3,90

0,17 0,34

1,20

5,09

0,20 0,39 0,53

1,94

0

1

2

3

4

5

6

7

8

9

10

1024 2048 4096 8192

seco

nd

s

naiveSSE tiled strassenSSE MKL

28,3

46 dl980 on 128 cores

Page 45: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048

Summary

• Analyze algorithm for bottlenecks

– IO optimization

– Hardware specific optimization

• Cache size

• NUMA architecture

• Specific instructions (SSE)

• Try to minimize remote memory access

• Visualisations can facilitate understanding

47


Recommended