+ All Categories
Home > Documents > Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science...

Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science...

Date post: 11-Jan-2016
Category:
Upload: sheryl-chandler
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
122
Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland
Transcript
Page 1: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

Performance Optimizationsfor NUMA-Multicore Systems

Zoltán Majó

Department of Computer ScienceETH Zurich, Switzerland

Page 2: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

2

About me

ETH Zurich: research assistant Research: performance optimizations Assistant: lectures

TUCN Student Communications Center: network engineer Department of Computer Science: assistant

Page 3: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

3

Computing

Unlimited need for performance

Page 4: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

4

Performance optimizations

One goal: make programs run fast

Idea: pick good algorithm Reduce number of operations executed Example: sorting

Page 5: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

5

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

n^2 n*log(n)

Input size (n)

Execution time [T]

Number of operations

Page 6: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

6

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

Column1

Input size (n)

Execution time [T]

Number of operationsbubble so

rt

Page 7: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

7

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort

Number of operations

Page 8: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

8

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort

Number of operations

11Xfaster

Page 9: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

9

Sorting

We picked good algorithm, work done

Are we really done?

Make sure our algorithm runs fast Operations take time We assumed 1 operation = 1 time unit T

Page 10: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

10

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T

Page 11: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

11

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

Page 12: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

12

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

Page 13: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

13

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

Page 14: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

14

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

bubble sort (

1 op = 1 T)

32%faster

Page 15: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

15

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

CPU

Page 16: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

16

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Hardware must be used efficiently

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

CPU

Page 17: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

17

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusion

ETH scholarship

Page 18: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

18

RAM

Memory accessesCPU

230 cycles access latency

Page 19: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

19

RAM

230 cycles access latency

CPU

Memory accesses

Total access latency = ?Total access latency = 16 x 230 cycles = 3680 cycles

Page 20: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

20

RAM

230 cycles access latency

CPU

Caching

Page 21: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

21

Cache

RAM

CachingCPU

30 cycles access latency

200 cycles access latency

Block size:

Page 22: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

22

RAM

Cache

CachingCPU

30 cycles access latency

200 cycles access latency

Block size:

Page 23: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

23

Cache

RAM

CachingCPU

30 cycles access latency

200 cycles access latency

Block size:

Page 24: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

24

RAM

Cache

CachingCPU

30 cycles access latency

200 cycles access latency

Block size:

Page 25: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

25

Cache

RAM

CachingCPU

30 cycles access latency

200 cycles access latency

Block size:

Page 26: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

26

RAM

Cache

Hits and missesCPU

30 cycles access latency

200 cycles access latency

Cache miss: data not in cache = 230 cyclesCache hit: data in cache = 30 cycles

Page 27: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

27

RAM

Cache

Total access latencyCPU

30 cycles access latency

200 cycles access latency

Total access latency = ?Total access latency = 4 misses + 12 hits= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles

Page 28: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

28

Benefits of caching

Comparison Architecture w/o cache: T = 230 cycles Architecture w/ cache: Tavg = 80 cycles → 2.7X

improvement

Do caches always help? Can you think of access pattern with bad cache usage?

Page 29: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

29

RAM

Cache

CachingCPU

35 cycles access latency

200 cycles access latency

Block size:

Page 30: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

30

Cache-aware programming

Today’s example: matrix-matrix multiplication (MMM)

Number of operations: n3

Compare naïve and optimized implementation Same number of operations

Page 31: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

31

C

MMM: naïve implementation

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k < N; k+

+)sum += A[i]

[k]*B[k][j];C[i][j] = sum;

}

A B= Xj

i i

j

Page 32: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

32

RAM

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

C A B

Cache hits Total accesses

A[][]B[][]

44

??

Page 33: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

33

RAM

A

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

C B

Cache hits Total accesses

A[][]B[][]

44

??3

Page 34: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

34

RAM

A

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

C B

Cache hits Total accesses

A[][]B[][]

44

??3

Page 35: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

35

RAM

A

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

C B

Cache hits Total accesses

A[][]B[][]

44

??3

Page 36: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

36

RAM

BA

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

C

Cache hits Total accesses

A[][]B[][]

44

??30

Page 37: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

37

MMM: Cache performance

Hit rate Accesses to A[][]: 3/4 = 75% Accesses to B[][]: 0/4 = 0% All accesses: 38%

Can we do better?

Page 38: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

38

Cache-friendly MMM

Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj)

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k <

N; k++)sum +=

A[i][k]*B[k][j];C[i][j] += sum;

}

for (i=0; i<N; i++)for (k=0; k<N; k++) {

r = A[i][k];for (j=0; j <

N; j++)C[i][j]

+= r*B[k][j];}

C A B= Xk

iik

Page 39: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

39

RAM

BC A

Cache

MMMCPU

30 cycles access latency

200 cycles access latency

Cache hits Total accesses

C[][]B[][]

44

??33

Page 40: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

40

Cache-friendly MMM

Cache-unfriendly MMM (ijk)

A[][]: 3/4 = 75% hit rate

B[][]: 0/4 = 0% hit rate

All accesses: 38% hit rate

Cache-friendly MMM (ikj)

C[][]: 3/4 = 75% hit rate

B[][]: 3/4 = 75% hit rate

All accesses: 75% hit rate

Better performance due to cache-friendliness?

Page 41: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

41

512 1024 2048 4096 81920.01

0.1

1

10

100

1000

10000

ijk (cache-unfriendly) ikj (cache-friendly)

Matrix size

Performance of MMMExecution time [s]

Page 42: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

42

512 1024 2048 4096 81920.01

0.1

1

10

100

1000

10000

ijk (cache-unfriendly) ikj (cache-friendly)

Matrix size

Performance of MMMExecution time [s]

20X

Page 43: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

43

Cache-aware programming

Two versions of MMM: ijk and ikj Same number of operations (~n3) ikj 20X better than ijk

Good performance depends on two aspects Good algorithm Implementation that takes hardware into account

Hardware Many possibilities for inefficiencies We consider only the memory system in this lecture

Page 44: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

44

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusions

ETH scholarship

Page 45: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

45

CPU

Cache-based architecture

RAM

Bus Controller

L1-C

CacheL2 Cache

Memory Controller

10 cycles access latency

20 cycles access latency

200 cycles access latency

Page 46: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

46

Processor package

CPUCore

Multi-core multiprocessor

RAM

Bus Controller

L1-C

CacheL2 Cache

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor package

CPUCore

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

Page 47: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

47

Experiment

Performance of a well-optimized program soplex from SPECCPU 2006

Multicore-multiprocessor systems are parallel Multiple programs run on the system simultaneously Contender program: milc from SPECCPU 2006

Examine 4 execution scenarios

soplex

milc

Page 48: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

48

Execution scenariosProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex milc

Page 49: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

49

Execution scenariosProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex milc

Page 50: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

50

Performance with sharing: soplex

0.00.40.81.21.62.0

Execution time relative to solo execution

Page 51: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

51

Performance with sharing: soplex

0.00.40.81.21.62.0

Execution time relative to solo execution

Page 52: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

52

Performance with sharing: soplex

0.00.40.81.21.62.0

Execution time relative to solo execution

Page 53: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

53

Resource sharing

Significant slowdowns due to resource sharing

Why is resource sharing so bad?Example: cache sharing

Page 54: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

54

RAM

L1 Cache

Cache sharingCore Coresoplex milc

Page 55: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

55

RAM

L1 Cache

Cache sharingCore Coresoplex milc

Page 56: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

56

Resource sharing

Does resource sharing affect all programs? So far: we considered at the performance of under contention Let us consider a different program:

soplex

namd

Page 57: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

57

Performance with sharing

0.00.40.81.21.62.0

soplexnamd

Execution time relative to solo execution

Page 58: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

58

Performance with sharing

0.00.40.81.21.62.0

soplexnamd

Execution time relative to solo execution

Page 59: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

59

Resource sharing

Significant slowdown for some programs affected significantly affected less

What do we do about it?

Scheduling can help Example workload:

soplex

namd

soplex soplexsoplexsoplex

namd namd namd namd

Page 60: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

60

Execution scenariosProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 0

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex soplex soplex namdnamd namd namd

Page 61: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

61

Execution scenariosProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 0

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex namd namdsoplex namdsoplex namd

Page 62: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

62

Challenges for a scheduler

Programs have different behaviors

Behavior not known ahead-of-time vs.

Behavior changes over time

soplex namd

Page 63: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

63

Single-phased program

Page 64: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

64

Program with multiple phases

Page 65: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

65

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusions

ETH scholarship

Page 66: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

66

Hardware performance counters

Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation

In the past: undocumented feature

Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst

Page 67: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

67

Programming performance counters

Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode)

perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files

Simple API: Set up counters: perf_event_open() Read counters as files

Page 68: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

68

int main() {

int pid = fork();

if (pid == 0) {

exit(exec(“./my_program”, NULL));

} else {

int status; uint64_t value;

int fd = perf_event_open(...);

waitpid(pid, &status, NULL);

read(fd, &value, sizeof(uint64_t);

printf(”Cache misses: %”PRIu64”\n”, value);

}

}

Example: monitoring cache misses

Page 69: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

69

perf_event_open()

Looks simpleint sys_perf_event_open(

struct perf_event_attr *hw_event_uptr,

pid_t pid,

int cpu,

int group_fd,

unsigned long flags

);

struct perf_event_attr {__u32 type;__u32 size;__u64 config;union {

__u64 sample_period;__u64 sample_freq;

};__u64 sample_type;__u64 read_format;__u64 inherit;__u64 pinned;__u64 exclusive;__u64 exclude_user;__u64 exclude_kernel;__u64 exclude_hv;__u64 exclude_idle;__u64 mmap;

Page 70: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

70

libpfm

Open-source helper library

user programlibpfm perf_events

(1) event name

(2) set up perf_event_attr

(3) call perf_event_open()

(4) read results

Page 71: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

71

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual

Page 72: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

72

Software Developer’s Manual

Page 73: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

73

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM

Page 74: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

74

512 1024 2048 4096 819210000

100000

1000000

10000000

100000000

1000000000

10000000000

100000000000

1000000000000

ijk (cache-unfriendly) ikj (cache-friendly)

Matrix size

MMM cache misses# cache misses x 106

30X

Page 75: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

75

Single-phased program

set up performance counters read performance counters

Page 76: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

76

Program with multiple phases

set up performance counters

get sample

Page 77: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

77

Membus: multicore scheduler

1. Dynamically determine program behavior Measure # of loads/stores that cause memory traffic Hardware performance counters in sampling mode

2. Determine optimal placement based on measurements

Page 78: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

78

Evaluation

Workload with 8 processes lbm, soplex, gromacs, hmmer from SPEC CPU 2006 Two instances of each program

Experimental results

Page 79: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

79

Evaluation

lbm soplex gromacs hmmer Average0.0

0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus

Execution time relative to solo execution

Page 80: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

80

Evaluation

lbm soplex gromacs hmmer Average0.0

0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus

Execution time relative to solo execution

Page 81: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

81

Evaluation

lbm soplex gromacs hmmer Average0.0

0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus

Execution time relative to solo execution

16%

8%

Page 82: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

82

Summary: multicore processors

Resource sharing critical for performance

Membus: a scheduler that reduces resource sharing

Question: why wasn’t Membus able to improve more?

Page 83: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

83

Memory controller sharingProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex namd namdnamd soplexnamd soplex

Page 84: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

84

Memory Controller

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

RAM RAM

Core

Page 85: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

85

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

Memory Ctrl

L1-C

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Memory Ctrl

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

RAM RAM

Interconnect Interconnect

Page 86: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

86

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusions

ETH scholarship

Page 87: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

87

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

Page 88: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

88

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Page 89: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

89

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Page 90: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

90

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Page 91: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

91

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Page 92: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

92

First-touch page placement policy

Processor 1

DRAM

Processor 0

DRAM

T0 T1

P0

R/W

Page 93: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

93

First-touch page placement policy

Processor 1

DRAM

Processor 0

DRAM

T0 T1

P1

R/W

P0

Page 94: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

94

Automatic page placement

First-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported by hardware performance counters many architectures

Page 95: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

95

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

T0T1

T1P1

P0

Page 96: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

96

Automatic page placement

Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

Page 97: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

97

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

Page 98: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

98

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

Page 99: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

99

Profile-based page placement

Performance improvement over first-touch in some cases No performance improvement in many cases

Why?

Page 100: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

100

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1

P2

P2: inter-processor shared

Page 101: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

101

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1P2

P2: inter-processor shared

Page 102: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

102

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Page 103: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

103

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Page 104: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

104

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Page 105: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

105

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Page 106: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

106

Automatic page placement

Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property

We propose program transformations No time for details now, see results

Page 107: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

107

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Page 108: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

108

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Page 109: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

109

Conclusions

Performance optimizations Good algorithm + hardware-awareness Example: cache-aware matrix multiplication

Hardware awareness Resource sharing in multicore processors Data placement in non-uniform memory architectures

A lot remains to be done...

...and you can be part of it!

Page 110: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

110

ETH scholarship for masters students...

...to work on their master thesisIn the Laboratory of Software Technology

Prof. Thomas R. GrossPhD. Stanford University, MIPS project, supervisor John L.

HennessyCarnegie Mellon: Warp, iWarp, Fx projects

ETH offers to you Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400) Assistance with finding housing Thesis topic

Page 111: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

111

Possible Topics

Michael Pradel: Automatic bug finding

Luca Della Toffola: Performance optimizations for Java

Me: Hardware-aware performance optimizations

Page 112: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

A

B DC

E

Call Graph

A B C D E

A B C

Memory

… …

Cache

OO code positioning

Page 113: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

A

B DC

E

Call Graph

Profiling Hot Path

Page 114: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

A

B DC

E

Call Graph

A B C D E

A B C

Memory

… …

Cache

Miss

Page 115: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

A

B DC

E

Call Graph

A B CDE

A B E

Memory

… …

Cache

Hit

• JVM• No Profiling• Constructors

Page 116: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

• Linked list traversal• Looking for the youngest/oldest person

Person

next

name

surname

age

Person

next

name

surname

age

Person

next

name

surname

age

Person

next

name

surname

age

null

Page 117: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

Cache

next name agesurname

next name agesurname

next name agesurname

next name agesurname

next name agesurname next name agesurname

Page 118: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

Cache

next name agesurname

next name agesurname

next name agesurname

next name agesurname

next name agesurname next name agesurname

Page 119: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

Cache

next age next age next age next age

next age next age next age next age

next age next age next age next age

Page 120: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

Aa1a2

a3

a4

a5

Aa1: 10a2: 100

a3: 1000

a4: 30

a5: 2000

Aa3a5

Class$Cold

A$Colda1a2

a4

hot field

cold field

Profiling Splitting

# field accesses

• Jikes RVM• Splitting strategies• Garbage collection optimizations• Allocation optimizations

Page 121: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

121

If interested and motivated

Apply @ Prof. Rodica Potolea Until August 2012

Come to Zurich Start in February 2013 Work 4-6 months on the thesis

If you have questions Send e-mail to me [email protected] Talk to Prof. Rodica Potolea

Page 122: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland.

122

Thank you for your attention!


Recommended