+ All Categories
Home > Documents > Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross...

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross...

Date post: 05-Jan-2016
Category:
Upload: sherman-vincent-arnold
View: 219 times
Download: 1 times
Share this document with a friend
Popular Tags:
37
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1
Transcript
Page 1: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

1

Page 2: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

2

Page 3: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

3

Page 4: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

NUMA multicores: how it happened

3210

BusC

Northbridge

MC

DRAM memory

4

0 1 2 3 7654

BusC

4 5 6 7

BusC BusC BusC BusC BusC BusC

MC

First generation: SMP

Page 5: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

NUMA multicores: how it happened

3210

BusC

Northbridge

DRAM memory

5

7654

BusC

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

IC IC

Page 6: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

NUMA multicores: how it happened

3210

DRAM memory

6

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 7: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

NUMA multicores: how it happened

3210

DRAM memory

7

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 8: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

3210

DRAM memory

7654

MC MC

DRAM memory

IC IC

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

8

0 1 2 3 4 5 6 7

Page 9: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

9

Page 10: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

10

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

Page 11: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Bandwidth sharing: local accesses

11

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

0

DRAM memory

3

Global Queue

Processor 0 Processor 1

Page 12: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Bandwidth sharing: remote accesses

12

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Page 13: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Bandwidth sharing: combined accesses

13

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Global Queue

Page 14: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

14

Page 15: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++)

{

a[i]=b[i]+SCALAR*c[i];

}

• Multiple co-executing triad clones

15

Page 16: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

16

C C

C C

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

Page 17: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: local accesses

17

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 18: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: remote accesses

18

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 19: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

19

Page 20: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: combined accesses

Execute clones in all possible configurations

20

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

(2L, 3R)

Page 21: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: combined accesses

Execute clones in all possible configurations

21

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 22: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: combined accesses

22

Total bandwidth [GB/s]

Page 23: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

GQ fairness: combined accesses

Execute clones in all possible configurations

23

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 24: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Combined accesses

24

Total bandwidth [GB/s]

Page 25: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

25

Page 26: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

26

Page 27: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Bandwidth sharing model

27

remotelocaltotal bandwidthbandwidthbandwidth )1(

3210

DRAM memory

7654

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

C C

Page 28: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

28

Page 29: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Contention affects sharing factor

29

DRAM

Processor 0 Processor 0

C

CQPI

contenders

C

C

C

Page 30: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Contention affects sharing factor

30

Sharing factor ()

Page 31: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Combined accesses

31

Total bandwidth [GB/s]

Page 32: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Contention affects sharing factor

• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

32

Page 33: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

33

Page 34: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

34

3210

DRAM memory

IMC

DRAM memory

QPI

Level 3 cache

Global Queue

BA98

IMCQPI

Level 3 cache

Global Queue

764 5

Page 35: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

The next generation

35

Total bandwidth [GB/s]

Page 36: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling36

Page 37: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Thank you! Questions?

37


Recommended