+ All Categories
Home > Documents > IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl...

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl...

Date post: 18-Jan-2018
Category:
Upload: shanon-wright
View: 217 times
Download: 0 times
Share this document with a friend
Description:
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 3
32
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech
Transcript
Page 1: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 2: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

2

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 3: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

3

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 4: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Motivation

• Number of cores in a same chip grows every year

Nehalem4~6 Cores

Tilera64~100 Cores

Intel Polaris80 Cores

Nvidia GeForceUp to 256 Cores

4

Page 5: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

5

Prefetching

• Reduce memory latency• Bring to a nearest cache next data required by CPU• Increase the hit ratio• It is implemented in most of the commercial

processors• Erroneous prefetching may produce

– Cache pollution– Resources consumption (queues, bandwidth, etc.)– Power consumption

Page 6: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

6

Prefetch in CMPs

• Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency

• Useless prefetchers implies less performance– More power consumption– More NoC congestion– Interference with other cores requests

Page 7: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

7

Prefetch adverse behaviors

M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Page 8: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

8

Prefetch in shared memories

• Prefetcher distributed

• Entails challenges – Distributed memory streams – Distributed prefetch queue– Statistics generation and recollection point differ

• Difficult the prefetcher task

• Harder to prefetch accuratelyM. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.

Page 9: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

9

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 10: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

10

Objective• Maximize the prefetching effect • By using it only when it is working properly• Minimizing its adverse effects

Page 11: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

11

Proposal

• Identify when the prefetcher generates slowdown– Identify code regions with several granularities– Analyze the prefetcher performance in these regions – Tag this code regions with stats

• Switch the prefetcher off– Save power– Avoid network contention– Avoid cache pollution

• Switch it on again– When it generates speedup

Page 12: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

12

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

Instruction level

Page 13: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

13

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

Basic Bloc level

Page 14: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

14

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

All the code

Page 15: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

15

Code Region Granularity

• Regions tagged with statistics– Accuracy / Miss Ratio

• Activate or deactivate at every new code region– According to the statistic and the current code region

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

• Identify and tag the regions – Statically (Profiling execution)– Dynamically (During the warm up)

Page 16: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

16

Switching off the prefetcher

• Detect the uselessness of the prefetcher

• Accuracy– Useful prefetches / Total number of prefetches– Switch off when the accuracy decreases

• Miss Ratio– Based on the number of misses

Page 17: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

17

Switching on the prefetcher

• Switched off prefetcher does not generate stats

• Cannot reactivate with accuracy increment

• Reactivate when?– Based on miss ratio– After a certain timeout

Page 18: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

18

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 19: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

19

Experimental framework

• Gem5– 16 x86 CPUs– Ruby memory system– L1 prefetchers– MOESI coherency protocol– Garnet network simulator

• Parsecs 2.1

Page 20: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

20

Simulation environment

Page 21: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

21

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 22: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

22

Expected Results

• Power savings without losing performance

• Smaller granularity more accuracy– Blocs or super blocs better than the whole code– Single instructions more accurate than blocs or super blocs

• Smaller granularity: – More resources– More complexity

• Basic bloc granularity should provide good results with a realistic complexity

Page 23: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

23

Q & A

Page 24: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

24

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 25: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

25

Back up slides

Page 26: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

26

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

Page 27: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

27

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Page 28: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

28

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Distributed patterns

Page 29: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

29

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

Page 30: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

30

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

Queue filtering

Page 31: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

31

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Page 32: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

32

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Dynamic profiling


Recommended