+ All Categories
Home > Documents > TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48...

TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48...

Date post: 10-Sep-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
124
TUNING SLIDE
Transcript
Page 1: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

TUNING SLIDE

Page 2: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

MICRO-48 Tutorial

December 5, 2015

Fast and Accurate Microarchitectural

Simulation with ZSim

Daniel Sanchez, Nathan Beckmann,

Anurag Mukkara, Po-An Tsai

MIT CSAIL

Page 3: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Welcome!

Page 4: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Agenda4

8:30 – 9:10 Intro and Overview

9:10 – 9:25 Simulator Organization

9:25 – 10:00 Core Models

10:00 – 10:20 Break / Q&A

10:20 – 11:00 Memory System

11:00 – 11:20 Configuration and Stats

11:20 – 11:40 Validation

11:40 – 12:00 Q&A

Page 5: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Introduction and Overview

5

Page 6: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation6

Current detailed simulators are slow (~200 KIPS)

Page 7: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation6

Current detailed simulators are slow (~200 KIPS)

Simulation performance wall

More complex targets (multicore, memory hierarchy, …)

Hard to parallelize

Page 8: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation6

Current detailed simulators are slow (~200 KIPS)

Simulation performance wall

More complex targets (multicore, memory hierarchy, …)

Hard to parallelize

Problem: Time to simulate 1000 cores @ 2GHz for 1s at

200 KIPS: 4 months

Page 9: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation6

Current detailed simulators are slow (~200 KIPS)

Simulation performance wall

More complex targets (multicore, memory hierarchy, …)

Hard to parallelize

Problem: Time to simulate 1000 cores @ 2GHz for 1s at

200 KIPS: 4 months

200 MIPS: 3 hours

Page 10: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation6

Current detailed simulators are slow (~200 KIPS)

Simulation performance wall

More complex targets (multicore, memory hierarchy, …)

Hard to parallelize

Problem: Time to simulate 1000 cores @ 2GHz for 1s at

200 KIPS: 4 months

200 MIPS: 3 hours

Alternatives?

FPGAs: Fast, good progress, but still hard to use

Simplified/abstract models: Fast but inaccurate

Page 11: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Techniques7

Three techniques to make 1000-core simulation practical:

1. Detailed DBT-accelerated core models to speed up sequential

simulation

2. Bound-weave to scale parallel simulation

3. Lightweight user-level virtualization to bridge user-level/full-

system gap

Page 12: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Techniques7

Three techniques to make 1000-core simulation practical:

1. Detailed DBT-accelerated core models to speed up sequential

simulation

2. Bound-weave to scale parallel simulation

3. Lightweight user-level virtualization to bridge user-level/full-

system gap

ZSim achieves high performance and accuracy:

Simulates 1024-core systems at 10s-1000s of MIPS

100-1000x faster than current simulators

Validated against real Westmere system, avg error ~10%

Page 13: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

This Presentation is Also a Demo!8

ZSim is simulating these slides

OOO Westmere cores running at 2 GHz

3-level cache hierarchy

Will illustrate other features as I present them

Total cycles and instructions

simulated (in billions)

Current simulation speed and basic stats

(updated every 500ms)

Page 14: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

This Presentation is Also a Demo!8

ZSim is simulating these slides

OOO Westmere cores running at 2 GHz

3-level cache hierarchy

Will illustrate other features as I present them

Total cycles and instructions

simulated (in billions)

Current simulation speed and basic stats

(updated every 500ms)

Busy (> 0.9 cores active)

0.1 < cores active < 0.9

Idle (< 0.1 cores active)

Page 15: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

This Presentation is Also a Demo!8

ZSim is simulating these slides

OOO Westmere cores running at 2 GHz

3-level cache hierarchy

Will illustrate other features as I present them

Total cycles and instructions

simulated (in billions)

Current simulation speed and basic stats

(updated every 500ms)

ZSim performance relevant when busy

Running on 2-core laptop CPU @ 1.7 GHz

~12x slower than 16-core server @ 2.6 GHz

Busy (> 0.9 cores active)

0.1 < cores active < 0.9

Idle (< 0.1 cores active)

!

Page 16: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Design Decisions9

General execution-driven simulator:

Functional

model

Timing

model

Page 17: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Design Decisions9

General execution-driven simulator:

Functional

model

Timing

model

Emulation? (e.g., gem5, MARSSx86)

Instrumentation? (e.g., Graphite, Sniper)

Page 18: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Design Decisions9

General execution-driven simulator:

Functional

model

Timing

model

Emulation? (e.g., gem5, MARSSx86)

Instrumentation? (e.g., Graphite, Sniper)

Functional model “for free”

Base ISA = Host ISA (x86)

Dynamic Binary Translation (Pin)

Page 19: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Design Decisions9

General execution-driven simulator:

Functional

model

Timing

model

Emulation? (e.g., gem5, MARSSx86)

Instrumentation? (e.g., Graphite, Sniper)

Cycle-driven?

Event-driven?

Functional model “for free”

Base ISA = Host ISA (x86)

Dynamic Binary Translation (Pin)

Page 20: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Design Decisions9

General execution-driven simulator:

Functional

model

Timing

model

Emulation? (e.g., gem5, MARSSx86)

Instrumentation? (e.g., Graphite, Sniper)

Cycle-driven?

Event-driven?

Functional model “for free”

Base ISA = Host ISA (x86)

DBT-accelerated,

instruction-driven core

+

Event-driven uncore

Dynamic Binary Translation (Pin)

Page 21: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Outline10

Introduction

Detailed DBT-accelerated core models

Bound-weave parallelization

Lightweight user-level virtualization

Page 22: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Shift most of the work to DBT instrumentation phase

Accelerating Core Models11

mov (%rbp),%rcx

add %rax,%rbx

mov %rdx,(%rbp)

ja 40530a

Load(addr = (%rbp))

mov (%rbp),%rcx

add %rax,%rdx

Store(addr = (%rbp))

mov %rdx,(%rbp)

BasicBlock(BBLDescriptor)

ja 10840530a

Basic block Instrumented basic block Basic block descriptor

Insµop decoding

µop dependencies,

functional units, latency

Front-end delays

+

Page 23: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Shift most of the work to DBT instrumentation phase

Instruction-driven models: Simulate all stages at once for each

instruction/ µop

Accelerating Core Models11

mov (%rbp),%rcx

add %rax,%rbx

mov %rdx,(%rbp)

ja 40530a

Load(addr = (%rbp))

mov (%rbp),%rcx

add %rax,%rdx

Store(addr = (%rbp))

mov %rdx,(%rbp)

BasicBlock(BBLDescriptor)

ja 10840530a

Basic block Instrumented basic block Basic block descriptor

Insµop decoding

µop dependencies,

functional units, latency

Front-end delays

+

Page 24: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Shift most of the work to DBT instrumentation phase

Instruction-driven models: Simulate all stages at once for each

instruction/ µop

Accurate even with OOO if instruction window prioritizes older instructions

Faster, but more complex than cycle-driven

Accelerating Core Models11

mov (%rbp),%rcx

add %rax,%rbx

mov %rdx,(%rbp)

ja 40530a

Load(addr = (%rbp))

mov (%rbp),%rcx

add %rax,%rdx

Store(addr = (%rbp))

mov %rdx,(%rbp)

BasicBlock(BBLDescriptor)

ja 10840530a

Basic block Instrumented basic block Basic block descriptor

Insµop decoding

µop dependencies,

functional units, latency

Front-end delays

+

Page 25: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Detailed OOO Model12

OOO core modeled and validated against Westmere

Main Features

Fetch

Decode

Issue

OOO

Exec

Commit

Wrong-path fetches

Branch Prediction

Front-end delays (predecoder, decoder)

Detailed instruction to µop decoding

Rename/capture stalls

IW with limited size and width

Functional unit delays and contention

Detailed LSU (forwarding, fences,…)

Reorder buffer with limited size and width

Page 26: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Detailed OOO Model13

OOO core modeled and validated against Westmere

Fetch

Decode

Issue

OOO

Exec

Commit

Page 27: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Detailed OOO Model13

OOO core modeled and validated against Westmere

Fetch

Decode

Issue

OOO

Exec

Commit

Fundamentally Hard to Model

Wrong-path execution

Page 28: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Detailed OOO Model13

OOO core modeled and validated against Westmere

Fetch

Decode

Issue

OOO

Exec

Commit

Fundamentally Hard to Model

Wrong-path execution

In Westmere, wrong-path instructions don’t

affect recovery latency or pollute caches

Skipping OK

Page 29: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Detailed OOO Model13

OOO core modeled and validated against Westmere

Fetch

Decode

Issue

OOO

Exec

Commit

Fundamentally Hard to Model

Wrong-path execution

Rarely used

instructions

BTB

LSD

TLBs

In Westmere, wrong-path instructions don’t

affect recovery latency or pollute caches

Skipping OK

Not Modeled (Yet)

Page 30: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Accuracy14

8.5% average IPC error, max 26%, 21/29 within 10%

29 SPEC CPU2006 apps for 50 Billion instructions

Real: Xeon L5640 (Westmere), 3x DDR3-1333, no HT

Simulated: OOO cores @ 2.27 GHz, detailed uncore

Page 31: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Performance15

Host: E5-2670 @ 2.6 GHz (single-thread simulation)

29 SPEC CPU2006 apps for 50 Billion instructions

Page 32: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Performance15

Host: E5-2670 @ 2.6 GHz (single-thread simulation)

29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean

12 MIPS hmean

Page 33: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Performance15

Host: E5-2670 @ 2.6 GHz (single-thread simulation)

29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean

12 MIPS hmean

~10-100x faster

Page 34: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Performance15

Host: E5-2670 @ 2.6 GHz (single-thread simulation)

29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean

12 MIPS hmean

~3x between least and

most detailed models!

~10-100x faster

Page 35: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Outline16

Introduction

Detailed DBT-accelerated core models

Bound-weave parallelization

Lightweight user-level virtualization

Page 36: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques17

Parallel Discrete Event Simulation (PDES):

Divide components across host threads

Execute events from each component

maintaining illusion of full order

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Page 37: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques17

Parallel Discrete Event Simulation (PDES):

Divide components across host threads

Execute events from each component

maintaining illusion of full order

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Host

Thread 0

Host

Thread 1

Page 38: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques17

Parallel Discrete Event Simulation (PDES):

Divide components across host threads

Execute events from each component

maintaining illusion of full order

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Host

Thread 0

Host

Thread 1

5 10

15 15

10 5Skew < 10 cycles

Page 39: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques17

Parallel Discrete Event Simulation (PDES):

Divide components across host threads

Execute events from each component

maintaining illusion of full order

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Host

Thread 0

Host

Thread 1

5 10

15 15

10 5Accurate

Not scalableSkew < 10 cycles

Page 40: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques17

Parallel Discrete Event Simulation (PDES):

Divide components across host threads

Execute events from each component

maintaining illusion of full order

Lax synchronization: Allow skews above inter-component

latencies, tolerate ordering violations

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Host

Thread 0

Host

Thread 1

5 10

15 15

10 5

Scalable

Inaccurate

Accurate

Not scalableSkew < 10 cycles

Page 41: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference18

Path-altering interference

If we simulate two accesses out of order, their

paths through the memory hierarchy change

GETS A

HIT

Core 0

LLC

Mem

GETS A

MISS

1 2

Core1

Page 42: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference18

Path-altering interference

If we simulate two accesses out of order, their

paths through the memory hierarchy change

GETS A

HIT

Core 0

LLC

Mem

GETS A

MISS

1 2

Core1 Core 0

LLC

Mem

GETS A

HIT

GETS A

MISS

2 1

Core 1

Page 43: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference18

Path-altering interference

If we simulate two accesses out of order, their

paths through the memory hierarchy change

Path-preserving interference

If we simulate two accesses out of order, their

timing changes but their paths do not

GETS A

HIT

Core 0

LLC

Mem

GETS A

MISS

1 2

Core1 Core 0

LLC

Mem

GETS A

HIT

GETS A

MISS

2 1

Core 1

GETS B

HIT

Core 0

LLC (blocking)

Mem

GETS A

MISS

1 2

Core 1

3 4

5

6

Page 44: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference18

Path-altering interference

If we simulate two accesses out of order, their

paths through the memory hierarchy change

Path-preserving interference

If we simulate two accesses out of order, their

timing changes but their paths do not

GETS A

HIT

Core 0

LLC

Mem

GETS A

MISS

1 2

Core1 Core 0

LLC

Mem

GETS A

HIT

GETS A

MISS

2 1

Core 1

GETS B

HIT

Core 0

LLC (blocking)

Mem

GETS A

MISS

1 2

Core 1

3 4

5

6GETS B

HIT

Core 0

LLC (blocking)

Mem

GETS A

MISS

2 1

Core 1

4 5

6

3

Page 45: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference19

Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

Page 46: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference19

Path-altering interference extremely rare in small intervals

Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

Page 47: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Characterizing Interference19

Path-altering interference extremely rare in small intervals

Strategy:

Simulate path-preserving interference faithfully

Ignore (but optionally profile) path-altering interference

Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

Page 48: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Parallelization20

Divide simulation in small intervals (e.g., 1000 cycles)

Two parallel phases per interval: Bound and weave

Page 49: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Parallelization20

Divide simulation in small intervals (e.g., 1000 cycles)

Two parallel phases per interval: Bound and weave

Bound phase: Find paths

Weave phase: Find timings

Page 50: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Parallelization20

Divide simulation in small intervals (e.g., 1000 cycles)

Two parallel phases per interval: Bound and weave

Bound-Weave equivalent to PDES

for path-preserving interference

Bound phase: Find paths

Weave phase: Find timings

Page 51: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Page 52: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Page 53: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Host Thread 0

Host Thread 1Host

Time

Page 54: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Core 0

Core 3

Core 1

Core 2

Bound Phase: Parallel simulation until cycle

1000, gather access traces

Host Thread 0

Host Thread 1Host

Time

Page 55: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Core 0

Core 3

Core 1

Core 2

Bound Phase: Parallel simulation until cycle

1000, gather access traces

Domain 0

Domain 1

Weave Phase: Parallel event-driven simulation of

gathered traces until actual cycle 1000

Host Thread 0

Host Thread 1Host

Time

Page 56: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Core 0

Core 3

Core 1

Core 2

Bound Phase: Parallel simulation until cycle

1000, gather access traces

Domain 0

Domain 1

Weave Phase: Parallel event-driven simulation of

gathered traces until actual cycle 1000

Feedback: Adjust core cycles

Host Thread 0

Host Thread 1Host

Time

Page 57: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example21

2-core host simulating4-core system

1000-cycle intervals

Divide components

among 2 domains Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

Core 0

Core 3

Core 1

Core 2

Bound Phase: Parallel simulation until cycle

1000, gather access traces

Domain 0

Domain 1

Weave Phase: Parallel event-driven simulation of

gathered traces until actual cycle 1000

Feedback: Adjust core cycles

Bound Phase

(until cycle 2000)

…Core 3

Core 2 Core 0

Core 1Host Thread 0

Host Thread 1Host

Time

Page 58: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Bound Phase22

Host thread 0 simulates core 0, records trace:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

Edges fix minimum latency between events

Minimum L3 and main memory latencies (no interference)

20

20

20

3030 100

30 120

2020 20

40

Page 59: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase23

Host threads simulate components from domains 0,1

Host threads only sync when needed

e.g., thread 1 simulates other events (not shown) until cycle 110, syncs

Lower bounds guarantee no order violations

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1

Page 60: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1

Page 61: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

Page 62: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

170

Page 63: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

280

170

Page 64: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

280

290

300

170

Page 65: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

280

290

300

320

170

Page 66: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Example: Weave Phase24

Delays propagate as events are simulated:

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

Core0 @ 290

L3b3 @ 270

HIT

20

20

20

3030 100

30 120

20

20 20

40

Host Thread 0

Host Thread 1Row miss +50 cycles

280

290

300

320

340

170

Page 67: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Scalability25

Bound phase scales almost linearly

Using novel shared-memory synchronization protocol (later)

Weave phase scales much better than PDES

Threads only need to sync when an event crosses domains

A lot of work shifted to bound phase

Page 68: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Scalability25

Bound phase scales almost linearly

Using novel shared-memory synchronization protocol (later)

Weave phase scales much better than PDES

Threads only need to sync when an event crosses domains

A lot of work shifted to bound phase

Need bound and weave models for each component, but

division is often very natural

e.g., caches: hit/miss on bound phase; MSHRs, pipelined

accesses, port contention on weave phase

Page 69: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Take-Aways26

Minimal synchronization:

Bound phase: Unordered accesses (like lax)

Weave: Only sync on actual dependencies

Page 70: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Take-Aways26

Minimal synchronization:

Bound phase: Unordered accesses (like lax)

Weave: Only sync on actual dependencies

No ordering violations in weave phase

Page 71: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Take-Aways26

Minimal synchronization:

Bound phase: Unordered accesses (like lax)

Weave: Only sync on actual dependencies

No ordering violations in weave phase

Works with standard event-driven models

e.g., 110 lines to integrate with DRAMSim2

Page 72: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Multithreaded Accuracy27

23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM

11.2% avg perf error (not IPC), 10/23 within 10%

Similar differences as single-core results

Page 73: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

1024-Core Performance28

Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)

Results for the 14/23 parallel apps that scale

Page 74: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

1024-Core Performance28

Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)

Results for the 14/23 parallel apps that scale

200 MIPS hmean

41 MIPS hmean

Page 75: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

1024-Core Performance28

Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)

Results for the 14/23 parallel apps that scale

200 MIPS hmean

41 MIPS hmean

~100-1000x faster

Page 76: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

1024-Core Performance28

Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)

Results for the 14/23 parallel apps that scale

200 MIPS hmean

41 MIPS hmean

~5x between least and

most detailed models!

~100-1000x faster

Page 77: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Scalability29

Page 78: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Scalability29

10.1-13.6x speedup @ 16 cores

Page 79: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Outline30

Introduction

Detailed DBT-accelerated core models

Bound-weave parallelization

Lightweight user-level virtualization

Page 80: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Lightweight User-Level Virtualization31

No 1Kcore OSs

No parallel full-system DBT

ZSim has to be

user-level for now

Page 81: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Lightweight User-Level Virtualization31

No 1Kcore OSs

No parallel full-system DBT

Problem: User-level simulators limited to simple workloads

Lightweight user-level virtualization: Bridge the gap with

full-system simulation

Simulate accurately if time spent in OS is minimal

ZSim has to be

user-level for now

Page 82: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Lightweight User-Level Virtualization32

Multiprocess workloads

Scheduler (threads > cores)

Time virtualization

System virtualization

Simulator-OS deadlockavoidance

Signals

ISA extensions

Fast-forwarding

Page 83: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Limitations33

Not implemented yet:

Multithreaded cores

Detailed NoC models

Virtual memory (TLBs)

Page 84: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Limitations33

Not implemented yet:

Multithreaded cores

Detailed NoC models

Virtual memory (TLBs)

Fundamentally hard:

Systems or workloads with frequent path-altering interference

(e.g., fine-grained message-passing across whole chip)

Kernel-intensive applications

Page 85: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Summary34

Three techniques to make 1Kcore simulation practical

DBT-accelerated models: 10-100x faster core models

Bound-weave parallelization: ~10-15x speedup from

parallelization with minimal accuracy loss

Lightweight user-level virtualization: Simulate complex

workloads without full-system support

ZSim achieves high performance and accuracy:

Simulates 1024-core systems at 10s-1000s of MIPS

Validated against real Westmere system, avg error ~10%

Page 86: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Simulator Organization

35

Page 87: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Main Components36

Harness

DriverSystem

Initialization

Config

Core timing

models

Memory system

timing models

Global

Memory

User-

level

virtualiz

ation

Stats

Page 88: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

Page 89: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

./build/opt/zsim test.cfg

Page 90: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

Page 91: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

Global Memory

Page 92: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

Global Memory

Page 93: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = {command = “ls”;

};

process1 = {command = “echo foo”;

};

Global Memory

Page 94: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = {command = “ls”;

};

process1 = {command = “echo foo”;

};

Global Memory

pin –t libzsim.so -- ls

Page 95: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

ZSim Harness37

Most of zsim implemented as

a pintool (libzsim.so)

A separate harness process

(zsim) controls simulation

Initializes global memory

Launches pin processes

Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = {command = “ls”;

};

process1 = {command = “echo foo”;

};

Global Memory

pin –t libzsim.so -- ls

pin –t libzsim.so – echo foo

Page 96: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory38

Pin processes communicate through a shared memory

segment, managed as a single global heap

All simulator objects must be allocated in the global heap

Page 97: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory38

Pin processes communicate through a shared memory

segment, managed as a single global heap

All simulator objects must be allocated in the global heap

Process 0

address space

Program code

Local heap

Global heap

libzsim.so

Page 98: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory38

Pin processes communicate through a shared memory

segment, managed as a single global heap

All simulator objects must be allocated in the global heap

Process 0

address space

Program code

Local heap

Global heap

libzsim.so

Process 1

address space

Program code

Local heap

Global heap

libzsim.so

Page 99: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory38

Pin processes communicate through a shared memory

segment, managed as a single global heap

All simulator objects must be allocated in the global heap

Process 0

address space

Program code

Local heap

Global heap

libzsim.so

Process 1

address space

Program code

Local heap

Global heap

libzsim.so

Page 100: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory38

Pin processes communicate through a shared memory

segment, managed as a single global heap

All simulator objects must be allocated in the global heap

Process 0

address space

Program code

Local heap

Global heap

libzsim.so

Process 1

address space

Program code

Local heap

Global heap

libzsim.so

Global heap and

libzsim.so code in

same memory

locations across all

processes Can

use normal pointers

& virtual functions

Page 101: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory Allocation Idioms39

Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

Page 102: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory Allocation Idioms39

Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

Page 103: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory Allocation Idioms39

Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

C-style memory allocation (discouraged):

gm_malloc, gm_calloc, gm_free, …

Page 104: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Global Memory Allocation Idioms39

Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

C-style memory allocation (discouraged):

gm_malloc, gm_calloc, gm_free, …

Declare globally-scoped variables under struct zinfo

Page 105: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Page 106: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Page 107: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Page 108: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4

Page 109: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4

User-

level

virtualiz

ation

5

Page 110: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4System

Initialization

6

User-

level

virtualiz

ation

5

Page 111: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4System

Initialization

6

User-

level

virtualiz

ation

5

Stats

7

Page 112: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4System

Initialization

6

User-

level

virtualiz

ation

5

Stats

7

Memory system

timing models

8

Page 113: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Initialization Sequence40

Harness

1

Config

2

Global

Memory

3

Driver

4System

Initialization

6

User-

level

virtualiz

ation

5

Stats

7

Memory system

timing models

8

Core timing

models

9

Page 114: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Thanks For Your Attention!

Questions?

Page 115: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Backup Slides

Page 116: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Accuracy: Traces116

Page 117: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Single-Thread Accuracy: Traces117

Page 118: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Motivation118

Timeline:

2008: Decide to study 1K-core systems for my Ph.D. thesis

2009: Try every sim out there, none fast enough

Got M5+GEMS to 512 threads [ASPLOS 2010], barely usable

2010: Start developing ZSim [ZCache, MICRO 2010]

2011: Make ZSim flexible, scalable, develop detailed models, other groups start using it

2012: Let’s publish a paper and release it…

ZSim design approach:

Make judicious tradeoffs to achieve detailed 1K core sims efficiently

Verify that those tradeoffs result in minor inaccuracies

Disclaimer: Not a silver bullet & tradeoffs may not be accurate for your target system; you should validate the tradeoffs!

Page 119: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Instruction-Driven Timing Models119

Cycle/event-driven models: Simulate all stages cycle by cycle

Instruction-driven models: Simulate all stages at once for each ins/uop

Each stage has separate clocks

Ordered queues (FetchQ, UopQ, LoadQ, StoreQ, ROB) model feedback loops between stages

Issue window tracks cycles each FU is used to determine dispatch cycle

Even with OOO, accurate if:

1. IW prioritizes older uops (OK)

2. uop exec times not affected by newer uops(OK except mem uops, ignore for now)

Fetc

h

Deco

de

Issu

e

OO

O

Exec

Com

mit

Instr code drives directly

DBT can accelerate better

Harder to develop

Page 120: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

DBT-based Acceleration120

With instruction-driven models, can push most overheads into instrumentation phase

mov -0x38(%rbp),%rcx

lea -0x2040(%rbp),%rdx

add %rax,%rbx

mov %rdx,-0x2068(%rbp)

cmp $0x1fff,%rax

jne 40530a

Load(addr = -0x38(%rbp))

mov -0x38(%rbp),%rcx

lea -0x2040(%rbp),%rdx

add %rax,%rdx

mov %rdx,-0x2068(%rbp)

Store(addr = -0x2068(%rbp))

cmp $0x1fff,%rax

BasicBlock(DecodedBBL)

jne 10840530a

Basic block descriptor

Type Src1 Src2 Dst1 Dst2 Lat PortMsk

Load rbp rcx 001000

Exec rbp rdx 3 110001

Exec rax rdx rdx rflgs 1 110001

StAddr rbp S0 1 000100

StData rdx S0 000010

Exec rax rip rip rflgs 1 000001

Instrumented code

Original code (1 basic block)

Predecoder/decoder delays

Instruction to uop fission

Instruction fusion

Uop dependencies, latency, ports

Page 121: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Parallelization Techniques121

Parallel Discrete Event Simulation (PDES):

Core 1Core 0

Mem 0

L3 Bank 0 L3 Bank 1

Thread 0 Thread 1 Divide components across threads

Execute events from each component

maintaining illusion of full order

Pessimistic PDES: Keep skew between

threads below inter-component latency5 10

15 15

10 5

Optimistic PDES: Speculate & roll back

on ordering violations

Simple

Excessive sync

Less sync

Heavyweight

Lax synchronization: Allow skews above inter-component latencies,

tolerate ordering violations Scalable

Inaccurate

Accurate

Scales poorly

Page 122: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Parallelization122

Divide simulation in small intervals (e.g., 1000 cycles)

Two parallel phases per interval: Bound and weave

Bound phase:

Simulate each core independently using instruction-driven models

Record paths of all accesses through the memory hierarchy

Uncore models assume no interference, use minimum response time for all accesses puts lower bound on all events e.g., for a main memory access: uncontended caches, buses, row hit

Weave phase:

Perform parallel event-driven simulation of recorded events

Leverage prior knowledge of events to scale

Bound-Weave equivalent to PDES

for path-preserving interference

Find paths

Find timings

Page 123: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Bound-Weave Example123

Weave phase: Events spread across two threads

Crossing events ( ) to only synchronize when needed

e.g., thread 1 reaches cycle 110, “L3b0 @ 80” not done checks thread 0’s progress, requeues itself later

Other synchronization-avoiding mechanisms in paper

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ Mem0 @ 130

WBACK

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

L3b0 @ 250

FREE MSHR

L3b3 @ 270

HIT

Core0 @ 290

Thread 0 Thread 1

Domain 0

Page 124: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!

Events are lower-bounded No ordering violations

e.g., 110 lines of code to integrate with DRAMSim2

Bound-Weave Example124

Delays propagate across crossings:

Works with standard event-driven models!

L3b1 @ 50

HIT

Core0 @ 30 Core0 @ 60

L3b0 @ 80

MISS

Mem1 @ 110

READ Mem0 @ 130

WBACK

Core0 @ 90 Core0 @ 250

L3b0 @ 230

RESP

L3b0 @ 250

FREE MSHR

L3b3 @ 270

HIT

Core0 @ 290

Thread 0 Thread 1

Domain 0

Row miss +50 cycles

280

290

300

320

350


Recommended