+ All Categories
Home > Documents > Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Date post: 24-Dec-2015
Category:
Upload: peregrine-cooper
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
40
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1
Transcript
Page 1: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

1

Designing Memory Systems for Tiled ArchitecturesAnshuman GuptaSeptember 18, 2009

Page 2: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Multi-core Processors are abundant

Multi-cores increase the compute resources on the chip without increasing hardware complexity

Keeps power consumption within the budgets.

2

AMD Phenom (4-core)

Sun Niagara 2 (8-core)

Tile64 (64-core) Intel Polaris (80-core)

Page 3: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

3

Multi-Core Processors are underutilized

…b = a + 4 … (0)c = b * 8 … (1)d = c – 2 … (2)e = b * b … (3)f = e * 3 … (4)g = f + d … (5)…

0

3

2

5

11

12

1

Single –thread code Parallel Execution

1

42

313

14

13

0

3

2

5

2

1

1

43

54

6

Serial Execution

Software gets the responsibility of utilizing the cores with parallel instruction streams

Hard to parallelize applications.

Page 4: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

4

Tiled Architectures increase Utilization by enabling Parallelization

The OCN communication latencies are of the order of 2+(distance between tiles) cycles*

*Latency for RAW inter-ALU OCN

Tiled architectures are of class of multi-core architectures

Provide mechanisms to facilitate automatic parallelization of single-threaded programs

Fast On Chip Networks (OCNs) to connect cores

Page 5: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

5

Automatic Parallelization on Tiled Architectures

…b = a + 4 … (0)c = b * 8 … (1)d = c – 2 … (2)e = b * b … (3)f = e * 3 … (4)g = f + d … (5)…

0

3

2

5

11

12

1

Single –thread code Multi-cores Tiled Architecture

In tiled architectures, dependent instructions can be placed on multiple cores with low penalty in tiled architectures due to cheap inter-ALU communication.

1

42

313

14

13

0

3

2

5

2

1

1

43

54

6

0

3

2

5

2

3

1

1

42

34

5

4

Page 6: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

6

Why aren’t tiled architectures used everywhere?

Automatic parallelization is still very difficult due to slow resolution of remote memory dependencies

Tiled Architecture Memory systems have a special requirement –

Fast Memory Dependence Resolution

…(*b) = a + 4 … (0)c = (*b) * 8 … (1)(*d) = c – 2 … (2)e = (*h) * 4 … (3)f = e * 3 … (4)g = f + (*i) … (5)…

0

3

2

5

11

12

1

Single –thread code Multi-cores Tiled Architecture

1

42

313

14

13

0

3

2

5

2

1

1

4

3

54

6

0

3

2

5

11

12

1

1

42

313

14

13

What if we add some memory

instructions?

Page 7: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

7

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

WorkAnalysis of Existing WorkFuture Work and Conclusion

Page 8: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

8

Memory Dependence

Static Analysis

Type a address

b address

Static placement

No No 0x1000

0x2000

Must True 0x1000

0x1000

May True 0x1000

0x1000

False 0x1000

0x2000

*a = … … = *b

foo (int * a, int * b){ *a = … … = *b}

*a = …… = *b

*a = … … = *b

Page 9: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

9

Memory Coherence

Coherent space provides an abstraction of a single data buffer with a single read write port

Hierarchical implementation of shared memory◦ Require coherence protocols to provide the same abstraction

Core 0 Core 1

Shared Memory

Core 0

Write A = 1

Core 1

Read A

Shared MemoryCache Cache

Shared Buffer

Write A = 1 Read A

A = 0

A = 1

A = 1

DependenceSignal

Page 10: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

10

Improving Memory Dependence Resolution

Memory Dependence Resolution Performance depends on –◦True Dependence Performance◦False Dependence Performance◦Coherence System Performance

Page 11: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

11

True Dependence Resolution

Delay 1 – Determined by Signaling Stage◦ Earlier is better

Delay 2 – Determined by signaling delay inside the ordering mechanism◦ Faster is better

Delay 3 – Determined by Stalling Stage◦ Later is better

Delays 1 and 3 are determined by the resolution model

Source Destination

Signal

Stall Stage

Signal Stage

1

2

3

Delay

Page 12: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

12

False Dependence ResolutionFalse Dependencies occur when

◦Static analysis cannot disambiguate◦Memory Dependence encoding is not partial

For false dependencies, dependent instruction should ideally not wait for any signal◦Runtime Disambiguation

The address comparison done in hardware to declare the dependent instruction as free

◦Speculation Dependent instruction is issued speculatively

assuming the dependence is false

Page 13: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

13

Fast Data AccessLocal L1 caches can help decrease

average latencies◦No network delays

Cache Coherence (CC)◦Dynamic access – data location not known

statically◦Expensive dynamic access in the absence of

CC

Page 14: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

14

What features to look out for?

L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

Page 15: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

15

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

Work◦RAW◦WaveScalar◦EDGE

Analysis of Existing WorkFuture Work and Conclusion

Page 16: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

16

RAWA highly static tiled architecture

Array of simple in-order MIPS cores Scalar Operand Network (SON) for fast inter-ALU

communication Shared address space, local caches and shared DRAMs No cache coherence mechanism

Software cache management through flush and invalidation

*Taylor et al, IEEE Micro 2002

Page 17: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

17

Artifacts of Software Cache ManagementDifficult to keep track of the most up-

to-date version of a memory addressAll memory accesses can be

categorized as -◦Static Access

The location of the cache line is known statically

◦Dynamic Access A runtime lookup is required for determining

the location of the cache line These are really expensive (36 vs 7)

Page 18: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

18

Static-Dynamic Access OrderingTwo static accesses

◦Synchronization over SONDependence between a static and a

dynamic access◦Synchronizing over SON between

Static access Static requestor or receiver for dynamic access

Execute side resolutionNo speculative runaheadFalse dependencies are as expensive

as true dependence

Page 19: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

19

Summary

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

Page 20: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

20

Dynamic Access Ordering Execute side resolution very

expensive Resolution done late in the memory

system Static ordering point

◦ Turnstile tile◦ One per equivalence class◦ Equivalence class - set of all memory

operations that can access the same memory address

Requests sent on static SON to turnstile◦ Receives in memory order

In-order dynamic network channels

Page 21: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

21

Summary

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

RAWdd Yes No Turnstile Secondary Mem-side

Partial No Yes

Page 22: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

22

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

Work◦RAW◦WaveScalar◦EDGE

Analysis of Existing WorkFuture Work and Conclusion

Page 23: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

23

WaveScalarA fully dynamic Tiled Architecture with Memory Ordering

Clusters arranged in 2D array connected by mesh dynamic network

Each tile has a store buffer and banked data cache

Secondary memory system made up of L2 caches around the tiles

Cache coherence*Swanson et al, Micro 2003

Page 24: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

24

Memory OrderingLoad A

Store B

Load C

Store Buffer

WaveScalar preserves memory ordering by using a sequence number for each memory operation in a wave ◦ Unique◦ Indicates age

Each memory operation also stores its predecessor’s and successor’s sequence number◦ Use “?” if not known at compile time

There cannot be a memory operation whose possible predecessor has it’s successor marked as “?” and vice-versa◦ MEM-NOPs

A request is allowed to go ahead if it’s predecessor has issued

In hardware this ordering is managed in the store buffers◦ A single store buffer is responsible to

handle all memory requests for a dynamic wave

Load A <0>

Store B <1>

Load C <2>

Load A <.,0,?>

Store B <0,1,2>

Load C <?,2,.>

Nop<0,2,3>

Store B <0,1,3>

Load C <?,3,.>

Load C <?,3,.>Store B <0,1,3>Load A <.,0,?>

Load C <1,3,.>

Load A <.,0,1>

Page 25: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

25

Removing False Load Dependencies

Sequence number based ordering is highly restrictive◦ Loads are stalled on previous

loadsEach memory operation has

ripple number as last store’s sequence number

Memory operation can issue if op with ripple number has issued◦ Loads can issue OoO

Stores still have total ordering

Page 26: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

26

Summary

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

RAWdd Yes No Turnstile Secondary Mem-side

Partial No Yes

WaveScalar

No Yes Store Buffer

Primary Mem-side

Store Total

No No

Page 27: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

27

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

Work◦RAW◦WaveScalar◦EDGE

Analysis of Existing WorkFuture Work and Conclusion

Page 28: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

28

EDGEA partially dynamic Tiled Architecture with block execution

Array of tiles connected over fast OCNs

Primary memory system is distributed over tiles

Each such tile has address interleaved

Data cache Load Store Queue

Distributed Secondary Memory System

Cache Coherence*S. Sethumadhavan et al, ICCD ‘06

Page 29: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

29

Memory Ordering Unique 5 bit tag called LSID

◦ Completion of block execution

◦ Ordering of memory operations

DTs get a list of all LSIDs in a block during fetch stage

Memory operations reach a DT◦ LSID sent to all the DTs

Request issued if all requests with earlier LSIDs completed◦ memory side dependence

resolution When all memory

operations have completed, block is committed

<0,1,2,3>

<0,1,2,3>

<0,1,2,3>

<0,1,2,3>

Ld A <0>Ld B <1>St C <2>Ld C <3>

Ld A <0>

Ld B <1>

St C <2>

Ld C <3>

<0,1,2,3>

<0,1,2,3>, 1

<0,1,2,3>

<0,1,2,3>

<0,1,2,3>,0

<0,1,2,3>, 1

<0,1,2,3>

<0,1,2,3>

<0,1,2,3>,0

<0,1,2,3>, 1

<0,1,2,3>,0

<0,1,2,3>, 1

<0,1,2,3>, 3

<0,1,2,3>

<0,1,2,3>,0

<0,1,2,3>, 1

<0,1,2,3>, 3,2

<0,1,2,3>

<0,1,2,3>, 3,2<0,1,2,3>, 3,2

Control Tile

Execution Tiles

Interleaved Data Tiles

Page 30: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

30

Dependence SpeculationEDGE memory ordering is very

restrictive◦Total memory order

Loads execute speculativelyEarlier store to the same address

causes squash◦Predictor used to reduce squashes

Page 31: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

31

Summary

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

RAWdd Yes No Turnstile Secondary Mem-side

Partial No Yes

WaveScalar

No Yes Store Buffer

Primary Mem-side

Store Total

No No

EDGE No Yes LSQ Primary Mem-side

Total Yes No

Page 32: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

32

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

WorkAnalysis of Existing WorkFuture Work and Conclusion

Page 33: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

33

True Dependence Optimization

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

RAWdd Yes No Turnstile Secondary Mem-side

Partial No Yes

WaveScalar

No Yes Store Buffer

Primary Mem-side

Store Total

No No

EDGE No Yes LSQ Primary Mem-side

Total Yes No

Page 34: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

34

Memory Side Resolution allows more Overlap

Requestor A Requestor B

Home Node

Requestor A Requestor B

Home Node

Requestor A Requestor B

Home Node Tag Buffer

Turnstile

RAWsd EDGE/WaveScalarRAWdd

RAWsd

E/WS

RAWdd

*The length of the bars do not indicate delays

Request A

Response A

Coherence delay A

Request B

Response B

Coherence delay B

Page 35: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

35

Network Stalls should be avoided

Execute Side Resolution - e◦ RAWsd

Memory Side Resolution - m◦ Edge, WaveScalar

RAW dynamic ordering - mt

◦ Network delay to memory system is overlapped

e

em

m mt

F Na E N$ Tp Nm Ts M Nc Nr W m,

mt

e

m

mt

E,W,N$,Nr

Tp,N

m

Page 36: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

36

False Dependence Optimization

Arch L1 Local

CC Ordering Point

Resolution Encoding Spec

Runtime Disambiguation

RAWsd Yes No OCN Exec-side Partial No No

RAWdd Yes No Turnstile Secondary Mem-side

Partial No Yes

WaveScalar

No Yes Store Buffer

Primary Mem-side

Store Total

No No

EDGE No Yes LSQ Primary Mem-side

Total Yes No

Partial Ordering reduces

false deps

Speculation on false

deps reduces

stalls

Disambiguation should

be done early

Page 37: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

37

OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing

WorkAnalysis of Existing WorkFuture Work and Conclusion

Page 38: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

38

What’s a Good Tiled Architecture Memory System? Local caches for fast L1 hit Cache Coherence support for ease in

programmability and no dynamic access delays Fast True Dependence Resolution

◦ Performance comparable to same core placement of operations

◦ Late stalls◦ Early signaling

Reduction of false dependencies through partial memory operation ordering

Fast False Dependence resolution◦ Performance comparable to same core placement of

operations◦ Early runtime memory disambiguation◦ Speculative memory requests

Page 39: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

39

ConclusionAuto-parallelization on tiled architecture can

benefit from fast Memory Dependence resolution◦ Multi-core memory system were not designed with this

goalPerformance of both true and false dependence

resolution should be comparable dependent memory instructions placed on the same core

ISA should support partial memory operation ordering to avoid artificial false dependencies

Memory system should have local caches and cache coherence for performance and programmability

Thank You!Questions?

Page 40: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

40

Dynamic Accesses are expensive

X looks up a global address list and sends a dynamic request to owner Y

Y is interrupted, data is fetched and dynamic request sent to Z

Z is interrupted, data is stored in local cache

One table lookup, two interrupt handlers and two dynamic requests make dynamic loads expensive

Lifted portions represent processor occupancy,while unlifted portions portion represents network latency


Recommended