+ All Categories
Home > Documents > Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors

Date post: 15-Jan-2016
Category:
Upload: orli
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Improving Memory System Performance for Soft Vector Processors. Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008. Soft Processors in FPGA Systems. Data-level parallelism → soft vector processors. Soft Processor. Custom Logic. C + Compiler. HDL + CAD. - PowerPoint PPT Presentation
Popular Tags:
26
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008
Transcript
Page 1: Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector ProcessorsPeter YiannacourasJ. Gregory SteffanJonathan Rose

WoSPS – Oct 26, 2008

Page 2: Improving Memory System Performance for Soft Vector Processors

2

Soft Processors in FPGA Systems

SoftProcessor

CustomLogic

HDL+

CAD

C+

Compiler

Easier Faster Smaller Less Power

Data-level parallelism → soft vector processors

Configurable – how can we make use of this?

Page 3: Improving Memory System Performance for Soft Vector Processors

3

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

vadd

1 Vector Lane

Page 4: Improving Memory System Performance for Soft Vector Processors

4

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

vadd

16 Vector Lanes

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

16x speedup

Page 5: Improving Memory System Performance for Soft Vector Processors

5

Sub-Linear Scalability

4.7

8.0

6.05.2

3.1

01

23

45

67

89

autc

or

conv

en

ip_c

heck

sum

imgb

lend

GM

EA

N

Cycle

Perf

orm

ance

Rela

tive to 1

Lane

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Vector lanes not being fully utilized

Page 6: Improving Memory System Performance for Soft Vector Processors

6

Where Are The Cycles Spent?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

autcor conven ip_checksum imgblend AVERAGE

Fra

ction o

f Tota

l Cycle

s

~

Memory Unit Stall Cycles

Miss Cycles 67%

2/3 cycles spent waiting on memory unit, often from cache misses

16 lanes

Page 7: Improving Memory System Performance for Soft Vector Processors

7

Our Goals

1. Improve memory system Better cache design Hardware prefetching

2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

Page 8: Improving Memory System Performance for Soft Vector Processors

8

Current Infrastructure

Vectorizedassembly

subroutines

GNU as+

Vectorsupport

ELFBinary

MINT Instruction

Set Simulator

scalar μP

+vpu

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF

ALU

MemUnit

x & satur.

VRWB

MUX

Satu-rate

Rshift

VRRF

ALU

x & satur.

VRWB

MUX

Satu-rate

Rshift

EEMBC CBenchmarks

Modelsim(RTL

Simulator)

SOFTWARE HARDWAREVerilog

AlteraQuartus II

v 8.0

cyclesarea,frequency

GCC

ld

verification verification

Page 9: Improving Memory System Performance for Soft Vector Processors

9

VESPA Architecture Design

ScalarPipeline3-stage

VectorControlPipeline3-stage

VectorPipeline6-stage

Icache Dcache

Decode RFALU

MUX WB

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF A

LU

x & satur.

VRWB

MUX

Satu-rate

Rshift

VRRF A

LU

x & satur.

VRWB

MUX

Satu-rate

Rshift

MemUnit

Decode

Supports integerand fixed-point operations, and predication

32-bitdatapaths

Shared Dcache

10

Page 10: Improving Memory System Performance for Soft Vector Processors

10

Vector Memory Crossbar

Memory System Design

DDR

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache4KB,

16B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

Page 11: Improving Memory System Performance for Soft Vector Processors

11

Vector Memory Crossbar

Memory System Design

DDR

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache16KB,

64B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

4x

4x

Reducedcache accesses +some prefetching

Page 12: Improving Memory System Performance for Soft Vector Processors

12

Improving Cache Design

Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB

Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware

Measure area cost Equate silicon area of all resources used

Report in units of Equivalent LEs

Page 13: Improving Memory System Performance for Soft Vector Processors

13

Cache Design Space – Performance (Wall Clock Time)

1.68

1.93

1.55

1.77

1.37

1.50

1.13

1.00

1.25

1.50

1.75

2.00

4KB 8KB 16KB 32KB 64KB

Speedup V

s 4

KB

,16B

128B

64B

32B

16B

Best cache design almost doubles performance of original VESPA

122MHz

123MHz

126MHz129MHz

More pipelining/retiming could reduce clock frequency penalty

Cache line more important than cache depth (lots of streaming)

Page 14: Improving Memory System Performance for Soft Vector Processors

14

Cache Design Space – Area

1.00

1.25

1.50

1.75

2.00

4KB 8KB 16KB 32KB 64KB

Are

a V

s 4

KB

,16B

128B

64B

32B

16B

M4K

MRAM

16bits

4096bits

64B (512 bits)

16bits

4096bits

16bits

4096bits

…16bits

4096bits

16bits

4096bits

16bits

4096bits

16bits

4096bits

32 => 16KB of storage

System area almost doubled in worst case

Page 15: Improving Memory System Performance for Soft Vector Processors

15

Cache Design Space – Area

1.00

1.25

1.50

1.75

2.00

4KB 8KB 16KB 32KB 64KB

Are

a V

s 4

KB

,16B

128B

64B

32B

16B

M4K

MRAM

b) Don’t use MRAMs: big, few, and overkill

a) Choose depth to fill block RAMs needed for line size

Page 16: Improving Memory System Performance for Soft Vector Processors

16

Hardware Prefetching Example

DDR

Dcache

vld.w

No Prefetching Prefetching 3 blocks

DDR

Dcache

vld.w

MISS MISS

9 cyclepenalty

9 cyclepenalty

vld.w vld.w

HITMISS

Page 17: Improving Memory System Performance for Soft Vector Processors

17

Hardware Data Prefetching

Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth

Disadvantages Cache pollution

We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss

We measure performance/area using a 64B, 16KB dcache

Page 18: Improving Memory System Performance for Soft Vector Processors

18

Prefetching K Blocks – Any Miss

0.5

1

1.5

2

2.5

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup v

s n

o

Pre

fetc

hin

g

autcor

conven

viterb

fbital

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

GMEAN

Peak average speedup 28%

2.2x

Not receptive

Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

Page 19: Improving Memory System Performance for Soft Vector Processors

19

dirtylines

Prefetching Area Cost: Writeback Buffer

Two options: Deny prefetch Buffer all dirty lines

Area cost is small 1.6% of system area Mostly block RAMs Little logic

No clock frequency impact

Prefetching 3 blocks

DDR

Dcache

vld.w

MISS

9 cyclepenalty

WBBuffer

Page 20: Improving Memory System Performance for Soft Vector Processors

20

Any Miss vs Sequential Vector Miss

0.70

0.80

0.90

1.00

1.10

1.20

1.30

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup

Any Cache Misses

Sequential Vector only

Collinear – nearly all misses in our benchmarks are sequential vector

Page 21: Improving Memory System Performance for Soft Vector Processors

21

Vector Length Prefetching

Previously: constant # cache lines prefetched Now: Use multiple of vector length

Only for sequential vector memory instructions Eg. Vector load of 32 elements

Guarantees <= 1 miss per vector memory instr

vld.w0 31

fetch +prefetch 28*k

Page 22: Improving Memory System Performance for Soft Vector Processors

22

Vector Length Prefetching - Performance

0.5

1

1.5

2

2.5N

one

1*V

L

2*V

L

4*V

L

8*V

L

16*V

L

32*V

L

Amount of Prefetching

Speedup

autcor

conven

fbital

viterb

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

GMEAN

Peak 29%

2.2x

Not receptive

1*VL prefetching provides good speedup without tuning, 8*VL best

no cachepollution

21%

Page 23: Improving Memory System Performance for Soft Vector Processors

23

Overall Memory System Performance

00.10.20.30.40.50.60.70.8

16-byte line 64-byte line 64-byte line +prefetch

Fra

ction o

f Tota

l Cycle

s

Memory Unit Stall Cycles

Miss Cycles

(4KB) (16KB)

67%

48%

31%

4%

15

Wider line + prefetching reduces memory unit stall cycles significantly

Wider line + prefetching eliminates all but 4% of miss cycles

Page 24: Improving Memory System Performance for Soft Vector Processors

24

Improved Scalability

02468

101214

autc

or

conv

en

fbita

l

vite

rb

rgbc

myk

rgby

iq

ip_c

heck

sum

imgb

lend

filt3

x3

GM

EA

N

Cyc

le P

erfo

rman

ce

Rel

ativ

e to

1 L

ane

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

Page 25: Improving Memory System Performance for Soft Vector Processors

25

Summary

Explored cache design ~2x performance for ~2x system area

Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB

Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL

Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup

Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

Page 26: Improving Memory System Performance for Soft Vector Processors

26

Vector Memory Unit

Dcache

base

stride*0

index0

+MUX

...

stride*1

index1

+MUXstride*L

indexL

+MUX

MemoryRequestQueue

ReadCrossbar

…Memory Lanes=4

rddata0rddata1

rddataL

wrdata0wrdata1

wrdataL ...

WriteCrossbar

MemoryWrite

Queue

L = # Lanes - 1……


Recommended