+ All Categories
Home > Documents > 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an...

04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an...

Date post: 08-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
33
04 - DSP Architecture and Microarchitecture Andreas Ehliar September 11, 2015 Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Transcript
Page 1: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

04 - DSP Architecture and Microarchitecture

Andreas Ehliar

September 11, 2015

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 2: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Memory indirect addressing (continued from last lecture)

; Reality check: Data hazards!

; Assembler code v3:

repeat 256,endloop

load r0,DM1[DM0[ptr0++]]

store DM0[ptr1++],r0

endloop:

// 512 clock cycles

I Short discussion break: How to rewrite the code above toavoid the data hazard?

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 3: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Memory indirect addressing

; Assembler code v4:

repeat 64,endloop

load r0,DM1[DM0[ptr0++]] ; Unrolls the loop

load r1,DM1[DM0[ptr0++]] ; to avoid data

load r2,DM1[DM0[ptr0++]] ; hazards

load r3,DM1[DM0[ptr0++]]

store DM0[ptr1++],r0

store DM0[ptr1++],r1

store DM0[ptr1++],r2

store DM0[ptr1++],r3

endloop:

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 4: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 2: Memory indirect addressing during store

; Assembler code v5:

repeat 256,endloop

load r0,DM0[ptr0++]

store DM0[ptr1++],DM1[r0] // DM0 = DM1[r0]

endloop:

// 512 clock cycles

I Justification: It is better if the pipeline is created in such away that a store takes a long time to complete than a load.

I (A store will seldom generate data dependencies whereas aload to a register will easily generate data dependencies asseen in the first alternative.)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 5: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 2: Memory indirect addressing during store(unrolled)

; Assembler code v6:

repeat 64,endloop

load r0,DM0[ptr0++]

load r1,DM0[ptr0++]

load r2,DM0[ptr0++]

load r3,DM0[ptr0++]

// A store buffer can simplify some

// of the data hazards here.

// (might still need some unrolling)

store DM0[ptr1++],DM1[r0]

store DM0[ptr1++],DM1[r1]

store DM0[ptr1++],DM1[r2]

store DM0[ptr1++],DM1[r3]

endloop:

// 512 clock cyclesAndreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 6: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 3: Rewrite loop as follows

I Output stored in DM1 this time around, remaining data inDM0

; Assembler code v7:

load r0,DM0[ptr0++]

repeat 255,endloop

load r0,DM0[r0]

store DM1[ptr1++],r0

load r0,DM0[ptr0++]

endloop

load r0,DM0[r0]

store DM1[ptr1++], r0

// 768 clock cycles for loop, no improvement (yet)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 7: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 3: Merge instructions

I No real data dependency between the marked instructions,merge these into one!

; Assembler code v7:

load r0,DM0[ptr0++]

repeat 255,endloop

load r0,DM0[r0]

store DM1[ptr1++],r0 // These two instructions

load r0,DM0[ptr0++] // can be merged without

// additional HW cost!

endloop

load r0,DM0[r0]

store DM1[ptr1++], r0

// 768 clock cycles for loop, no improvement (yet)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 8: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 3: Merge instructions

I A form of software pipelining has been used hereI (The inner loop operates partly on iteration i, and partly on

iteration i+1)

; Assembler code v8:

load r0,DM0[ptr0++] // Prologue

repeat 255,endloop

load r0,DM0[r0]

loadstore r0,DM0[ptr0++], DM1[ptr1++],r0 // These two instructions

// loadstore does the following:

// DM1[ptr1++] = r0; r0 = DM0[ptr0++]

endloop

load r0,DM0[r0] // Epilogue

store DM1[ptr1++], r0

// 768 clock cycles for loop, no improvement (yet)Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 9: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Alternative 3: Rewrite loop as follows

I Advantage of alternative 3:I The pipeline depth of loadstore is the same as the pipeline

depth of load and storeI The instruction may also be useful in other situations such as

when copying values from one memory to another

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 10: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Conclusions - Instruction set design

I C hides memory addressing costs and loop costs

I At assembly language level, memory addressing must beexplicitly executed.

I We can conclude that most memory access and addressingcan be pipelined and executed in parallel behind running thearithmetic operations.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 11: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Conclusions - Instruction set design

I One essential ASIP design technique will be grouping thearithmetic and memory operations into one specific instructionif they are used together all the time

I Remember this during lab 4!

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 12: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Conclusions - Instruction set design

I To hide the cost of memory addressing and data access is todesign smart addressing models by finding and usingregularities of addressing and memory access.

I Addressing regularities:I postincremental addressingI modulo addressingI postincremental with variable step sizeI and bit-reversed addressing.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 13: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Conclusions - Instruction set design

I An assembly language instruction set must be more efficientthan Junior

I Accelerations shall be implemented at arithmetic andalgorithmic levels.

I Addressing and data accesses can be executed in parallel witharithmetic computing.

I Program flow control, loop or conditional execution, can alsobe accelerated

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 14: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Conclusions - Instruction set design

I A DSP processor will seldomly have a pure RISC-likeinstruction set

I To accelerate important DSP kernels, CISC-like extensions areacceptable (especially if they don’t add any real hardwarecost)

I (Also, note that both RISC and CISC are losers in theprocessor wars today, real processors are typically hybrids)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 15: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

What if you can’t create an ASIP?

I Trade program memory for performanceI To avoid control complexity (loop unrolling)I To avoid addressing complexity

I Other clever programming tricksI Conditional executionI (Self modifying code)I Rewrite algorithmI etc. . .

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 16: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

History of DSP architectures

I Von Neumann architecture vs Harvard architecture

Memory

Control unit

Arithmetic unit

In-out

Program memory

Control unit

Arithmetic unit

In-out

Data memory

[Liu2008, Figure 3.3]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 17: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

History of DSP architectures

DP DM

CP PM

DP DM

CP PM

MUX

DP DM

CP PM

MUX

(a) (b) (c)

[Liu2008, Figure 3.4]

I a) Normal Harvard architecture

I b) Words from PM can be sent to the datapath

I c) Use a dual port data memory

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 18: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

History of DSP architectures

I Efficient FIR filter with only two memories (PM and DM)

Alt 1: Carries coefficient Alt 2: Override instruction

as immediate fetch, fetch data from PM

mac A, DM[AR0++%], -1 conv 8,DM[AR0++%]

mac A, DM[AR0++%], -743 .data -1

mac A, DM[AR0++%], 0 .data -743

mac A, DM[AR0++%], 8977 .data 0

mac A, DM[AR0++%], 16297 .data 8977

mac A, DM[AR0++%], 8977 .data 16297

mac A, DM[AR0++%], 0 .data 8977

mac A, DM[AR0++%], -743 .data 0

mac A, DM[AR0++%], -1 .data -743

rnd A .data -1

rnd A

More orthogonal No need for wide PM

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 19: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

History of DSP architectures

DP DM

CP PM/DM

MUX

DP DM

CP PM

MUX

(d) (e)

Cac

he

DM

[Liu2008, Figure 3.5]

I d) Use a small (loop?) cache to allow for one memory to beshared between PM and DM

I e) Typical three memory configuration.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 20: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

DSP Processor vs DSP Core

Bus and arbiration

DSP Processor

DSP core

Interrupt Timer

MMU

Other pheriph

Main memories

Chip inferface

RF Control pathADG

DM DM PMDMA

ALUMAC accelerator DSP

sub

syst

em

[Liu2008, Figure 3.2]Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 21: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Architecture selection

I Selecting a suitable ASIP architecture for the desiredapplication domain

I The decision includes how many function modules arerequired, how to interconnect these modules (relationsbetween modules), and how to connect the ASIP to theembedded system

I Closely related to instruction set selection if an efficientimplementation is desired

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 22: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Architecture selection

I DSP processor developers have an advantage over generalpurpose CPU developers (e.g. Intel, AMD, ARM):

I Known applicationsI Known scheduling requirementsI Vector based algorithms and processing

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 23: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Architecture selection

I Challenges of DSP parallelizationI Hard real time and high performanceI Low memory and low power costsI Data and control dependencies

I Remember Amdahl’s law: Your speedup is ultimately limitedby the amount of sequential parts you have in yourapplication.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 24: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Ways to speed up a processor - Discussion break

I Programmer visible:I VLIWI Multiple memoriesI AcceleratorsI SIMDI Multicore

I Programmer invisibleI CacheI PipeliningI Superscalar (in- or

out-of-order)I DataforwardingI Branch prediction

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 25: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Advanced architectures: Dual MAC

* *

+ +Register files

A B

AAR BAR

MO1 MO2+

Legends in this figure

*

Accumulation arithmetic unit

Multiplier

Multiplexer

Register or a pipeline stage

[Liu2008, Figure 3.22]

Liu2008

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 26: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Advanced architectures: Dual MAC

I Allows you to speed up operations such as FIR filters.

I Can allow you to calculate y [n] =∑N−1

k=0 h[k]x [n − k] and

y [n + 1] =∑N−1

k=0 h[k]x [n + 1 − k] at the same time forexample.

I Note: Will roughly halve the number of memory accesses(More on this in a later lecture.)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 27: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Advanced architectures: SIMD

Program memory carries only one instruction

Address

Execution unit

Address

Execution unit

Address

Execution unit

Address

Execution unit

I-decoding

[Liu2008, Figure 3.24 (modified)]

I Advantage: Low power and area

I Disadvantage: Difficult to use efficiently, very difficult targetfor a compiler.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 28: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Advanced architectures: VLIW

I Why: DSP tasks are relatively predictableI A parallel datapath gives higher performance

I How: Very Large Instruction WordI Multiple instruction issues per-cycleI Compiler manages data dependency

I ChallengesI Memory issue and on chip connectionsI Register (fan-out ports) costsI Hard compiler target

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 29: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Advanced architectures: Superscalar

I Analyze instruction flowI Run several instruction in parallel

I (And possibly out of order)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 30: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

VLIW vs Superscalar

I VLIW:I Relatively easy to design

and verify the hardwareI Not code efficient due to

instruction size and NOPinstructions

I Hard to keep binarycompatibility

I Hard to create an efficientcompiler

I SuperscalarI Hard to design and verify

the hardwareI Good code efficiency,

relatively smallinstructions, No NOPsneeded

I Easier to managecompatibility betweenprocessor versions

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 31: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Multicore architectures

I Heterogenous or homogenousI Well known heterogenous architecture: CellI Well known homogenous architecture: Modern X86

I Usually harder to program than single threaded arch.I Heterogenous architectures are well suited for ASIPs

I Standard MCU for main part of applicationI Specialized DSP for performance critical parts

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 32: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Summary: Advanced Architectures

I Dual MAC: Easy, not a huge improvement

I SIMD DSP: Very good for regular tasks

I VLIW: Good parallelism but hard for compiler

I Superscalar: Relatively easy for a compiler, but highest siliconcost and verification cost

I Multicore: Whenever a single core is not powerful enough

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 33: 04 - DSP Architecture and MicroarchitectureArchitecture selection I DSP processor developers have an advantage over general purpose CPU developers (e.g. Intel, AMD, ARM): I Known applications

Summary: Advanced Architectures

[Liu2008, Figure 4.5 (modified)]

Liu2008

Andreas Ehliar 04 - DSP Architecture and Microarchitecture


Recommended