1 The Role Of ASIP In Programmable Platforms. 2OutlineOutline Using ASIP – a new design paradigm ...

Post on 15-Jan-2016

218 views 0 download

transcript

1

The Role Of ASIP In

Programmable Platforms

2

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIP using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

3

A short story ofa design paradigm shift

4

Once upon a timeOnce upon a timeOnce upon a timeOnce upon a time

How do I solve the encryption problem?

5

Data Encryption Standard (DES)Data Encryption Standard (DES)Data Encryption Standard (DES)Data Encryption Standard (DES)

Initial step(R, L) = Initial_permutation(Din64)

Iterate 16 timesKey generation

(C, D) = PC1(k)n = rotate_amount (function of iteration count)C = rotate_right(C, n)D = rotate_right (D, n)K = PC2(D, C)

EncryptionR i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) )L i+1 = Ri

Final stepDout64 = Final_permutation(L, R)

6

The SW engineer very proudly presentedThe SW engineer very proudly presentedThe SW engineer very proudly presentedThe SW engineer very proudly presented

static unsigned permute(unsigned char *table,in t n,unsigned hi,unsigned lo)

{int ib, ob;unsigned out = 0;for (ob = 0; ob < n; ob++) {

ib = table[ob] - 1;if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 <<

ob;} else {

if (lo & (1 << ib)) out |= 1 << ob;}

}return out;

}

This code is fast

7

The HW engineer laughedThe HW engineer laughedThe HW engineer laughedThe HW engineer laughed

Initial step(R, L) = Initial_permutation(Din64)

Iterate 16 timesKey generation

(C, D) = PC1(k)n = rotate_amount (function of iteration count)C = rotate_right(C, n)D = rotate_right (D, n)K = PC2(D, C)

EncryptionR i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) )L i+1 = Ri

Final stepDout64 = Final_permutation(L, R)

200 cycles?I can do it in 1!!!

?

8

The HW engineer presentedThe HW engineer presentedThe HW engineer presentedThe HW engineer presented

Initial Permutation

ExpansionPermutation

S Boxes

P Permutation

Final Permutation

KeyGeneration

StateMachine

I’ll show you howfast it can be

9

The SW engineer laughedThe SW engineer laughedThe SW engineer laughedThe SW engineer laughed

Initial Permutation

ExpansionPermutation

S Boxes

P Permutation

Final Permutation

KeyGeneration

StateMachine

I can change this in1 minute, can you?

?

10

Realizing that they each had something the Realizing that they each had something the other wantedother wantedRealizing that they each had something the Realizing that they each had something the other wantedother wanted

If only I don’t have todesign the controller

If only I have just theinstruction I need

11

They decided to work togetherThey decided to work togetherThey decided to work togetherThey decided to work together

GETDATA ars, hilo

DES immediate

SETDATA ars, artInitial Permutation

ExpansionPermutation

S Boxes

P Permutation

Final Permutation

KeyGeneration

StateMachine

SETKEY ars, art

12

and improved the SW solution by 70xand improved the SW solution by 70xand improved the SW solution by 70xand improved the SW solution by 70x

SETKEY(K_hi, K_lo);for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }

SETKEY(K_hi, K_lo);for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ }

DecryptionEncryption

13

When the boss asked how,When the boss asked how,the SW engineer said:the SW engineer said:When the boss asked how,When the boss asked how,the SW engineer said:the SW engineer said:

Registers

Datapath

Con

trol

SW Solution

Mem

ory

(Pro

gram

)

XCorrect Efficient

X

SW

14

and the HW engineer said:and the HW engineer said:and the HW engineer said:and the HW engineer said:

HW Solution

FSM Storage

X

Correct Efficient

X

HW

15

ASIP

Together, they had the best of both worldTogether, they had the best of both worldTogether, they had the best of both worldTogether, they had the best of both world

Registers

Datapath

Con

trol

SW Solutions HW Solutions

FSM Storage

Mem

ory

(Pro

gram

)

Correct EfficientSW

HW

16

The boss was very happy The boss was very happy The boss was very happy The boss was very happy

Optimality/integration

(e.g. mW, $)

Flexibility/modularity(e.g. time-to-market)

specialhardware

traditionalprocessors

+ SW

~

10x

~10x

ASIP

Use Softwarefor Control

Use Application-specific datapathfor computation

17

And they worked together happily ever And they worked together happily ever afterafterAnd they worked together happily ever And they worked together happily ever afterafter

18

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIPs using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

19

What Is “EEMBC”?What Is “EEMBC”?What Is “EEMBC”?What Is “EEMBC”?

EDN Embedded Microprocessor Benchmark Consortium

Pronounced “Embassy”

Non-profit consortium, funded by over 40 members

Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba, Tensilica, and more

Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications

Independent laboratory recreates and certifies all benchmark results - no tricks

20

EEMBC Benchmark SuitesEEMBC Benchmark SuitesEEMBC Benchmark SuitesEEMBC Benchmark Suites

Five different benchmark suites Consumer Networking Telecom Automotive Office Automation

Each suite comprised of a range (five to sixteen) ofbenchmarks representative of that product category Example: Consumer

• Image compression, image filtering, color conversion

21

Two Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. Optimized

Out-of-Box Benchmark C code, no manual code optimization,

no assembly coding

Optimized, or “Full-Fury” Conventional Processors

• Laboriously hand-tuned assembly code• Rewriting C code to fit the architecture for VLIW or SIMD

machines• Changing Code to Fit the Processor

Xtensa• Optimized processor using Xtensa processor generator and TIE

Compiler • Changing Processor to Fit the Application!!

22

Xtensa Optimization ProcessXtensa Optimization ProcessXtensa Optimization ProcessXtensa Optimization Process

Step #1: Configure processor via generator GUI Compile C-code, evaluate results Modify configuration as needed “Out of Box” results measurement taken here

Step #2: Profile Code, Add TIE

Step #3: Optimize Code to Utilize TIE instructions “Optimized” results measured on final hardware configuration

Same Path Used by Tensilica Customers!

23

Optimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBC

OUT-OF-BOX

Configured Xtensa(Using GUI Click box options)

Unmodified C-Code

64.1K TIE127K total gates200MHz

25000 base gates +37600 config. gates

200MHz

OPTIMIZED

Configured XtensaPlus TIE Gates & Instructions

C-Code optimizations

62.6K

59K total gates200MHz

25000 base gates +25000 config. gates

200MHz50K

180K total gates200MHz

25000 base gates +37000 config Gates

200MHz

Consumer Configuration

Network Configuration

Telecom Configuration

9.2K TIE

VECTRA

18K

TIE

Illustrations conceptual, see EEBMC report for full details

24

EEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer Benchmark

0

20

40

60

80

100

120

140

160

180

200

Processors

Consumermark

Out-of-boxXtensa

OptimizedXtensa

25

EEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer Benchmark

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Processors

Consumermark / MHz

Out-of-boxXtensa

OptimizedXtensa

26

EEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking Benchmark

0

2

4

6

8

10

12

14

Processors

Netmark

Out-of-boxXtensa

OptimizedXtensa

AMD K6

27

EEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking Benchmark

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045

Netmark / MHz

Out-of-boxXtensa

OptimizedXtensa

AMD K6

Processors

28

EEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom Benchmark

225.8

0

10

20

30

40

50

60

70

80

90

100

Processors

Telemark

Out-of-boxXtensa

OptimizedXtensa

BOPS 2x2

29

0.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

0.450

0.500

EEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom Benchmark

Processors

Telemark / MHz

Out-of-boxXtensa

OptimizedXtensa

BOPS 2x2

1.67

30

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIPs using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

31

ASIP Generation FlowASIP Generation FlowASIP Generation FlowASIP Generation Flow

Select processor options

Xtensa Processor Generator

ALU

Pipe

I/O

Timer

MMURegister File

Cache

Tailored,synthesizable HDL uP core

•Optimizing C/C++ Compiler•Cycle-accurate Simulator•Assembler•Linker•C/C++/asm/inst Debugger•RTOS

Describe newinstructions In Minutes!

32

Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.

opcode PMAC op2=0 CUST0

state ACC1 40

state ACC2 40

iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}

semantic pmac_sem {PMAC} {

assign ACC1 = ACC1 + ars[15:0] * art[15:0];

assign ACC2 = ACC2 + ars[31:16] * art[31:16];

}

schedule pmac_schd {PMAC} {

use ars 1; use art 1;

use ACC1 2; use ACC2 2;

def ACC1 2; def ACC2 2;

}

33

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIP using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

34

Sample platformsSample platformsSample platformsSample platforms

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

Network Network Processor Processor

ArchitectureArchitecture

Network Network Processor Processor

ArchitectureArchitecture

Intel IXP1200 Vitesse PRISM IQ2000

Motorola C-Port CDP C-5 PMC-Sierra VoIP Gateway

35

ObservationsObservationsObservationsObservations

Heterogeneous processing elements

General purpose processors

Micro-controllers

Dedicated blocks

Heterogeneous communication links

Bandwidth

Latency

Hardware overhead

Communication overhead

36

Two Legs Of Platform DesignTwo Legs Of Platform DesignTwo Legs Of Platform DesignTwo Legs Of Platform Design

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

ProcessingElementDesign

CommunicationDesign

Platform Designer

37

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIP using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

38

ASIP requirementsASIP requirementsASIP requirementsASIP requirements

Match the performance of hard-wired logic

Offer variety of performance/cost points

Easy to design

Easy to useOptimality/integration

(e.g. mW, $)

Flexibility/modularity(e.g. time-to-market)

specialhardware

traditionalprocessors

+ SW

~

10x

~10x

ASIP

Use Softwarefor Control

Use Application-specific datapathfor computation

Optimality/integration

(e.g. mW, $)

Flexibility/modularity(e.g. time-to-market)

specialhardware

traditionalprocessors

+ SW

~

10x

~10x

ASIP

Use Softwarefor ControlUse Softwarefor Control

Use Application-specific datapathfor computation

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

39

Fixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASIC

Source

RF0

FU0

Result

Decoder

Co

ntr

ol

Temporal bottleneck:Limited functionality

Spatial bottleneck:not enough bandwidth

40

Adding Customized Function Units to Break Adding Customized Function Units to Break Temporal BottleneckTemporal BottleneckAdding Customized Function Units to Break Adding Customized Function Units to Break Temporal BottleneckTemporal Bottleneck

Source routing

RF0

FU0 FU1 FU2 FU3

Result routing

Decoder

Co

ntr

ol

FSM StorageFSM Storage

41

Example of Customized Functional UnitExample of Customized Functional UnitExample of Customized Functional UnitExample of Customized Functional Unit

opcode PMAC op2=0 CUST0

state ACC1 40

state ACC2 40

iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}

semantic pmac_sem {PMAC} {

assign ACC1 = ACC1 + ars[15:0] * art[15:0];

assign ACC2 = ACC2 + ars[31:16] * art[31:16];

}

schedule pmac_schd {PMAC} {

use ars 1; use art 1;

use ACC1 2; use ACC2 2;

def ACC1 2; def ACC2 2;

}

42

Effectiveness of Customized Functional UnitEffectiveness of Customized Functional UnitEffectiveness of Customized Functional UnitEffectiveness of Customized Functional Unit

Requirements:

Performance - similar

Cost - similar

Ease of design – similar

TIE: assign ACC1 = ACC1 + ars[15:0] * art[15:0];

Ease of use – much easier

C: PMAC(x, y);

43

Adding Processor States to Break Spatial Adding Processor States to Break Spatial Bottleneck Bottleneck Adding Processor States to Break Spatial Adding Processor States to Break Spatial Bottleneck Bottleneck

Source routing

RF0 S1S0

FU0 FU1 FU2 FU3

Result routing

Decoder

Co

ntr

ol

FSM StorageFSM Storage

44

Example of Processor States Example of Processor States Example of Processor States Example of Processor States

opcode PMAC op2=0 CUST0

state ACC1 40

state ACC2 40

iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}

semantic pmac_sem {PMAC} {

assign ACC1 = ACC1 + ars[15:0] * art[15:0];

assign ACC2 = ACC2 + ars[31:16] * art[31:16];

}

schedule pmac_schd {PMAC} {

use ars 1; use art 1;

use ACC1 2; use ACC2 2;

def ACC1 2; def ACC2 2;

}

45

Effectiveness of Processor StatesEffectiveness of Processor StatesEffectiveness of Processor StatesEffectiveness of Processor States

Requirements:

Performance – better

Especially when used with pipelined functional units

Cost – higher due to pipelined implementation

Ease of design – very simple

state ACC1 40

Ease of use – very easy

PMAC(x, y); /* implicitly using the states */

x = R_ACC1_Lo(); W_ACC1_Hi(y);

46

Sharing States Using Register FilesSharing States Using Register FilesSharing States Using Register FilesSharing States Using Register Files

Source routing

RF0 RF1 RF2 S1S0

FU0 FU1 FU2 FU3

Result routing

Decoder

Co

ntr

ol

FSM StorageFSM Storage

47

Example of a Register FileExample of a Register FileExample of a Register FileExample of a Register File

Co

ntr

ol

regfile RF24 24 16 r

operand vs s {RF24[s]}

operand vt t {RF24[t]}

operand vr r {RF24[r]}

iclass rrr {average} {out vr, in vs, in vt}

reference average {

wire [8:0] t2 = vs[23:16] + vt[23:16];

wire [8:0] t1 = vs[15:8] + vt[15:8];

wire [8:0] t0 = vs[7:0] + vt[7:0];

assign vr = {t2[8:1], t1[8:1], t0[8:1]};

}

ctype rgb 24 32 RF24

48

Crossing the HW/SW BoundaryCrossing the HW/SW BoundaryCrossing the HW/SW BoundaryCrossing the HW/SW Boundary

Working with typed data:

rgb x, y, z; /* C code */

Letting C-Compiler allocate the registers

z = average(x, y); /* assembly: average v1, v4, v6 */

Letting C-Compiler spill the registers

Letting C-Compiler convert to/from other types

yuv a, b;

b = average (a, y);

Auto saved/restored on context switching

49

Effectiveness of Register FileEffectiveness of Register FileEffectiveness of Register FileEffectiveness of Register File

Requirements:

Performance – better

Especially when used with pipelined functional units

Cost – higher due to pipelined implementation

Ease of design – very simple

regfile RF24 24 16 r

Ease of use – very easyrgb x, y, z;z = average(x, y);

50

Multi-cycle InstructionsMulti-cycle InstructionsMulti-cycle InstructionsMulti-cycle Instructions

Source routing

RF0 RF1 RF2 S1S0

FU0 FU1 FU2 FU3

Result routing

Decoder

Co

ntr

ol

FSM StorageFSM Storage

51

Example of a Multi-cycle InstructionExample of a Multi-cycle InstructionExample of a Multi-cycle InstructionExample of a Multi-cycle Instruction

opcode PMAC op2=0 CUST0

state ACC1 40

state ACC2 40

iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}

semantic pmac_sem {PMAC} {

assign ACC1 = ACC1 + ars[15:0] * art[15:0];

assign ACC2 = ACC2 + ars[31:16] * art[31:16];

}

schedule pmac_schd {PMAC} {

use ars 1; use art 1;

use ACC1 2; use ACC2 2;

def ACC1 2; def ACC2 2;

}

ars art

ACC1ACC2

52

Effectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle Instructions

Requirements:

Performance – usually better

difficult in hard-wired logic

Cost – higher due to bypass and interlock logic

Ease of design – very simple

use arr 3;

Ease of use – very easy and optimized by C Compiler

t = sat_mult(x,y);z = sat_add(z, t);t2 = sat_mult(x2, y2);

sat_mult s3, s1, s2 sat_mult s6, s5, s4sat_add s7, s7, s3

53

Replacing the State MachineReplacing the State MachineReplacing the State MachineReplacing the State Machine

Source routing

RF0 RF1 RF2 S1S0

FU0 FU1 FU2 FU3

Result routing

Decoder

Co

ntr

ol

FSM StorageFSM Storage

program

54

Effectiveness of Control ProgrammingEffectiveness of Control ProgrammingEffectiveness of Control ProgrammingEffectiveness of Control Programming

Requirements:

Performance – comparable

0-overhead loop, branch prediction, scheduling

Cost – comparable

Ease of design – very simple

reference BT {…, assign BranchTarget = …; …}

Ease of use – very easywhileforif then elseswitchgotofunction call

55

Short Summary of ASIP Computing Short Summary of ASIP Computing CapabilityCapabilityShort Summary of ASIP Computing Short Summary of ASIP Computing CapabilityCapability

ASIP:

Performance – comparable

Cost – higher due to pipelined implementation

Ease of design – easy using Xtensa/TIE

Ease of use – very easy using optimizing compiler

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

56

Meet the Communication RequirementsMeet the Communication RequirementsMeet the Communication RequirementsMeet the Communication Requirements

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

ProcessingElementDesign

CommunicationDesign

Platform Designer

57

Ways for ASIP to CommunicateWays for ASIP to CommunicateWays for ASIP to CommunicateWays for ASIP to Communicate

Functional Units

Regfiles State

Load/Store Units

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)E

xter

nal

In

terf

ace

MEM Device ASIP

Interrupt

58

Communicate Via PIF and Shared MemoryCommunicate Via PIF and Shared MemoryCommunicate Via PIF and Shared MemoryCommunicate Via PIF and Shared Memory

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•Simple•Low cost•Standard

Cons:•Long latency•Limited by PIF width•Resource contention•Polling

Interrupt

59

Communicate Via InterruptsCommunicate Via InterruptsCommunicate Via InterruptsCommunicate Via Interrupts

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•Simple•low cost•Standard•Event driven

Cons:•Very low bandwidth

Interrupt

60

Communicate Via Dual-ported Local Communicate Via Dual-ported Local MemoryMemoryCommunicate Via Dual-ported Local Communicate Via Dual-ported Local MemoryMemory

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•FastCons:

•High cost•Special programming•Limited bandwidth

Interrupt

61

Communicate Via Local Memory PortCommunicate Via Local Memory PortCommunicate Via Local Memory PortCommunicate Via Local Memory Port

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•Configurable•Low latency•Low cost

Cons:•Non-standard•Limited bandwidth•Special programming•External HW design•Expose to ASIP pipeline

Interrupt

62

Communicate Via Processor StatesCommunicate Via Processor StatesCommunicate Via Processor StatesCommunicate Via Processor States

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•Highly configurable•Low latency•Low cost•High bandwidth

Cons:•Non-standard•Special programming•One-way•Restricted to level signal•External HW design

Interrupt

63

Communicate Via InstructionsCommunicate Via InstructionsCommunicate Via InstructionsCommunicate Via Instructions

Functional Units

Regfiles States

Load/Store Unit

I-RAM D-RAMI-Cache D-Cache

Processor Interface (PIF)

Ext

ern

al I

nte

rfac

e

MEM Device ASIPPros:

•Highly configurable•No latency•Very low cost•High bandwidth

Cons:•Non-standard•Special programming•Restricted to edge signal•External HW design•Expose to ASIP pipeline

Interrupt

64

OutlineOutlineOutlineOutline

Using ASIP – a new design paradigm

EEMBC – a case study

Designing ASIP using Xtensa and TIE

Addressing the needs of platforms

ASIP computing capabilities

ASIP communication capabilities

Challenges

65

ASIP ChallengesASIP ChallengesASIP ChallengesASIP Challenges

Optimality/integration

(e.g. mW, $)

Flexibility/modularity(e.g. time-to-market)

specialhardware

traditionalprocessors

+ SW

~

10

x

~10x

ASIP

Use Softwarefor Control

Use Application-specific datapathfor computation

Optimality/integration

(e.g. mW, $)

Flexibility/modularity(e.g. time-to-market)

specialhardware

traditionalprocessors

+ SW

~

10

x

~10x

ASIP

Use Softwarefor ControlUse Softwarefor Control

Use Application-specific datapathfor computation

Balance computation and communication

Performance, cost, power

Choose the right instructions

Flexibility, product longevity, ease of programming

Let HW engineers design ASIP

No FSMs!

Let SW engineers design ASIP

Efficient functional units!

Support variety of communication

Separation of platform designs and system designs

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit

64-Bit64-Bit

32-Bit32-Bit

SDRAM Controller

SDRAM Controller

PCIInterface

PCIInterface

32-Bit32-Bit SRAMController

SRAMController

MicroengineMicroengine

StrongArmCore

(166 MHz)

StrongArmCore

(166 MHz)

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

168KBInstruction

Cache

8KBData Cache

1KB Mini-Data Cache

HashEngine

IX BusInterface

ScratchPad

SRAM

HashEngine

IX BusInterface

ScratchPad

SRAM

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

MicroengineMicroengine

64-Bit64-Bit