lect2

Processor Architectures and Program Mapping

Programmable Digital Signal Processors5kk10TU/e

Henk CorporaalJef van MeerbergenBart Mesman

Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Topic 2: Programmable Digital Signal Processors

instruction level parallelism (ILP) hardware support for loop control attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) difficult to compare architectures e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation can be included or forgotten benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)


architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline


Goal = 1 cycle per iteration

position ACR (1 or 2)adder/subtractorextra pipelinesasymmetric inputsmulti-precisionModifications extra inputs/outputs


not every signal requires 32 bits 2 types of DSP: floating point and integer advantages FP: most specs are in FP (conversion to int is time consuming since the behaviour may change) disadvantage FP: cost (area, speed, power) wanted : type of output of an operation = type of input (because both stored in RAM) no problem for FP but for integer integer multiplication doubles the number of bits: n * n => 2n What about fractional numbers ?DSP data types


integer and fractional numbers are a special case of fixed pointfix (ART designer & SystemC)1101101-19/8 = -2.3751fix negative weight2s complementif q=0 then integer e.g. int if q=p-1 then fractional e.g. int DSP data typesScale factor 1/8pq2-22-32-120212223-24quantization errorSame alu handlesfix , fix , fix , ...


01101-19/80000197/16101001101-1843/128111110001Int Int s x x xs y y y--------s s z z z z z zs z z z z z z 0=> if FRCT = 1Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain 2n-1 2n-1)DSP data types1111


continue (after multiplication) with msb only represents the limit of the accuracy of the result (can not be larger than the accuracy of the inputs) more efficient solution continue with msb + lsbsum-of-product operations generate accumulative noise at 32nd vs. 16th bit Still overflow for addition = overflow bits double precision accumulator + extra overflow bits + shift, round, truncate unitDSP data types


roundingvalue truncationmagnitude truncationxxQxQxQxx 1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1 1 1 1 . 1 1 -0.25

= 1 1 1 -1 1 1 1 . 0 1 -0.75

= 1 1 1 -1 1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0


saturationzeroingsawtooth


Prog/datamemoryEXUVon Neumann(sequencial)progmem.EXUHarvarddatamem.progmem.EXUdatamem. 1datamem. 2Modified Harvard c(i) * x(i)Goal = 1 cycle per iteration


RAM_ARAM_BMAC


*Z-1*Z-1*Z-1*+c4c5c3c2x5x4x3x2yZ-1c1x1* ci * xitime loopfilter loop iHow updating the delayline ?1 cycle/tap ?


Solution 1: blockmove in memory2 possibilities complete move after every output sample is calculated read and write the data twice move after read of every datum separately write the data twice need for a special instruction (TMS320)


Memory

location

Output

sample 1

output

sample 2

output

sample 3

1

x1

x2

x3

2

x2

x3

x4

3

x3

x4

x5

4

x4

x5

x6

5

x5

x6

x7

Solution 2: indirect adressing use of a pointer to mark the begin of the delay line update the pointer instead of moving the data problem: trashing of the whole memory solution: modulo addressing need for a register to store the pointer


Memory

location

output sample 1

output sample 2

output sample 3

output sample 4

Output sample 5

1

x1

x9

2

x2

x2

3

x3

x3

x3

4

x4

x4

x4

x4

5

x5

x5

x5

x5

x5

6

x6

x6

x6

x6

7

x7

x7

x7

8

x8

x8

*Z-1*Z-1*Z-1*+c2c1c3c4xy2y3y4yZ-1y5y1y1y2y3y4y5pointerIIR filtermemory map


for j = 1..jtaps d(j) * y(j) for i = 1..itaps c(i) * x(i) time loop2 filtersy1y2y3y4y5pntr 2modulo range 2x1x2x3x4x5pntr 1modulo range 1y1y2y3y4y5x1x2x3x4x5pntr 1modulo range2 memory segments => 1 segment


y1y2x1/y3x2x3pntr 1modulo rangeMapping strategy define positions in Ramconstraint: vars that form a delay line in consecutive places find a scheduleexample : c1 => c2 => c3 => c4 => c5 define ACU instructionsMapping strategy


*Z-1Z-1*Z-1*+c6c7c4x7x6x5x4yeZ-1x1x3Z-1*x2Z-1Z-1*x8c8+yo*c5*c3*c1c2


ASModulooutputto RAM Output reg Areg SRead_A A ASRead_S S ASincA A+1 A+1SdecA A-1 A-1SStep A+S A+SSInc_step S+1 AS+1Modulo can beimplemented as a mask operation if the size is 2k16 10 00023 10 111mask=holdACU architecture andInstruction set


y1y2x1/y3x2x3pntrmodulo rangeread_A17incA18incA19incA20incA21step19dec18prepare new pointer for next iterationAssumeinitialisationA = pointer=17S = -21617181920212223Mapping example


Addressing modes register ADD R4, R3 R[R4] = R[R4] + R[R3] immediate ADD R4, #3 R[R4] = R[R4] + #3 direct ADD R4, (100) R[R4] = R[R4] + Mem[100] indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] w. inc/dec ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] 1 indexed ADD R4, (R3R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] R[R2] Remarks direct = for static data indirect = for arrays inc/dec = for stepping through arrays e.g. xn index = for stepping through arrays e.g. x2n


8 ARs (address or auxiliary register) available extra indirect modescircular *ARn % post inc/dec by 1 - circular *ARn AR0 % post inc/dec by AR0 - circular bit reverse *ARn AR0 B post inc/dec by AR0 - bit rev.Addressing modes: extra for DSP


regular data-flow algorithms ==> MACfiltering, correlation, windowing etc decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision Incorporation of an ALU


+1PCInterrupt addressStackResetProgramMemoryIRACU_AAR_ARAM_ADR_AACU_BAR_BRAM_BDR_BMACALUControl BusRfile


ALUSXSYDXDYRFACUA BMULTSXSYDXDYRFACUA BImm. dataDXDYRFACUA BNext addressBRCondACUA B00011011Bus-oriented instruction encoding


c(i) * x(i)6 clockcycles/samplelimit pipelines in the controllerfirst solutionresourcestime (cc)Not showncoefficient RAM+ACU


LABELALU

MPY-ACC

RAM

ACU

Acc = 0

init (i=0)

init counter

loop

incr (=i+1)

read x(i)

acc(i)=acc(i-1)+x(i)*c(i)

dec counter

branch to loop if counter > 0

nop

Loopfolding (software pipelining)


c(i) * x(i)Pre- and postamble4 clockcycles /sampleLoopfolding (software pipelining)


LABEL

ALU

MPY-ACC

RAM

ACU

acc(i-1)=0

init (i=1)

init counter

read x(i)inc(=i+1)

loop

acc(i) = acc(i-1)+x(i)*c(i)read x(i+1)incr (=i+2)

dec counter

branch to loop if counter > 0

nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1)read x(n)

acc(n) = acc(n-1)+x(n)*c(n)

c(i) * x(i) hardware support for loop control1 clockcycles/samplerepeat instruction and repeat block


LabelALUMPY-ACC

RAM

ACU

acc(i-1=0

init (i=1)

init counter

read x(i)inc(=i+1)

repeat n-2

acc(i)=acc(i-1)+x(i)*c(i)

read x(i+1)incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1)read x(n)

acc(n) = acc(n-1) + x(n)*c(n)

architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline


T registerSign ctrSign ctrSign ctrSign ctrSign ctrTMultiplier (17*17)A(40)B(40)MUXA0AABBAfractionalMUXAdder (40)ZEROSATROUNDMALU (40)UBMUXTABCDCDBarrer shifterMSW/LSWselectECOMPTRNTCBAPCDDTMS320C5000


Address bus16 bitsEXTERNALADRESS SWITCHY AddressY memory256-by-24-bitRAM256-by-24-bitROMAddressALUX memory256-by-24-bitRAM256-by-24-bitROM2,048-by-24-bitPROGRAMMEMORYROMX AddressP AddressEXTERNALDATA-BUSSWITCHINTERNAL DATA-BUSSWITCH24 BITSDATABUSX-DATAY DATAP DATAGLOBAL DATADATA ALU

24-by-24 bitMULTIPLIER-ACCUMULATORPRODUCING56 BIT RESULTPROGRAM CONTROLLERON CHIPPERIPHERALS,HOST,SYNCHRONOUSSERIAL INTERFACESERIAL COMMU-NICATIONSINTERFACE,PROGRAMMED I/O,BUS CONTROL2 BITSCLOCK 3 BITSINTERRUPT24 BITSI/OPORTS7 BITSMotorola 56K family


X dataY dataZ dataBuses for

Instruction decoder96-bit instructionsProgram control unitProgrammemory (Z data)16-bit busTwo 16-by-16 bitmultipliersY0Y1XY0Y1XPOP1scalescaleTwo 40 bit arithmic-logic unitsSaturationSaturationFour 40 bitaccumulators Saturation/scaleshiftR.E.A.L.


RD16021 DSP


memories

Not included

Process

0.35(, 5M

voltage

2.7-3.6 V

frequency

39 MHz

Tj = 85 C, 2.7V, wcp

area

3.9 mm2

Power dissipation

2.1 mW/MHz

16 taps 40 samples 8 biquads Instruction cycle counts for BDTi benchmarks


Function

DSPgroup OAK

Motorola

DSP561xx

ADI

ADSP-218x

Lucent

DSP16xx

TI TMS320

C54x

TI320

C62xx

Lucent

DSP16210

Philips

RD16020

Real block FIR

835

925

841

1240

684

334

780

448

Single sample FIR

21

23

22

26

18

17

16

20

Complex block FIR

3018

3043

3122

3123

2922

1294

1681

1470

LMS adaptive

90

64

59

101

58

33

55

IIR (8 sections)

51

45

43

65

44

30

38

37

Vector dot product

43

43

43

47

41

29

23

43

Vector add

122

85

83

123

61

36

43

63

Vector maximum

41

86

128

120

111

39

40

Convolution encoder

506

772

818

888

528

188

464

176

FSM

284

375

198

415

455

147

301

167

256 pnt FFT

16514

12148

10633

21035

13234

4225

9016

5797

architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)Outline


lexical analysissyntax analysissemantic analysisCode selectionRegister allocationschedulingFront endCode generationcodesourceIntermediate machine independent representation1 instr = // opsorder of instr


ab*cd++*c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3t1t2t3BBiBBjBBkIntermediate machine independent representation


Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]Notation ar := ar | ax + ay | af meansar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af Code selection


axayarafmxmymrmf+ -xyxy+ -*ALUMACd memoryp memoryADSP[Analog Devices]Code selection example


ab*cd++*ct1t2t3mx := dmemmy := pmemax := dmemay := pmemmr := dmem2:1:3:ar := ax + aymy := armr = mr * myMr := mr + (mx * my)Example of code selection = covering of intermediate representation with RTPs


Problems local decisions which have a global impact phase coupling: example asap schedule maximal freedom for scheduling code selection during scheduling register allocation comes afterwards can lead to infeasible solutions


R1R2R3alu2alu1(a)(b)1234phase coupling: example 1


PuCuPvCvPuCuPvCvuvuvif u and vshare the same registerphase coupling: example 2Example of coupling between scheduling and register allocation[Mesman]


Traditional code generation(heuristic)OK ?constraintsnoyesfeasiblespacedesign space seen by code generatorapplication[Mesman]phase coupling: discussionPhase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.


Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecturedevelop an architecture which is still efficient but alsoa good model for building a compilerEfficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction WordIt is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assemblerphase coupling: discussion


architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word) principles central register file + example TM clustered VLIW + example C6 subword parallelism or SIMDOutline


multiple parallel FUs, possibly different and pipelined pipelining is exposed to the compiler = no interlock mechanism load-store architectureall operands fetched from/stored in register files, possibly multi-ported each FU can receive an instruction every clock cycle one instruction = many RISC instructions each RISC instruction = one issue slot no dependencies between different RISC instructions = orthogonal microcode = compiler friendlyVLIW principles


Execunit 1Register fileIssue slot 1Execunit 2Issue slot 2Execunit 3Issue slot 3Execunit 4Issue slot 4Execunit 5Issue slot 5Execunit 24Issue slot 24Execunit 25Issue slot 25R&W addr.instruction

...... long instruction words e.g. (3*7+4)*25=625 many ports on the registerfile e.g. 75VLIW architecture


Execunit 1Execunit 2Execunit 3Register fileIssue slot 1Execunit 4Execunit 5Execunit 6Execunit 7Execunit 8Execunit 9Issue slot 2Issue slot 3VLIW architecture: central Register File


ExecunitExecunitExecunitExecunitExecunitRegister file (128 regs, 32 bit, 15 ports)Instruction register (5 issue slots)Data cache(16 kB)PCInstructioncache (32kB)5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrtTM1000 DSPCPU


TriMedia TM32A processor 0.18 micronarea : 16.9mm2200 MHz (typ)1.4 W7 mW/MHz

(MIPS=0.9 mW/MHz)


Synthesised RF area (CMOS18, 64 bit)Area, speed and power dissipation goes more than linear with thenumber of ports


Chart1

0.170.340.69

0.2960.7481.49

0.4541.12.38

32regs, after P&R

64regs, after P&R

128regs, after P&R

Nr of ports

Area in mm-sq

Sheet1

ProjectCVLIW

ExperimentswithRegFile

Date9/27/01

ModuleCompilerresultsMulti port register file

CMOS18techspecification# portsareadelay

pre-layoutresultswp=1,rp=2,bits=64,words=3232935151.4763241811741351.711.87

wp=2,rp=4,bits=64,words=3264878051.5790897217204131.762.02

wp=4,rp=8,bits=64,words=32128220341.65148385029296391.862.06

wp=8,rp=16,bits=64,words=322413957031.84260194757547012.092.2

wp=1,rp=2,bits=64,words=6436324181.71

wp=2,rp=4,bits=64,words=6469089721.76

wp=4,rp=8,bits=64,words=641214838501.86

wp=8,rp=16,bits=64,words=642426019472.09

wp=1,rp=2,bits=64,words=128311741351.87

wp=2,rp=4,bits=64,words=128617204132.02

wp=4,rp=8,bits=64,words=1281229296392.06

wp=8,rp=16,bits=64,words=1282457547012.2

HDLICompilerresultsMulti port register file

CMOS18techspecification# portsareaporosity(hdli)lay area%utilisation (sedsm)

pre-layoutresults64reg-sch64reg-lay128reg-sch128reg-lay

wp=1,rp=2,bits=64,words=3230.1640.70.1796.40.320.340.6570.69

wp=2,rp=4,bits=64,words=3260.2810.610.29695.050.560.7481.121.49

wp=3,rp=5,bits=64,words=3280.3870.45485.30.771.11.552.38

wp=1,rp=2,bits=64,words=6430.320.70.3495.3

wp=2,rp=4,bits=64,words=6460.560.610.74875

wp=3,rp=5,bits=64,words=6480.771.170

wp=1,rp=2,bits=64,words=12830.6570.70.6995

wp=2,rp=4,bits=64,words=12861.121.4975.2

wp=3,rp=5,bits=64,words=12881.552.3865.05

Sheet1

2935156324181174135

4878059089721720413

82203414838502929639

139570326019475754701

32 Reg

32 Reg

64 Reg

128 Reg

Nr of ports

Area in um2

Area Vs nr of ports

graph

1.471.711.87

1.571.762.02

1.651.862.06

1.842.092.2

32 Reg

64 Reg

128 Reg

Nr of ports

Delay in ns

Dealy Vs Nr of ports

Sheet3

000000

000000

000000

Nr of ports

Area in mm-sq

Register FIle Area after P&R

64 bit registers

0.170.340.69

0.2960.7481.49

0.4541.12.38

32regs, after P&R

64regs, after P&R

128regs, after P&R

Nr of ports

Area in mm-sq

Register FIle Area after Place & Route

Execunit 1Execunit 2copyunitRegister file 1Execunit 3Execunit 4copyunitRegister file 2Execunit 5Execunit 6copyunitRegister file 3VLIW architecture: clustered Register Files


REGISTERFILE 1FMULFADDREGISTERFILE 2IMULIADDREGISTERFILE 3IMULIADDFMUL r1,r2,r3IADD r1,r2,r3IMUL r1,r2,r3VLIW architecture: clustered Register Files


REGISTERFILE I0IADD_01IMOV_01:FU00IADD_00LAND_00:FU01IMUL_00SHFT_00:FU02REGISTERFILE I1IADD_10IMOV_10:FU10IADD_11LAND_10:FU01IMUL_10SHFT_10:FU02VLIW architecture: clustered Register Files


performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)15-20 % for 2 clusters20-30 % for 4 clusters limited scalability not too many clusters not too many registers within each cluster (too many RF ports) add of copy ops in the compiler = graph changes during schedulingVLIW architecture: clustered Register FilesDiscussion


Dstsrc1src2Src_upDst_upDstsrc1src2L1S1M1Store/loaddataStore/loadaddressDstsrc1src2D1Registerfile 0-15 (32 bits)Store/loadaddressDstsrc1src2D2Dstsrc1src2M2S2L2loaddataRegisterfile 0-15TMS320C62x VelociTI (fixed point)Int addlogicalbit countInt addlogicalbit manipshiftconstantbranchInt mult(16=>32)Int addload/store


parallelism (fetch-decode-execute) (max 8 issue slots) pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz) Risc (simple, atomic, independent instructions)performance comes from compiler (pipelining, unroll) load-store orthogonal (2 identical DP, add on 6 units) deterministic (no interlock) conditional instructions (=guarding) instruction packingVelociTI principles


nnAnnnnnnBnnnnnnnnnnnCnnnnnnnDnnnnnEnnnnFnnnnnnnnnnnnnGnnnnnnnnHABCDEFGH00000000nBAnnCnnnnnEnDnnFnnnnnnnnnnnnnGHABCDEFGH11010010ABCDEFGH11111110ABCDEFGHFully serialMixed serial/parallelFully parallelVelocity encodingClassical encoding: fetching many nops


Instruction cycle counts for BDTi benchmarks


Function

DSPgroup OAK

Motorola

DSP561xx

ADI

ADSP-218x

Lucent

DSP16xx

TI TMS320

C54x

TI320

C62xx

Lucent

DSP16210

Philips

RD16020

Real block FIR

835

925

841

1240

684

334

780

448

Single sample FIR

21

23

22

26

18

17

16

20

Complex block FIR

3018

3043

3122

3123

2922

1294

1681

1470

LMS adaptive

90

64

59

101

58

33

55

IIR (8 sections)

51

45

43

65

44

30

38

37

Vector dot product

43

43

43

47

41

29

23

43

Vector add

122

85

83

123

61

36

43

63

Vector maximum

41

86

128

120

111

39

40

Convolution encoder

506

772

818

888

528

188

464

176

FSM

284

375

198

415

455

147

301

167

256 pnt FFT

16514

12148

10633

21035

13234

4225

9016

5797

byte3opbyte3byte3byte2opbyte2byte2byte1opbyte1byte1byte0opbyte0byte0Ex. +, - , min, max => quadumin => quadumax ...Subword parallelism(custom operators in TM)1st input operand2nd input operandoutput operand32 bits = 4 bytes are processedindependently


Subword parallelism+ faster execution- rewrite effort (e.g. different types for in- and outputs)Typical example : graphics ( 4 * 32 bit floating point) (custom operators in TM)


int size = 1000

byte out[size], in1[size], in2[size]

for (i = 0; i < size; i+( (

out[ i ] = in1[ i ] + in2[ i ];

(

int size = 1000

byte out[size], in1[size], in2[size]

for (i = 0; i < size; i+( (

packet4 t1 = packet4_load ( in1 );

packet4 t2 = packet4_load ( in2 );

packet4 t3 = packet4_add ( t1, t2 );

packet4_store ( out, t3 );

(

for (i=0; i> 1) +idct(i);if (temp > 255)temp = 255;else if (temp < 0)temp = 0;destination[i] = temp;}Subword parallelismMPEG exampleRemark: simple example without interloop dependencies


for (i=0; i> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;

temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;

temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;

temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}


temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;

temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;

destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;quadavgdspuquadaddui=


Will embedded CPUs and DSPs converge ? Converging forces both include a hardware multiplier trend in DSPs towards caches and RTK trend in DSPs towards C/C++ common trend towards VLIW Diverging forces deeply embedded code (DSP) vs. end-user SW (CPU) different RTKs SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)Conclusions VLIW good balance between hw and sw between efficiency (ILP) and cost fundamental problems: code size, interruptability


Trendlines (Exponxxx): Synthesis done by Excel.

32-bit register file with 128 registers 10R 5W is realisable in CMOS18. Its 2.7 m2. However, expanding to 64-bits makes things worse!

The curve is exponential. Therefore, splitting the register file in two will reduce the area. See the vertical line from the 128 reg curve to the 64 reg curve. Next, we can reduce the number of ports per register file. Say 9 ports per file. In this setup the register files will be available only to the local issue slots of the cluster.

It will also help in power, since the RF area is smaller.

Date post:	09-Jan-2016
Category:	Documents
Upload:	usama-javed
View:	212 times
Download:	0 times

lect2

Documents