+ All Categories

lect2

Date post: 09-Jan-2016
Category:
Upload: usama-javed
View: 212 times
Download: 0 times
Share this document with a friend
Description:
mb

of 71

Transcript
  • Processor Architectures and Program Mapping

    Programmable Digital Signal Processors5kk10TU/e

    Henk CorporaalJef van MeerbergenBart Mesman

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Topic 2: Programmable Digital Signal Processors

    instruction level parallelism (ILP) hardware support for loop control attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) difficult to compare architectures e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation can be included or forgotten benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Goal = 1 cycle per iteration

    position ACR (1 or 2)adder/subtractorextra pipelinesasymmetric inputsmulti-precisionModifications extra inputs/outputs

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • not every signal requires 32 bits 2 types of DSP: floating point and integer advantages FP: most specs are in FP (conversion to int is time consuming since the behaviour may change) disadvantage FP: cost (area, speed, power) wanted : type of output of an operation = type of input (because both stored in RAM) no problem for FP but for integer integer multiplication doubles the number of bits: n * n => 2n What about fractional numbers ?DSP data types

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • integer and fractional numbers are a special case of fixed pointfix (ART designer & SystemC)1101101-19/8 = -2.3751fix negative weight2s complementif q=0 then integer e.g. int if q=p-1 then fractional e.g. int DSP data typesScale factor 1/8pq2-22-32-120212223-24quantization errorSame alu handlesfix , fix , fix , ...

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • 01101-19/80000197/16101001101-1843/128111110001Int Int s x x xs y y y--------s s z z z z z zs z z z z z z 0=> if FRCT = 1Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain 2n-1 2n-1)DSP data types1111

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • continue (after multiplication) with msb only represents the limit of the accuracy of the result (can not be larger than the accuracy of the inputs) more efficient solution continue with msb + lsbsum-of-product operations generate accumulative noise at 32nd vs. 16th bit Still overflow for addition = overflow bits double precision accumulator + extra overflow bits + shift, round, truncate unitDSP data types

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • roundingvalue truncationmagnitude truncationxxQxQxQxx 1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1 1 1 1 . 1 1 -0.25

    = 1 1 1 -1 1 1 1 . 0 1 -0.75

    = 1 1 1 -1 1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • saturationzeroingsawtooth

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Prog/datamemoryEXUVon Neumann(sequencial)progmem.EXUHarvarddatamem.progmem.EXUdatamem. 1datamem. 2Modified Harvard c(i) * x(i)Goal = 1 cycle per iteration

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • RAM_ARAM_BMAC

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • *Z-1*Z-1*Z-1*+c4c5c3c2x5x4x3x2yZ-1c1x1* ci * xitime loopfilter loop iHow updating the delayline ?1 cycle/tap ?

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Solution 1: blockmove in memory2 possibilities complete move after every output sample is calculated read and write the data twice move after read of every datum separately write the data twice need for a special instruction (TMS320)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Memory

    location

    Output

    sample 1

    output

    sample 2

    output

    sample 3

    1

    x1

    x2

    x3

    2

    x2

    x3

    x4

    3

    x3

    x4

    x5

    4

    x4

    x5

    x6

    5

    x5

    x6

    x7

  • Solution 2: indirect adressing use of a pointer to mark the begin of the delay line update the pointer instead of moving the data problem: trashing of the whole memory solution: modulo addressing need for a register to store the pointer

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Memory

    location

    output sample 1

    output sample 2

    output sample 3

    output sample 4

    Output sample 5

    1

    x1

    x9

    2

    x2

    x2

    3

    x3

    x3

    x3

    4

    x4

    x4

    x4

    x4

    5

    x5

    x5

    x5

    x5

    x5

    6

    x6

    x6

    x6

    x6

    7

    x7

    x7

    x7

    8

    x8

    x8

  • *Z-1*Z-1*Z-1*+c2c1c3c4xy2y3y4yZ-1y5y1y1y2y3y4y5pointerIIR filtermemory map

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • for j = 1..jtaps d(j) * y(j) for i = 1..itaps c(i) * x(i) time loop2 filtersy1y2y3y4y5pntr 2modulo range 2x1x2x3x4x5pntr 1modulo range 1y1y2y3y4y5x1x2x3x4x5pntr 1modulo range2 memory segments => 1 segment

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • y1y2x1/y3x2x3pntr 1modulo rangeMapping strategy define positions in Ramconstraint: vars that form a delay line in consecutive places find a scheduleexample : c1 => c2 => c3 => c4 => c5 define ACU instructionsMapping strategy

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • *Z-1Z-1*Z-1*+c6c7c4x7x6x5x4yeZ-1x1x3Z-1*x2Z-1Z-1*x8c8+yo*c5*c3*c1c2

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ASModulooutputto RAM Output reg Areg SRead_A A ASRead_S S ASincA A+1 A+1SdecA A-1 A-1SStep A+S A+SSInc_step S+1 AS+1Modulo can beimplemented as a mask operation if the size is 2k16 10 00023 10 111mask=holdACU architecture andInstruction set

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • y1y2x1/y3x2x3pntrmodulo rangeread_A17incA18incA19incA20incA21step19dec18prepare new pointer for next iterationAssumeinitialisationA = pointer=17S = -21617181920212223Mapping example

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Addressing modes register ADD R4, R3 R[R4] = R[R4] + R[R3] immediate ADD R4, #3 R[R4] = R[R4] + #3 direct ADD R4, (100) R[R4] = R[R4] + Mem[100] indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] w. inc/dec ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] 1 indexed ADD R4, (R3R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] R[R2] Remarks direct = for static data indirect = for arrays inc/dec = for stepping through arrays e.g. xn index = for stepping through arrays e.g. x2n

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • 8 ARs (address or auxiliary register) available extra indirect modescircular *ARn % post inc/dec by 1 - circular *ARn AR0 % post inc/dec by AR0 - circular bit reverse *ARn AR0 B post inc/dec by AR0 - bit rev.Addressing modes: extra for DSP

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • regular data-flow algorithms ==> MACfiltering, correlation, windowing etc decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision Incorporation of an ALU

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • +1PCInterrupt addressStackResetProgramMemoryIRACU_AAR_ARAM_ADR_AACU_BAR_BRAM_BDR_BMACALUControl BusRfile

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ALUSXSYDXDYRFACUA BMULTSXSYDXDYRFACUA BImm. dataDXDYRFACUA BNext addressBRCondACUA B00011011Bus-oriented instruction encoding

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • c(i) * x(i)6 clockcycles/samplelimit pipelines in the controllerfirst solutionresourcestime (cc)Not showncoefficient RAM+ACU

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    LABELALU

    MPY-ACC

    RAM

    ACU

    Acc = 0

    init (i=0)

    init counter

    loop

    incr (=i+1)

    read x(i)

    acc(i)=acc(i-1)+x(i)*c(i)

    dec counter

    branch to loop if counter > 0

    nop

  • Loopfolding (software pipelining)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • c(i) * x(i)Pre- and postamble4 clockcycles /sampleLoopfolding (software pipelining)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    LABEL

    ALU

    MPY-ACC

    RAM

    ACU

    acc(i-1)=0

    init (i=1)

    init counter

    read x(i)inc(=i+1)

    loop

    acc(i) = acc(i-1)+x(i)*c(i)read x(i+1)incr (=i+2)

    dec counter

    branch to loop if counter > 0

    nop

    acc(n-1) = acc(n-2)+x(n-1)*c(n-1)read x(n)

    acc(n) = acc(n-1)+x(n)*c(n)

  • c(i) * x(i) hardware support for loop control1 clockcycles/samplerepeat instruction and repeat block

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    LabelALUMPY-ACC

    RAM

    ACU

    acc(i-1=0

    init (i=1)

    init counter

    read x(i)inc(=i+1)

    repeat n-2

    acc(i)=acc(i-1)+x(i)*c(i)

    read x(i+1)incr(=i+2)

    acc(n-1) = acc(n-2) + x(n-1)*c(n-1)read x(n)

    acc(n) = acc(n-1) + x(n)*c(n)

  • architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • T registerSign ctrSign ctrSign ctrSign ctrSign ctrTMultiplier (17*17)A(40)B(40)MUXA0AABBAfractionalMUXAdder (40)ZEROSATROUNDMALU (40)UBMUXTABCDCDBarrer shifterMSW/LSWselectECOMPTRNTCBAPCDDTMS320C5000

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Address bus16 bitsEXTERNALADRESS SWITCHY AddressY memory256-by-24-bitRAM256-by-24-bitROMAddressALUX memory256-by-24-bitRAM256-by-24-bitROM2,048-by-24-bitPROGRAMMEMORYROMX AddressP AddressEXTERNALDATA-BUSSWITCHINTERNAL DATA-BUSSWITCH24 BITSDATABUSX-DATAY DATAP DATAGLOBAL DATADATA ALU

    24-by-24 bitMULTIPLIER-ACCUMULATORPRODUCING56 BIT RESULTPROGRAM CONTROLLERON CHIPPERIPHERALS,HOST,SYNCHRONOUSSERIAL INTERFACESERIAL COMMU-NICATIONSINTERFACE,PROGRAMMED I/O,BUS CONTROL2 BITSCLOCK 3 BITSINTERRUPT24 BITSI/OPORTS7 BITSMotorola 56K family

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • X dataY dataZ dataBuses for

    Instruction decoder96-bit instructionsProgram control unitProgrammemory (Z data)16-bit busTwo 16-by-16 bitmultipliersY0Y1XY0Y1XPOP1scalescaleTwo 40 bit arithmic-logic unitsSaturationSaturationFour 40 bitaccumulators Saturation/scaleshiftR.E.A.L.

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • RD16021 DSP

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    memories

    Not included

    Process

    0.35(, 5M

    voltage

    2.7-3.6 V

    frequency

    39 MHz

    Tj = 85 C, 2.7V, wcp

    area

    3.9 mm2

    Power dissipation

    2.1 mW/MHz

  • 16 taps 40 samples 8 biquads Instruction cycle counts for BDTi benchmarks

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Function

    DSPgroup OAK

    Motorola

    DSP561xx

    ADI

    ADSP-218x

    Lucent

    DSP16xx

    TI TMS320

    C54x

    TI320

    C62xx

    Lucent

    DSP16210

    Philips

    RD16020

    Real block FIR

    835

    925

    841

    1240

    684

    334

    780

    448

    Single sample FIR

    21

    23

    22

    26

    18

    17

    16

    20

    Complex block FIR

    3018

    3043

    3122

    3123

    2922

    1294

    1681

    1470

    LMS adaptive

    90

    64

    59

    101

    58

    33

    55

    IIR (8 sections)

    51

    45

    43

    65

    44

    30

    38

    37

    Vector dot product

    43

    43

    43

    47

    41

    29

    23

    43

    Vector add

    122

    85

    83

    123

    61

    36

    43

    63

    Vector maximum

    41

    86

    128

    120

    111

    39

    40

    Convolution encoder

    506

    772

    818

    888

    528

    188

    464

    176

    FSM

    284

    375

    198

    415

    455

    147

    301

    167

    256 pnt FFT

    16514

    12148

    10633

    21035

    13234

    4225

    9016

    5797

  • Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)Outline

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • lexical analysissyntax analysissemantic analysisCode selectionRegister allocationschedulingFront endCode generationcodesourceIntermediate machine independent representation1 instr = // opsorder of instr

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ab*cd++*c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3t1t2t3BBiBBjBBkIntermediate machine independent representation

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]Notation ar := ar | ax + ay | af meansar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af Code selection

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • axayarafmxmymrmf+ -xyxy+ -*ALUMACd memoryp memoryADSP[Analog Devices]Code selection example

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ar | mr | mxmy | mf*mr+mr | mfar | mr | mxmy | mf*mr-mr | mfar | mr | mxmy | mf*mr | mfmr | ar | axay | af+ar | afmr | ar | axay | af-ar | afExamples of RTPs on the ADSP-210 datapath

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ab*cd++*ct1t2t3mx := dmemmy := pmemax := dmemay := pmemmr := dmem2:1:3:ar := ax + aymy := armr = mr * myMr := mr + (mx * my)Example of code selection = covering of intermediate representation with RTPs

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Problems local decisions which have a global impact phase coupling: example asap schedule maximal freedom for scheduling code selection during scheduling register allocation comes afterwards can lead to infeasible solutions

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • R1R2R3alu2alu1(a)(b)1234phase coupling: example 1

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • PuCuPvCvPuCuPvCvuvuvif u and vshare the same registerphase coupling: example 2Example of coupling between scheduling and register allocation[Mesman]

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Traditional code generation(heuristic)OK ?constraintsnoyesfeasiblespacedesign space seen by code generatorapplication[Mesman]phase coupling: discussionPhase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecturedevelop an architecture which is still efficient but alsoa good model for building a compilerEfficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction WordIt is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assemblerphase coupling: discussion

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word) principles central register file + example TM clustered VLIW + example C6 subword parallelism or SIMDOutline

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • multiple parallel FUs, possibly different and pipelined pipelining is exposed to the compiler = no interlock mechanism load-store architectureall operands fetched from/stored in register files, possibly multi-ported each FU can receive an instruction every clock cycle one instruction = many RISC instructions each RISC instruction = one issue slot no dependencies between different RISC instructions = orthogonal microcode = compiler friendlyVLIW principles

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Execunit 1Register fileIssue slot 1Execunit 2Issue slot 2Execunit 3Issue slot 3Execunit 4Issue slot 4Execunit 5Issue slot 5Execunit 24Issue slot 24Execunit 25Issue slot 25R&W addr.instruction

    ...... long instruction words e.g. (3*7+4)*25=625 many ports on the registerfile e.g. 75VLIW architecture

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Execunit 1Execunit 2Execunit 3Register fileIssue slot 1Execunit 4Execunit 5Execunit 6Execunit 7Execunit 8Execunit 9Issue slot 2Issue slot 3VLIW architecture: central Register File

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • ExecunitExecunitExecunitExecunitExecunitRegister file (128 regs, 32 bit, 15 ports)Instruction register (5 issue slots)Data cache(16 kB)PCInstructioncache (32kB)5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrtTM1000 DSPCPU

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • TriMedia TM32A processor 0.18 micronarea : 16.9mm2200 MHz (typ)1.4 W7 mW/MHz

    (MIPS=0.9 mW/MHz)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Synthesised RF area (CMOS18, 64 bit)Area, speed and power dissipation goes more than linear with thenumber of ports

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Chart1

    0.170.340.69

    0.2960.7481.49

    0.4541.12.38

    32regs, after P&R

    64regs, after P&R

    128regs, after P&R

    Nr of ports

    Area in mm-sq

    Sheet1

    ProjectCVLIW

    ExperimentswithRegFile

    Date9/27/01

    ModuleCompilerresultsMulti port register file

    CMOS18techspecification# portsareadelay

    pre-layoutresultswp=1,rp=2,bits=64,words=3232935151.4763241811741351.711.87

    wp=2,rp=4,bits=64,words=3264878051.5790897217204131.762.02

    wp=4,rp=8,bits=64,words=32128220341.65148385029296391.862.06

    wp=8,rp=16,bits=64,words=322413957031.84260194757547012.092.2

    wp=1,rp=2,bits=64,words=6436324181.71

    wp=2,rp=4,bits=64,words=6469089721.76

    wp=4,rp=8,bits=64,words=641214838501.86

    wp=8,rp=16,bits=64,words=642426019472.09

    wp=1,rp=2,bits=64,words=128311741351.87

    wp=2,rp=4,bits=64,words=128617204132.02

    wp=4,rp=8,bits=64,words=1281229296392.06

    wp=8,rp=16,bits=64,words=1282457547012.2

    HDLICompilerresultsMulti port register file

    CMOS18techspecification# portsareaporosity(hdli)lay area%utilisation (sedsm)

    pre-layoutresults64reg-sch64reg-lay128reg-sch128reg-lay

    wp=1,rp=2,bits=64,words=3230.1640.70.1796.40.320.340.6570.69

    wp=2,rp=4,bits=64,words=3260.2810.610.29695.050.560.7481.121.49

    wp=3,rp=5,bits=64,words=3280.3870.45485.30.771.11.552.38

    wp=1,rp=2,bits=64,words=6430.320.70.3495.3

    wp=2,rp=4,bits=64,words=6460.560.610.74875

    wp=3,rp=5,bits=64,words=6480.771.170

    wp=1,rp=2,bits=64,words=12830.6570.70.6995

    wp=2,rp=4,bits=64,words=12861.121.4975.2

    wp=3,rp=5,bits=64,words=12881.552.3865.05

    Sheet1

    2935156324181174135

    4878059089721720413

    82203414838502929639

    139570326019475754701

    32 Reg

    32 Reg

    64 Reg

    128 Reg

    Nr of ports

    Area in um2

    Area Vs nr of ports

    graph

    1.471.711.87

    1.571.762.02

    1.651.862.06

    1.842.092.2

    32 Reg

    64 Reg

    128 Reg

    Nr of ports

    Delay in ns

    Dealy Vs Nr of ports

    Sheet3

    000000

    000000

    000000

    Nr of ports

    Area in mm-sq

    Register FIle Area after P&R

    64 bit registers

    0.170.340.69

    0.2960.7481.49

    0.4541.12.38

    32regs, after P&R

    64regs, after P&R

    128regs, after P&R

    Nr of ports

    Area in mm-sq

    Register FIle Area after Place & Route

  • Execunit 1Execunit 2copyunitRegister file 1Execunit 3Execunit 4copyunitRegister file 2Execunit 5Execunit 6copyunitRegister file 3VLIW architecture: clustered Register Files

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • REGISTERFILE 1FMULFADDREGISTERFILE 2IMULIADDREGISTERFILE 3IMULIADDFMUL r1,r2,r3IADD r1,r2,r3IMUL r1,r2,r3VLIW architecture: clustered Register Files

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • REGISTERFILE I0IADD_01IMOV_01:FU00IADD_00LAND_00:FU01IMUL_00SHFT_00:FU02REGISTERFILE I1IADD_10IMOV_10:FU10IADD_11LAND_10:FU01IMUL_10SHFT_10:FU02VLIW architecture: clustered Register Files

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)15-20 % for 2 clusters20-30 % for 4 clusters limited scalability not too many clusters not too many registers within each cluster (too many RF ports) add of copy ops in the compiler = graph changes during schedulingVLIW architecture: clustered Register FilesDiscussion

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Dstsrc1src2Src_upDst_upDstsrc1src2L1S1M1Store/loaddataStore/loadaddressDstsrc1src2D1Registerfile 0-15 (32 bits)Store/loadaddressDstsrc1src2D2Dstsrc1src2M2S2L2loaddataRegisterfile 0-15TMS320C62x VelociTI (fixed point)Int addlogicalbit countInt addlogicalbit manipshiftconstantbranchInt mult(16=>32)Int addload/store

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • parallelism (fetch-decode-execute) (max 8 issue slots) pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz) Risc (simple, atomic, independent instructions)performance comes from compiler (pipelining, unroll) load-store orthogonal (2 identical DP, add on 6 units) deterministic (no interlock) conditional instructions (=guarding) instruction packingVelociTI principles

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • nnAnnnnnnBnnnnnnnnnnnCnnnnnnnDnnnnnEnnnnFnnnnnnnnnnnnnGnnnnnnnnHABCDEFGH00000000nBAnnCnnnnnEnDnnFnnnnnnnnnnnnnGHABCDEFGH11010010ABCDEFGH11111110ABCDEFGHFully serialMixed serial/parallelFully parallelVelocity encodingClassical encoding: fetching many nops

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Instruction cycle counts for BDTi benchmarks

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Function

    DSPgroup OAK

    Motorola

    DSP561xx

    ADI

    ADSP-218x

    Lucent

    DSP16xx

    TI TMS320

    C54x

    TI320

    C62xx

    Lucent

    DSP16210

    Philips

    RD16020

    Real block FIR

    835

    925

    841

    1240

    684

    334

    780

    448

    Single sample FIR

    21

    23

    22

    26

    18

    17

    16

    20

    Complex block FIR

    3018

    3043

    3122

    3123

    2922

    1294

    1681

    1470

    LMS adaptive

    90

    64

    59

    101

    58

    33

    55

    IIR (8 sections)

    51

    45

    43

    65

    44

    30

    38

    37

    Vector dot product

    43

    43

    43

    47

    41

    29

    23

    43

    Vector add

    122

    85

    83

    123

    61

    36

    43

    63

    Vector maximum

    41

    86

    128

    120

    111

    39

    40

    Convolution encoder

    506

    772

    818

    888

    528

    188

    464

    176

    FSM

    284

    375

    198

    415

    455

    147

    301

    167

    256 pnt FFT

    16514

    12148

    10633

    21035

    13234

    4225

    9016

    5797

  • byte3opbyte3byte3byte2opbyte2byte2byte1opbyte1byte1byte0opbyte0byte0Ex. +, - , min, max => quadumin => quadumax ...Subword parallelism(custom operators in TM)1st input operand2nd input operandoutput operand32 bits = 4 bytes are processedindependently

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Subword parallelism+ faster execution- rewrite effort (e.g. different types for in- and outputs)Typical example : graphics ( 4 * 32 bit floating point) (custom operators in TM)

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    int size = 1000

    byte out[size], in1[size], in2[size]

    for (i = 0; i < size; i+( (

    out[ i ] = in1[ i ] + in2[ i ];

    (

    int size = 1000

    byte out[size], in1[size], in2[size]

    for (i = 0; i < size; i+( (

    packet4 t1 = packet4_load ( in1 );

    packet4 t2 = packet4_load ( in2 );

    packet4 t3 = packet4_add ( t1, t2 );

    packet4_store ( out, t3 );

    (

  • for (i=0; i> 1) +idct(i);if (temp > 255)temp = 255;else if (temp < 0)temp = 0;destination[i] = temp;}Subword parallelismMPEG exampleRemark: simple example without interloop dependencies

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • for (i=0; i> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;

    temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;

    temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;

    temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;

    temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;

    destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;quadavgdspuquadaddui=

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  • Will embedded CPUs and DSPs converge ? Converging forces both include a hardware multiplier trend in DSPs towards caches and RTK trend in DSPs towards C/C++ common trend towards VLIW Diverging forces deeply embedded code (DSP) vs. end-user SW (CPU) different RTKs SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)Conclusions VLIW good balance between hw and sw between efficiency (ILP) and cost fundamental problems: code size, interruptability

    Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

    Trendlines (Exponxxx): Synthesis done by Excel.

    32-bit register file with 128 registers 10R 5W is realisable in CMOS18. Its 2.7 m2. However, expanding to 64-bits makes things worse!

    The curve is exponential. Therefore, splitting the register file in two will reduce the area. See the vertical line from the 128 reg curve to the 64 reg curve. Next, we can reduce the number of ports per register file. Say 9 ports per file. In this setup the register files will be available only to the local issue slots of the cluster.

    It will also help in power, since the RF area is smaller.


Recommended