Date post: | 09-Jan-2016 |
Category: |
Documents |
Upload: | usama-javed |
View: | 212 times |
Download: | 0 times |
of 71
Processor Architectures and Program Mapping
Programmable Digital Signal Processors5kk10TU/e
Henk CorporaalJef van MeerbergenBart Mesman
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Topic 2: Programmable Digital Signal Processors
instruction level parallelism (ILP) hardware support for loop control attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) difficult to compare architectures e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation can be included or forgotten benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Goal = 1 cycle per iteration
position ACR (1 or 2)adder/subtractorextra pipelinesasymmetric inputsmulti-precisionModifications extra inputs/outputs
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
not every signal requires 32 bits 2 types of DSP: floating point and integer advantages FP: most specs are in FP (conversion to int is time consuming since the behaviour may change) disadvantage FP: cost (area, speed, power) wanted : type of output of an operation = type of input (because both stored in RAM) no problem for FP but for integer integer multiplication doubles the number of bits: n * n => 2n What about fractional numbers ?DSP data types
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
integer and fractional numbers are a special case of fixed pointfix (ART designer & SystemC)1101101-19/8 = -2.3751fix negative weight2s complementif q=0 then integer e.g. int if q=p-1 then fractional e.g. int DSP data typesScale factor 1/8pq2-22-32-120212223-24quantization errorSame alu handlesfix , fix , fix , ...
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
01101-19/80000197/16101001101-1843/128111110001Int Int s x x xs y y y--------s s z z z z z zs z z z z z z 0=> if FRCT = 1Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain 2n-1 2n-1)DSP data types1111
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
continue (after multiplication) with msb only represents the limit of the accuracy of the result (can not be larger than the accuracy of the inputs) more efficient solution continue with msb + lsbsum-of-product operations generate accumulative noise at 32nd vs. 16th bit Still overflow for addition = overflow bits double precision accumulator + extra overflow bits + shift, round, truncate unitDSP data types
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
roundingvalue truncationmagnitude truncationxxQxQxQxx 1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1 1 1 1 . 1 1 -0.25
= 1 1 1 -1 1 1 1 . 0 1 -0.75
= 1 1 1 -1 1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0 1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
saturationzeroingsawtooth
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Prog/datamemoryEXUVon Neumann(sequencial)progmem.EXUHarvarddatamem.progmem.EXUdatamem. 1datamem. 2Modified Harvard c(i) * x(i)Goal = 1 cycle per iteration
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
RAM_ARAM_BMAC
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
*Z-1*Z-1*Z-1*+c4c5c3c2x5x4x3x2yZ-1c1x1* ci * xitime loopfilter loop iHow updating the delayline ?1 cycle/tap ?
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Solution 1: blockmove in memory2 possibilities complete move after every output sample is calculated read and write the data twice move after read of every datum separately write the data twice need for a special instruction (TMS320)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Memory
location
Output
sample 1
output
sample 2
output
sample 3
1
x1
x2
x3
2
x2
x3
x4
3
x3
x4
x5
4
x4
x5
x6
5
x5
x6
x7
Solution 2: indirect adressing use of a pointer to mark the begin of the delay line update the pointer instead of moving the data problem: trashing of the whole memory solution: modulo addressing need for a register to store the pointer
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Memory
location
output sample 1
output sample 2
output sample 3
output sample 4
Output sample 5
1
x1
x9
2
x2
x2
3
x3
x3
x3
4
x4
x4
x4
x4
5
x5
x5
x5
x5
x5
6
x6
x6
x6
x6
7
x7
x7
x7
8
x8
x8
*Z-1*Z-1*Z-1*+c2c1c3c4xy2y3y4yZ-1y5y1y1y2y3y4y5pointerIIR filtermemory map
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
for j = 1..jtaps d(j) * y(j) for i = 1..itaps c(i) * x(i) time loop2 filtersy1y2y3y4y5pntr 2modulo range 2x1x2x3x4x5pntr 1modulo range 1y1y2y3y4y5x1x2x3x4x5pntr 1modulo range2 memory segments => 1 segment
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
y1y2x1/y3x2x3pntr 1modulo rangeMapping strategy define positions in Ramconstraint: vars that form a delay line in consecutive places find a scheduleexample : c1 => c2 => c3 => c4 => c5 define ACU instructionsMapping strategy
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
*Z-1Z-1*Z-1*+c6c7c4x7x6x5x4yeZ-1x1x3Z-1*x2Z-1Z-1*x8c8+yo*c5*c3*c1c2
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ASModulooutputto RAM Output reg Areg SRead_A A ASRead_S S ASincA A+1 A+1SdecA A-1 A-1SStep A+S A+SSInc_step S+1 AS+1Modulo can beimplemented as a mask operation if the size is 2k16 10 00023 10 111mask=holdACU architecture andInstruction set
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
y1y2x1/y3x2x3pntrmodulo rangeread_A17incA18incA19incA20incA21step19dec18prepare new pointer for next iterationAssumeinitialisationA = pointer=17S = -21617181920212223Mapping example
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Addressing modes register ADD R4, R3 R[R4] = R[R4] + R[R3] immediate ADD R4, #3 R[R4] = R[R4] + #3 direct ADD R4, (100) R[R4] = R[R4] + Mem[100] indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] w. inc/dec ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] 1 indexed ADD R4, (R3R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] R[R2] Remarks direct = for static data indirect = for arrays inc/dec = for stepping through arrays e.g. xn index = for stepping through arrays e.g. x2n
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8 ARs (address or auxiliary register) available extra indirect modescircular *ARn % post inc/dec by 1 - circular *ARn AR0 % post inc/dec by AR0 - circular bit reverse *ARn AR0 B post inc/dec by AR0 - bit rev.Addressing modes: extra for DSP
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
regular data-flow algorithms ==> MACfiltering, correlation, windowing etc decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision Incorporation of an ALU
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
+1PCInterrupt addressStackResetProgramMemoryIRACU_AAR_ARAM_ADR_AACU_BAR_BRAM_BDR_BMACALUControl BusRfile
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ALUSXSYDXDYRFACUA BMULTSXSYDXDYRFACUA BImm. dataDXDYRFACUA BNext addressBRCondACUA B00011011Bus-oriented instruction encoding
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
c(i) * x(i)6 clockcycles/samplelimit pipelines in the controllerfirst solutionresourcestime (cc)Not showncoefficient RAM+ACU
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
LABELALU
MPY-ACC
RAM
ACU
Acc = 0
init (i=0)
init counter
loop
incr (=i+1)
read x(i)
acc(i)=acc(i-1)+x(i)*c(i)
dec counter
branch to loop if counter > 0
nop
Loopfolding (software pipelining)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
c(i) * x(i)Pre- and postamble4 clockcycles /sampleLoopfolding (software pipelining)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
LABEL
ALU
MPY-ACC
RAM
ACU
acc(i-1)=0
init (i=1)
init counter
read x(i)inc(=i+1)
loop
acc(i) = acc(i-1)+x(i)*c(i)read x(i+1)incr (=i+2)
dec counter
branch to loop if counter > 0
nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1)read x(n)
acc(n) = acc(n-1)+x(n)*c(n)
c(i) * x(i) hardware support for loop control1 clockcycles/samplerepeat instruction and repeat block
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
LabelALUMPY-ACC
RAM
ACU
acc(i-1=0
init (i=1)
init counter
read x(i)inc(=i+1)
repeat n-2
acc(i)=acc(i-1)+x(i)*c(i)
read x(i+1)incr(=i+2)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1)read x(n)
acc(n) = acc(n-1) + x(n)*c(n)
architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)examples: C6 and TMOutline
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
T registerSign ctrSign ctrSign ctrSign ctrSign ctrTMultiplier (17*17)A(40)B(40)MUXA0AABBAfractionalMUXAdder (40)ZEROSATROUNDMALU (40)UBMUXTABCDCDBarrer shifterMSW/LSWselectECOMPTRNTCBAPCDDTMS320C5000
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Address bus16 bitsEXTERNALADRESS SWITCHY AddressY memory256-by-24-bitRAM256-by-24-bitROMAddressALUX memory256-by-24-bitRAM256-by-24-bitROM2,048-by-24-bitPROGRAMMEMORYROMX AddressP AddressEXTERNALDATA-BUSSWITCHINTERNAL DATA-BUSSWITCH24 BITSDATABUSX-DATAY DATAP DATAGLOBAL DATADATA ALU
24-by-24 bitMULTIPLIER-ACCUMULATORPRODUCING56 BIT RESULTPROGRAM CONTROLLERON CHIPPERIPHERALS,HOST,SYNCHRONOUSSERIAL INTERFACESERIAL COMMU-NICATIONSINTERFACE,PROGRAMMED I/O,BUS CONTROL2 BITSCLOCK 3 BITSINTERRUPT24 BITSI/OPORTS7 BITSMotorola 56K family
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
X dataY dataZ dataBuses for
Instruction decoder96-bit instructionsProgram control unitProgrammemory (Z data)16-bit busTwo 16-by-16 bitmultipliersY0Y1XY0Y1XPOP1scalescaleTwo 40 bit arithmic-logic unitsSaturationSaturationFour 40 bitaccumulators Saturation/scaleshiftR.E.A.L.
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
RD16021 DSP
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
memories
Not included
Process
0.35(, 5M
voltage
2.7-3.6 V
frequency
39 MHz
Tj = 85 C, 2.7V, wcp
area
3.9 mm2
Power dissipation
2.1 mW/MHz
16 taps 40 samples 8 biquads Instruction cycle counts for BDTi benchmarks
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Function
DSPgroup OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
Real block FIR
835
925
841
1240
684
334
780
448
Single sample FIR
21
23
22
26
18
17
16
20
Complex block FIR
3018
3043
3122
3123
2922
1294
1681
1470
LMS adaptive
90
64
59
101
58
33
55
IIR (8 sections)
51
45
43
65
44
30
38
37
Vector dot product
43
43
43
47
41
29
23
43
Vector add
122
85
83
123
61
36
43
63
Vector maximum
41
86
128
120
111
39
40
Convolution encoder
506
772
818
888
528
188
464
176
FSM
284
375
198
415
455
147
301
167
256 pnt FFT
16514
12148
10633
21035
13234
4225
9016
5797
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word)Outline
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
lexical analysissyntax analysissemantic analysisCode selectionRegister allocationschedulingFront endCode generationcodesourceIntermediate machine independent representation1 instr = // opsorder of instr
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ab*cd++*c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3t1t2t3BBiBBjBBkIntermediate machine independent representation
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]Notation ar := ar | ax + ay | af meansar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af Code selection
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
axayarafmxmymrmf+ -xyxy+ -*ALUMACd memoryp memoryADSP[Analog Devices]Code selection example
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ar | mr | mxmy | mf*mr+mr | mfar | mr | mxmy | mf*mr-mr | mfar | mr | mxmy | mf*mr | mfmr | ar | axay | af+ar | afmr | ar | axay | af-ar | afExamples of RTPs on the ADSP-210 datapath
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ab*cd++*ct1t2t3mx := dmemmy := pmemax := dmemay := pmemmr := dmem2:1:3:ar := ax + aymy := armr = mr * myMr := mr + (mx * my)Example of code selection = covering of intermediate representation with RTPs
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Problems local decisions which have a global impact phase coupling: example asap schedule maximal freedom for scheduling code selection during scheduling register allocation comes afterwards can lead to infeasible solutions
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
R1R2R3alu2alu1(a)(b)1234phase coupling: example 1
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
PuCuPvCvPuCuPvCvuvuvif u and vshare the same registerphase coupling: example 2Example of coupling between scheduling and register allocation[Mesman]
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Traditional code generation(heuristic)OK ?constraintsnoyesfeasiblespacedesign space seen by code generatorapplication[Mesman]phase coupling: discussionPhase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecturedevelop an architecture which is still efficient but alsoa good model for building a compilerEfficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction WordIt is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assemblerphase coupling: discussion
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word) principles central register file + example TM clustered VLIW + example C6 subword parallelism or SIMDOutline
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
multiple parallel FUs, possibly different and pipelined pipelining is exposed to the compiler = no interlock mechanism load-store architectureall operands fetched from/stored in register files, possibly multi-ported each FU can receive an instruction every clock cycle one instruction = many RISC instructions each RISC instruction = one issue slot no dependencies between different RISC instructions = orthogonal microcode = compiler friendlyVLIW principles
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Execunit 1Register fileIssue slot 1Execunit 2Issue slot 2Execunit 3Issue slot 3Execunit 4Issue slot 4Execunit 5Issue slot 5Execunit 24Issue slot 24Execunit 25Issue slot 25R&W addr.instruction
...... long instruction words e.g. (3*7+4)*25=625 many ports on the registerfile e.g. 75VLIW architecture
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Execunit 1Execunit 2Execunit 3Register fileIssue slot 1Execunit 4Execunit 5Execunit 6Execunit 7Execunit 8Execunit 9Issue slot 2Issue slot 3VLIW architecture: central Register File
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
ExecunitExecunitExecunitExecunitExecunitRegister file (128 regs, 32 bit, 15 ports)Instruction register (5 issue slots)Data cache(16 kB)PCInstructioncache (32kB)5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrtTM1000 DSPCPU
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
TriMedia TM32A processor 0.18 micronarea : 16.9mm2200 MHz (typ)1.4 W7 mW/MHz
(MIPS=0.9 mW/MHz)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Synthesised RF area (CMOS18, 64 bit)Area, speed and power dissipation goes more than linear with thenumber of ports
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Chart1
0.170.340.69
0.2960.7481.49
0.4541.12.38
32regs, after P&R
64regs, after P&R
128regs, after P&R
Nr of ports
Area in mm-sq
Sheet1
ProjectCVLIW
ExperimentswithRegFile
Date9/27/01
ModuleCompilerresultsMulti port register file
CMOS18techspecification# portsareadelay
pre-layoutresultswp=1,rp=2,bits=64,words=3232935151.4763241811741351.711.87
wp=2,rp=4,bits=64,words=3264878051.5790897217204131.762.02
wp=4,rp=8,bits=64,words=32128220341.65148385029296391.862.06
wp=8,rp=16,bits=64,words=322413957031.84260194757547012.092.2
wp=1,rp=2,bits=64,words=6436324181.71
wp=2,rp=4,bits=64,words=6469089721.76
wp=4,rp=8,bits=64,words=641214838501.86
wp=8,rp=16,bits=64,words=642426019472.09
wp=1,rp=2,bits=64,words=128311741351.87
wp=2,rp=4,bits=64,words=128617204132.02
wp=4,rp=8,bits=64,words=1281229296392.06
wp=8,rp=16,bits=64,words=1282457547012.2
HDLICompilerresultsMulti port register file
CMOS18techspecification# portsareaporosity(hdli)lay area%utilisation (sedsm)
pre-layoutresults64reg-sch64reg-lay128reg-sch128reg-lay
wp=1,rp=2,bits=64,words=3230.1640.70.1796.40.320.340.6570.69
wp=2,rp=4,bits=64,words=3260.2810.610.29695.050.560.7481.121.49
wp=3,rp=5,bits=64,words=3280.3870.45485.30.771.11.552.38
wp=1,rp=2,bits=64,words=6430.320.70.3495.3
wp=2,rp=4,bits=64,words=6460.560.610.74875
wp=3,rp=5,bits=64,words=6480.771.170
wp=1,rp=2,bits=64,words=12830.6570.70.6995
wp=2,rp=4,bits=64,words=12861.121.4975.2
wp=3,rp=5,bits=64,words=12881.552.3865.05
Sheet1
2935156324181174135
4878059089721720413
82203414838502929639
139570326019475754701
32 Reg
32 Reg
64 Reg
128 Reg
Nr of ports
Area in um2
Area Vs nr of ports
graph
1.471.711.87
1.571.762.02
1.651.862.06
1.842.092.2
32 Reg
64 Reg
128 Reg
Nr of ports
Delay in ns
Dealy Vs Nr of ports
Sheet3
000000
000000
000000
Nr of ports
Area in mm-sq
Register FIle Area after P&R
64 bit registers
0.170.340.69
0.2960.7481.49
0.4541.12.38
32regs, after P&R
64regs, after P&R
128regs, after P&R
Nr of ports
Area in mm-sq
Register FIle Area after Place & Route
Execunit 1Execunit 2copyunitRegister file 1Execunit 3Execunit 4copyunitRegister file 2Execunit 5Execunit 6copyunitRegister file 3VLIW architecture: clustered Register Files
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
REGISTERFILE 1FMULFADDREGISTERFILE 2IMULIADDREGISTERFILE 3IMULIADDFMUL r1,r2,r3IADD r1,r2,r3IMUL r1,r2,r3VLIW architecture: clustered Register Files
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
REGISTERFILE I0IADD_01IMOV_01:FU00IADD_00LAND_00:FU01IMUL_00SHFT_00:FU02REGISTERFILE I1IADD_10IMOV_10:FU10IADD_11LAND_10:FU01IMUL_10SHFT_10:FU02VLIW architecture: clustered Register Files
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)15-20 % for 2 clusters20-30 % for 4 clusters limited scalability not too many clusters not too many registers within each cluster (too many RF ports) add of copy ops in the compiler = graph changes during schedulingVLIW architecture: clustered Register FilesDiscussion
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Dstsrc1src2Src_upDst_upDstsrc1src2L1S1M1Store/loaddataStore/loadaddressDstsrc1src2D1Registerfile 0-15 (32 bits)Store/loadaddressDstsrc1src2D2Dstsrc1src2M2S2L2loaddataRegisterfile 0-15TMS320C62x VelociTI (fixed point)Int addlogicalbit countInt addlogicalbit manipshiftconstantbranchInt mult(16=>32)Int addload/store
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
parallelism (fetch-decode-execute) (max 8 issue slots) pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz) Risc (simple, atomic, independent instructions)performance comes from compiler (pipelining, unroll) load-store orthogonal (2 identical DP, add on 6 units) deterministic (no interlock) conditional instructions (=guarding) instruction packingVelociTI principles
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
nnAnnnnnnBnnnnnnnnnnnCnnnnnnnDnnnnnEnnnnFnnnnnnnnnnnnnGnnnnnnnnHABCDEFGH00000000nBAnnCnnnnnEnDnnFnnnnnnnnnnnnnGHABCDEFGH11010010ABCDEFGH11111110ABCDEFGHFully serialMixed serial/parallelFully parallelVelocity encodingClassical encoding: fetching many nops
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Instruction cycle counts for BDTi benchmarks
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Function
DSPgroup OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
Real block FIR
835
925
841
1240
684
334
780
448
Single sample FIR
21
23
22
26
18
17
16
20
Complex block FIR
3018
3043
3122
3123
2922
1294
1681
1470
LMS adaptive
90
64
59
101
58
33
55
IIR (8 sections)
51
45
43
65
44
30
38
37
Vector dot product
43
43
43
47
41
29
23
43
Vector add
122
85
83
123
61
36
43
63
Vector maximum
41
86
128
120
111
39
40
Convolution encoder
506
772
818
888
528
188
464
176
FSM
284
375
198
415
455
147
301
167
256 pnt FFT
16514
12148
10633
21035
13234
4225
9016
5797
byte3opbyte3byte3byte2opbyte2byte2byte1opbyte1byte1byte0opbyte0byte0Ex. +, - , min, max => quadumin => quadumax ...Subword parallelism(custom operators in TM)1st input operand2nd input operandoutput operand32 bits = 4 bytes are processedindependently
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Subword parallelism+ faster execution- rewrite effort (e.g. different types for in- and outputs)Typical example : graphics ( 4 * 32 bit floating point) (custom operators in TM)
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
int size = 1000
byte out[size], in1[size], in2[size]
for (i = 0; i < size; i+( (
out[ i ] = in1[ i ] + in2[ i ];
(
int size = 1000
byte out[size], in1[size], in2[size]
for (i = 0; i < size; i+( (
packet4 t1 = packet4_load ( in1 );
packet4 t2 = packet4_load ( in2 );
packet4 t3 = packet4_add ( t1, t2 );
packet4_store ( out, t3 );
(
for (i=0; i> 1) +idct(i);if (temp > 255)temp = 255;else if (temp < 0)temp = 0;destination[i] = temp;}Subword parallelismMPEG exampleRemark: simple example without interloop dependencies
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
for (i=0; i> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;
temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;
temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;
temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;
temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;
destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;quadavgdspuquadaddui=
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Will embedded CPUs and DSPs converge ? Converging forces both include a hardware multiplier trend in DSPs towards caches and RTK trend in DSPs towards C/C++ common trend towards VLIW Diverging forces deeply embedded code (DSP) vs. end-user SW (CPU) different RTKs SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)Conclusions VLIW good balance between hw and sw between efficiency (ILP) and cost fundamental problems: code size, interruptability
Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Trendlines (Exponxxx): Synthesis done by Excel.
32-bit register file with 128 registers 10R 5W is realisable in CMOS18. Its 2.7 m2. However, expanding to 64-bits makes things worse!
The curve is exponential. Therefore, splitting the register file in two will reduce the area. See the vertical line from the 128 reg curve to the 64 reg curve. Next, we can reduce the number of ports per register file. Say 9 ports per file. In this setup the register files will be available only to the local issue slots of the cluster.
It will also help in power, since the RF area is smaller.