Code optimizationpeople.ucalgary.ca/~smithmr/2004webs/encm515_04/04... · Code optimization The...

M. R. Smith -- “The SHARC versus the Minnow – Part 200/08/29 – Page 1 of 13

Code optimization

The Case of “ The SHARC versus the Minnow”Part 2 – The Byte of the SHARC

M. R. Smith

Department of Electrical and Computer Engineering,University of Calgary, Alberta, Canada T2N 1N4

Email: -- smithmr @ ucalgary.caPhone: -- 403 – 220 – 6142Fax: -- 403 – 282 -- 6855

Developed for Electronic Design Magazine – April 2000Updated July 2000,

To be published October 2000Not for general distribution until that time

A classic choice facing the embedded system designer is whether toadd to legacy code, upgrade an existing working system by adding aco-processor or go for an entirely new system. In this second part of atwo part article Mike looks at the available speed improvements byswitching from DSP algorithms on an existing CISC processor to aDSP̀ processor. Coding practices to get the most optimized codefrom a compiler are discussed. Techniques for further codeoptimizations, including parallel operations, are detailed.


Code optimization

The Case of “ The SHARC versus the Minnow”Part 2 – The BYTE of the SHARC

M. R. Smith,

Techniques to add DSP capabili ty to anexisting CISC system were discussed in the firstpart of this article (XXXXXXX ). “C” code forpart of a frequency analysis system wasdeveloped. Listing 1 shows a procedure that willgenerated both instantaneous and average powerof a complex-valued array.

It was shown that with this DSP codedeveloped on the Software DevelopmentSystems’ 68K environment, the compiler couldbe persuaded to produce code almost as eff icientas hand optimization. The words “ the compilercould be persuaded” are key. The developer hadto rewrite the code in the format shown inListing 2. Here explicit pointer operations areused to force the compiler to generate the fasterauto-incrementing addressing modes availableon the 68K processor. Other speedimprovements include evaluation of constantexpressions not recognized by the compileroptimizer, and introducing a faster “down-counting” form of the loop.

Changing code writing from the explicitindexing into an array (Listing 1) to pointeroperations (Listing 2) to gain performance is notparticularly onerous. However, depending on theapplication, it may not be sufficient. In this partof the article, we discuss the advantages thatcould arise from using a processor customizedfor DSP operations. Examples are taken for codegenerated for the Analog Devices ADSP-21061SHARC processor using the White MountainVisual-DSP development environment.Techniques for increasing the use of paralleloperations to produce faster are discussed.

•

// short ints are 16-bit values on this machine

short int Power(short int real[ ], short int imag[ ] , short int power[ ], short int Npts) {

short int count = 0; short int totalpower = 0; short int re_power, im_power;

for (count = 0; count < Npts; count++) { re_power = real[count] * real[count]; im_power = imag[count] * imag[count]; power[count] = re_power + im_power; totalpower += re_power + im_power; }

return (totalpower / Npts); }

Listing 1: This DSP algorithm calculates the power inan array of complex numbers and returns theinstantaneous average power. The time to perform thisprocedure is determined by the instructions in the loop,rather than the subroutine overhead, provided thenumber of points, Npts, in the array is not small


Ignoring Speed, Aspectswould the DSP Processor be

better any way?

In the last article, we discussed that thefirst consideration in adding new material to 68Klegacy code was not one of speed but will thecode work? Typically the decision to switch to aDSP processor is associated with speed.However, the developer should also ask whetherthere are other advantages of switching.

A comparison of the time it takes toexecute a program on any processor can beobtained from the formula

Cycle

Time

nInstructio

CyclesAverage

ogram

nsInstructio

TimeExecution

**Pr

=

Updating from a 16-bit older CISC processor toa newer 16-bit DSP processor allows a switch tomore recent technology. This leads to a decreasein execution time as the Time / Cycle getssmaller, especially with faster implementation ofthe multiplication logic.

Switching to a 32-bit DSP processoroffers other advantages. The wider data busmeans that there are no penalties associated withusing 32-bit data operations. DSP algorithmsinvolve repeated multiplications and additionswhich can quickly lead to overflow of a 16-bitnumber representation and inaccurate results.Many 32-bit processors have pipelined floating-point operations. The pipelining means that thereare no speed penalties when using FP operationsrather than integer. Floating point algorithms areeasier to design as the automatic renormalizationof FP numbers removes the problems associatedwith number overflow. However, the designershould realize that a 32-bit FP operation hasroughly the same precision as a 24-bit integeroperation [1].

A 32-bit DSP processor offers otheradvantages. The wider data bus allows widerinstructions to be fetched, which improvesperformance by decreasing Cycles / Instruction.In addition the wider data bus could provideenough bits to allow the description of paralleloperations to decrease the overall number ofinstructions that need to be executed.

A particularly useful characteristic ofDSP processor is the presence of hardware loop,hardware circular buffers and even zero-overhead bit-reverse addressing that is usefulduring FFT operations. Other features availablewith the more recent DSP processors includealternate register banks (faster interrupthandling) and large on-board data and instructioncaches.

The SHARC processor has a singlecycle instruction where three memory accesses(2 data and an instruction), two memory addressadjustments and a floating point multiplicationare performed at the same time as paralleladdition and subtraction operations. Thatprovides for a 4000% improvement over the 68Kprocessor without even changing the clockspeed! The problem then becomes how to codeso that the possible speed improvement becometrue speed improvement

// short ints are 16-bit values on this machineshort int Power(short int *real, short int * imag, short int *power, short int Npts) {

short int count = 0; short int totalpower = 0; short int re_power, im_power; short int temp;

for (count = 0; count < Npts; count++) { re_power = *real++; im_power = * imag++; temp = re_power * re_power + im_power * im_power; *power++ = temp; totalpower += temp; }

return (totalpower / Npts); }

Listing 2: The “C”-code must be written to explicitlyuse pointer arithmetic before the faster auto-incrementing indirect addressing instructions aregenerated by the SDS 68K compiler. The codegenerated from this listing had twice the performanceof the code from Listing 1 and approached theperformance of hand optimized code.


First step in CustomizationVariables in Registers

As with the 68K processor, the first stepin customizing a SHARC routine for speed is tomove frequently used variables into registersrather than leaving them in external memory.This is particularly important as the SHARC hasa basic LOAD/STORE architecture, which doesnot support the direct memory to memoryoperations available on a CISC processor.

Listing 3 shows the stages ofestablishing a stack frame by the Visual DSP“C” compiler. Even this simple task uncoverssome of the basic differences in architecturebetween the 68K CISC and 21K DSParchitecture. There is no 21K equivalent to thecomplex MOVE Multiple operation thatdescribes the storage of many non-volatileregisters to memory in a single instruction.Instead each register is individually moved to the“C”-stack. This difference eats up more programspace on the SHARC than the equivalent 68KROM space. However there is no speed penaltyas the SHARC memory access is the moreeff icient.

Note the two-stage operation needed tostore the SHARC index registers, I0, I1, I2, butsingle stage operation to store the data registers,R3, R6 etc. This is a consequence of the DataAddress Generators (DAG) block on theSHARC. The DAGs are separate ALUsdedicated for address calculations which canoccur in parallel with the standard dataCOMPUTE operations,

Two DAGs are needed as the SHARCcan parallel three data operations – one along the“data” data bus and another along the “program”data bus. The third is from the “instruction-cache” data bus controlled by the programsequencer logic unit. The architecture to supportthe multiple data accesses means that there is nota direct path for the registers of a DAG to besaved into the memory block controlled by thatspecific DAG. The lack of the data path is notcritical, as such operations are neededinfrequently.

The SHARC provides a hardware stackfor high speed subroutine entry and return.However this stack is small and not capable ofhandling the extensive stack usage needed to

support “C” . As with the 68K processor, one ofthe SHARC index registers (i6) is set aside as aframe pointer to allow eff icient handling of stackframe operations. A second index register (i7)acts as a C-TOP-OF-STACK pointer to play thesoftware equivalent of the 68K hardware stackpointer. Special SHARC instructions, not listedin the standard USER manual, are available tosupport “C” stack operations. For moreinformation of the logistics of coding “C”subroutines on the SHARC see the online articleSHARC IN THE “ C” [2]

#define CSTACKTOP I7#define FP I6

// Establish the stack frame// LINK #-32, SP EQUIVALENT

modify(CSTACKTOP,-8);

// Save the non-volatile data registers (rn)// and non-volatile index registers (in)

// MOVEM EQUIVALENTdm(-9,FP)=r3;dm(-8,FP)=r6;dm(-7,FP)=r10;dm(-6,FP)=r11;dm(-5,FP)=r13;

// Saving certain index registers to the data memory// stack takes two steps – architecture considerations

r2=i0;dm(-4,FP)=r2;r2=i1;dm(-3,FP)=r2;r2=i2;dm(-2,FP)=r2;

// The first three subroutine parameters must be moved// from the data registers in which they were// passed into index registers

i0=r4; // &real[ 0] ;i1=r8; // &imag[0] ;i2=r12; // &power[0];

// The fourth parameter is passed, a la 68K, via the stackr11=dm(1,FP); // Npts;

Listing 3: The stack frame from the ADSP-21K code hasconsiderable similarity to the 68K stack frameestablishment once changes in processor architecturehave been taken into account’


The first three parameters are passed onthe SHARC via registers, rather than through thememory-based “C”-stack as commonly occurswith 68K code generation. Passing parametersvia registers is not always as eff icient as itsounds. The SHARC architecture does notsupport general-purpose registers. These meansthat the pointer parameters of Power(), althoughpassed in data registers (r4, r8 and r12), mustthen must be moved into index registers beforeuse. The fourth parameter, Npts, was passed,68K fashion on the stack. With today’sprocessors with on-chip memory, there is reallyno significant time penalty using either registeror memory stack parameter passing conventions.The only exception would be the time lostpassing parameters during subroutine callsembedded in a frequently used loop. Then it is acompiler rather than an architecture problem assuch situations are best handled using in-linesubroutines to avoid subroutine overhead

The minnow inside theSHARC

It is possible to program the SHARC asif it was a 68K-style processor. Listing 4 showsthis approach. Here code was developed for thePower( ) subroutine using the White MountainVisual-DSP compiler with all optimizationdeactivated. This SHARC code will run muchfaster that the equivalent 68K code. However, asthe code does not take into account thedifferences between the two architectures, thereis much hidden inefficiency.

The for-loop code is converted into thewhile-loop format rather than the more eff icientdo-while format. On a 68K processor, there is asmall penalty for using one format of loop overanother. However the high degree of pipeliningpresent in the SHARC architecture means thatevery jump instruction has a series of hiddenNOPs associated with it as the three-stageinstruction pipeline is flushed.

Both the 68K and 21K processors haveinstructions that support direct indexing throughan array. The 68K has an indirect addressingmode where an index register (A0) can be usedto point to the start of an array, and a dataregister (D0) used to step through the arrayelements.

MOVE.L (A0, D0), D1

The equivalent 21K instructionn is the pre-modify operation

R1 = dm(M4, I4)

where the SHARC index and modify registers(I4 and M4) play the same role as the 68Kaddress and data index registers.

The 68K D0 register can be eff icientlyused both as an array index and as a loopcounter. However, the modify register, M4,forms part of the SHARC DAG block whichpermits direct mathematical operations on theindex registers but not on the modify registers.The loop counter must be transferred into themodify register before each memory access.

Other interesting SHARC architecturalfeatures are revealed by the code in Listing 4.The single cycle SHARC multiplicationoperation must be specified with the signed-signed-integer (SSI) syntax as the SHARCsupports both integer and fractional numberrepresentations. The SHARC supports threeregister operations rather than the 68K two-register instructions. Finally all short intoperations are 32-bit wide (rather than 16) withno time penalty. The 32-bit wide data operationsmean that the SHARC DSP algorithm is lesslikely to overflow the short int numberrepresentations.

The SHARC processor does not havethe ineff icient, multi-cycle division operationpresent on the 68K processor. In fact theSHARC does not have a division instructor atall! Instead division is handled through asubroutine based around an 8-bit approximationgenerated from a RECIPR instruction. Apeculiarity of the Visual-DSP “C” compiler isthat it generates specific subroutine calls whenasked to perform integer division, but inline-codes the equivalent floating point operations.This is strange since the subroutine overhead(around 10 cycles) to call , and return from, thedivision code is as large as the division codeitself.

Many of the coding inefficienciesdisappear when the “C” code has been rewrittenusing pointer arithmetic (Listing 2) rather thanarray indexing (Listing 2). Part of the Visual-DSP “C”-model is to use dedicated specific


modify register (dm_one in Listing 5) to storeconstants associated with basic array handling.

However with the more eff icient code,the ineff iciencies associated with the loopcontrol become more important. We can defineLoop Eff iciency as

LoopInCyclesTotal

LoopInCyclesUseful

−−−−−−

The Loop Efficiency drops from 75% for thecode in Listing 4, to around 50% for the fasterloop in Listing 5.

r13=0;r3=0; // loop counter

_L$500007: // conditional test – for loopcomp(r3, r11);// 2 hidden NOPs during// non-delayed branchif ge jump(pc,_L$500009);

! line 53 // determining re_powerm4=r3; // offset calculationr2=dm(m4,i0); // dm[ i0+M4]m4=r3;r1=dm(m4,i0);r6=r2*r1 (SSI); // SSI not SSF

! line 54 // determing im_powerm4=r3;r12=dm(m4,i1); // dm[ i1+M4]m4=r3;r8=dm(m4,i1);r10=r12*r8 (SSI);

! line 55 // storing power[ count]r2=r10+r6;m4=r3;dm(m4,i2)=r2; // dm[ i2+M4]

! line 56 // calculating total powerr1=r6+r10;r13=r13+r1;

! line 57 // incrementing the loop counterr3=r3+1;jump(pc,_L$500007); // 2 NOPS

! line 59 // NO DIVISION INSTRUCTION_L$500009: r8=r11;

r4=r13;// Instructions hidden in delay slotcjump ___divsi3 (DB);

dm(i7,m7)=r2; dm(i7,m7)=pc;

Listing 4:The non-optimized ADSP-21K code fromListing 1 bears many similarities to the 68K code.There are many architectural differences – single cycleinstructions, single cycle multiplications, delayedbranches and no explicit division instruction.

r13=0;r11=0; // Loop counter

dm_one = 1; // Modify register

_L$750010: // conditional test – for loopcomp(r11, r14);// 2 Hidden NOPSif ge jump(pc,_L$750012);

! line 71 // Accessing memory using a// post-modify mode// MOVE (I0)+, R3 EQUIVALENTr3=dm(i0,dm_one);

! line 72 // MOVE(I1)+r6=dm(i1,dm_one);

! line 73 // Calculating the powerr2=r3*r3 (SSI);r1=r6*r6 (SSI);r10=r1+r2;

! line 74 // MOVE R10, (I2)+dm(i2,dm_one)=r10;

! line 75 Calculating the total powerr13=r13+r10;

! line 76 // incrementing the loop counterr11=r11+1;jump(pc,_L$750010); // 2NOPS

_L$750012: // NO DIVISION INSTRUCTIONr8=r14;r4=r13;cjump ___divsi3 (DB); dm(i7,m7)=r2; dm(i7,m7)=pc;

Listing 5: The non-optimized ADSP-21K code from “C”routine using pointer arithmetic (Listing 2) is more efficientthan the code generated for array indexing. Note that 21Kshort ints are 32-bits long rather than 16, with no loss ofspeed.


Speed improvement availableby activating the compiler

optimizer. Listing 6 shows the changes in codeeff iciency on activating the optimizer whencompili ng SHARC “C” code involving explicitarray indexing (Listing 1). First the loopeff iciency has jumped from 75% to 100% withthe activation of the SHARC processor’s zero-overhead hardware loop capabili ty.

Even more important, the speed of thekey operations in the loop has jumped by almost100% as the number of instructions has droppedfrom 15 to 8. The optimizer buil t into the Visual-DSP compiler has recognized the simple modeof indexing through the arrayss and has switchedto the more eff icient post-modify addressingmode associated with incrementing a pointerthrough an array. Note that the optimizer did notrecognize the common expression

re_power + im_power

in the code which is why the loop body inListing 6 is 8 cycles compared to 7 cycles for theloop body in Listing 5 where the developer hadincorporated the common expressionoptimization directly into the “C” code.

The “C” code of Listing 2 wasdeveloped directly using the incrementing-pointer form of array access. Activating theoptimizer increases the loop efficiency from 50%to 100% with the introduction of hardware loops.However, the instructions in the loop body itselfremains at 6 cycles.

Some interesting characteristics of theSHARC ALU architecture are exposed at thestart of Listing 6. Prior to entering the hardwareloop, there is a test on the number of points,Npts, to be processed. There is a 21K COMPinstruction that is the equivalent of the 68K CMPinstruction to set the ALU flags prior to aconditional jump. However in this code thespecialized ALU COMPUTE operation PASS isused to set the flags. Unlike the 68K register-to-register MOVE instructions, SHARC operationsof the form

i6 =r4

do not change the ALU flag, so that thecondition set by the PASS operation isunchanged until needed by the (later) conditionaljump.

// Getting the Npts parameter and testing itr10=dm(1,FP);r10=pass r10; // Test

occurs

// Setting the index registersi0=r4;i1=r8;i2=r12;

! line 52 // Using the result of earlier testif le jump(pc,_L$500009);

// 1 memory access is brought out of the loopr2=dm(i0,dm_one);

// Zero-overhead, hardware loop// Note the strange end-loop address

lcntr=r10, do(pc,_L$566002-1)until lce;

// Optimized memory accessesr0=dm(i1,dm_one);r13=r2*r2 (SSI);r11=r0*r0 (SSI);r8=r11+r13;dm(i2,dm_one)=r8;r4=r13+r11;r2=dm(i0,dm_one);r6=r6+r4; !end loop _

_L$566002: // Label is “ 1 past the loop”

_L$500009:// Division subroutine

r8=pass r10;r4=pass r6;cjump ___divsi3 (DB);

dm(i7,m7)=r2; dm(i7,m7)=pc;

Listing 6: The ADSP-21K Compiler optimizationrecognizes the in-efficiencies in the multiply use of theoffset addressing modes used in the “ C” of listing 1 andswitches to a post-modify (auto-incrementing) mode.Note the introduction of zero-overhead hardwarecontrolled for-loops.


The PASS operation offers several advantagesover the COMP instruction. First it is faster, asthe SHARC LOAD/STORE architecture onlypermits comparison between registers, and notbetween registers and constants.

R10 = PASS R10; // Set Flags

but

R11 = 0; // Build ConstantCOMP(R10, R11); // Set Flags

The second advantage is the use of thePASS instruction when invoking the SHARCparallel instruction capabili ty. There areinsuff icient bits available in the SHARC’s 48-bitopcode to describe the generalized movement ofany two of the many SHARC registers at thesame time, as in the instruction

R4 = R5, R6 = R7; // ILLEGAL

However, there are sufficient op-code bits todescribe a combination of a more specificCOMPUTE instruction operating in with a moregeneral register operation or memory accessoperation

R4 = PASS R5, R6 = R7or R4 = PASS R5, R6 = dm(I4, M4);

Later in this article, it will be seen that the abili tyto parallel PASS instructions with otheroperations is a key feature to obtainingmaximum code eff iciency.

Trying to persuade thecompiler to take advantage ofother SHARC speed features.

The SHARC architecture has been buil tfor efficient DSP algorithm production. Listing 7shows an attempt to activate these featuresdirectly from code written in “C” .

The code employs floating pointvariables rather than integer variables. Floating-point variables automatically renormalize ratherthan overflow the number representation. Thissimpli fies code development. There are no timepenalties associated with this approach asSHARC floating point operations are as fast as

SHARC integer operations. However, thedeveloper should remember that the 32-bitfloating-point number representation has thesame precision as a 24-bit integer representation.

The SHARC architecture supportsmany parallel operations involving the multiplierand adder. The loop has been “unrolled” to“encourage” the compiler to use these paralleloperations. Optimization could includeparalleling the additions in the first part of theloop (green) with multiplications from thesecond part of the loop (red). This particularoptimization will only work if the number ofpoints to be processed, Npts, is even. This is nota real limi tation for most DSP algorithms, but itsimplications will be discussed later.

// This routine attempts to gain efficiencies using// the SHARC Harvard architecture with its// Data Memory Data Bus (DM) (default)// and Program Memory Data Bus (PM).

// The loop has been unrolled to offer the compiler// opportunities to optimize the code by// Multiple memory accesses in same instruction// Parallel operations of multiplication and addition

float Power(float dm *real, float pm *imag, float dm *power, short int Npts) { short int count = 0; float totalpower = 0; float re_power, im_power; float temp;

// Following unrolled code works for Npts divisible by 2 if ( (Npts % 2) != 0 ) exit (0);for (count = 0; count < Npts / 2; count++) { re_power = *real++; im_power = * imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp;

re_power = *real++; im_power = * imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp; } return (totalpower / Npts); }

Listing 7: The SHARC architecture offers a Super Harvardarchitecture to bring in data values along two busses, dmand pm, in parallel and also the ability to combinemultiplication and addition operations.


The SHARC architecture supports parallelmultiple memory accesses along a “program”memory data bus (pm) and a “data” memory databus (dm). Clashes between data and instructionfetches along the pm data bus are avoided bystoring instructions in a cache. Only instructionsthat would conflict with pm data fetches arestored in the cache.

To “encourage” the compiler to activatethis feature during code generation, the complex-number array has its real components, real,placed in a data memory data location and itsimaginary components, imag, placed in aprogram memory data location. This is achievedthrough an extension of the “C” language whereparameters of the form

float dm *real;float pm *imag

are passed. The arrays must be declared as globalvariables

float dm real[Npts];float pm imag[Npts];

since the standard “C” auto variables would onlybe placed on the “C” stack, which is buil tentirely in data memory.

The optimized code generated by theVisual-DSP “C” compiler from Listing 7 isshown in Listing 8. For clarity, the instructionsassociated with the first and second parts of theunrolled loop have been colour coded.

Speeding the OptimizedCompiler Output

Using the syntax float dm * and floatpm * to describe the subroutine pointers hasresulted in code where the floating-point datavalues are brought in along both the SHARCdata busses

R13 = dm(I0, dm_one);R3 = pm(I8, pm_one);

There are some interesting architectural featuresrevealed in these two instructions. It wouldappear that once again integer values are beingaccessed from memory as the values are storedin integer registers (R3 and R13) rather than

floating-point registers (F3 and F13). On theSHARC, each data register can be used forstorage of either floating-point or integer values.The floating point characteristics reside in theALU rather than in the bit pattern stored inmemory or a register.

The two memory banks can be accessedin parallel only when index and modify registersfrom DAG1 are used to access data memory andindex registers from DAG2 are used to access

// Conditional Testr11=pass r11; // check loop counterif le jump(pc,_L$750014);

// Instruction moved outside the loop// Looks like an integer access of a float variable

r13=dm(i0,dm_one); // real [ ] on dm

// Hardware looplcntr=r11, do(pc,_L$816004-1)until lce;

// Access to imag[ ] data along pm bus as wantedr3=pm(i8,pm_one); // imag[ ] on pmF8=F13*F13;F12=F3*F3;F13=F8+F12;

// Part of second part of the loopr3=pm(i8,pm_one); // imag[ ] on pm

F10=F10+F13;dm(i1,dm_one)=r13; // power[ ] on dm

r13=dm(i0,dm_one); // real [ ] on dm

F9=F13*F13;F14=F3*F3;F13=F9+F14;

dm(i1,dm_one)=r13; // power[ ] ondm

F10=F10+F13;

r13=dm(i0,dm_one); // real [ ] on dm!end loop

_L$816004:

Listing 8. The compiler recognizes that access can occuralong the pm data-bus but makes not attempt to parallel thesefetches with dm data-bus access. It takes 7 cycles to processone data point.


program memory. The compiler has prepared forthis optimizing by generating code using theappropriate DAG1 registers, I0 and M6 (dm_one= + 1) together with DAG2 registers I8 and M14(pm_one = + 1). Hand optimization is needed toproduce the parallel operations

R13 = dm(I0, dm_one), R3 = pm(I8, pm_one);

to remove two cycles from the 14 cycle loopexecution time. The total loop cycle count can befurther reduced to 10 by paralleling addition andmemory store operations using the R13 (F13)register

F10=F10+F13, dm(I1,dm_one)=R13

As shown in Listing 9, further savingscan be obtained. The dual memory fetches to dmand pm memory from the second part of the loop(red) can be overlapped with the initialmultiplication in the first part of the loop(green). However the paralleling of theseinstructions introduces data dependenciesbetween the use of F3 and F13 in the first andsecond part of the loop. This dependency isbroken by changing the compiler code to useregisters F1 and F4. It will become apparentshortly why these particular registers werechosen rather than F4 and F14.

To save the final two cycles it isnecessary to generate parallel the two additionoperations from the first part of the loop with themultiplications in the second part of the loop.These operations could be made to occur inconjunction with memory accesses to formparallel instructions involving half of theSHARC data registers in a single instruction

Fa = Fb * Fc, Fd = Fe + Ff, Fg = dm(Iu,Mv), Fh = pm(Ix,My);

There would not be enough room in the48-bit wide SHARC opcode if this instructionwas so general as to allow the use of any dataregisters in any position. Restrictions are placedon certain of the registers used to reduce thenumber of bits needed to describe the operation.

Multiplication registersFb is one of F0, F1, F2 or F3Fc is one of F4, F5, F6 or F7

Addition registersFe is one of F8, F9, F10 or F11Ff is one of F12, F13, F14 or F15

To allow these parallel operations to beprogrammed requires some forethought. If allregisters could be used during paralleloperations, then the sequence

R1=dm(I0,dm_one), R4=pm(I8,pm_pm_one);F9 = F1 * F1, F13 = F8 + F12;

F14 = F4 * F4, F10 = F10+ F13;

could be performed. The opcode bit limi tationmeans that multiplication operation of the formF1 * F1 or F4 * F4 can’ t be described in parallelwith addition operations. However the operationsF1 * F5 (with F5=F1) and F2 * F4 (withF2=F4) can be described. It takes only a li ttleingenuity to place equal values in two registerswithout adding additional cycles to the loop.Equali ty can be achieved by dual accesses tomemory.

IF R1=dm(I0,dm_zero;)then R5 = dm(I0, dm_one);

or register to register transfersIF R1 = dm(I0, dm_one)

then F5 = F1;or F5 = PASS F1;

// Hardware loop lcntr=r11, do(pc,_L$816004-1)until lce;

// Dual access along dm and pm data bussesr13=dm(i0,dm_one), r3=pm(i8,pm_one);

// pm_zero contains zero to surpress the auto-incrementing modeF8=F13*F13, r1=dm(i0,dm_one), r4=pm(i8,pm_zero);

// The value in F1 must be passed over to F5 in order to// prepare for the combined multiplication and addition// operation

F12=F3*F3, F5 = F1;

// Accessing pm memory is an alternate approach to preparing// for parallel multiplication and addition operations// One cycle overhead first time round the loop.

F9=F1*F5, F13=F8+F12, r2=pm(i8,pm_one);

F14=F2*F4, F10=F10+F13, dm(i1,dm_one)=r13;

F13=F9+F14;F10=F10+F13, dm(i1,dm_one)=r13;!end loop

_L$816004:

Listing 9: The code produced by the optimizing compiler can bereduced from 14 cycles to 7 cycles by hand-customizing theparallel operations permitted with the SHARC architecture


With the on-board SHARC memory,there is not the penalty in doing a memory toregister transfer rather than a register to registertransfer that there would be on the 68Kprocessor. Which of the three approaches istaken depends on which parallel operations areavailable for use without penalty in other parts ofthe loop.

Theoretical Maximum Speedof the loop code

We have seen it is a fairlystraightforward procedure to reduce the 14 cycleloop produced by the optimizing compiler to 7cycles. However, what is the theoreticalmaximum speed of this loop, and how can this

speed be achieved in practice? Figure 1 showsthe resource usage for the instructions needed tocalculate the instantaneous and average power.Registers have been assigned to allowparallelization of a number of these calculations.

Maximum speed is achieved when aresource is used to a maximum. There are twocycles each associated with additions,multiplications, and program memoryoperations. In theory, if the data memoryaccesses could be ignored, the loop cycle countcould be reduced from seven cycles per powercalculation to just 2. This optimum codingsequence is shown in Figure 2 with a number ofpower calculations occurring in parallel –unrolli ng the loop.

Figure 1. The resource usage associated with the instructions to calculate the instantaneous and averagepower of a numbers in a complex-valued array.

Figure 2. The resource usage associated with 7 sets of instructions to calculate power. Note the optimal useof the multiplication, addition and pm memory access resources. Resource conflicts occur with the dmmemory access resource.


Achieving the maximumthroughput on the SHARC

The simplest way to avoid the dm datamemory access conflict is to move two of theseaccesses to a separate line. This would producecode where it would take the average of 3 cyclesto produce the instantaneous and average powerlevels of the complex-valued arrays. However, itis possible to duplicate the register values usingan additional memory access combined with aCOMPUTE operation. This technique isdemonstrated in figure 3 which shows the finalcode “rerolled” back into a loop.

Note the stages of the optimizedalgorithm. First there are a series of instructionsused to “prime” the computational pipeline priorto the loop. Then come the operations within theloop itself. Finally there are instructions used to“empty” the computational pipeline. For thisparticular set of code it also proved possible toaccount for situations when Npts was odd,something that was not straight- forward tooptimize with the original “C” code

ConclusionIn this second part of a two part

article,we have taken a brief look at the SHARC-- Analog Devices’s ADSP-2106X series of DSPprocessors. We compared the characteristics ofthe SHARC with a simple CISC processor – theMotorola 68K whose equivalent can befrequently found in embedded systems. Wecompared this two processors to show why, evenwith equal clock speeds, the SHARC had thecapabili ty of outperforming the CISC processorby around 4000%. However, this was atheoretical speed improvement rather than actual.

It was shown that, with the optimizeractivated, the White Mountain Visual-DSPdevelopment environment generated 2106Xassembly language code with a fair degree ofspeed. By hand optimization of “unrolled” code,it was fairly straight forward to activate theparallel operations available with the SHARCarchitecture to improve the code speed byanother 200%. A detailed analysis of the code atthe “resource” level revealed techniques thatallowed a further 300% speed improvement. Thefinal code has a speed that was close to thetheoretical maximum processor speed.

The author is currently working on apackage, DIGICAP, to automate the process ofproducing high speed code for the SHARCprocessor.

Figure 3. Utili zing a hardware loop to avoid pipeline stalls, the final code maximizes the use of theresources and has a speed approaching the theoretical maximum throughput – addition, multiplication andtwo memory accesses all occurring in parallel..

M. R. Smith -- “The SHARC and the Minnow00/08/29 – Page 13 of 13

About the AuthorMike Smith is a professor in Electrical andComputer Engineering at the University ofCalgary, Canada. He teaches in the area ofintroductory and advanced microprocessors andcontrollers. His research area is in high speedDSP algorithms for biomedical applications.

References

[1] M. Smith and L. E. Turner, “ Are youdamaging your data through a lack of bitcushions?´ To be published in Circuit CellarOnline.

www.circuitcellar.com/online

Date post:	24-Sep-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Code optimizationpeople.ucalgary.ca/~smithmr/2004webs/encm515_04/04... · Code optimization The...

Documents