Memory Access Coalescing

8/3/2019 Memory Access Coalescing

1/10

Memory Access Coalescing: A Technique for Eliminating RedundantMemory AccessesJACK W. DAVIDSON and SANJAY JINTURKAR

{ jwd, sj3e}@virgini.a. eduDepartment of Computer Science, Thornton Hall

University of VirginiaCharlottesville, VA 22903 U. S. A.

ABSTRACTAs microprocessor speeds increase, memory bandwidth isincreasing y the performance bottleneck for microprocessors. Thishas occurred because innovation and technological improvementsin processor design have outpaced advances in memory design.Most attempts at addressing this problem have involved hardwaresolutions. Unfortunately, these solutions do little to help thesituation with respect to current microprocessors. In previous work,we developed, implemented, and evaluated an algorithm thatexploited the ability of newer machines with wide-buses to load/store multiple floating-point operands in a single memoryreference. This paper describes a general code improvementrdgorithm that transforms code to better exploit the availablememory bandwidth on existing microprocessors as well as wide-bus machines. Where possible and advantageous, the algorithmcoalesces narrow memory references into wide ones. An interestingcharacteristic of the algorithm is that some decisions about theapplicability of the transformation are made at run time. Thisdynamic analysis significant y increases the probability of thetransformation being applied. The code improvementtransformation was implemented and added to the repertoire ofcode improvements of an existing retargetable optimizing back end.Using three current architectures as evaluation platforms, theeffectiveness of the transformation was measured on a set ofcompute- and memory-intensive programs. Interestingly, theeffectiveness of the transformation varied significantly with respectto the instruction-set archi tecture of the tested platform. For one ofthe tested architectures, improvements in execution speed rangingfrom 5 to 40 percent were observed. For another, the improvementsin execution speed ranged from 5 to 20 percent, while for yetanother, the transformation resulted in slower code for allprograms.1 INTRODUCTION

Processor speeds are increasing much faster than memoryspeeds. For example, microprocessor performance has increased by50 to 100 percent in the last decade, while memory performance hasincreased by only 10 to 15 percent. Additional hardware supportsuch as larger, faster caches [Joup90], software-assi steal caches[Cal191], speculative loads [Roge92], stream memory controllers

Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordireot commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copying is by permission of the Association of ComputingMachinery. To copy otherwise, or to republish, requires a feeand/or specific permission.

[McKe94], and machines with wider memory buses, helps, but theproblem is serious enough that performance gains by any approach,including software, are worth pursuing. Furthermore, even withadditional hardware, processors often do not obtain anywhere neartheir peak performance with respect to their memory systems.

This paper describes a code improvement transformationthat attempts to utilize a processors memory system moreeffectively by coalescing narrow loads and stores of width N bitsinto more efficient wide loads and stores of width N x c where c isa multiple of two and the processor can fetch N x c bits efficiently.The terms narrow and wide are relative to the target architecture.On a 16-bit architecture, for example, two narrow loads of bytes (8-bi ts) that are in consecutive memory locations might be coalescedinto a single wide load of 16 bits. Similarly, on a64-bi t architecture,four narrow stores of words (16-bits) that are in consecutivememory locations and proper] y aligned might be coalesced into asingle wide store of 64-bits.

As the paper shows, the analysis to perform suchtransformations is difficult but doable, and in many cases wellworth the effort. The two questions that the analysis must answerare: Is the transformation safe and is the transformation profitable?Safety analysis determines whether the transformation can be donewithout changing the semantics of the program. The two keycomponents of the safety analysis address aliasing and dataal ignment issues. Alias analysis, in particular, is extremely diff icultwhen the source language contains unrestr ic ted pointers [Land92,Land93]. The problem is further compounded because for manycodes where this transformation would be beneficial, the code isstructured so that aliasing and data alignment hazards cannotprecisely be determined via interprocedural, compile-time analysis.The paper describes a new technique, called ron-time alias andalignment analysis, that neatly solves this problem.

Profitability analysis determines whether the transformationwi 11result in code that runs faster, This is perhaps the most diff icultpart of the analysis because memory coalescing interacts with othercode improvements. For example, to expose more narrow,consecut ive memory references for possible coalescing, loops aresometimes unrolled by the optimizer. However, naive loopunrolling may cause the size of a loop to grow larger than theinstruction cache, and any gains in performance by memorycoalescing may be more than offset by degraded cacheperformance. Similarly, memory coalescing collects memoryaccesses that are distributed throughout the loop into a singlereference. This gathering of dependencies into a single instruct ioncan adversely af fect inst ruction scheduling. The paper discusses

SIGPLAN 94-6/94 Orlando, florida USA0 1994 ACM 0-89791 -662-xt94/[email protected]

186


2/10

how these and other issues are resolved so that the memorycoalescing yields code that runs faster, not slower.

The following section briefly discusses work related toreducing memory bandwidth requirements of programs. Section 2describes the algorithm with emphasis on the analyses required toapply the transformation safely and prof itably. Section 3 descr ibesthe implementation of the algorithm in an existing retargetable backend called vpo [Beni89, Beni94]. Using a C front end and vpo, theeffectiveness of the transformation was evaluated on threeprocessors: DECS Alpha [Digi92], Motorolas 88100 [Moto91],and Motorolas 68030 [Moto85]. Section 4 contains a summary.1.1 RELATED WORK

Software approaches to the memory bandwidth problemfocus on reducing the memory bandwidth requirements ofprograms. For example, there is a plethora of research describingalgorithms for register allocation, a fundamental transformation forreducing a programs memory bandwidth requirements. Registerallocation identifies variables that can be held in registers. Byallocating the variable to a register, memory loads and storespreviously necessary to access the variable are eliminated. Anevaluation of the register coloring approach to register allocationshowed that up to 75 percent of the scalar memory references canbe removed using these techniques [Chow90].

Cache blocking and register blocking are codetransformations that also reduce aprograms memory bandwidthrequirement. These transformations can profitably be applied tocodes that process large sets of data held in arrays. For example,consider the multiplication oftwo large arrays. By large, we meanthat the arrays are much larger than the size of the machines datacache.

Because the arrays are larger than the cache, processing theentire array results in data being read from memory to the cachemany times. Cache blocking, however, t ransforms thecode so thatablockof the array that will fit in the cache is read in once, usedmany times, andtben replaced bythenext block. The performancebenefits from this transformation can be quite good. Lam,Rothberg, and Wolf[Lam91] show that for multiplication of large,floating-point arrays, cache blocking can easily triple theperformance of a cache-based system.

Register blocking is similar in concept to cache blocking,but instead of transforming code to reduce the number of redundantloads of array elements into the cache, it transforms code so thatunnecessary loads of array elements are eliminated. Registerblocking can be considered a specific application of scalarreplacement of subscripted variables [CaH90, Dues93] and loopunrolling. Scalar replacement identifies reuse of subscriptedvariables and replaces them by references to temporary scalarvariables. Unrolling the loop exposes ablockof these subscriptedvariables to which scalar replacement can be applied. Registerblocking and cache blocking can be used in combination to reducethe number of memory references and cache misses.

Another program trattsformation that reduces aprogramsmemory bandwidth requirements is called recurrence detection andoptimization [Beni91]. Knuthdefines arecurrence relation asandethat defines each element ofasequence interrns of the precedingelements [Knut73]. Rectrrrence relat ions appear in the sohttionstomany compute- and memory-intensive problems. Interestingly,codes containing recurrences often cannot bevectorized. Considerthe following C code:

This is the fifth Livermore loop, which is a tri-diagonalelimination below the diagonal. It contains a recurrence since x [ i ]is defined in terms of x [i-II . By detecting the fact that arecurrence is being evaluated, code can be generated so that thex [ i ] computed on one iteration of loop is held in a register and isobtained from that register on the next iteration of the loop. For thisloop, the transformation yields code that saves one memoryreference per loop iteration.

For machines with wide-buses (the size of the bus is greaterthan the size of a single-precision floating-point value), it ispossible to compact some number of floating-point loads into asingle reference [Alex93]. Indeed, the work reported here is ageneralization and extension of this technique applied to data of anysize. We call this technique memory access coalescing. Thistechnique can be used with the techniques mentioned previously.2 MEMORY ACCESS COALESCING2.1 MOTIVATION

To describe memory access coalescing and highlight thepotential hazards that must be handled by an optimizer, consider theC code in Figure 1a. The code computes the dot product of twovectors containing 16-bit integers. The code is taken from a signalprocessing application, and 16-bits was sufficient to represent thedynamic range of the sampled signal.

Figure 1b contains the DEC Alpha machine code in registertransfer lists (RTLs) generated by our compiler. Because the DECAlpha is a relatively new architecture, and because it has someinteresting characteristics that affect code generation, a fewrelevant details of the architecture are described. There are 3264-bit fixed point registers, and all operations are performed on 64-bitregisters. The load and store instructions can move 32-bit(longWord) or 64-bit (quadword) quantities from and to memory.Memory addresses must be naturally aligned. Data that is 2N bytesin size is naturally aligned if it is stored at an address that is amultiple of 2N. To accommodate loading and storing of data that isunaligned, the architecture contains unaligned loads and stores of64-bits. These instruct ions fetch the aligned quadword that containsthe unaligned data. There is a full complement of arithmetic andlogical instructions that manipulate 64-bit values. In addition, thereare three instructions, add, subtract, and multiply, that operate on32-bit data. In a departure from other RISC archi tectures, the Alphadoes not include instructions for loading and storing bytes orshortwords (16-bi ts). Instead, archi tectural support is provided forextracting 8-bi t (byte) and 16-bit (shortword) quantities f rom 64-bitregisters. For example, there are instructions for efficientlyextracting and insert ing bytes or shortwords f rornho a register. Therationales for these design decisions are outlined in the Alphaarchitecture handbook [Dec92].

With this information in mind, the code in Figure lb can beexplained. In the RTL code, q [ n I and r [ n ] refer to fixed-pointregisters. r [n I is used when the operation is 32-bit. In the code,Q [ addrl refers to quadword memory. The unaligned loadinstruction at line 12 fetches the aligned quadword that containsa [ i ] It is necessary to use an unaligned load because the baseaddresses of a and b are not guaranteed to be aligned on a quadwordboundary, but they are guaranteed to be aligned on a shortwordboundary. The instructions at lines 14 through 16 extract theshortword from the quadword. Line 14 computes the offset of theshortword within the register.for (i = 2; i < n; i++)

x[i] = z[ i ]* (y [i] - x[i-l l);

187


3/10

int dotproduct(short a[l, short b[l, int n) {int c, i;~=o;for (i = O; i < n; i++)

c += a[i] * b[i];return c;1

Figure 1a. Dot-product loop.

1. r[4] = O;2. // test n for zero-trip loop3. r[O] = r[311 - r[181;4. Pc = r[O] >= o -> L15;5. // compute address of a+n*26. q[61 = r[18] > 32;8. q[6] = q[61 > 48;17. // load quad containing b[il18. q[31 = Q[(q[171 )&-71;19. // extract and sign extend b[i]20. q[8] = q[171 + 2;21. q[2] = EQH[q[3],q[8]];22. r[2] = q[21 >> 48;23. // compute product and accumulate24. r[l] = r[l] * r[21;25. r[4] = r[4] + r[ll;26. / advance to next array elements27. q[17] = q[17] + 2;28. q[16] = q[16] + 2;29. // test for loop termination30. q[o] = q[161 - q[61;31. PC = q[O] < 0 -> L17;32. L1533. r[O]=r[41;Figure lb. Original code for loop.

1. r[4] = O;2. // test for zero-trip loop3. r[O] = r[31] - r[18];4. PC = r[O] >= O -> L15;5. // compute loop termination6. q[6] = r[18] > 32;8. q[6] = q[6] > 48;// load quad containing b[i]q[201 = Q[q[1711;// extract b[i] (two bytes)q[8] = q[17] + 2;q[21 = EQH[q[201,q[811;r[2] = q[2] >> 48;// compute dot productr[l] = r[l] * r[2];r[4] = r[4] + r[l];// extract a[i+l]q[8] = q[16] + 4;q[l] = EQH[q[21],q[8]];r[l] = q[l] >> 48;// extract b[i+llq[8] = q[17] + 4;q[2] = EQH[q[201,q[8]];r[2] = q[2J >> 48;// compute dot productr[l] = r[l] * r[2];r[4] = r[4] + r[l];.

. .// adv to next a[il & b[ilq[16] = q[16] + 8;q[17] = q[171 + 8;// test for loop terminationq[o] = q[161 - q[61;pc = uIOI < 0 -> L17;45. L15

46. r[O]=r[4];Figure lc. Unrolled loop with coalesced

memory references.

Figure 1: DEC Alpha code for dot-product.


4/10

1234567891011121314151617181920

//Main routine to coalesce memory accessesproc CoalesceMemoryAccesses(CtrrrFunction) is

II Consider each loop in the current function.VLOOP e CurrFunction.Loop do

LOOP. InductionVars +-- FindInductionVars(LOOP)// Unroll the loop. Ifit fits in the cache, use it, else use the rolled loop.CurrFunction.Loop +- {UnRollLoopIfProfitable(LOOP)} u CurrFunction.Loop//Classif ies memory references into di fferent partitions ifa unique identljier is found to distinguish//a set of such references. lkus, all references to an array A passed as a parameter will have a loop//invariant register (most probably the register containing the star t address of A) as their partition// identtfrezClassifyMemoryReferencesIntoPartitions(LOOP)// Calculate relative offsets of the memory references belonging to same partition from the induction variable.//ffa constant offset is not found, it is not safe to do memory coalescing. Sort the offsets.CalculateRelativeOffsets(LOOP)EliminateInductionVariables(LOOP)//Attempt Wide reference optimizationWideRefOptimization(LOOP, CurrFunction)enddo

endprocFigure 2: Memory access coalescing algorithm main loop.

The RTLq[l] = EQH[q[2], q[8]]

shifts register q[2] left bythenumber of bytes specifiedbythelowthree bits of q [ 8 ], inserts zeros into the vacated bit positions, andthen extracts 8 bytes into register q[l]. Line 16sign extends theshortword. Lines 18 through 22 perform a similar operation forb[i].

The code in Figure 1b, as it stands, is fair] y tight. However,the loop fetches the same quadwords for the respective arrays everyfour iterations. Thus, for every four iterations six redundant loadsare executed. It is these redundant memory accesses that memorycoalescing eliminates. By unrolling the loop four times, andapplying the memory coalescing algorithm, the optimizer producesthe code in Figure 1c, Notice that there are still two loads in the loop(lines 12 and 18), but the modified loop iterates one-fourth as manytimes as the loop of Figure lb. Thus, the original loop performs2xn memory references, while the coalesced loop performs ~memory references forasavingsof75 percent.

At this point, the transformation may seem ratherstraightforward. However, there are subtle details that must beaddressed. First, the code in Figure lC assumes that the startingaddress of the vectors a and b are aligned on a quadword boundary.This may, or may not betroe. If it is not trtte, the first memoryreference to an unaligned address will trap. Second, the code alsoassumes that the length of the vectors indivisible by four, This toomay, or may not be true. If it is not true, the loop will fetch dataoutside the arrays and possibly fault (we d be lucky if it did), but ismore l ikely that silently an incorrect resul t will be computed. Thi rd,the loop body has gotten larger, and the assumption is that anypotential negative effects due to increasing the size of the loop willbe offset by the gains resulting from reducing the number ofmemory references, This may or may not be a reasonableassumption.

2.2 MEMORY ACCESS COALESCING ALGORITHMThese safety and profitability issues mentioned above, and

others, must be handled by the memory access coalescingalgorithm. The Ccode in Figure lahighlights thedifficulty of thisanalysis. For this routine, standard intraprocedural analysis cannotgather the necessary information to safely coalesce memoryreferences. The vectors and n, the number of elements in the arrays,are parameters. InterProcedural analysis would help, but often theroutines of interest are part of a library and are not accessible untillink time. One could limit the applicability of the algorithm toroutines where static, compile-time analysis is sufficient, but ourexperience shows that this would eliminate most opportunities forapplying the algorithm. Theapproach taken here attempts to clotheanalysis at compile time, if possible, but if it is not possible code isgenerated to check the safety issues at run time, Our evaluationsshow that it is generally possible to do this in a way that the impactof the extra code for checking is negligible.

Figure 2contains thehigh-level portion of the algorithm.Due to space limitations, the enti re algorithm cannot be presented.The focus here is the profitability and safety analysis. Line 7determines if it is profitable to unroll the loop. Our heuristic is thatif the original loop will fit in the instruction cache, then thealgorithm must ensure that the unrolled loop will fit as well. Inaddition, this routine, if necessary, produces code to execute theloop body enough times so that the number of iterations of the mainloop is a multiple of the unrolling factor. Line 12 analyzes thememory reference of the loop andpartitions them into disjoint setsfor later analysis [Beni9 1]. The memory access coalescing is doneby WideRefOptimization. After identifying candidate memoryreferences for coalescing, DoProfitability AnalysisAndModify iscalled. Thealgorithm is in Figure 3.

The algorithm makes a copy of the loop and performsmemory coalescing on it. This involves not only coalescing thememory accesses, but inserting code to extract the requiredinformation from the coalesced memory reference. After doing this,i t cal ls the scheduler to schedule both loops and does a comparison.Ifit appears advantageous to use the coalesced loop, then vasious

189


5/10

1 //Do Cost/Benefit analysis before doing memory coalescing234567891011121314151617181920212223242526272829303132333435363738394041424344454647

proc DoProfitability Anal ysisAndModify(LOOP, S, T, WideSize, CurrFunction) isInst & 0/! Do data hazard analys is for possible aligned and unaligned wide referencesAlignedWideType +-- DoHazardAnal ysis (LOO~ S, ALIGNED, LOOP. PossibleAliases, Inst, WideSize

AlignedWideReferencePosition, AlignedWideReferenceAddress)UnAlignedWideType + DoHazardAnalysis(LOOP, S, -ALIGNED, LOOP. PossibleAliases, Inst,

WideSize, UnalignedWideReferencePosition, UnAlignedWideReferenceAddress)// Chack if a valid Wide memory reference which can replace the narrow references is f oundif IsValidType(AlignedWideType) v IsValidType(UnAlignedWideType) then

//Make a copy of the loop. Schedule the instructions in the original loop and find the number of cycles necessary,// Then insert appropriate wide references in ~hecopy of the loop and schedule it 100. If the latter requires//less cycles, then go ahead.LCOPY + DoReplication(LOOP)// Calculate the cycles required by the original loop by static schedulingCyclesforOriginalLoop +- Schedule(LOOP)if IsValidType(AlignedWideType) then

InsertWideReferences( LCOPY, S, AlignedWideReferencePosition,AlignedWideReferenceAddress)

endifif IsValidType(UnAlignedWideType) then

InsertWideReferences( LCOPY, T, UnAlignedWideReferencePosition,UnA1ignedWideReferenceAddress)

endif[I Calculate the cycles required by the loop after replacing the narrow references by wide ones.CyclesforCopiedLoop + Schedule(LCOPY)if CyclesforCopiedLoop < CyclesforOriginalLoop then

I/If the alignment checking for the aligned wide address under consideration is not there,//then insert a check that will allow the execution of the LCOPY if the address is actually aligned//at runtimeif _AlignmentCheckExists( LOOP, WideType, WideReferenceAddress) then

InsertAlignmentCheckInPreheader(LOOP.Preheader, LOOP. Label, LCOPY.LabelWideReferenceAddress, WideType)

InsertAl iasingChecksInPreheader(LOOP.Preheader, LOOP. Label, LCOPY.Label, Inst)else

1/Else just use the LCOPY instead of the original one, since this is betterTarget + FindTargetOfUnalignedAddress(LOOP.Preheader)InsertAliasingChecksInPreheader(LOOP.Preheader, LCOPY.Label, Target. Label, Inst)ChangeATargetOfAlignmentCheck(LOOP.Preheader, LOOP, Label, LCOPY,Label)CurrFunction.Loop ~ CurrFunction.Loop - {LOOP}endif

CurrFunction.Loop .$- CurrFunction.Loop u {LCOPY}endifif LOOP e Curr.Function then

VZc SuTdoZ.Modify +-- TRUEenddo

endifreturn TRUEendif

return FALSEendproc

Figure 3: Profi tabi li ty analysis algori thm.

190


6/10

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152

//Check if there are any data hazardsproc IsHazard(S, WideReferencePosition, ReferenceType, C, Inst) isvMEsdo

//A wide load is inserted before the dominating loads of all the narrow loads. So the narrow load reference is//called the BottomInst. A wide store is inserted after the dominated store of all the narrow stores. So the//narrow store is called the TopInst.if ReferenceType = LOAD then

BottomInst & MTopInst - WideReferencePosition

else TopInst e MBottomInst ~ WideReferencePositionendif

1/The narrow and the wide reference have to lie in the same basic blockif BottomInst,BasicBlock # TopInst.BasicBIock thenReturn(TRUE)endif

CurrInst + BottomInstII Check all the instruct ions between the Bottomlnst and TopInstwhile CurrInst + CurrInst.PrevInst A CurrInst # TopInst do

//We cannot allow a load between two stores, all belonging to same partition. If they do not lie in same//partition, there is a possibility of aliasing, which can probably be detected only at run time.if IsStore(M.Reference) A IsLoad(CurrInst. Reference) then

if (M. Paritition = CurrInst. Partition) thenif ReferenceSameLocation(M.Reference, CurrInst.Reference) A-JsNeededTodoNarrowStoreOnly(M.Reference, CurrInst.Reference) then

Return(TRUE)endif

elseInst + Inst u DoAliasDetection(CurrInst.Partition.Load, M. Partition. Store, C)endif//We cannot allow a store between two load or store references.elseif IsStore(CurrInst, Reference) then

if (M. Paritition = CurrInst.Part ition) thenif ReferenceSameLocation(M. Reference, CurrInst.Reference) .

Return(TRUE)endif

elseif IsLoad(M,Reference) thenInst + Inst u DoAliasDeteetion(CurrInst.Partition,Store, M. Partition. Load, C)else

Inst + Inst u DoAliasDetection(CurrInst.Partition.Store, M. Partition. Store, C)endif

endifFindBaseAndDisplacementOfAddress(CurrInst.Reference, Base, Displacement)llIf the base register has been modified, then the coalescing may not be safe.if IsModifiedBase(CurrInst. Reference, Base) then

Return(TRUE)endif

endwhileend for

IINo data hazards were foundReturn(FALSE)endproc

Figure 4: Hazard analysis algorithm,

191


7/10

Potential Hazard

Original LoopBody

+

iteraten mod unrollfactortimes

itcraten / unrollfactortimes

Figure 5: Flow graph showing alignment and alias checks.

checks are done to see if it is necessary to put alignment checks inthe preheader of the loop. Additionally, if it is not possible to dostatic alias detection (for example, do the memory referencesoverlap), then code is inserted in the preheader to do the checks atrun time.

A second key algorithm is IsHazard that does the safetyanalysis. lltis routine is contained in Figure 4. Most of the analysishere is s traightforward, The routine assures that coalesced memoryreferences are in the same basic block and that sequentialconsis tency is preserved. In addition if aliasing cannot be resolvedstatically, the routine DoAliasDetection is called which generatescode that will be inserted in the loop preheader to check forpotential aliasing problems (e.g., two arrays overlap in memory). Ifat run time, an alias condition is detected, the original safe loop isexecuted.

The result is code represented by the flow graph in Figure 5.For the example in Figure 1, this results in additional code beingadded to the loop preheader for each possible alias pair . Inparticular, the following instructions are added:

// q[161: address of a;// q[171: address of b; q[18]: nq[l] = q[171 + q[181;q[o] = q[161 < q[l];Pc = q[o] L16;q[ll = q[161 + q[181;q[o] = q[171 < q[ll;PC = q[OJ > 0 -> L13;

L16q[O] = q[18] % 4;PC = q[O] != O -> L13;q[Ol = q[161 & 7;PC = q[Ol != O -> L13;q[ol = q[171 & 7;Pc = q[o] != (J > L13;

// do memory coalesced unrolled loop. . .. . .

L13// do original safe loop

The code appearing before L16 checks to make sure thearrays do not overlap, while the code after L 16 checks for the abilityto unroll and that the ar rays are properly aligned.

3 IMPLEMENTATION AND RESULTSA prototype implementation of the memory access

coalescing algorithm has been implemented in an existingretargetable compiler, and tested on platforms containing thefollowing processors: DEC Alpha, Motorola 88100, and Motorola68020. Usingaset ofcompute- andmemory-intensive kemelloopslisted in Table I, the effectiveness of the algorithm was evaluated.These benchmarks were chosen because they represent realisticcode, and they contain loops that are memory-intensive and containmemory references that are candidates for memory accesscoalescing. Memory access coalescing, unlike register allocation,code motion, induction variable elimination, etc., is not a codeimprovement that applies to code in general. However, like cacheblocking, register blocking, iteration space tiling, softwarepipelining, and recurrence detection and optimization, it does applyto a small set of important codes, and asthe resul ts show, it p rovideshigh payoff when it does apply.

The results for the DEC Alpha are presented in Table 11.Alltimings were gathered by running each program ten times on asingle-user machine. The two highest execution times and the twolowest were discarded, and an average of the remaining six timeswastaken. Column 2(labe1ed cc-0) is the execution timetakenbycode produced by the native compiler with the loop unrolled.Column 3 is the execution time taken by the code produced by ourcompiler again with the loop unrolled. The loops were unrolled sothat the effect of memory access coalescing could be isolated andobserved. Column 4 contains the average execution time for thebenchmark when loads were coalesced, and column 5 contains theaverage execution time when both loads and stores were coalescedColumn 6 contains the percentage speedup. The first thing to noticeis that the optimizing compiler in which the memory accesscoalescing algorithm is embedded is comparable to the nativecompiler. This indicates that the speedups in column 6 are notartifacts of embedding the algorithm in a poor compiler. The secondthing to notice is that, in general, the percentage speedup is quitegood

Table III contains similar timing information for theMotorola 88100 -based platform. It is interesting to note that thecode with both loads and stores coalesced runs slower than the codewith jus t loads coalesced. The reason is that the Motorola 88100 hasefficient instructions for extracting bytes and words from a 32-bit

192


8/10

Program Description Lines of CodeConvolution Gradient Directional Edge Convolution of a 500 by 500 black and white 154

Image [Lind91]Image Add Image addition of two 500 by 500 black and white frames 48Image xor Image addition of two 500 by 500 black and white frames 48Translate Translate 500 by 500 black and white image image to a new position 48Eqntott Part of the SPEC 89 benchmark suite 146Mirror Generate mirror image of 500 by 500 black and white image 50

Table I: Compute- and memory-intensive benchmarks.

Program

ConvolutionImage addImage add (16-bit)

I Image xorI TranslateI EqntottI Mirror

Percent SavingsVpcclvpo -o

cc -o vpcc/vpiJ -o vpcc/vpo -o(coalesce loads) (coalesce loads (C013 - cl+)and stores) co12 x 100

16.67 I 17.76 I 15.62 I 15.76 I 11.2617.41 17.71 11.48 10.44 41.0512.03 12.02 8.97 8.13 32.36

17.43 I 17.49 1 11.48 I 10.48 I 40.0811.46 I 10.52 I 8.45 I 7.04 I 33.1119.17 I 21.55 I 20.72 I 20.72 I 3.8615,62 I 14.49 I 12.63 I 9.84 I 32.09

Table H: DEC Alpha execution times (in seconds) and percent improvement.

register, but there are no instructions for inserting bytes and wordsinto a register without affecting the other bytes or words in theregister, Thus, code must be generated using logical instructions toplace the word into the proper position in the register. Thesesequences outweigh the gains of coalescing stores. However,coalescing loads was profitable exhibiting speedups of up to 25percent.

We also implemented the algorithm in a compiler for theMotorola 68030. Unfortunately, in all cases the code ran slower,Inspection of the code revealed that while the Motorola 68030 hasinstructions for extracting bytes and words, these are much more

expensive than simply loading the bytes and words directly. Thisagain highlights how most optimizations are machine dependent.4 SUMMARY

We have described an algorithm for coalescing redundantmemory accesses in loops. If possible, static analysis is used toresolve safety and profitability issues. However, in most interestingcases, it is necesswy to rely on rtmtime tests to handle aliasing andalignment issues. Such code is relatively easy to generate.Typically, 10 to 15 instructions must be added in the loop preheaderto check for possible hazards. The results on two machines showthat the technique can result in substantial speedups. For the DEC

193


9/10

Percent Savingsvpcc/vpo -o

Program Vpcc(vpo -occ -o Vpcclvpo -o (coalesce loads) (coalesce loads (C013-C014) ~ ~mand stores) C012

Convolution 22.86 22.82 18.87 22.64 17.3Image add 15.33 25.74 12,97 13.45 15.39Image xor 15.34 15.34 12.94 13.7 15.64Translate 16.32 17.52 13.49 16.91 24.46Eqntott 130.3 145.0 143.2 143.2 1,3Mirror 20.52 19.23 16.03 16.89 16.64

Table III: Motorola 88100 execution times (in seconds) and percent improvement.

Alpha, we observed speed ups ranging from 3 percent up to 40percent. For the Motorola 88100, we observed speed ups of a fewpercent up to 25 percent, while for the Motorola 68030 thetechnique resul ted in slower code,ACKNOWLEDGEMENTS

This work was supported in part by National ScienceFoundation grant CCR-9214904.REFERENCES[Alex93]

[Aho86]

[Beni94]

[Beni91]

Alexander, M. J., Bailey, M. W., Childers, B. R.,Dav idson, J . W., Jinturkar, S., Memory BandwidthOptimizations for Wide-Bus Machines,Proceedings of the 26th Annual HawaiiInternational Conference on System Sciences, Maui,HI, January 1993, pp. 466475.Aho, A. V., Sethi, R., and Unman, J. D., CompilersPrinciples, Techniques and Tools, Addison-Wesley,Reading, MA, 1986.Benitez, M, E., and Davidson J. W., TheAdvantages of Machine-Dependent GlobalOptimization, Proceedings of the InternationalConference on Programming Languages andSystem Architectures, Springer Verlag LectureNotes in Computer Science, Zurich, Switzerland,March, 1994, pp. 105124.Benitez, M, E., and Davidson, J. W., CodeGeneration for Streaming: an Access/ExecuteMechanism, Proceedings of the FourthInternational Symposium on Architectural Supportfor Programming Languages and Operating

[Beni89]

[cal191]

[Cal190]

[Chow90]

[Digi92]

[Dues93]

Systems, Santa Clara, CA, April 1991, pp. 132141.Benitez, M. E., and Davidson J, W., A PortableGlobal Optimizer and Linker, Proceedings ofSIGPLAN 88 Conference on ProgrammingLanguage Design and Implementation, Atlanta, GA,June, 1988, pp.329338.Callahan, D., Kennedy, K., and Porterfield, A,,Software Prefetching, Proceedings of the FourthInternational Conference on Archi tec tural Supportfor Programming Languages and OperatingSystems, Santa Clara, CA, April 1991, pp. 40-52.Callahan, D. and Carr, S. and Kennedy, K,Improving Register Allocation for SubscriptedVariables. Proceedings of the ACM SIGPLAN 91Conference on Programming Language Design andImplementation, White Plains, NY, June, 1990, pp5365.Chow, F. C. and Hennessy, J . L. The Priority-BasedColoring Approach to Register Allocation. ACMTransactions on Programming Languages andSystems 12(4):501536 October 1990.Alpha Architecture Handbook, Digital EquipmentCorporat ion, 1992.Duesterwald, E., Gupta, R., and Soffa, M. L., APractical Data Flow Framework for ArrayReference Analysis and its Use in Optimizations,Proceedings of the ACM SIGPLAN 93 Conferenceon Programming Language Design andImplementation, Albuquerque, NM, June 1993, pp.6877.

194


10/10

[Joup90]

[Knut73]

[Lam91]

[Land93]

[Land92]

[Lind91]

[McFa91]

[McKe94]

[Moto91]

[Moto85]

[Roge92]

[Spec89]

Jouppi, N., Improving Direct-Mapped CachePerformance by the Addition of a Small FullyAssociative Cache and Prefetch Buffers,Proceedings of the 17th Annual InternationalSymposium on Computer Architecture, Seattle, WA,May 1990, pp. 364373.Knuth, D. E., Volume 1: Fundamental Algorithms.Addison-Wesley, Reading, MA, 1973.Lam, M. and Rothberg, E. E. and Wolf, M. E., TheCache Performance and Optimizations of BlockedAlgorithms, Proceedings of the FourthInternational Conference on Archi tectural Supportfor Programming Languages and OperatingSystems, Santa Clara, CA, April, 1991, pp 6374.Landi, W., Ryder, B. G., and Zhang, S., InterProcedural Modificat ion Side Effect Analysiswith Pointer Aliasing, Proceedings of the ACMSIGPLAN 93 Conference on ProgrammingLanguage Design and Implementation,Albuquerque, NM, June 1993, pp. 56-67.Landi, W., and Ryder, B. G., A SafeApproximation Algorithm for InterproceduralPointer Aliasing, Proceedings of the ACMSIGPLAN 92 Conference on ProgrammingLanguage Design and Implementation, SanFrancisco, CA, June 1992, pp. 235248.Lindley, C. A., Practical Image Processing in C,John Wiley and Sons, Inc., New York, NY, 1991.McFarling, S., Procedure Merging with Instruct ionCaches, Proceedings of the ACM SIGPL4N 91Conference on Programming Language Design andImplementation, Toronto, Ontario, June 1991, pp.7179.McKee, S. A., Klenke, R. H., Schwab, A, J,, Wulf,W. A., Moyer, S. A., and Aylor, J. H.,Experimental Implementation of Dynamic AccessOrdering, Proceedings of the 27th Annual HawaiiInternational Conference on System Sciences, Maui,HI, January 1994.MC88I1O: Second Generation RISCMicroprocessor Users Manual. Motorola, Inc.,Phoenix, AZ, 1991.MC68020 32-bit Microprocessor Users Manual,Prentice-Hall, Inc. Englewood Cliffs, N. J. 07632.Rogers, A., and Li, K., Software Support forSpeculat ive Loads, Proceedings of the FifthInternational Conference on Architectural Supportfor Programming Languages and OperatingSystems, Boston, MA, October 1992, pp. 3850.Systems Performance Evaluation Cooperative, c/oWaterside Associates, Fremont, CA, 1989.

195

Date post:	06-Apr-2018
Category:	Documents
Upload:	karthik-navaneethakrishnan
View:	230 times
Download:	0 times

Memory Access Coalescing

Documents