Dynamic Cache Splitting

Dynamic Cache Splitting�Toni Juan, Dolors Royo and Juan J. NavarroComputer Architecture Department, Universitat Politecnica de CatalunyaGran Capita s/n, Modul D6, E-08034 Barcelona, (Spain)e-mail: [email protected] novel hardware for cache management that we call split cache is presented. With somehelp from the compiler, a program can dynamically split the cache in several caches of anysize that is power of two, and choose in which cache each data element is mapped. Whenproperly used, this scheme can reduce the con ict misses near to zero. Moreover, as the totalcapacity of the original cache is used in a more optimal way, also the capacity misses can bereduced.1 IntroductionThe increasing gap between processor speed and main memory speed has made the cache missesto become the performance bottleneck. Thus, it is very important to reduce the cache misses asmuch as possible. Three types of cache misses can be distinguished [2], compulsory, capacity andcon ict. The �rst two depend on the size of the working set of the code and on the size of the cachememory. The third type depends on the cache organization, and given a certain cache capacity,varies depending on the degree of associativity [1]. When di�erent data structures compete forthe same cache lines during program execution, con ict misses arise. Fortunately, in many casesthe programmer/compiler knows what data structures are candidates to con ict with others andwhich of this data structures will be needed in a near future. Therefore, if the compiler couldmanage the cache as a local memory, the number of con ict misses could be reduced almost tozero.In this paper we propose a hardware mechanism and a simple extension of the instruction setthat allows the compiler to dynamically divide the physical cache in a variable number of logicallyseparated caches of di�erent sizes and choose in which partition is mapped each memory reference.In this way, con icts can easily be avoided assigning di�erent cache partitions to di�erent datastructures. Each data structure could be assigned to a cache partition with the size that better �tsits needs. When our new cache organization is properly used, the con ict misses become nearlyzero. Another advantage is that some capacity misses disappear due to the better use of the totalcache capacity.The paper is organized as follows. In section 2 related work is presented and the main di�er-ences with our approach are highlighted. The split cache organization is described in section 3and some design parameters are studied. In section 4 some general ideas about management ofthe split cache are used to evaluate the performance gain on some examples. We �nish with someconclusions and future work in section 5.�This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-880/92) and by theEU (ESPRIT Project APPARC 6634) 1

2 Related workFor a long time, there has been many studies about improving the performance of the cachememory. Usually, the contribution of the data cache misses to the execution time can be modeledas Tmisses = Misses �CPM � Tc (1)where CPM (cycles per miss) are the cycles needed to service a miss,Misses is the total numberof misses and Tc is the processor cycle time.Reducing any of the terms in equation (1) improves performance. For example, [3], [8], [7]and [9] aimed at the reduction of CPM . In [6] and [9] the goal was to reduce the number ofmisses and, in [5] the objective is to reduce the memory cycle time. It is interesting to note thatmost of these solutions reduce one term of equation (1) at the expense of increasing some of theother two terms. Consequently, these solutions only are useful when the reduction of one termsurpasses the increase in the other two. As an example of this tradeo�, if the original system hasa direct mapped cache, the number of misses can be reduced replacing the direct cache with a setassociative cache of the same capacity. However, set associative caches have a higher cycle timeand the processor cycle time, Tc could be increased.Our proposal aims at the reduction of the Misses term in equation (1), at the expense of somepossible (not always will happen, though) increase in Tc. The main di�erence of our proposalresides in allowing the compiler to allocate data in di�erent cache partitions based on its knowledgeof the program data structures. With this approach, we maximize the use of the full cache capacityand obtain a substantial reduction in the number of con ict misses. Our mechanism almost o�ersthe power of a local memory without all the problems associated with its management.3 Split cacheThe cache memory is split in several partitions under software control, the size of each partition isindependent of the size of the others as long as it is a power of two. To simplify the explanation,we are going to assume that1. all cache lines that con�gure a partition are consecutive lines from the original cache, and2. each cache line resides only in one partition, therefore there are no intersections betweenpartitions1.From the point of view of the program, this is equivalent to having several cache memories. Noweach memory reference has to indicate through which partition whishes to be cached. From thepoint of view of cache management, each partition behaves like an independent cache, that is, theaddress mapping in a given partition (direct, set associative, : : : ) is performed in the same way asin the original cache but assuming its size.To avoid cache con icts, all the di�erent data sets that the compiler considers that couldcon ict are going to use di�erent cache partitions. The size of each partition is chosen trying tosatisfy the memory needs of the data that is going to be mapped in it. For example, given the jiform of dense matrix by vector multiplication,DO j= 1, Nxr= x(j)DO i= 1, Ny(i)= y(i) + A(i,j)*xrENDDOENDDO1In section 5 this restriction is eliminated

one way to avoid all the possible con icts is to assign a di�erent partition to each variable inthis program. In �gure 1a the data structures, their relationships and the order in which data isaccessed is shown. In �gure 1b we can see how each variable is mapped in a di�erent partition(each variable has its own cache).

*

+ =

A

x

y1, y2

A

x

y

a) b)

y

00

01

1x

Figure 1: Example a) Data structures, b) Cache distributionThe compiler divides the data cache in several partitions. Then, the compiler maps all thedata structures into that partitions. When generating memory instructions, the compiler addsthe partition identi�er to the instruction so that the translation hardware knows what partitionshould be used to perform the memory access. To force that any address from a given data setmaps in its assigned partition, some bits of the address that are used to access the cache are �xedto 0 or 1 independently of their original value. With this simple address transformation we canselect the concrete sets of the cache that can be used.Consider a processor with 16 address bits, a direct mapped cache with a capacity of 16 linesand a line size of 32 bytes. Figure 2 shows, for the previous example, the transformations appliedat run time to all memory references associated to the x variable assuming that it is only allowedto use 4 lines (those with line address (in binary) 1000, 1001, 1010 and 1011).

x x xx

0 x x1

LineIndexTag00

01

10

11

Original @

Transformed @

ID

Translation

Cache

x var

Figure 2: ExampleNow, a possible hardware implementation of the address translation is described.3.1 Hardware requirementsSuppose a cache memory that is a w-way set associative cache with line size 2l and 2s sets and thatb is the size of the mask and value. We call original index the s bits of the original address usedto select the cache set, and transformed index the bits in the same position on the transformedaddress. The s� b least signi�cant bits of the original and transformed index have the same valuewhereas any number of bits (from 0 to b) of the most signi�cant b bits of the transformed index arealways �xed to the same value, independently of their value in the original index. For example,in �gure 2: w = 1 (a direct mapped cache), l = 5, s = 4 and b = 2.

A direct access table, translation table, is needed to store the address transformations associatedwith each identi�er. The number of table entries determine the maximumnumber of simultaneouspartitions, this number is limited to 2s (in this case, each set of the original cache can be seen asa partition).Each entry of the translation table has two �elds that determine the address translation to beapplied when a memory reference is issued1. the mask, that has b bits associated with the b most signi�cant bits of the index. Each bitthat is set to 1 in the mask indicates that the associated bit has to be �xed in the transformedindex independently of its original value.2. the value, that is a vector of b bits and indicates to which value has to be �xed each bit ofthe transformed index with a 1 in the corresponding bit of the mask. That is, for 1 � i � bif (Mask(i)== 1) thentransformed_index(s-b+i)= Value(i)elsetransformed_index(s-b+i)= Original_index(s-b+i)endifand obviously, b � s.To obtain the transformed address, the correct entry of the translation table is selected andthe mask is used to select between the bits of the original address and the associated bits in thevalue. This means that we need b multiplexers with two entries of one bit. The cache memorywith the new hardware to con�gure a split cache can be seen in �gure 3.

TagsData

Hit?

cmp

Value

IR ID

@t

t b

Mask

b b

t+b

Data

e

s

s-b1 1 1 11

LinebIndex

Figure 3: Cache splitting hardware

An example of two di�erent partitions that can be obtained depending on the number of indexbits controlled from the mask for a �xed number of entries in the translation table is now presented.In �gure 4a a cache with eight sets is divided into four independent partitions of the same size,since b = 2, each of the partitions has the minimum size allowed. In �gure 4b the same cacheis now also divided in four partitions, but mask and value can control one bit more than in theprevious case (b = 3). In both cases the full cache capacity is used but the second example hasmore resolution to choose the partition size. From this example, two conclusions can be drawn:�rst of all, the bigger b is, the smaller the partitions can be. Second, all the possible partitionsachievable with a given value for b can be obtained with any con�guration where the number ofbits of mask and value is bigger. A more detailed justi�cation is done in the next point.

000001010011100101110111

1

2

3

4

Cache

Mask Value11

11

11

11

00

01

10

11

1

2

3

4

000001010011100101110111

1

2

34

Cache

Mask Value100

110

111

111

0XX

10X

110

111

1

2

3

4

a) b)

Translation Table Translation Table

Figure 4: Two di�erent cache partitionsDesign parametersSome relations between the number of cache partitions allowed and the number of bits in maskand value can be drawn. The compiler can use from 1 to e entries (where e is the number oftranslation table entries). The idea is that the number of entries used will be de�ned to map allthe cache capacity without overlapping between partitions.If mask and value have b bits each, there are 2s sets in the original cache and the number ofbits set to 1 in the mask 2 is denoted as m(i), then the number of sets assigned to each partitionis 2s2m(i) = 2s�m(i). To use all the cache capacity without overlapping between partitions it isnecessary that eXi=1 2s�m(i) = 2sThe number of bits set to 0 in mask determine the size of the partition; the greater the numberof zeros, the greater the size of the partition. The biggest partition is obtained when all the maskbits are set to 0, in this case one table entry covers all the original cache capacity. The smallestpartition is obtained when all the mask bits are set to 1. If all the mask bits are set to 1 and b = sthen, the minimum size of a partition will be one set.A partition is a subset of contiguous sets in the cache. The base address of a partition in thecache depends on the bits �xed (number of �xed bits and their position in the mask) and on thevalue which they are �xed to. This address can be determined asBase Address(i) = m(i)�1Xj=0 V alue(i)j � 2s�j2all the bits �xed to 1 are consecutive, beginning in the high order bit of the mask

Area requirementsThe hardware required to implement the address translation mechanism causes an increase in thearea devoted to memory management. The amount of extra memory depends on b and e. It isinteresting to have b as big as possible because the partitions can be smaller (better use of theavailable capacity, see �gure 4). However, for each bit added to mask and value �elds, a newbit has to be added to the tag in order to determine whether a given line is in the cache or not.Moreover, the area devoted to the translation table increases.On the other hand, as the number of entries of the translation table e increases, more di�erentdata streams can be maintained in memory without any interference between them, but also thearea needed for the translation table increases.Usually e, the number of entries in the translation table, is going to be small because all thememory instructions have to codify the entry to be used and we assume that there is little spaceleft in the instruction codi�cation. Furthermore, the number of bits for mask and value (b) willbe small because the bigger b is, the smaller the partitions are, and with small partitions it isnecessary to have more partitions to cover all the cache capacity.E�ect on the processor cycle timeThe original address of any data reference has to be transformed into a new address. This addresstranslation has to be done in a way that does not a�ect the critical path of the processor. There aretwo steps to follow when determining the new address. First of all, for each memory reference it isnecessary to determine the address transformations (mask and value) associated with a partition.This is accomplished by an access to the translation table. Second, the transformation has tobe applied to the memory address. This is accomplished with the pass through the multiplexers.The �rst step can be done as soon as the instruction is available, for example in the decode stage.This means that the mask and its associated value can be obtained in parallel with the originaladdress calculation. Usually, in a pipelined processor, the cycle time is determined either by theALU (address calculation) or by the Memory stage (cache access). Then, the application of theaddress transformation should be placed in the stage that does not determine the cycle time.Since the table access is done in the decode stage, the multiplexers are selected before the cacheaddress has been calculated. Thus, only a two gate delay has to be placed in the less restrictivestage. If both stages are restrictive, the cycle time of the processor will be incremented with thedelay introduced by this two levels of gates. However, in many cases, the address calculationstage can be redesigned to anticipate those address bits needed for the translation process and nomodi�cation of the cycle time is produced.In the remainder of this paper we assume that the processor cycle time is not increased due toour mechanism.3.2 Extension of the instruction setThe process of de�ning a partition is as simple as determining which bits of the index have to be�xed and what concrete value they get. The hardware mechanism presented requires an extensionof the instruction set to properly manage all the new structures introduced. In particular, oneinstruction is needed to write the mask and the value into the translation table. Moreover, everymemory access has to indicate a translation table entry to be used to translate the memory address.The assembler instructions could be something likePartition Entry, Fixed_mask, Value_maskLoad Entry, @, RdStore Entry, Rf, @

The usual way to use the new instruction is as follows: �rst of all, the compiler determines thenumber of partitions needed and their sizes, then, generates the mask and value that de�ne eachpartition and loads the translation table. Every memory reference is going to use the suitableentry of the translation table.It is important to note that this scheme is dynamically changeable. Thus, the compiler canuse di�erent cache partitionings in di�erent parts of the program as the memory needs change(number and size of data structures being accessed). This way, the program dynamically adjuststhe data distribution over the cache to minimize the number of misses.4 Evaluation and examples of use4.1 Partitioning criteriaOur objective is to reduce con ict misses. These misses can be subdivided in cross-interferences,when data from di�erent data structures compete for the same lines; and auto-interferences, whenthe same data structure competes for the same cache lines. We de�ne a stream as the sequence ofmemory references due to a concrete memory access instruction.The main criteria to determine what data is mapped in a given partition could be to separatethe memory accesses to each di�erent data structure. This way, the compiler can guarantee thatno cross interferences will happen (see �gure 1). However, the same data structure can be accessedby more than one stream. If di�erent streams are assigned to di�erent partitions, all the possiblecon icts (auto and cross-interferences) can be eliminated.The size of the partition assigned to one stream depends on the locality type of that stream. Astream with spatial locality only needs one cache line to exploit that locality. Thus, the smallestpossible partition will be assigned to streams having only spatial locality. When a stream hastemporal locality it will be mapped in a partition where all its working set �ts (when possible).Blocking techniques [4] and all the well known compilation techniques used to generate code forvector processors can be used to reduce the working set of the program and improve the localityin all the partitions.When there are more active streams than possible partitions, some streams have to coexist inthe same partition. The compiler will try to mix those streams that have the smallest temporallocality. Moreover, when the compiler doesn't knows how to distribute the data in the cache, thereis always the possibility to only de�ne one partition that covers all the cache memory and map allthe streams on it (as in the traditional cache).4.2 Coherence issuesWhen assigning streams to partitions, the compiler has to be able to determine if more than onestream is referencing the same memory positions. Clearly, if two di�erent streams that access tosome common positions, are assigned to di�erent partitions, then a coherence problem appears.Another coherence problem arises when two di�erent processes share data structures. In bothsituations, the compiler has to guarantee that the shared data is mapped in an equivalent partition,that is, a partition that begins in the same cache position and has the same size.In any case, when the compiler can not guarantee that all the references to a given data elementgo through the same cache position, then it can de�ne a speci�c partition to hold this kind ofdata.4.3 ExamplesTo evaluate the e�ectiveness of the separation of streams we have performed experiments usingmatrix by matrix and matrix by vector multiplication. The experiments consist of simulations ofthe behavior of the cache system using address traces. To obtain these traces, we have instrumentedby hand the fortran code for each of the algorithms. The performance metric used is the number

of misses per op (MPF ), because this can be translated easily to execution time3, and comparethe use of the original cache with the use of cache splitting.The main characteristics of the cache system are: 1K words (or 8K bytes), direct mapped, 4words per line (l = 4), and the translation table entries have two bits for mask and value (b = 2).4.3.1 Matrix by Matrix multiplicationWe use the jki form of matrix multiplication to evaluate the performance that can be obtainedwhen using the split cache.DO j= 1, NDO k= 1, Nb_aux= B(k,j)DO i= 1, NC(i,j)= C(i,j) + A(i,k)*b_auxENDDOENDDOENDDOWhen the problem does not �t in cache a signi�cant value ofMPF is due to capacity misses. Inaddition, there are interference misses which increase with the size of the problem. Furthermore,these interferences produce very large spikes, since for speci�c values of the matrix size (leadingdimension of the matrix), columns of matrix A are mapped in the same cache lines than thoseof matrix C giving a pathological interference pattern. We can eliminate all these interferencesand obtain better performance and a at behavior for any matrix dimension using cache splitting.Depending on the problem size there are two possible cache partitions that give the best perfor-mance. When the whole matrix A �ts in one partition of half cache capacity, then all its temporallocality is reused; B and C go through to di�erent partitions of any size to avoid con icts. Thesecond possibility is when A does not �t in half cache, then this partition is assigned to C and Aand B are mapped in di�erent partitions of any size. If the problem size is unknown in compilationtime, the program will choose the best partitioning in execution time.The improvement of the split cache over the traditional cache can be seen in �gure 5, the bestpartition has been used in each point for the split cache.4.3.2 Matrix by Vector multiplicationThe code for the ji form of matrix by vector multiplication has been presented in section 3. Sincematrix A has no temporal locality, the compulsory misses are very important. Therefore, it isessential to avoid interferences on x and y vectors and exploit their spatial and temporal locality.The performance of this algorithm depends on how the columns of matrix A are mapped inthe cache, in relation to vector y on a conventional cache. This results in a variable performance,as shown in Figure 6. This can be avoided using cache splitting in the following way. The y vectorcan use half of the cache capacity. All references to matrixA can go through one quarter, althoughone line would have been enough to exploit the spatial locality. The last quarter of the cache isused for vector x. This partition of the cache and assignation of the reference streams obtains avery good performance for a large range of problems. If C is the cache capacity in words then,for N = 1 to C=2 there are only compulsory misses, whereas for N = C=2 to C the part of y thatis reused diminishes continuously, and �nally for N � C the vector y cannot be reused. However,for any problem size we avoid all pathological behavior.TheMPF obtained can be seen in �gure 6, the split version always obtains better performance,and in some points the performance improvement is near to 500%.3remember that we assume that our translation hardware does not modify the processor cycle time.

0 200 400 600 800

Matrix Order

0.0

0.2

0.4

0.6

MP

F

Normal CacheSplitted Cache

Figure 5: MPF for jki form of matrix by matrix multiplication5 Conclusions and future workA combined hardware/software mechanism called split cache has been presented. This schemeallows the compiler to dynamically split the cache in several separated caches (partitions) withoutincreasing the cycle time in many cases. Every partition can have a di�erent size. With thisnew cache organization, the compiler can easily separate each stream from the others eliminatingthe con ict misses. Moreover, the full cache memory is used better, meaning that some capacitymisses can also be removed.Our mechanism is specially useful for direct mapped caches but can be used in set associativecaches also. However, when used in a set associative caches, as the number of con ict misses issmaller, a lower improvement in the performance is expected.Some interesting points that are being studied are� Determining compiler algorithms to get the best performance of the new cache guaranteeingthat there are no coherence problems.� Application and evaluation of cache splitting applied to Instruction caches. The main ideais to avoid interferences between subroutines.� Improve the performance of mixed caches. Depending on the data and instruction workingsets of the program the cache capacity can be dynamically distributed. This means that thebest features of split and mixed caches can be obtained.� From the point of view of the operating system, a split cache could be used t o avoidinterferences between the system and the user processes. Even, a cache demanding processcould get a separated part of the cache so that any other process could not ush the cacheof the desired process.� Use the same mechanism, the split cache with the same hardware presented in this paperto de�ne partitions that can be included into other partitions. Some interferences will stillremain but the capacity will be used better again.� A selective prefetching mechanism can be implemented. Depending on the locality charac-teristics of a stream, a partition could automatically prefetch data lines only when it is reallyuseful.

0 500 1000 1500 2000

Matrix Order

0.0

0.5

1.0

MP

F

Normal CacheSplitted cache

Figure 6: MPF for ji form of matrix by vector multiplication� A full hardware management of the cache partitioning and referencing process. With thisapproach it is not necessary to modify the instruction set of the processor.References[1] J.L. Hennessy and D.A. Patterson. Computer Architecture, A Quantitative Approach. MorganKaufmann, 1990.[2] Mark D. Hill and Alan Jay Smith. Evaluating assciativity in cpu caches. IEEE trans. onComputers, Vol. 38(12):1612{1630, Dec 1989.[3] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fuly-associative cache and prefetch bu�ers. In Proc. of the Int. Symp. on Computer Architecture,pages 364{373, 1990.[4] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations ofblocked algorithms. In ASPLOS, pages 67{74, 1991.[5] Lishing Liu. Partial address directory for cache access. IEEE trans. on Very Large ScaleIntegration Systems, Vol. 2(2):226{239, Jun 1994.[6] A. Seznec and F. Bodin. Skewed associative caches. Technical Report 1655, INRIA, Mar 1992.[7] Andre Seznec. DASC cache. In Proc. of the 1st Int. Symp on High-Performance ComputerArchitecture, pages 134{143, Jan 1995.[8] Kimming So and Rudolph N. Rechtscha�en. Cache operations by MRU change. IEEE trans.on Computers, Vol. 37(6):700{709, Jun 1988.[9] O. Temam and Y. Jegou. Using virtual lines to enhance locality exploitation. In Proc. of theInt. Conf. on Supercomputing, pages 1{12, 1994.

Date post:	28-Nov-2023
Category:	Documents
Upload:	conaculta
View:	0 times
Download:	0 times

Dynamic Cache Splitting

Documents