+ All Categories
Home > Documents > Locality-based online trace compression

Locality-based online trace compression

Date post: 01-Mar-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
9
Locality-Based Online Trace Compression Yue Luo, Student Member, IEEE, and Lizy Kurian John, Senior Member, IEEE Abstract—Trace-driven simulation is one of the most important techniques used by computer architecture researchers to study the behavior of complex systems and to evaluate new microarchitecture enhancements. However, modern benchmarks, which largely resemble real-world applications, result in long and unmanageable traces. Compression techniques can be employed to reduce storage requirement of traces. Special trace compression schemes such as Mache and PDATS/PDI take advantage of spatial locality to compress memory reference addresses. In this paper, we propose the Locality-Based Trace Compression (LBTC) method, which employs both spatial locality and temporal locality of program memory references. It efficiently compresses not only the address but also other attributes associated with each memory reference. In addition, LBTC is designed to be simple and on-the-fly. If traces with addresses and other attributes are compressed by LBTC, the compression ratio is better by a factor of 2 over compression by PDI. Index Terms—Measurement techniques, tracing, performance analysis and design aids, modeling of computer architecture. æ 1 INTRODUCTION S IMULATION is one of the most important techniques that computer architecture researchers use to understand the behavior of complex systems and to evaluate new micro- architecture enhancements. Currently, most research in this area uses either execution-driven simulation or trace-driven simulation. Trace-driven simulation remains an important technique, especially for studying complex commercial applications, because it is usually a heroic task to set up and tune execution-driven simulators to simulate runs of commercial server benchmarks (e.g., TPC-W) while making sure that the execution delivers the best performance and meets the run rules at the same time. Traces, on the other hand, can be collected once by server performance experts and then shared with system designers relatively easily. Moreover, if only part of the system (e.g., memory hierarchy) is to be studied, execution-driven full-system simulation is often not necessary and tends to be slower than trace-driven simulation. One of the major problems of tracing is the high cost of storing the traces. Modern benchmarks from both scientific and commercial workloads largely resemble real applica- tions. They often run for minutes on even GHz machines, resulting in huge trace files. Certainly, storage costs and capacities are improving at rates comparable to those of processor speeds, but managing terabytes of trace data is not a simple task. Therefore, various techniques have been used to reduce trace volumes. Trace reduction techniques can be classified into lossy and lossless approaches. The lossy approach aims at retaining the representative part of the original trace and discards the remainder. The most commonly used lossy trace reductions are sampling and filtering. Trace sampling stores relatively short bursts of references. The bursts can be selected randomly, at regular intervals, or based on a clustering method [19], [23]. Trace filtering discards all references that hit in a small “filter cache” [1], [16], [18], [20]. Lossless trace reduction is often called trace compression. While retaining all the information in the original trace, it reduces the size of the trace file by filtering out the redundancy in the trace. Researchers have long recognized that there is abundant redundancy in program address streams [9]. The first order Markov Entropy is about 1-2 bits per address [3]. Yet the encoding based purely on coding theory is too complex and computationally expensive to implement [3]. General-purpose compression algorithms like LZW [24] (used in Unix compress), or LZ77 [25] (used in gzip) can certainly be applied to reduce the trace file size and achieve good compression ratio. Trace compression techniques, on the other hand, usually offer better compres- sion by taking advantage of the redundancy specific to the traces. Most of the trace compression techniques work together with general-purpose compression algorithms. Different applications require different types of traces. As a result, different compression techniques have been proposed to compress these traces. HAFT [5] effectively compresses the trace of heap allocation events (e.g., malloc, free, etc.) for studying dynamic memory allocation perfor- mance. Hamou-Lhadi and Lethbridge [10] developed a comprehension-driven compression framework that com- presses the traces of procedure calls. In this paper, we study techniques to compress program execution traces for processor simulation. Some of these techniques rely heavily on analysis of the program’s control flow information [6], [8], [12], [15]. Although they often give the best compression ratio, they usually need to either parse the trace multiple times, requiring large intermediate storage, or they need to instrument the source code or binary code of the benchmark program, a complex process often limited by the availability of tools. Instrumenting the program for tracing a full system with multiple processes and operating system activity is extremely hard. This paper focuses on online (one-pass without intermediate storage) trace compression techniques without code instrumentation. Similar techniques, such as Mache, PDATS/PDI, SBC, and VPC, are discussed in the next section. IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004 723 . The authors are with the Department of Electrical and Computer Engineering, University of Texas at Austin, 1 University Station, Austin, TX 78712. E-mail: {luo, ljohn}@ece.utexas.edu. Manuscript received 3 July 2003; revised 10 Dec. 2003; accepted 6 Jan. 2004. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TC-0076-0703. 0018-9340/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society
Transcript

Locality-Based Online Trace CompressionYue Luo, Student Member, IEEE, and Lizy Kurian John, Senior Member, IEEE

Abstract—Trace-driven simulation is one of the most important techniques used by computer architecture researchers to study the

behavior of complex systems and to evaluate new microarchitecture enhancements. However, modern benchmarks, which largely

resemble real-world applications, result in long and unmanageable traces. Compression techniques can be employed to reduce

storage requirement of traces. Special trace compression schemes such as Mache and PDATS/PDI take advantage of spatial locality

to compress memory reference addresses. In this paper, we propose the Locality-Based Trace Compression (LBTC) method, which

employs both spatial locality and temporal locality of program memory references. It efficiently compresses not only the address but

also other attributes associated with each memory reference. In addition, LBTC is designed to be simple and on-the-fly. If traces with

addresses and other attributes are compressed by LBTC, the compression ratio is better by a factor of 2 over compression by PDI.

Index Terms—Measurement techniques, tracing, performance analysis and design aids, modeling of computer architecture.

1 INTRODUCTION

SIMULATION is one of the most important techniques thatcomputer architecture researchers use to understand the

behavior of complex systems and to evaluate new micro-architecture enhancements. Currently, most research in thisarea uses either execution-driven simulation or trace-drivensimulation. Trace-driven simulation remains an importanttechnique, especially for studying complex commercialapplications, because it is usually a heroic task to set upand tune execution-driven simulators to simulate runs ofcommercial server benchmarks (e.g., TPC-W) while makingsure that the execution delivers the best performance andmeets the run rules at the same time. Traces, on the otherhand, can be collected once by server performance expertsand then shared with system designers relatively easily.Moreover, if only part of the system (e.g., memoryhierarchy) is to be studied, execution-driven full-systemsimulation is often not necessary and tends to be slowerthan trace-driven simulation.

One of the major problems of tracing is the high cost ofstoring the traces. Modern benchmarks from both scientificand commercial workloads largely resemble real applica-tions. They often run for minutes on even GHz machines,resulting in huge trace files. Certainly, storage costs andcapacities are improving at rates comparable to those ofprocessor speeds, but managing terabytes of trace data isnot a simple task. Therefore, various techniques have beenused to reduce trace volumes.

Trace reduction techniques can be classified into lossyand lossless approaches. The lossy approach aims atretaining the representative part of the original trace anddiscards the remainder. The most commonly used lossytrace reductions are sampling and filtering. Trace samplingstores relatively short bursts of references. The bursts can beselected randomly, at regular intervals, or based on a

clustering method [19], [23]. Trace filtering discards allreferences that hit in a small “filter cache” [1], [16], [18], [20].Lossless trace reduction is often called trace compression.While retaining all the information in the original trace, itreduces the size of the trace file by filtering out theredundancy in the trace. Researchers have long recognizedthat there is abundant redundancy in program addressstreams [9]. The first order Markov Entropy is about 1-2 bitsper address [3]. Yet the encoding based purely on codingtheory is too complex and computationally expensive toimplement [3]. General-purpose compression algorithmslike LZW [24] (used in Unix compress), or LZ77 [25] (used ingzip) can certainly be applied to reduce the trace file sizeand achieve good compression ratio. Trace compressiontechniques, on the other hand, usually offer better compres-sion by taking advantage of the redundancy specific to thetraces. Most of the trace compression techniques worktogether with general-purpose compression algorithms.

Different applications require different types of traces.As a result, different compression techniques have beenproposed to compress these traces. HAFT [5] effectivelycompresses the trace of heap allocation events (e.g., malloc,free, etc.) for studying dynamic memory allocation perfor-mance. Hamou-Lhadi and Lethbridge [10] developed acomprehension-driven compression framework that com-presses the traces of procedure calls. In this paper, we studytechniques to compress program execution traces forprocessor simulation. Some of these techniques rely heavilyon analysis of the program’s control flow information [6],[8], [12], [15]. Although they often give the best compressionratio, they usually need to either parse the trace multipletimes, requiring large intermediate storage, or they need toinstrument the source code or binary code of the benchmarkprogram, a complex process often limited by the availabilityof tools. Instrumenting the program for tracing a full systemwith multiple processes and operating system activity isextremely hard. This paper focuses on online (one-passwithout intermediate storage) trace compression techniqueswithout code instrumentation. Similar techniques, such asMache, PDATS/PDI, SBC, and VPC, are discussed in thenext section.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004 723

. The authors are with the Department of Electrical and ComputerEngineering, University of Texas at Austin, 1 University Station, Austin,TX 78712. E-mail: {luo, ljohn}@ece.utexas.edu.

Manuscript received 3 July 2003; revised 10 Dec. 2003; accepted 6 Jan. 2004.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-0076-0703.

0018-9340/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

In this paper, we present a trace compression scheme,the Locality-Based Trace Compression (LBTC), which issuited for online compression of full system traces. Previousmethods usually handles address only traces (Mache,PDATS) and, possibly, instruction words (PDI). The LBTCscheme accommodates other attributes and events inaddition to address streams and the instructions them-selves. Therefore, it is good for memory hierarchy simula-tors as well as detailed trace-driven microarchitecturesimulators. The compression method is based on the samespatial and temporal locality principle that has beensuccessfully exploited by microprocessor hardware cachesfor decades. The resultant compression ratio is about 2Xbetter than the PDI format.

2 RELATED WORK

Previous trace compression schemes such as Mache [17]and PDATS/PDI [11] make use of the small offsets betweenaddresses of successive memory references. The Machescheme [17] records addresses of three types of memoryreferences: instruction fetch, data read, and data write. A2-bit “label” is used to differentiate the type of reference.Three variables, initialized to zero, contain the last addressin the trace for each label. Upon a new memory reference(i.e., a label, address pair), the offset is computed betweenthis address and the last address for this label. If the offset issmaller than a given threshold, it is emitted to the trace filewith the label; otherwise, the full address is emitted.Because of spatial locality, the offset usually requires fewerbits than the 32-bit address. An example scheme [17] packsthe data into a 16-bit word with 2 bits reserved for the labeland 14 bits for the signed address offset, implying athreshold of 2^13 = 8,192. Finally, the processed trace ispassed to a general-purpose one-pass compression schemesuch as the Unix compress utility. This seemingly simplecompression scheme is very effective: Mache (with com-press) almost always creates a file at least 10 times smallerthan the original trace file and over three times smaller thanthat produced by compress alone.

The PDATS scheme [11] records the memory referenceaddress and an optional timestamp indicating when thereference occurred. A trace record in PDATS format consistsof a header byte, a repetition count byte (0-1 byte), anaddress offset (0-4 bytes), and an optional timestamp offset.Like Mache, PDATS also takes advantage of spatial localityby encoding the offset of the addresses and timestamps, butit extends the Mache scheme in several ways. First, PDATSallows variable length address offsets (1, 2, or 4 bytes). Theoffset is no longer limited to 14 bits. If its absolute value isless than 128, the offset will occupy only 1 byte. Second,PDATS takes advantage of “sequentiality.” Sequentiality isan extreme form of spatial locality in which references in astream progress monotonically through contiguous mem-ory locations. For example, the addresses in a sequentialstream of instructions from a processor with 32-bit wideinstructions show a constant stride of 4. It is observed thatover 90 percent of instruction fetches are from sequentiallocations. Therefore, this special offset of 4 can be encodedin the trace record header with far fewer bits, instead ofoccupying a whole offset byte. Finally, nearly half of all the

references examined are from the same stream and have thesame offset as the immediately preceding reference. Whenrepetition of the offset occurs, a repeat flag is set. The bytefollowing the header byte specifies the number of times thatthis offset is repeated contiguouslyin the original trace. As aresult, PDATS improves compression by 2X over Machebefore compress is applied and shows 25 percent improve-ment after compress is applied.

The Mache and PDATS traces contain only the addressinformation; they are suited for memory hierarchy simula-tions. When we need to evaluate processor designs, the tracemust contain several attributes (e.g., instruction word) inaddition tomemoryaddresses.ThePDI format [11] augmentsPDATS by including the instructionwords. In PDI, addressesare compressed using a subset of the PDATS techniques,while the instruction words are compressed using a dic-tionary-based approach. It is observed that the 256 mostfrequent instruction words account for 56 percent to99.9 percent of the instructions executed with a median of86percent. To compress the instructionwords, a dictionary ofthe 256most frequent instructionwords, which requires onlya 1-byte index to access, is stored at the beginning of the tracefile. Each occurrence of these instruction words in the trace isencoded as a 1-byte index to the dictionary. On aMIPS R3000machine, the compressed SPEC92 traces in PDI format withinstruction words are only 33 percent larger than traces inPDATS format (without instruction words) [11].

Most recently, two new online trace compressiontechniques have been proposed. Like PDI, the Stream-Based Compression (SBC) [14] also handles instructionwords. It compresses the instruction sequence and dataaddress sequence separately. The data address sequence iscompressed with offset encoding similar to PDATS. Theinstructions are divided into “streams.” A stream is asequential run of instructions from the target of a takenbranch to the first taken branch in sequence. Because theprogram often repeatedly executes the same stream, eachstream is identified and stored in a stream table. Indices tothe stream table, instead of the streams themselves, arestored in the compressed trace file, resulting in 2 to 5 ordersof magnitude reduction in Dinero+ traces [7]. Because thewhole stream table is loaded into memory during decom-pression, its memory usage is not bounded. The memoryrequired by SBC decompression will be approximatelyproportional to the instruction footprint of the programbeing traced.

Another trace compression technique is the value predic-tion-based compression scheme (VPC) proposed byBurtscher and Jeeradit [4]. Many value prediction schemeshave been proposed in processor architecture research forimproving the performance of the microprocessor, whereasBurtscher and Jeeradit used value prediction to compresstraces. A hybrid value predictor consisting of severalsubpredictors is created in the compression program. Thepredicted value is compared with the true value in theoriginal trace. If the two values are equal, then only a bit isstored in the compressed trace indicatinga correct prediction.Otherwise, the true value is stored in the compressed trace.The same predictor is also used in the decompressionprogram, which replaces the “correct prediction” flag bit

724 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004

with the predicted value to get the original trace. VPC cancompress load value traces by a factor of about 34 on average.The memory usage is bounded by the size of the valuepredictor. Because of the complexity of the value predictor,the decompression is slow compared to other methods.

3 LOCALITY-BASED TRACE COMPRESSION

In this section, we present the proposed Locality BasedTrace Compression (LBTC) method that is suitable foronline compression of full-system traces. It compresses thetrace on-the-fly without complicated analysis. It is intendedto handle both CISC (variable instruction length) and RISCinstruction sets efficiently. Trace files typically consist oftrace records, each of which corresponds to a memoryreference (data read, data write, or instruction fetch), anexception, or other event (e.g., cache flush). A trace recordin LBTC captures not only the address of the memoryreference but also other system information that isimportant for accurate memory hierarchy simulation (e.g.,whether the memory location is cacheable) and accurateprocessor microarchitecture simulation (e.g., the instructionword). The different types of information recorded in atrace record are called attributes. Structurally, a trace recordconsists of a header and some attribute fields following it.The header tells the type of the trace record and specifiesthe attribute fields (e.g., whether a field exists, how long itis). Table 1 gives a list of these attributes. Differentapplications require different attributes to be recorded inthe trace. After exploring the locality in the trace, each usercan devise the optimal format for his or her application.Exceptions and other events are also captured in the tracefile but they occur so rarely compared to memory referencesthat their coding is of no importance to the size of the trace.Their trace records are not discussed hereafter.

The proposed LBTC scheme employs two techniques.One technique, offset encoding, is inherited from the Macheand PDATS formats, which take advantage of the spatial

locality in memory references. The spatial locality is theproperty that the next address accessed will probably bevery close to the last address accessed. As a result,recording the small offset between the addresses ofsuccessive references requires fewer bits than recordingthe full addresses in the trace file. The encoding of theinstruction physical address is presented here as anexample. First, the offset between the addresses of thecurrent instruction and the last instruction is computed. Weallocate a 2-bit field (pa_code) in the trace record header toindicate the length of the offset (see Table 2). Then, theoffset is stored after the header in 0, 1, 2, or 4 bytes. Pleasenote that, when the address of the current instruction iscontiguous to the last instruction (pa_code = 00), thecurrent address can be calculated by adding the length ofthe last instruction word to the last address. Thus, no extraoffset byte is needed. A difference between PDATS/PDIand LBTC is the way sequentiality is exploited. PDATS/PDIformat encodes an offset of 4 as 00. It works well for MIPSinstructions, which have a fixed length of 4 bytes. But, forx86, the instruction length ranges from 1 to 15 bytes. Thepa_code = 00 allows LBTC to encode variable length aswell as fixed length instructions efficiently because con-tiguous instructions are often executed unless interruptedby a taken branch. Virtual addresses are treated in a similarway. In most cases, the current reference is in the same pageas the last reference. Only one bit is needed to flag thissituation where no extra byte for the offset is needed. Inother cases, the offset of the page number is stored (1 to3 bytes). A 4-byte offset is needed only in extreme cases.

The other technique in LBTC is based on the “static”property of most of the attributes and the well-knowntemporal locality of memory references. Because we arestoring instruction words and other attributes in addition tophysical addresses, the attribute information takes morethan half of the space. Encoding the most frequent256 instruction words in 1 byte as in the PDI format is noteffective enough for our application, as we will show in

LUO AND JOHN: LOCALITY-BASED ONLINE TRACE COMPRESSION 725

TABLE 1Example Attributes Recorded in a Memory Reference Trace Record

TABLE 2LBTC Trace Record Header Bits for Instruction

Section 4.1. Fortunately, most of the attributes are “static.”They do not change frequently from one dynamic access toanother if the references are to the same memory addressesor initiated by the same instruction. For example, after anon-self-modifying benchmark program is loaded, the(static) instruction at a specific address will remain thesame until the program terminates (or is swapped out ontodisk) and a new module is loaded at the same memory area.Therefore, the next (dynamic) instruction fetched from thesame address will probably have the same instructionword. This holds true for many other attributes such asvirtual addresses, memory type, etc. For all the “static”attributes, the temporal locality can be effectively employedfor compression. Memory references from programs areknown to show abundant temporal locality: If a memoryaddress is accessed, it will probably be accessed again in thevery near future. Therefore, we emulate a small directmapped cache structure in software, called a compressioncache, to keep the recently seen memory references. If thenext memory reference hits in the cache, all the “static”attributes can be retrieved from it without being stored inthe trace file. We index the compression cache with thephysical address of instruction fetches.

In our scheme, no separate data cache is created for datareferences. Instead, the data reference attributes areattached to the instruction before it. A data reference that

appears after an instruction is probably initiated by thatinstruction, but there is also a slight possibility that the datareference has been caused by DMA operations or byTLB misses. If the data reference attached to a dynamicinstruction is the same as the data reference to the lastexecution of the same static instruction and the instructionis found in the cache, we call it a data reference hit. The datareference record is then retrieved from the compressioncache with the instruction physical address as the index. Inthis way, on a data record hit, even the data address can beomitted in the trace. Only one bit in the header is needed toindicate that the data can be retrieved from the cacheduring decompression. It works because many instructionswill access the same memory address as its last execution.Another approach is to create a separate data cache indexedby the physical address of data references. The trade off isbriefly discussed in Section 4.2. The compression cache thatwe implemented contains only 32K entries, thus it does nottax the virtual memory system. It is also very fast becauseone access involves only one bit-wise and operation and oneindexing operation. Set associative or fully associative datastructures may improve the compression ratio, but requiremuch more time to access. The working of the compressioncache is analogous to that of a hardware cache.

The C-like pseudocode in Fig. 1 illustrates the steps inLBTC compression. The direct mapped compression cache

726 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004

Fig. 1. Pseudocode illustrating the LBTC algorithm.

structure is not part of a trace file, but is created on-the-fly

and used in the program. Only the cache parameter (i.e.,

number of entries) needs to be passed to the compression

and the decompression program. The tracer is a utility such

as the trace module in Simics [13], from which we get the

original uncompressed trace records. A cache_entry_t

structure contains the trace record of an instruction fetch

and the data memory reference records following it. The

emit statement in the figure writes the trace information to

the compressed trace file.Table 3a gives an example to illustrate the steps used to

produce a compressed trace. Column 1 assigns a number to

each trace record. Column 2 gives an ID to each static

instruction. Column 3 numbers the dynamic data references.

Column 4 shows the physical addresses of the memory

references. The next four columns are the contents in the

compressed trace. Column 5 is a bit showing the type of

the trace record (I for instruction, D for data). The

algorithm converts the physical addresses to address

offsets, which are stored in the trace file in two’s

complement using the minimum number of bytes re-quired to hold the offset. As in Mache and PDATS/PDI

format, the address offsets are only calculated between

successive references of the same type (data access or

instruction fetch), but we do not further split data accesses

into subtypes. The first address of each type is simply

reproduced in the compressed file. The offsets stored in

the trace file are in column 6. Column 7 is the 1-bit field

LUO AND JOHN: LOCALITY-BASED ONLINE TRACE COMPRESSION 727

TABLE 3Example of Trace Compression Using LBTC: (a) Trace Compression Procedure, (b) Compression Cache Content after Record #5,

(c) Compression Cache Content after Record #8, (d) Compression Cache Content after Record #12,(e) Compression Cache Content after Record #15

in every trace record header indicating whether thereference can be retrieved from the compression cache(M for cache miss, H for cache hit). In this example, weassume a direct-mapped cache with four entries. Thephysical address of the instruction is used to calculate theindex to the cache, whose operations are shown in the lastcolumn. Before each instruction is put into the cache, thecache is probed to see if the instruction is already there. Ifthe instruction is already in the cache (a cache hit), onlythe address offset is stored in the trace file with a flagindicating a cache hit. On a cache miss, all the attributesare stored. Each data reference is associated with theinstruction before it, thus in the same cache entry as theinstruction.

At the time of compression, if the physical address of aninstruction is found in the cache, all attributes of theinstruction (or data) trace record will be compared withthose in the cache (lines 14 and 25 in Fig. 1). To ensurecorrectness, only when every attribute matches will it beconsidered a cache hit. Finding the same physical addressin the cache does not guarantee that the same instruction isfound because the operating system may have loaded a newprogram at the same memory area or the program might beself-modifying. Table 3b to Table 3e show the contents ofthe compression cache at the selected points duringcompression. After trace record #5, instructions I1, I2, I3,and the data references are put into the cache (Table 3b).Trace records #6, #7, and #8 are found in the cache; the fullattributes are not stored in the trace file (Table 3c).Records #9 to #12 incur cache misses (Table 3d) and arestored in the trace with full attributes. Please note that I4replaced I3 just as in a regular hardware cache operation.Trace record #13 and one following data reference (D6) hitin the cache (Table 3e), but the next data reference (D7) is acache miss and all its attributes must be stored. After theencoding, a general-purpose compression utility such asgzip is applied to further reduce the file size.

4 RESULTS

We gathered traces using Simics/x86 [13], an x86 fullsystem simulator. In our experiments, Simics simulates aPentium II machine running Red Hat Linux 7.3. We selectSPECint2000 [21] to represent CPU intensive benchmarksand SPECweb99 [22] to represent commercial servers. ForSPECweb, we use Apache 2.0 as the web server andmod_perl [2] to speed up some dynamic web operations.Unless otherwise noted, about one hundred millioninstructions are gathered for each benchmark. Approxi-mately two billion instructions are skipped for everySPECint2000 benchmark before tracing.

4.1 Compression Ratio

Many general-purpose compression algorithms can be usedto reduce trace file sizes. We compare with gzip, one of themost popular general-purpose compression utilities. Gzipimplements the LZ77 algorithm [25], which strives to find therepetitive patterns within a sliding window. We dump theSimics trace file in its uncompressed raw binary format. Tomake the raw trace file size manageable, 10 million instruc-tions, instead of 100 million instructions, are collected in this

gzip comparison experiment. Fig. 2 shows the size ofcompressed trace normalized to the size of the raw trace file.The columns for SPECint show average values of allprograms. The error bars show the maximum and theminimum values. The Simics raw trace format uses unionstructure to accommodate the max size of different types oftrace records. Therefore, each trace record in the trace filemaycontain unused bits. These unused bits may take on arbitraryvalues making an otherwise repetitive record look differentand rendering gzip less effective. To ensure a fair comparison,we forced these unused bits to be zero. We also modified theSimics trace module and set its data value field to zerobecause we do not trace the data value of memory loads/stores in LBTC. Fig. 2 shows that gzip reduces the size of thetrace file by an order of magnitude, but LBTC alone yieldsbetter a compression ratio. LBTCwins by taking advantage ofthe knowledge of the structure of the trace record, whereasgzip blindly searches for repetition. The most notableproperty of LBTC in Fig. 2, which is also true for PDATS/PDI and Mache, is that it is complementary to general-purpose compression techniques. As shown in the thirdcolumn for each benchmark, gzip can further compress LBTCformat by a ratio of 4-6. On average, a trace record takes0.357 bytes in gzipped LBTC files.

Fig. 3 compares different trace compression techniques.File sizes are normalized to our baseline compression. Thebaseline compression, denoted as “offset encoding” in thefigure, compresses only memory reference addresses(physical and virtual) by offset encoding. It is essentiallythe Mache format with all the additional attributes un-compressed. Offset encoding of physical and virtualaddresses is also used in PDI and LBTC in our experiment.The PDI format [11] does further compression by encodingthe most frequent 256 instruction words in 1 byte. When thetop 256 instructions are obtained from the trace itself, it isdenoted as “specific.” This approach cannot be used onlinebecause it requires two passes. The first pass scans the traceto identify the 256 most common instruction words. Thesecond pass does the encoding. An alternative approachuses a “generic” dictionary that is selected for x86 afterexamination of a collection of traces. In our experiment, thegeneric dictionary is obtained by calculating the top256 instructions for all benchmark programs. This willusually yield less compression, but permits online compres-sion. Fig. 3b shows the same results after files are furthercompressed by gzip. As shown in Fig. 3a, the generic

728 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004

Fig. 2. Normalized trace file size for gzip and LBTC.

dictionary offers little compression for SPECweb. Even

within SPECint groups, some benchmarks are so different

that the generic dictionary is of little use. LBTC offers 2.5X

better compression over PDI (generic) and is about 2X better

after gzip compression is applied.We can see that the PDI format is not very effective in

our environment. The PDI format is most effective if the

following three conditions are met:

1. The instruction word constitutes the major part in atrace record besides the address information.

2. The 256 most frequent instruction words account formost dynamic instruction words in the trace.

3. The 256 most frequent instruction words are long sothat replacing them with 1-byte indices providesgood saving.

As for condition 1, besides the instruction word and the

physical address, we also trace the virtual address, the

memory type, and other information. PDI compresses only

the instruction word and thus is not effective for other

information. On a cache hit, LBTC retrieves all information

from the cache. Therefore, storing extra information is not a

problem as long as this information does not change

frequently from one execution of the same instruction to

the other.Condition 2 is met for most SPECint programs. As

shown in Table 4, the top 256 instructions account for an

average of 90 percent of all dynamic instructions. Yet,

instructions used in commercial servers like SPECweb are

more diversified. The top 256 instructions only cover

56 percent of the total instructions. Unlike MIPS, for which

PDI was first designed, x86’s instruction coding is fairlycompact. The average length of the instructions is about2.9 bytes, whereas, for the most common 256 instructions,the average length is even smaller. Therefore, as forcondition 3, replacing the instruction words with 1 bytedoes not provide as much compression as on MIPS, whoseinstruction length is 4 bytes.

4.2 Statistics Supporting Compression

Offset encoding takes advantage of spatial locality byencoding the offset in fewer bytes. Our method is similar toMache and PDATS [11] in this aspect and readers can obtainmore details and statistics about the offset encoding from theoriginal reference [11]. LBTC also employs temporal localityby caching a small number of previously seen trace records. Itis used to compress information other than the instructionphysical address, including the instruction word, virtualaddress, etc., that are mostly “static” (not changing in eachdynamic reference to the same memory location). Thesubsequent trace records can probably be retrieved from thecache, if it has been seen not long ago. Table 5 shows the hitratio of the compression cache.

Even though it is a small direct mapped 32k entry cache,more than 94.5 percent of the instruction records can beretrieved from the cache and about half of the data recordsare found in the cache. The relatively low data hit rate isdue to indexing the cache with the instruction physicaladdress. The other possible locality-based approach is touse a separate data cache indexed with the address of thedata reference, which would improve the cache hit ratio fordata references to about 90 percent. This would reduce theattributes stored in the trace file, but every data referencerecord would require an address as the index. In ourimplementation, if a data record hits in the cache, even the

LUO AND JOHN: LOCALITY-BASED ONLINE TRACE COMPRESSION 729

Fig. 3. Normalized trace file size for different trace compression formats. (a) Without gzip. (b) With gzip.

TABLE 4Statistics of the 256 Most Common Instructions

TABLE 5Compression Cache Hit Rate

data address is not necessary in the trace file. Only one bitindicating a data hit needs to be present in the trace recordheader. Since the physical address is a significant part inour data trace record, our implementation gives a betteroverall compression ratio than a separate data cacheindexed by the address of the data reference. However, inother environments, if the data hit ratio indexed byinstruction address is low and the data trace record containsmuch more than the physical address, then a separate datacache will be justified.

4.3 Access Time

Trace access (decompression) time is also an importantmetric to evaluate a trace compression method. In thissection, we compare the access time of different compres-sion formats as well as the execution-driven tracing time.Trace access time is heavily affected by disk access time. Inour experiment, we use a fast disk configuration. The accesstimes were measured by reading each trace file from a SCSIRAID (level 0, two disks) attached to an 8-processor DELLPowerEdge 8450 server running Red Hat Linux 7.3. The real(elapsed) time spent in reading and converting each tracewas measured using the Unix time utility.

Again, the access times to compressed trace file (notgzipped) are normalized to the “offset encoding” format andshown in Table 6. All formats perform similarly in terms ofaccess time. This indicates that the compression cache addsvery little time overhead. It is as fast as the PDI format. If thedisk is slow,LBTC is expected to showrelatively faster resultsbecause it has a better compression ratio and, thus, less time isspent reading from the disk compared to other compressionmethods. We also measured the access time to trace filesfurther compressed by gzip. The files were gunzipped andpiped into the trace decompressors. On our multiprocessormachine, the gunzip and trace decompressor run in paralleland the access time stays almost the same as in Table 6.

The last column shows the normalized time for generat-ing the trace information in Simics. It is the overhead inexecution-driven simulation. We use the original trace

module shipped with Simics and turn on the raw formattracing, which just dumps the trace records without anyprocessing. We redirect the trace dump to /dev/null, sothere are no disk writes. It is interesting to note thatdecompressing traces are faster than executing the bench-mark in Simics. This is because executing a benchmark in afull system simulator is a costly process itself. Moreover,because Simics is a commercial product, to keep the sourcecode secret, the trace module is “hooked” to the simulatorcore through the call back API, which causes additionallarge overhead.

4.4 Design Space Exploration

Since our algorithm is based on the same principle oflocality as hardware caches, any approach that enhances thecache hit ratio should also improve the compression ratio.For comparison, we emulate a 2M-entry direct mappedcache and an infinite fully associative cache. The com-pressed file sizes, after gzip is applied, are shown in Table 7.File sizes are normalized to the 32k cache case. Table 7shows that the 32K entry direct mapped cache works fairlywell. The average reduction in file size of SPECint is onlyabout 10.8 percent when a 2M direct mapped cache is used.Yet, there are some applications, notably the SPECweb andeon from SPECint2000, which can benefit from larger caches.Fortunately, a larger direct mapped cache is as fast as asmall cache. Thus, when memory is abundant, a largercache could be used. Moving further to an infinite fullyassociative cache gives less than 1 percent improvement.Therefore, fully associative cache, which requires a slowassociative search, is not needed.

LBTC could also be combined with PDI compression.But, our experiment shows that it will achieve little furthercompression. In addition, using PDI online requires creat-ing a generic instruction word dictionary, which is adifficult task. Therefore, we believe that adding PDI toLBTC is unnecessary.

5 CONCLUSION

Processor memory references exhibit spatial and temporallocality. Previous research has employed spatial locality tocompress address traces by encoding the offset of theconsecutive reference addresses. It works well for traces thatonly contain addresses. In this paper, we have proposed anew technique, LBTC, to take advantage of the temporallocality by using a small data structure emulating a cache.LBTC can efficiently compress traces with more informationin addition to addresses. It is shown to improve thecompression ratio by about 2X over the PDI format, whichuses a dictionary to compress instruction words. Thealgorithmis simple, fast andcanbeusedonline in conjunctionwith general-purpose compression algorithms.

730 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004

TABLE 6Normalized Access Time

TABLE 7Normalized File Size of Different Compression

Cache Configuration (with gzip)

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers

for their valuable comments. This research was partially

supported by the US National Science Foundation under

grant number 0113105 and by IBM, Intel, Motorola, and

AMD Corporations.

REFERENCES

[1] A. Agarwal and M. Huffman, “Blocking: Exploiting SpatialLocality for Trace Compaction,” Proc. 1990 ACM SIGMETRICSConf. Measurement and Modeling of Computer Systems, pp. 48-57,1990.

[2] Apache Software Foundation, “mod_perl Home Page,”http://perl.apache.org/, 2003.

[3] J.C. Becker, A. Park, and M. Farrens, “An Analysis of theInformation Content of Address Reference Streams,” Proc. 24thAnn. Int’l Symp. Microarchitecture, pp. 19-24, 1991.

[4] M. Burtscher and M. Jeeradit, “Compressing Extended ProgramTraces Using Value Predictors,” Proc. 12th Int’l Conf. ParallelArchitectures and Compilation Techniques (PACT), pp. 159-169, Sept.2003.

[5] T. Chilimbi, R. Jones, and B. Zorn, “Designing a Trace Format forHeap Allocation Events,” ACM SIGPLAN Notices, vol. 36, no. 1,pp. 35-49, Jan. 2001.

[6] L. DeRose, K. Ekanadham, J.K. Hollingsworth, and S. Sbaraglia,“SIGMA: A Simulator Infrastructure to Guide Memory Analysis,”Proc. 2002 ACM/IEEE Conf. Supercomputing, pp. 1-13, 2002.

[7] J. Edler andM.D.Hill, “Dinero IVTrace-DrivenUniprocessorCacheSimulator,” http://www.cs.wisc.edu/~markhill/DineroIV/,2003.

[8] E.N. Elnozahy, “Address Trace Compression through LoopDetection and Reduction,” Proc. 1999 ACM SIGMETRICS Int’lConf. Measurement and Modeling of Computer Systems, pp. 214-215,1999.

[9] D.W. Hammerstrom and E.S. Davidson, “Information Content ofCPU Memory Referencing Behavior,” Proc. Fourth Ann. Symp.Computer Architecture, pp. 184-192, 1977.

[10] A. Hamou-Lhadj and T.C. Lethbridge, “Compression Techniquesto Simplify the Analysis of Large Execution Traces,” Proc. 10thInt’l Workshop Program Comprehension (IWPC ’02), pp. 159-168,June 2002.

[11] E. Johnson, J. Ha, and M.B. Zaidi, “Lossless Trace Compression,”IEEE Trans. Computers, vol. 50 no. 2, pp. 158-173, Feb. 2001.

[12] J.R. Larus, “Efficient Program Tracing,” Computer, vol. 26 no. 5,pp. 52-61, May 1993.

[13] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner,“Simics: A Full System Simulation Platform,” Computer, vol. 35,no. 2, pp. 50-58, Feb. 2002.

[14] A. Milenkovic and M. Milenkovic, “Stream-Based Trace Compres-sion,” Computer Architecture Letters, vol. 2, Sept. 2003.

[15] A.R. Pleszkun, “Techniques for Compressing Program AddressTraces,” Proc. 27th Ann. IEEE/ACM Int’l Symp. Microarchitecture,pp. 32-40, 1994.

[16] T.R. Puzak, “Analysis of Cache Replacement Algorithms,” PhDdissertation, Univ. of Massachusetts, 1985.

[17] A.D. Samples, “Mache: No-Loss Trace Compaction,” Proc. 1989ACM SIGMETRICS Int’l Conf. Measurement and Modeling ofComputer Systems, pp. 89-97, 1989.

[18] C.D. Schieber and E.E. Johnson, “RATCHET: Real-Time AddressTrace Compression Hardware for Extended Traces,” ACMSIGMETRICS Performance Evaluation Rev., vol. 21, nos. 3-4, pp. 22-32, Apr. 1994.

[19] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Auto-matically Characterizing Large Scale Program Behavior,” Proc.10th Int’l Conf. Architectural Support for Programming Languages andOperating Systems (ASPLOS 2002), pp. 45-57, 2002.

[20] A.J. Smith, “Two Methods for the Efficient Analysis of MemoryAddress Trace Data,” IEEE Trans. Software Eng., vol. 3 no. 1, 1977.

[21] Standard Performance Evaluation Corp., http://www.spec.org/cpu2000/, 2004.

[22] Standard Performance Evaluation Corp., http://www.spec.org/web99/, 2004.

[23] R. Todi, “SPEClite: Using Representative Samples to Reduce SPECCPU2000 Workload,” Proc. Fourth Ann. IEEE Int’l WorkshopWorkload Characterization, pp. 15-23, 2001.

[24] T.A. Welch, “A Technique for High-Performance Data Compres-sion,” Computer, vol. 17, no. 6, pp. 8-19, June 1984.

[25] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential DataCompression,” IEEE Trans. Information Theory, vol. 23, no. 3,pp. 337-343, 1987.

Yue Luo received the MS degree in electronicsfrom Peking University. He is a PhD student inthe Electrical and Computer Engineering De-partment at the University of Texas at Austin.His research interests include microprocessormodeling, computer architecture simulation, andperformance analysis. He is a student memberof the IEEE.

Lizy Kurian John received the PhD degree incomputer engineering from The PennsylvaniaState University in 1993. She received theBachelor’s degree in electronics and telecom-munication from University of Kerala, India, andthe Master’s degree in computer engineeringfrom the University of Texas El Paso. She is anassociate professor in the Electrical and Com-puter Engineering Department at the Universityof Texas (UT) Austin and is a UT Austin

Engineering Foundation Centennial Teaching Fellow. Her researchinterests include high-performance processor and memory architec-tures, low power design, reconfigurable architectures, rapid prototyping,Field Programmable Gate Arrays, workload characterization, etc. Shehas published papers in the IEEE Transactions on Computers, IEEETransactions on VLSI, ACM/IEEE International Symposium on Compu-ter Architecture (ISCA), IEEE Micro Symposium (MICRO), IEEE High-Performance Computer Architecture Symposium (HPCA), ACM Inter-national Symposium on Low Power Electronics and Design (ISLPED),etc. and holds a patent on Field Programmable Memory Cell Arrays. Herresearch has been supported by the US National Science Foundation(NSF), the State of Texas Advanced Technology program, US DefenseAdvanced Research Projects Agency, IBM, Intel, Motorola, DELL, AMD,and Microsoft Corporations. She is a recipient of an NSF CAREERaward, Junior Faculty Enhancement Award from Oak Ridge AssociatedUniversities, IBM Austin Center for Advanced Studies (CAS) Fellowship,Texas Exes Teaching Award (2004), UT Austin Engineering FoundationFaculty Award (2001), Halliburton, Brown and Root EngineeringFoundation Young Faculty Award (1999), etc. She is a senior memberof the IEEE and a member of the IEEE Computer Society and ACM andACM SIGARCH. She is also a member of Eta Kappa Nu, Tau Beta Pi,and Phi Kappa Phi.

. For more information on this or any computing topic, please visitour Digital Library at www.computer.org/publications/dlib.

LUO AND JOHN: LOCALITY-BASED ONLINE TRACE COMPRESSION 731


Recommended