Memory Subsystem Performance of Programs using Coping Garbage
Collection
Authers:
Amer Diwan
David Traditi
Eliot Moss
Presented by: Ronen Shabo
Introduction
Heap allocation with coping garbage collection is believed to have poor memory subsystem performance.
However, with the appropriate memory subsystem organization, heap allocation can have good memory subsystem performance.
Agenda
Background.
Memory subsystem
Cache
Write buffer
Page mode
CPI
Copying garbage collection
SML Related work Methodology Result and Analysis Conclusions
Cache
It is known that CPUs get faster relative to DRAM memory chips.
A solution to this problem is to add a small fast memory call cache.
Cache work by reducing the average memory access time.
It is possible since memory access has temporal and spatial locality.
Cacheas
socia
tivity
subblock subblock subblocksubblock
Block
Block
Block
tagValid
V V V Vtag
tag
Valid
Valid
SET
Cache Hit Policies
On read hit
Read the word from cache.
Write through:
Write the word to cache and memory.
Write back:
Write the word to cache.
Mark the block as dirty.
When evicted block from cache, if dirty write it to memory.
Cache miss policies
On read miss the block is copying from main memory.
Write no allocate:
Do not allocate block in the cache.
Send the write to main memory, without putting the write in the cache.
Write allocate, no subblock placement:
Allocate a block in the cache.
Fetch the corresponding memory block from main memory.
Write the word to cache and to memory. Write allocate,subblock placement :
Allocate block in the cache.
Write the word to the cache and to memory.
Invalidate the remaining words in the cache.
Memory Subsystem
Write buffer :
Is a queue containing writes that are to be sent to main memory.
Page-mode :
Main memory is divided into DRAM pages. Page-mode writes reduce the latency of write to the same DRAM page.
CPI - Cycles Per useful Instruction :
number of CPU cycles to complete a program divided by the total number of useful instruction.
Coping Garbage Collection Two memory areas Memory allocation is done from FROMSPACE. When FROMSPACE is full, moves all the live objects
from FROMSPACE to TOSPACE. Exchange names.
Generational Coping GC Split objects into multiple areas by age. Scan older objects area less frequently. Copy long surviving objects to older generations area.
SMLStandard ML
Call by value Safe Polymorphic Functional Garbage collection
SML/NJ compiler Making allocation cheap and function call fast.
Allocation done in-line.
Aggressive -reduction (in-line) function call is used.
Extensive use of registers. Allocate procedure activation record on the heap
instead of the stack.
Related work This Work Advantage
This work made a different between read and write miss and there penalty.
Previous work use overall miss ratios .
This work module the entire memory subsystem including the write buffer and DARM page-mode.
Previous work did not module the entire memory subsystem.
The conclusions of a work that study the cache write policies on the performance of C and Fortran programs support ours that write allocate with subblock is the preferred architecture.
Methodology Tools :
QPT - Used to produce memory traces for SML/NJ programs.
Tycho - Used for the memory subsystem simulation.
Performance:
Performance numbers are in CPI. All instruction besides nops are considered useful.
Benchmarks :
The benchmark run on eight programs listed on the next table:Program Description
CW The Concurrency Workbench is a tool for analyzing networks of finitestate processes expressed in Milner's Calculus of CommunicatingSystems.
Leroy An implementation of the Knuth-Bendix completion algorithm.Lexgen A lexical-analyzer generator, processing the lexical description of
Standard ML.Life The game of Life implemented using lists.PIA The Perspective Inversion Algorithm decides the location of an
object in a perspective video image.Simple A spherical fluid-dynamics program .VLIW A Very-Long-Instruction-Word instruction scheduler.YACC An implementation of an LALR(1) parser generator processing the
grammar of Standard ML.
Cont.
Program Inst Fetches Allocations (words)
CW 523,245,978 56,467,440Leroy 312,086,438 67,733,930Lexgen 328,422,283 33,046,349Life 413,536,662 37,849,681PIA 122,215,151 13,047,041Simple 604,611,016 67,261,664VLIW 399,812,033 59,496,919YACC 133,043,324 17,015,250
Memory Subsystem Simulation
The memory features and penalty used in this study restrict to currently popular RISC workstation.
All simulation use: Write buffer (depth 6) Page mode Separated Data and Instruction caches Write-through policy
The simulations take over: Cache size 8K-128K Direct map and two-way set associative caches ( with LRU
replacement). Block size of 16 and 32 bytes Write allocate versus write no allocate Subblock placement versus no subblock placement.
Results and Analysis
Analysis SML/NJ programs: Programs do heap allocation at a rate of 0.2-0.4
words per instruction. The majority of writes are initialization writes. Writes come in bunches, they initialize newly
allocated area.
An aggressive write policy is necessary. Avoid waiting for writes to memory write buffer &
fast page mode. On write miss avoid reading cache block write
allocate with subblock placement cache policy is needed.
Result
Result CW
summary graphs
Result
CW
write alloc
subblock
block size 16
ConclusionsWrite miss policy and subblock placement: It is clear from this study that the best cache organization
is write-allocate / subblock placement.(Surprisingly for caches larger then 64k direct map cache the memory subsystem overhead of SML/NJ programs is acceptable less then 16%)
Performance of write allocate /no subblock is almost identical to write no allocate /no subblock. (Address is being read soon after being write,even for 8K cache. Since our program allocate 0.4-0.9 bytes per instruction , a read block occurs within 9K-20K).
Associativity: Increasing the associativity improve the CPI.
(This improvement is less then the one obtained from subblock placement).
Higher associativity improves the instruction cache performance but has little impact on data cache.(A lot of the penalty from the instruction cache is due to conflict miss and that from data cache is due to capacity miss).
ConclusionsBlock size: Increasing block size from 16 to 32 bytes improve the
performance.
Cache Size: Increasing cache size improve the performance. Most of the CPI improvement come from the instruction
cache.(From related work we expect to see sharp improvement once it can feet the allocation area 512K is large enough to hold most benchmark)
Write Buffer: A six deep write buffer with page mode is sufficient to
absorb the bursty writes.(Since there contribution to CPI is negligible)
Summary A depth study of the memory subsystem was made and the results show that:
Programs with intensive heap allocation performed poorly on most memory subsystem.
However on some machine (DECstation 5000/200) the performance was good.
The most crucial parameter for good performance was subblock placement, in this case the overhead was under 16% for caches bigger then 64K.
Associativity and cache size (up to 128k) were more important for the instruction cache.
Higher associativity and larger block size had small contribution.