Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | nicholas-morrison |
View: | 233 times |
Download: | 0 times |
Integrating Memory Integrating Memory Compression and Compression and
Decompression with Decompression with Coherence Protocols in DSM Coherence Protocols in DSM
MultiprocessorsMultiprocessors
Lakshmana R Vittanala Mainak Lakshmana R Vittanala Mainak ChaudhuriChaudhuri
IntelIntel IIT Kanpur IIT Kanpur
Memory Compression and DecompressionMemory Compression and Decompression
Talk in Two Slides (1/2)Talk in Two Slides (1/2) Memory footprint of data-intensive Memory footprint of data-intensive
workloads is ever-increasingworkloads is ever-increasing– We explore compression to reduce We explore compression to reduce
memory pressure in a medium-scale memory pressure in a medium-scale DSM multiDSM multi
Dirty blocks evicted from last-level of Dirty blocks evicted from last-level of cache is sent to home nodecache is sent to home node– Compress in home memory controllerCompress in home memory controller
A last-level cache miss request from A last-level cache miss request from a node is sent to home nodea node is sent to home node– Decompress in home memory controllerDecompress in home memory controller
Memory Compression and DecompressionMemory Compression and Decompression
Talk in Two Slides (2/2)Talk in Two Slides (2/2) No modification in the processorNo modification in the processor
– Cache hierarchy sees decompressed blocksCache hierarchy sees decompressed blocks All changes are confined to the directory-All changes are confined to the directory-
based cache coherence protocolbased cache coherence protocol– Leverage spare core(s) to execute Leverage spare core(s) to execute
compression-enabled protocols in softwarecompression-enabled protocols in software– Extend directory structure for compression Extend directory structure for compression
book-keepingbook-keeping Use hybrid of two compression algorithmsUse hybrid of two compression algorithms
– On 16 nodes for seven scientific computing On 16 nodes for seven scientific computing workloads, 73% storage saving on average workloads, 73% storage saving on average with at most 15% increase in execution timewith at most 15% increase in execution time
Memory Compression and DecompressionMemory Compression and Decompression
ContributionsContributions Two major contributionsTwo major contributions
– First attempt to look at First attempt to look at compression/decompression as compression/decompression as directory protocol extensions in mid-directory protocol extensions in mid-range serversrange servers
– First proposal to execute a First proposal to execute a compression-enabled directory protocol compression-enabled directory protocol in software on spare core(s) of a multi-in software on spare core(s) of a multi-core diecore die Makes the solution attractive in many-core Makes the solution attractive in many-core
systemssystems
Memory Compression and DecompressionMemory Compression and Decompression
SketchSketch Background: Programmable Protocol Background: Programmable Protocol
CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression
AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary
Memory Compression and DecompressionMemory Compression and Decompression
Programmable Protocol CoreProgrammable Protocol Core Past studies have considered off-die Past studies have considered off-die
programmable protocol processorsprogrammable protocol processors– Offers flexibility in choice of coherence Offers flexibility in choice of coherence
protocols compared to hardwired FSMs, protocols compared to hardwired FSMs, but suffers from performance loss [Sun but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …]Piranha, …]
With on-die integration of memory With on-die integration of memory controller and availability of large controller and availability of large number of on-die cores, number of on-die cores, programmable protocol cores may programmable protocol cores may become an attractive designbecome an attractive design– Recent studies show almost no Recent studies show almost no
performance loss [IEEE TPDS, Aug’07]performance loss [IEEE TPDS, Aug’07]
Memory Compression and DecompressionMemory Compression and Decompression
Programmable Protocol CoreProgrammable Protocol Core In our simulated system, each node In our simulated system, each node
containscontains– One complex out-of-order issue core One complex out-of-order issue core
which runs the application threadwhich runs the application thread– One or two simple in-order static dual One or two simple in-order static dual
issue programmable protocol core(s) issue programmable protocol core(s) which run the directory-based cache which run the directory-based cache coherence protocol in softwarecoherence protocol in software
– On-die integrated memory controller, On-die integrated memory controller, network interface, and routernetwork interface, and router
Compression/decompression Compression/decompression algorithms are integrated into the algorithms are integrated into the directory protocol softwaredirectory protocol software
Memory Compression and DecompressionMemory Compression and Decompression
Programmable Protocol CoreProgrammable Protocol Core
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1 DL1
OOO Core In-order Core
Network
Protocol Core/Protocol Processor
Memory Compression and DecompressionMemory Compression and Decompression
Anatomy of a Protocol HandlerAnatomy of a Protocol Handler On arrival of a coherence transaction at On arrival of a coherence transaction at
the memory controller of a node, a the memory controller of a node, a protocol handler is scheduled on the protocol handler is scheduled on the protocol core of that nodeprotocol core of that node– Calculates the directory address if home Calculates the directory address if home
node (simple hash function on transaction node (simple hash function on transaction address)address)
– Reads 64-bit directory entry if home nodeReads 64-bit directory entry if home node– Carries out simple integer arithmetic Carries out simple integer arithmetic
operations to figure out coherence actionsoperations to figure out coherence actions– May send messages to remote nodesMay send messages to remote nodes– May initiate transactions to local OOO coreMay initiate transactions to local OOO core
Memory Compression and DecompressionMemory Compression and Decompression
Baseline Directory ProtocolBaseline Directory Protocol Invalidation-based three-state (MSI) Invalidation-based three-state (MSI)
bitvector protocolbitvector protocol– Derived from SGI Origin MESI protocol Derived from SGI Origin MESI protocol
and improved to handle early and late and improved to handle early and late intervention races betterintervention races better
4 44 16
States: L, M,two busy
Sharer vectorUnused
64-bit datapath
Memory Compression and DecompressionMemory Compression and Decompression
SketchSketch Background: Programmable Protocol Background: Programmable Protocol
CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression
AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression supportCompression support
– All handlers that update memory blocks All handlers that update memory blocks need extension with compression algorithmneed extension with compression algorithm
– Two major categories: writeback handlers Two major categories: writeback handlers and GET intervention response handlersand GET intervention response handlers
Latter involves a state demotion from M to S and Latter involves a state demotion from M to S and hence requires an update of memory block at hence requires an update of memory block at homehome
GETX interventions do not require memory GETX interventions do not require memory update as they involve ownership hand-off onlyupdate as they involve ownership hand-off only
Decompression supportDecompression support– All handlers that access memory in All handlers that access memory in
response to last-level cache miss requestsresponse to last-level cache miss requests
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (writeback Compression support (writeback
cases)cases)
SP
SPP HPP
WBWB_ACK
WB
Compress
DRAM
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (writeback Compression support (writeback
cases)cases)
HP HPP
Compress
WBDRAM
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (intervention Compression support (intervention
cases)cases)
RPP HPPCompress
DRAM
RP
DP
GET
GETGET
PUT
SWB
PUT
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (intervention Compression support (intervention
cases)cases)
RPP HPPCompress
DRAM
RP
HP
GET
GETGET
PUT
PUT
PUT(Uncompressed)
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (intervention Compression support (intervention
cases)cases)
HP HPPCompress
DRAM
DP
GETGET
PUTPUT(Uncompressed)
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Decompression supportDecompression support
RP
RPP HPP
PUT/PUTX
GET/GETX
Decompress
DRAM
GET/GETXPUT/PUTX
Memory Compression and DecompressionMemory Compression and Decompression
Directory Protocol ExtensionsDirectory Protocol Extensions Decompression supportDecompression support
HP HPP
PUT/PUTX
GET/GETX
Decompress
DRAM
Memory Compression and DecompressionMemory Compression and Decompression
SketchSketch Background: Programmable Protocol Background: Programmable Protocol
CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression
AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary
Memory Compression and DecompressionMemory Compression and Decompression
Compression AlgorithmsCompression Algorithms Consider each 64-bit chunk at a time of Consider each 64-bit chunk at a time of
a 128-byte cache blocka 128-byte cache blockAlgorithm IAlgorithm I
OriginalOriginal Compressed Encoding Compressed EncodingAll zeroAll zero Zero byteZero byte 0000MS 4 bytes zeroMS 4 bytes zero LS 4 bytesLS 4 bytes 0101MS 4 bytes = LS 4 bytesMS 4 bytes = LS 4 bytes LS 4 bytesLS 4 bytes
1010NoneNone 64 bits64 bits 1111
Algorithm IIAlgorithm IIDiffers in encoding 10: LS 4 bytes zero. Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes.Compressed block stores the MS 4 bytes.
Memory Compression and DecompressionMemory Compression and Decompression
Compression AlgorithmsCompression Algorithms Ideally want to compute compressed Ideally want to compute compressed
size by both the algorithms for each size by both the algorithms for each of the 16 double-words in a cache of the 16 double-words in a cache block and pick the bestblock and pick the best– Overhead is too highOverhead is too high
Trade-off#1Trade-off#1– Speculate based on the first 64 bitsSpeculate based on the first 64 bits– If MS 32 bits ^ LS 32 bits = 0, use If MS 32 bits ^ LS 32 bits = 0, use
Algorithm I (covers two cases of Algorithm I (covers two cases of Algorithm I)Algorithm I)
– If MS 32 bits & LS 32 bits = 0, use If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II (covers three cases of Algorithm II)Algorithm II)
Memory Compression and DecompressionMemory Compression and Decompression
Compression AlgorithmsCompression Algorithms Trade-off#2Trade-off#2
– If compression ratio is low, it is better If compression ratio is low, it is better to avoid decompression overheadto avoid decompression overhead Decompression is fully on the critical pathDecompression is fully on the critical path
– After compressing every 64 bits, After compressing every 64 bits, compare the running compressed size compare the running compressed size against a threshold against a threshold maxCszmaxCsz (best: 48 (best: 48 bytes)bytes)
– Abort compression and store entire Abort compression and store entire block uncompressed as soon as the block uncompressed as soon as the threshold is crossedthreshold is crossed
Memory Compression and DecompressionMemory Compression and Decompression
Compression AlgorithmsCompression Algorithms Meta-dataMeta-data
– Required for decompressionRequired for decompression– Most meta-data are stored in the Most meta-data are stored in the
unused 44 bits of the directory entryunused 44 bits of the directory entry– Cache controller generates Cache controller generates
uncompressed block address; so uncompressed block address; so directory address computation remains directory address computation remains unchangedunchanged
– 32 bits to locate the compressed block32 bits to locate the compressed block Compressed block size is a multiple of 4 Compressed block size is a multiple of 4
bytes, but we extend it to next 8-byte bytes, but we extend it to next 8-byte boundary to have a cushion for future useboundary to have a cushion for future use
32 bits allow us to address 32 GB of 32 bits allow us to address 32 GB of compressed memorycompressed memory
Memory Compression and DecompressionMemory Compression and Decompression
Compression AlgorithmsCompression Algorithms Meta-dataMeta-data
– Two bits to know the compression algorithmTwo bits to know the compression algorithm Algorithm I, Algorithm II, uncompressed, all zeroAlgorithm I, Algorithm II, uncompressed, all zero All zero blocks do not store anything in memoryAll zero blocks do not store anything in memory
– For each 64 bits need to know one of four For each 64 bits need to know one of four encodingsencodings Maintained in a 32-bit header (two bits for each Maintained in a 32-bit header (two bits for each
of the 16 double words)of the 16 double words)
– Optimization to speed up relocation: store Optimization to speed up relocation: store the size of the compressed block in directory the size of the compressed block in directory entryentry Requires four bits (16 double words maximum)Requires four bits (16 double words maximum)
– 70 bits of meta-data per compressed block70 bits of meta-data per compressed block
Memory Compression and DecompressionMemory Compression and Decompression
Decompression ExampleDecompression Example Directory entry informationDirectory entry information
– 32-bit address: 0x4fd1276a32-bit address: 0x4fd1276a Actual address = 0x4fd1276a << 3Actual address = 0x4fd1276a << 3
– Compression state: 01Compression state: 01 Algorithm II was usedAlgorithm II was used
– Compressed size: 0101Compressed size: 0101 Actual size=40 bytes (not used in Actual size=40 bytes (not used in
decompression)decompression)
Header informationHeader information– 32-bit header: 00 11 10 00 00 01…32-bit header: 00 11 10 00 00 01…
Upper 64 bits used encoding 00 of Algorithm Upper 64 bits used encoding 00 of Algorithm IIII
Next 64 bits used encoding 11 of Algorithm IINext 64 bits used encoding 11 of Algorithm II
Memory Compression and DecompressionMemory Compression and Decompression
Performance OptimizationPerformance Optimization Protocol thread occupancy is criticalProtocol thread occupancy is critical
– Two protocol coresTwo protocol cores– Out-of-order NI scheduling to improve Out-of-order NI scheduling to improve
protocol core utilizationprotocol core utilization– Cached message buffer (filled with Cached message buffer (filled with
writeback payload)writeback payload) 16 uncached loads/stores needed to 16 uncached loads/stores needed to
message buffer if not cached during message buffer if not cached during compressioncompression
Caching requires invalidating the buffer Caching requires invalidating the buffer contents at the end of compression contents at the end of compression (coherence issue)(coherence issue)
Flushing dirty contents occupies the Flushing dirty contents occupies the datapath; so we allow only cached loadsdatapath; so we allow only cached loads
– Compression ratio remains unaffectedCompression ratio remains unaffected
Memory Compression and DecompressionMemory Compression and Decompression
SketchSketch Background: Programmable Protocol Background: Programmable Protocol
CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression
AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary
Memory Compression and DecompressionMemory Compression and Decompression
Storage SavingStorage Saving
0%
20%
40%
60%
80% 73%
16% 21%
66%
Barnes FFT FFTW LU OceanRadixWater
Memory Compression and DecompressionMemory Compression and Decompression
SlowdownSlowdown
1.00
1.15
1.30
1.45
Barnes FFT FFTW LU OceanRadixWater
1.60
1PP2PP2PP+OOO NI2PP+OOO NI+CLS2PP+OOO NI+CL
2% 5% 7% 1% 11% 15% 8%
Memory Compression and DecompressionMemory Compression and Decompression
Memory Stall CyclesMemory Stall Cycles
Memory Compression and DecompressionMemory Compression and Decompression
Protocol Core OccupancyProtocol Core Occupancy Dynamic instruction count and handler Dynamic instruction count and handler
occupancyoccupancy w/o compressionw/o compression w/ w/
compressioncompressionBarnes 29.1 M (7.5 ns) 215.5 M (31.9 ns)Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns)FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns)FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns)FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns)FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns)LU 11.4 M (6.3 ns) 29.2 M (14.8 ns)LU 11.4 M (6.3 ns) 29.2 M (14.8 ns)Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns)Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns)Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns)Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns)Water 62.4 M (5.5 ns) 137.3 M (8.8 ns)Water 62.4 M (5.5 ns) 137.3 M (8.8 ns)Occupancy still hidden under fastest memory Occupancy still hidden under fastest memory
access (40 ns)access (40 ns)
Memory Compression and DecompressionMemory Compression and Decompression
SketchSketch Background: Programmable Protocol Background: Programmable Protocol
CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression
AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary
Memory Compression and DecompressionMemory Compression and Decompression
Related WorkRelated Work Dictionary-basedDictionary-based
– IBM MXTIBM MXT– X-MatchX-Match– X-RLX-RL– Not well-suited for cache block grainNot well-suited for cache block grain
Frequent pattern-basedFrequent pattern-based– Applied to on-chip cache blocksApplied to on-chip cache blocks
Zero-aware compressionZero-aware compression– Applied to memory blocksApplied to memory blocks
See paper for more detailsSee paper for more details
Memory Compression and DecompressionMemory Compression and Decompression
SummarySummary Explored memory compression and Explored memory compression and
decompression as coherence protocol decompression as coherence protocol extensions in DSM multiprocessorsextensions in DSM multiprocessors
The compression-enabled handlers run The compression-enabled handlers run on simple core(s) of a multi-core nodeon simple core(s) of a multi-core node
The protocol core occupancy increases The protocol core occupancy increases significantly, but still can be hidden significantly, but still can be hidden under memory access latencyunder memory access latency
On seven scientific computing On seven scientific computing workloads, our best design saves 16% workloads, our best design saves 16% to 73% memory while slowing down to 73% memory while slowing down execution by at most 15%execution by at most 15%
Integrating Memory Integrating Memory Compression and Compression and
Decompression with Decompression with Coherence Protocols in DSM Coherence Protocols in DSM
MultiprocessorsMultiprocessors
Lakshmana R Vittanala Mainak Lakshmana R Vittanala Mainak ChaudhuriChaudhuri
IntelIntel IIT Kanpur IIT Kanpur
THANK YOU!