Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors...

Integrating Memory Integrating Memory Compression and Compression and

Decompression with Decompression with Coherence Protocols in DSM Coherence Protocols in DSM

MultiprocessorsMultiprocessors

Lakshmana R Vittanala Mainak Lakshmana R Vittanala Mainak ChaudhuriChaudhuri

IntelIntel IIT Kanpur IIT Kanpur

Memory Compression and DecompressionMemory Compression and Decompression

Talk in Two Slides (1/2)Talk in Two Slides (1/2) Memory footprint of data-intensive Memory footprint of data-intensive

workloads is ever-increasingworkloads is ever-increasing– We explore compression to reduce We explore compression to reduce

memory pressure in a medium-scale memory pressure in a medium-scale DSM multiDSM multi

Dirty blocks evicted from last-level of Dirty blocks evicted from last-level of cache is sent to home nodecache is sent to home node– Compress in home memory controllerCompress in home memory controller

A last-level cache miss request from A last-level cache miss request from a node is sent to home nodea node is sent to home node– Decompress in home memory controllerDecompress in home memory controller


Talk in Two Slides (2/2)Talk in Two Slides (2/2) No modification in the processorNo modification in the processor

– Cache hierarchy sees decompressed blocksCache hierarchy sees decompressed blocks All changes are confined to the directory-All changes are confined to the directory-

based cache coherence protocolbased cache coherence protocol– Leverage spare core(s) to execute Leverage spare core(s) to execute

compression-enabled protocols in softwarecompression-enabled protocols in software– Extend directory structure for compression Extend directory structure for compression

book-keepingbook-keeping Use hybrid of two compression algorithmsUse hybrid of two compression algorithms

– On 16 nodes for seven scientific computing On 16 nodes for seven scientific computing workloads, 73% storage saving on average workloads, 73% storage saving on average with at most 15% increase in execution timewith at most 15% increase in execution time


ContributionsContributions Two major contributionsTwo major contributions

– First attempt to look at First attempt to look at compression/decompression as compression/decompression as directory protocol extensions in mid-directory protocol extensions in mid-range serversrange servers

– First proposal to execute a First proposal to execute a compression-enabled directory protocol compression-enabled directory protocol in software on spare core(s) of a multi-in software on spare core(s) of a multi-core diecore die Makes the solution attractive in many-core Makes the solution attractive in many-core

systemssystems


SketchSketch Background: Programmable Protocol Background: Programmable Protocol

CoreCore Directory Protocol ExtensionsDirectory Protocol Extensions Compression/Decompression Compression/Decompression

AlgorithmsAlgorithms Simulation ResultsSimulation Results Related Work and SummaryRelated Work and Summary


Programmable Protocol CoreProgrammable Protocol Core Past studies have considered off-die Past studies have considered off-die

programmable protocol processorsprogrammable protocol processors– Offers flexibility in choice of coherence Offers flexibility in choice of coherence

protocols compared to hardwired FSMs, protocols compared to hardwired FSMs, but suffers from performance loss [Sun but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …]Piranha, …]

With on-die integration of memory With on-die integration of memory controller and availability of large controller and availability of large number of on-die cores, number of on-die cores, programmable protocol cores may programmable protocol cores may become an attractive designbecome an attractive design– Recent studies show almost no Recent studies show almost no

performance loss [IEEE TPDS, Aug’07]performance loss [IEEE TPDS, Aug’07]


Programmable Protocol CoreProgrammable Protocol Core In our simulated system, each node In our simulated system, each node

containscontains– One complex out-of-order issue core One complex out-of-order issue core

which runs the application threadwhich runs the application thread– One or two simple in-order static dual One or two simple in-order static dual

issue programmable protocol core(s) issue programmable protocol core(s) which run the directory-based cache which run the directory-based cache coherence protocol in softwarecoherence protocol in software

– On-die integrated memory controller, On-die integrated memory controller, network interface, and routernetwork interface, and router

Compression/decompression Compression/decompression algorithms are integrated into the algorithms are integrated into the directory protocol softwaredirectory protocol software


Programmable Protocol CoreProgrammable Protocol Core

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

OOO Core In-order Core

Network

Protocol Core/Protocol Processor


Anatomy of a Protocol HandlerAnatomy of a Protocol Handler On arrival of a coherence transaction at On arrival of a coherence transaction at

the memory controller of a node, a the memory controller of a node, a protocol handler is scheduled on the protocol handler is scheduled on the protocol core of that nodeprotocol core of that node– Calculates the directory address if home Calculates the directory address if home

node (simple hash function on transaction node (simple hash function on transaction address)address)

– Reads 64-bit directory entry if home nodeReads 64-bit directory entry if home node– Carries out simple integer arithmetic Carries out simple integer arithmetic

operations to figure out coherence actionsoperations to figure out coherence actions– May send messages to remote nodesMay send messages to remote nodes– May initiate transactions to local OOO coreMay initiate transactions to local OOO core


Baseline Directory ProtocolBaseline Directory Protocol Invalidation-based three-state (MSI) Invalidation-based three-state (MSI)

bitvector protocolbitvector protocol– Derived from SGI Origin MESI protocol Derived from SGI Origin MESI protocol

and improved to handle early and late and improved to handle early and late intervention races betterintervention races better

4 44 16

States: L, M,two busy

Sharer vectorUnused

64-bit datapath






Directory Protocol ExtensionsDirectory Protocol Extensions Compression supportCompression support

– All handlers that update memory blocks All handlers that update memory blocks need extension with compression algorithmneed extension with compression algorithm

– Two major categories: writeback handlers Two major categories: writeback handlers and GET intervention response handlersand GET intervention response handlers

Latter involves a state demotion from M to S and Latter involves a state demotion from M to S and hence requires an update of memory block at hence requires an update of memory block at homehome

GETX interventions do not require memory GETX interventions do not require memory update as they involve ownership hand-off onlyupdate as they involve ownership hand-off only

Decompression supportDecompression support– All handlers that access memory in All handlers that access memory in

response to last-level cache miss requestsresponse to last-level cache miss requests


Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (writeback Compression support (writeback

cases)cases)

SP

SPP HPP

WBWB_ACK

WB

Compress

DRAM


Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (writeback Compression support (writeback

cases)cases)

HP HPP

Compress

WBDRAM


Directory Protocol ExtensionsDirectory Protocol Extensions Compression support (intervention Compression support (intervention

cases)cases)

RPP HPPCompress

DRAM

RP

DP

GET

GETGET

PUT

SWB

PUT



cases)cases)

RPP HPPCompress

DRAM

RP

HP

GET

GETGET

PUT

PUT

PUT(Uncompressed)



cases)cases)

HP HPPCompress

DRAM

DP

GETGET

PUTPUT(Uncompressed)


Directory Protocol ExtensionsDirectory Protocol Extensions Decompression supportDecompression support

RP

RPP HPP

PUT/PUTX

GET/GETX

Decompress

DRAM

GET/GETXPUT/PUTX


Directory Protocol ExtensionsDirectory Protocol Extensions Decompression supportDecompression support

HP HPP

PUT/PUTX

GET/GETX

Decompress

DRAM






Compression AlgorithmsCompression Algorithms Consider each 64-bit chunk at a time of Consider each 64-bit chunk at a time of

a 128-byte cache blocka 128-byte cache blockAlgorithm IAlgorithm I

OriginalOriginal Compressed Encoding Compressed EncodingAll zeroAll zero Zero byteZero byte 0000MS 4 bytes zeroMS 4 bytes zero LS 4 bytesLS 4 bytes 0101MS 4 bytes = LS 4 bytesMS 4 bytes = LS 4 bytes LS 4 bytesLS 4 bytes

1010NoneNone 64 bits64 bits 1111

Algorithm IIAlgorithm IIDiffers in encoding 10: LS 4 bytes zero. Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes.Compressed block stores the MS 4 bytes.


Compression AlgorithmsCompression Algorithms Ideally want to compute compressed Ideally want to compute compressed

size by both the algorithms for each size by both the algorithms for each of the 16 double-words in a cache of the 16 double-words in a cache block and pick the bestblock and pick the best– Overhead is too highOverhead is too high

Trade-off#1Trade-off#1– Speculate based on the first 64 bitsSpeculate based on the first 64 bits– If MS 32 bits ^ LS 32 bits = 0, use If MS 32 bits ^ LS 32 bits = 0, use

Algorithm I (covers two cases of Algorithm I (covers two cases of Algorithm I)Algorithm I)

– If MS 32 bits & LS 32 bits = 0, use If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II (covers three cases of Algorithm II)Algorithm II)


Compression AlgorithmsCompression Algorithms Trade-off#2Trade-off#2

– If compression ratio is low, it is better If compression ratio is low, it is better to avoid decompression overheadto avoid decompression overhead Decompression is fully on the critical pathDecompression is fully on the critical path

– After compressing every 64 bits, After compressing every 64 bits, compare the running compressed size compare the running compressed size against a threshold against a threshold maxCszmaxCsz (best: 48 (best: 48 bytes)bytes)

– Abort compression and store entire Abort compression and store entire block uncompressed as soon as the block uncompressed as soon as the threshold is crossedthreshold is crossed


Compression AlgorithmsCompression Algorithms Meta-dataMeta-data

– Required for decompressionRequired for decompression– Most meta-data are stored in the Most meta-data are stored in the

unused 44 bits of the directory entryunused 44 bits of the directory entry– Cache controller generates Cache controller generates

uncompressed block address; so uncompressed block address; so directory address computation remains directory address computation remains unchangedunchanged

– 32 bits to locate the compressed block32 bits to locate the compressed block Compressed block size is a multiple of 4 Compressed block size is a multiple of 4

bytes, but we extend it to next 8-byte bytes, but we extend it to next 8-byte boundary to have a cushion for future useboundary to have a cushion for future use

32 bits allow us to address 32 GB of 32 bits allow us to address 32 GB of compressed memorycompressed memory


Compression AlgorithmsCompression Algorithms Meta-dataMeta-data

– Two bits to know the compression algorithmTwo bits to know the compression algorithm Algorithm I, Algorithm II, uncompressed, all zeroAlgorithm I, Algorithm II, uncompressed, all zero All zero blocks do not store anything in memoryAll zero blocks do not store anything in memory

– For each 64 bits need to know one of four For each 64 bits need to know one of four encodingsencodings Maintained in a 32-bit header (two bits for each Maintained in a 32-bit header (two bits for each

of the 16 double words)of the 16 double words)

– Optimization to speed up relocation: store Optimization to speed up relocation: store the size of the compressed block in directory the size of the compressed block in directory entryentry Requires four bits (16 double words maximum)Requires four bits (16 double words maximum)

– 70 bits of meta-data per compressed block70 bits of meta-data per compressed block


Decompression ExampleDecompression Example Directory entry informationDirectory entry information

– 32-bit address: 0x4fd1276a32-bit address: 0x4fd1276a Actual address = 0x4fd1276a << 3Actual address = 0x4fd1276a << 3

– Compression state: 01Compression state: 01 Algorithm II was usedAlgorithm II was used

– Compressed size: 0101Compressed size: 0101 Actual size=40 bytes (not used in Actual size=40 bytes (not used in

decompression)decompression)

Header informationHeader information– 32-bit header: 00 11 10 00 00 01…32-bit header: 00 11 10 00 00 01…

Upper 64 bits used encoding 00 of Algorithm Upper 64 bits used encoding 00 of Algorithm IIII

Next 64 bits used encoding 11 of Algorithm IINext 64 bits used encoding 11 of Algorithm II


Performance OptimizationPerformance Optimization Protocol thread occupancy is criticalProtocol thread occupancy is critical

– Two protocol coresTwo protocol cores– Out-of-order NI scheduling to improve Out-of-order NI scheduling to improve

protocol core utilizationprotocol core utilization– Cached message buffer (filled with Cached message buffer (filled with

writeback payload)writeback payload) 16 uncached loads/stores needed to 16 uncached loads/stores needed to

message buffer if not cached during message buffer if not cached during compressioncompression

Caching requires invalidating the buffer Caching requires invalidating the buffer contents at the end of compression contents at the end of compression (coherence issue)(coherence issue)

Flushing dirty contents occupies the Flushing dirty contents occupies the datapath; so we allow only cached loadsdatapath; so we allow only cached loads

– Compression ratio remains unaffectedCompression ratio remains unaffected






Storage SavingStorage Saving

0%

20%

40%

60%

80% 73%

16% 21%

66%

Barnes FFT FFTW LU OceanRadixWater


SlowdownSlowdown

1.00

1.15

1.30

1.45

Barnes FFT FFTW LU OceanRadixWater

1.60

1PP2PP2PP+OOO NI2PP+OOO NI+CLS2PP+OOO NI+CL

2% 5% 7% 1% 11% 15% 8%


Memory Stall CyclesMemory Stall Cycles


Protocol Core OccupancyProtocol Core Occupancy Dynamic instruction count and handler Dynamic instruction count and handler

occupancyoccupancy w/o compressionw/o compression w/ w/

compressioncompressionBarnes 29.1 M (7.5 ns) 215.5 M (31.9 ns)Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns)FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns)FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns)FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns)FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns)LU 11.4 M (6.3 ns) 29.2 M (14.8 ns)LU 11.4 M (6.3 ns) 29.2 M (14.8 ns)Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns)Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns)Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns)Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns)Water 62.4 M (5.5 ns) 137.3 M (8.8 ns)Water 62.4 M (5.5 ns) 137.3 M (8.8 ns)Occupancy still hidden under fastest memory Occupancy still hidden under fastest memory

access (40 ns)access (40 ns)






Related WorkRelated Work Dictionary-basedDictionary-based

– IBM MXTIBM MXT– X-MatchX-Match– X-RLX-RL– Not well-suited for cache block grainNot well-suited for cache block grain

Frequent pattern-basedFrequent pattern-based– Applied to on-chip cache blocksApplied to on-chip cache blocks

Zero-aware compressionZero-aware compression– Applied to memory blocksApplied to memory blocks

See paper for more detailsSee paper for more details


SummarySummary Explored memory compression and Explored memory compression and

decompression as coherence protocol decompression as coherence protocol extensions in DSM multiprocessorsextensions in DSM multiprocessors

The compression-enabled handlers run The compression-enabled handlers run on simple core(s) of a multi-core nodeon simple core(s) of a multi-core node

The protocol core occupancy increases The protocol core occupancy increases significantly, but still can be hidden significantly, but still can be hidden under memory access latencyunder memory access latency

On seven scientific computing On seven scientific computing workloads, our best design saves 16% workloads, our best design saves 16% to 73% memory while slowing down to 73% memory while slowing down execution by at most 15%execution by at most 15%

Integrating Memory Integrating Memory Compression and Compression and

Decompression with Decompression with Coherence Protocols in DSM Coherence Protocols in DSM

MultiprocessorsMultiprocessors

Lakshmana R Vittanala Mainak Lakshmana R Vittanala Mainak ChaudhuriChaudhuri

IntelIntel IIT Kanpur IIT Kanpur

THANK YOU!

Date post:	03-Jan-2016
Category:	Documents
Upload:	nicholas-morrison
View:	233 times
Download:	0 times

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors...

Documents