Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University...

Adaptive Cache Adaptive Cache Compression for High-Compression for High-

Performance ProcessorsPerformance Processors

Alaa Alameldeen Alaa Alameldeen and David Woodand David Wood

University of Wisconsin-MadisonUniversity of Wisconsin-Madison

Wisconsin Multifacet ProjectWisconsin Multifacet Project

http://www.cs.wisc.edu/multifacethttp://www.cs.wisc.edu/multifacet

ISCA 2004ISCA 2004 Alaa Alameldeen – Adaptive Cache CompressionAlaa Alameldeen – Adaptive Cache Compression 22

OverviewOverview Design of high performance processors Design of high performance processors

Processor speed improves faster than memoryProcessor speed improves faster than memory

Memory latency dominates performanceMemory latency dominates performance Need more effective cache designsNeed more effective cache designs

On-chip cache compression On-chip cache compression + Increases effective cache sizeIncreases effective cache size- Increases cache hit latencyIncreases cache hit latency

Does cache compression help or hurt?Does cache compression help or hurt?


0

0.2

0.4

0.6

0.8

1

1.2

Norm

alize

d R

unti

me

_ _____ ______

Does Cache Compression Help or Does Cache Compression Help or Hurt?Hurt?


0

0.2

0.4

0.6

0.8

1

1.2

Norm

alize

d R

unti

me

apache

No Compression

Compression



0

0.2

0.4

0.6

0.8

1

1.2

Norm

alize

d R

unti

me

apache ammp

No Compression

Compression



0

0.2

0.4

0.6

0.8

1

1.2

Norm

alize

d R

unti

me

apache ammp

No Compression

Compression

Adaptive


Adaptive Compression determines when compression is beneficialAdaptive Compression determines when compression is beneficial


OutlineOutline

MotivationMotivation

Cache Compression FrameworkCache Compression Framework Compressed Cache HierarchyCompressed Cache Hierarchy Decoupled Variable-Segment CacheDecoupled Variable-Segment Cache

Adaptive CompressionAdaptive Compression

EvaluationEvaluation

ConclusionsConclusions


Compressed Cache HierarchyCompressed Cache Hierarchy

InstructionInstruction FetcherFetcher

L2 Cache (Compressed)L2 Cache (Compressed)

L1 D-CacheL1 D-Cache(Uncompressed)(Uncompressed)

Load-StoreLoad-StoreQueueQueue

L1 I-CacheL1 I-Cache(Uncompressed)(Uncompressed)

L1 Victim CacheL1 Victim Cache

CompressionCompressionPipelinePipeline

DecompressionDecompressionPipelinePipeline

UncompressedUncompressedLineLine

BypassBypass

From MemoryFrom Memory To MemoryTo Memory


Address BAddress B

Decoupled Variable-Segment Decoupled Variable-Segment CacheCache

Objective: pack more lines into the same Objective: pack more lines into the same spacespace

Data AreaData Area

Address AAddress A

Tag AreaTag Area

2-way set-associative with 64-byte lines2-way set-associative with 64-byte lines

Tag Contains Address Tag, Permissions, LRU Tag Contains Address Tag, Permissions, LRU (Replacement) Bits(Replacement) Bits


Address BAddress B



Data AreaData Area

Address AAddress A

Tag AreaTag Area

Address CAddress C

Address DAddress D

Add two Add two more tagsmore tags


Address BAddress B



Data AreaData Area

Address AAddress A

Tag AreaTag Area

Address CAddress C

Address DAddress D

Add Compression Size, Add Compression Size, Status, More LRU bitsStatus, More LRU bits


Address BAddress B



Data AreaData Area

Address AAddress A

Tag AreaTag Area

Address CAddress C

Address DAddress D

Divide Data Area into Divide Data Area into 8-byte segments8-byte segments




Data AreaData AreaTag AreaTag Area

Address BAddress B

Address AAddress A

Address CAddress C

Address DAddress D

Data lines composed Data lines composed of 1-8 segmentsof 1-8 segments


Addr B compressed 2 Addr B compressed 2



Data AreaData Area

Addr A uncompressed 3Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Tag AreaTag Area

Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present but line isn’tbut line isn’t


OutlineOutline


Cache Compression FrameworkCache Compression Framework

Adaptive CompressionAdaptive Compression Key InsightKey Insight Classification of L2 accessesClassification of L2 accesses Global compression predictorGlobal compression predictor

EvaluationEvaluation




Use past to predict futureUse past to predict future

Key Insight:Key Insight: LRU Stack [Mattson, et al., 1970] indicates for each LRU Stack [Mattson, et al., 1970] indicates for each

reference whether compression helps or hurtsreference whether compression helps or hurts

Benefit(CompressionBenefit(Compression) )

> Cost(Compression> Cost(Compression))

Do not compress Do not compress future linesfuture lines

Compress Compress future linesfuture lines

YesYes NoNo


Cost/Benefit ClassificationCost/Benefit Classification

Classify each cache referenceClassify each cache reference Four-way SA cache with space for two 64-byte linesFour-way SA cache with space for two 64-byte lines

Total of 16 available segmentsTotal of 16 available segments


Addr B compressed 2Addr B compressed 2

LRU StackLRU Stack Data AreaData Area




An Unpenalized HitAn Unpenalized Hit

Read/Write Address ARead/Write Address A LRU Stack order = 1 LRU Stack order = 1 ≤≤ 2 2 Hit regardless of compression Hit regardless of compression Uncompressed Line Uncompressed Line No decompression penalty No decompression penalty Neither cost nor benefitNeither cost nor benefit







A Penalized HitA Penalized Hit

Read/Write Address BRead/Write Address B LRU Stack order = 2 LRU Stack order = 2 ≤≤ 2 2 Hit regardless of compression Hit regardless of compression Compressed Line Compressed Line Decompression penalty incurred Decompression penalty incurred Compression costCompression cost







An Avoided MissAn Avoided Miss

Read/Write Address CRead/Write Address C LRU Stack order = 3 LRU Stack order = 3 >> 2 2 Hit only because of compression Hit only because of compression Compression benefit: Eliminated off-chip missCompression benefit: Eliminated off-chip miss







An Avoidable MissAn Avoidable Miss

Read/Write Address DRead/Write Address D Line is not in the cache but tag exists at LRU stack order = 4Line is not in the cache but tag exists at LRU stack order = 4 Missed only because some lines are not compressedMissed only because some lines are not compressed Potential compression benefitPotential compression benefit






Sum(CSize) = 15 Sum(CSize) = 15 ≤ 16≤ 16


An Unavoidable MissAn Unavoidable Miss

Read/Write Address ERead/Write Address E LRU stack order > 4 LRU stack order > 4 Compression wouldn’t have helped Compression wouldn’t have helped Line is not in the cache and tag does not existLine is not in the cache and tag does not exist Neither cost nor benefitNeither cost nor benefit







Compression PredictorCompression Predictor

Estimate: Benefit(Compression) – Cost(Compression)Estimate: Benefit(Compression) – Cost(Compression)

Single counter : Global Compression Predictor (GCP)Single counter : Global Compression Predictor (GCP) Saturating up/down 19-bit counterSaturating up/down 19-bit counter

GCP updated on each cache accessGCP updated on each cache access Benefit: Increment by memory latencyBenefit: Increment by memory latency Cost: Decrement by decompression latencyCost: Decrement by decompression latency Optimization: Normalize to decompression latency = 1Optimization: Normalize to decompression latency = 1

Cache AllocationCache Allocation Allocate compressed line if GCP Allocate compressed line if GCP 0 0 Allocate uncompressed lines if GCP < 0Allocate uncompressed lines if GCP < 0


OutlineOutline


Cache Compression FrameworkCache Compression Framework


EvaluationEvaluation Simulation SetupSimulation Setup PerformancePerformance



Simulation SetupSimulation Setup

Simics full system simulator augmented Simics full system simulator augmented with:with:

Detailed OoO processor simulator [TFSim, Mauer, et al., 2002]Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] Detailed memory timing simulator [Martin, et al., 2002]Detailed memory timing simulator [Martin, et al., 2002]

Workloads: Workloads: Commercial workloads:Commercial workloads:

Database servers: OLTP and SPECJBBDatabase servers: OLTP and SPECJBB Static Web serving: Apache and ZeusStatic Web serving: Apache and Zeus

SPEC2000 benchmarks:SPEC2000 benchmarks: SPECint: bzip, gcc, mcf, twolfSPECint: bzip, gcc, mcf, twolf SPECfp: ammp, applu, equake, swimSPECfp: ammp, applu, equake, swim


System configurationSystem configuration

A dynamically scheduled SPARC V9 uniprocessorA dynamically scheduled SPARC V9 uniprocessor

Configuration parameters:Configuration parameters:

L1 CacheL1 Cache Split I&D, 64KB each, 2-way SA, 64B Split I&D, 64KB each, 2-way SA, 64B line, 2-cycles/accessline, 2-cycles/access

L2 CacheL2 Cache Unified 4MB, Unified 4MB, 8-way8-way SA, 64B line, SA, 64B line, 20cycles+decompression latency per 20cycles+decompression latency per accessaccess

MemoryMemory 4GB DRAM, 400-cycle access time, 128 4GB DRAM, 400-cycle access time, 128 outstanding requestsoutstanding requests

Processor pipelineProcessor pipeline 4-wide superscalar, 11-stage pipeline: 4-wide superscalar, 11-stage pipeline: fetch (3), decode(3), schedule(1), fetch (3), decode(3), schedule(1), execute(1+), retire(3)execute(1+), retire(3)

Reorder bufferReorder buffer 64 entries64 entries


Simulated Cache ConfigurationsSimulated Cache Configurations

Always:Always: All compressible lines are stored in All compressible lines are stored in compressed formatcompressed format Decompression penalty for all compressed linesDecompression penalty for all compressed lines

Never:Never: All cache lines are stored in All cache lines are stored in uncompressed format uncompressed format Cache is 8-way set associative with half the number of Cache is 8-way set associative with half the number of

setssets Does not incur decompression penaltyDoes not incur decompression penalty

Adaptive:Adaptive: Our adaptive compression Our adaptive compression schemescheme


PerformancePerformance

0

0.2

0.4

0.6

0.8

1

1.2bzip

gcc

mcf

twolf

ammp

applu

equake

swim

apache

zeus

oltp jbb

Norm

alize

d R

unti

me

NeverAlwaysAdaptive

SpecINTSpecINT SpecFPSpecFP CommercialCommercial



0

0.2

0.4

0.6

0.8

1

1.2bzip

gcc

mcf

twolf

ammp

applu

equake

swim

apache

zeus

oltp jbb

Norm

alize

d R

unti

me

NeverAlwaysAdaptive



0

0.2

0.4

0.6

0.8

1

1.2bzip

gcc

mcf

twolf

ammp

applu

equake

swim

apache

zeus

oltp jbb

Norm

alize

d R

unti

me

NeverAlwaysAdaptive

35% 35% SpeeduSpeedu

pp

18% 18% SlowdownSlowdown



0

0.2

0.4

0.6

0.8

1

1.2bzip

gcc

mcf

twolf

ammp

applu

equake

swim

apache

zeus

oltp jbb

Norm

alize

d R

unti

me

NeverAlwaysAdaptive

Adaptive performs similar to the best of Always and Adaptive performs similar to the best of Always and NeverNever

Bug in GCP Bug in GCP updateupdate


Effective Cache CapacityEffective Cache Capacity


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ammp gcc mcf apache

Norm

alized M

iss R

ate

NeverAlways

Cache Miss RatesCache Miss Rates

Penalized Hits Penalized Hits Per Per

Avoided MissAvoided Miss67096709 489 12.3 4.7 489 12.3 4.7

0.09 2.52 12.28 14.380.09 2.52 12.28 14.38Misses PerMisses Per

1000 1000 InstructionsInstructions


Adapting to L2 SizesAdapting to L2 Sizes

ammp

0

0.2

0.4

0.6

0.8

1

1.2

256K 1M 4M 16M

Norm

alize

d R

unti

me

NeverAlwaysAdaptive

0.93 5.7 6503 3260000.93 5.7 6503 326000




Avoided MissAvoided Miss



Cache compression increases cache Cache compression increases cache capacity but slows down cache hit timecapacity but slows down cache hit time Helps some benchmarks (e.g., apache, mcf)Helps some benchmarks (e.g., apache, mcf) Hurts other benchmarks (e.g., gcc, ammp)Hurts other benchmarks (e.g., gcc, ammp)

Our Proposal: Adaptive compression Our Proposal: Adaptive compression Uses (LRU) replacement stack to determine whether Uses (LRU) replacement stack to determine whether

compression helps or hurtscompression helps or hurts Updates a single global saturating counter on cache Updates a single global saturating counter on cache

accessesaccesses

Adaptive compression performs similar to Adaptive compression performs similar to the better of the better of Always CompressAlways Compress and and Never Never CompressCompress


Backup SlidesBackup Slides

Frequent Pattern Compression (FPC)Frequent Pattern Compression (FPC) Decoupled Variable-Segment CacheDecoupled Variable-Segment Cache Classification of L2 AccessesClassification of L2 Accesses (LRU) Stack Replacement(LRU) Stack Replacement Cache Miss RatesCache Miss Rates Adapting to L2 SizesAdapting to L2 Sizes – mcf – mcf Adapting to L1 SizeAdapting to L1 Size Adapting to Decompression LatencyAdapting to Decompression Latency – mcf – mcf Adapting to Decompression LatencyAdapting to Decompression Latency – ammp – ammp Phase BehaviorPhase Behavior – gcc – gcc Phase BehaviorPhase Behavior – mcf – mcf Can We Do Better Than Adaptive?Can We Do Better Than Adaptive?



Each set contains Each set contains fourfour tags and space for tags and space for twotwo uncompressed lines uncompressed lines

Data area divided into 8-byte segments Data area divided into 8-byte segments

Each tag is composed of:Each tag is composed of: Address tagAddress tag PermissionsPermissions

CStatus : 1 if the line is compressed, 0 otherwiseCStatus : 1 if the line is compressed, 0 otherwise CSize: Size of compressed line in segmentsCSize: Size of compressed line in segments LRU/replacement bitsLRU/replacement bits

Same as Same as uncompressed uncompressed

cachecache


Frequent Pattern CompressionFrequent Pattern Compression

A significance-based compression algorithmA significance-based compression algorithm

Related Work: Related Work: X-Match and X-RL Algorithms [Kjelso, et al., 1996]X-Match and X-RL Algorithms [Kjelso, et al., 1996] Address and data significance-based compression [Farrens and Address and data significance-based compression [Farrens and

Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000]Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000]

A 64-byte line is decompressed in five cyclesA 64-byte line is decompressed in five cycles

More details in technical report:More details in technical report: ““Frequent Pattern Compression: A Significance-Based Compression Frequent Pattern Compression: A Significance-Based Compression

Algorithm for L2 CachesAlgorithm for L2 Caches,” ,” Alaa R. Alameldeen and David A. Wood, Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). 2004 (available online).


Frequent Pattern Compression Frequent Pattern Compression (FPC)(FPC)

A significance-based compression algorithm A significance-based compression algorithm combined with zero run-length encodingcombined with zero run-length encoding

Compresses each 32-bit word separatelyCompresses each 32-bit word separately Suitable for short (32-256 byte) cache linesSuitable for short (32-256 byte) cache lines Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero-Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero-

padded half-word, two SE half-words, repeated bytepadded half-word, two SE half-words, repeated byte A 64-byte line is decompressed in a five-stage pipelineA 64-byte line is decompressed in a five-stage pipeline

More details in technical report:More details in technical report: ““Frequent Pattern Compression: A Significance-Based Compression Frequent Pattern Compression: A Significance-Based Compression

Algorithm for L2 CachesAlgorithm for L2 Caches,” ,” Alaa R. Alameldeen and David A. Wood, Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). 2004 (available online).


Classification of L2 AccessesClassification of L2 Accesses

Cache hits:Cache hits: Unpenalized hit:Unpenalized hit: Hit to an Hit to an uncompresseduncompressed line that line that would have would have

hithit without compression without compression - Penalized hit:Penalized hit: Hit to a Hit to a compressedcompressed line that line that would have hitwould have hit

without compressionwithout compression+ Avoided miss:Avoided miss: Hit to a line that Hit to a line that would NOT have hitwould NOT have hit without without

compressioncompression

Cache misses:Cache misses:+ Avoidable miss:Avoidable miss: Miss to a line that Miss to a line that would have hitwould have hit with with

compression compression Unavoidable miss:Unavoidable miss: Miss to a line that Miss to a line that would have missedwould have missed

even with compressioneven with compression


Differentiate penalized hits and avoided misses?Differentiate penalized hits and avoided misses? Only hits to top half of the tags in the LRU stack are penalized hitsOnly hits to top half of the tags in the LRU stack are penalized hits

Differentiate avoidable and unavoidable misses?Differentiate avoidable and unavoidable misses?

Is not dependent on LRU replacementIs not dependent on LRU replacement Any replacement algorithm for top half of tagsAny replacement algorithm for top half of tags Any stack algorithm for the remaining tagsAny stack algorithm for the remaining tags

(LRU) Stack Replacement(LRU) Stack Replacement


Cache Miss RatesCache Miss Rates

0

0.2

0.4

0.6

0.8

1

1.2

ammp gcc mcf apache

Norm

alize

d M

iss

Rate

NeverAlwaysAdaptive


Adapting to L2 SizesAdapting to L2 Sizes

mcf

0

0.2

0.4

0.6

0.8

1

1.2

256K 1M 4M 16M

Norm

alize

d R

unti

me

NeverAlwaysAdaptive

11.6 4.4 12.6 2x1011.6 4.4 12.6 2x1066




Avoided MissAvoided Miss


Adapting to L1 SizeAdapting to L1 Size


Adapting to Decompression Adapting to Decompression LatencyLatency

mcf

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Decompression Latency (Cycles)

Nor

mal

ized

Run

tim

e

NeverAlwaysAdaptive


Adapting to Decompression Adapting to Decompression LatencyLatency

ammp

0

0.5

1

1.5

2

0 5 10 15 20 25

Decompression Latency (Cycles)

Nor

mal

ized

Run

tim

e

NeverAlwaysAdaptive


Phase BehaviorPhase BehaviorPredictor Value (K)Predictor Value (K)

Cache Size (MB)Cache Size (MB)


Phase BehaviorPhase BehaviorPredictor Value (K)Predictor Value (K)

Cache Size (MB)Cache Size (MB)


Can We Do Better Than Adaptive?Can We Do Better Than Adaptive?

Optimal is an unrealistic configuration: Always with no Optimal is an unrealistic configuration: Always with no decompression penaltydecompression penalty

0

0.2

0.4

0.6

0.8

1

1.2bzip

gcc

mcf

twolf

ammp

applu

equake

swim

apache

zeus

oltp jbb

Norm

alize

d R

unti

me

NeverAlwaysAdaptiveOptimal

Date post:	24-Dec-2015
Category:	Documents
Upload:	stanley-garrison
View:	217 times
Download:	0 times

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University...

Documents