Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | hubert-long |
View: | 219 times |
Download: | 2 times |
1
Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.
Sadayappan2
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
1 Department of ECE
Iowa State University
2 Department of CSE
The Ohio State University
22
Shared Caches Can be a Critical Bottleneck in Multi-Core Processors L2/L3 caches are shared by multiple cores
Intel Xeon 51xx (2core/L2) AMD Barcelona (4core/L3)Sun T2, ... (8core/L2)
Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches.
Several hardware cache partitioning methods have been proposed with different optimization objectives Performance: [HPCA’02], [HPCA’04], [Micro’06]Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07]QoS: [ICS’04], [ISCA’07]
Shared L2/L3 cache
Core Core …… Core
33
Limitations of Simulation-Based Studies
Excessive simulation timeWhole programs can not be evaluated. It
would take several weeks/months to complete a single SPEC CPU2006 benchmark
As the number of cores continues to increase, simulation ability becomes even more limited
Absence of long-term OS activitiesInteractions between processor/OS affect
performance significantlyProneness to simulation inaccuracy
Bugs in simulatorImpossible to model many dynamics and
details of the system
44
Our Approach to Address the Issues
Design and implement OS-based Cache PartitioningEmbedding cache partitioning mechanism in
OSBy enhancing page coloring technique To support both static and dynamic cache
partitioningEvaluate cache partitioning policies on
commodity processorsExecution- and measurement-basedRun applications to completionMeasure performance with hardware counters
55
Four Questions to Answer
Can we confirm the conclusions made by the simulation-based studies?
Can we provide new insights and findings that simulation is not able to?
Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs?
What are advantages and disadvantages for OS-based cache partitioning?
66
Outline
IntroductionDesign and implementation of OS-based
cache partitioning mechanismsEvaluation environment and workload
constructionCache partitioning policies and their resultsConclusion
77
OS-Based Cache Partitioning MechanismsStatic cache partitioning
Predetermines the amount of cache blocks allocated to each program at the beginning of its execution
Page coloring enhancementDivides shared cache to multiple regions and
partition cache regions through OS page address mapping
Dynamic cache partitioningAdjusts cache quota among processes dynamically Page re-coloringDynamically changes processes’ cache usage
through OS page address re-mapping
88
Page Coloring
virtual page numberVirtual address page offset
physical page numberPhysical address Page offset
Address translation
Cache tag Block offsetSet indexCache address
Physically indexed cache
page color bits
… …
OS control
=
•Physically indexed caches are divided into multiple regions (colors).•All cache lines in a physical page are cached in one of those regions (colors).
OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).
99
Enhancement for Static Cache Partitioning
… …...
………
………
Physically indexed cache
………
………
Physical pages are grouped to page binsaccording to their page color1
234
…
i+2
ii+1
…
Process 1
1234
…
i+2
ii+1
…
Process 2
OS
address m
apping
Shared cache is partitioned between two processes through address mapping.
Cost: Main memory space needs to be partitioned too (co-partitioning).
1010
Dynamic Cache Partitioning
Why?Programs have dynamic behaviorsMost proposed schemes are dynamic
How?Page re-coloring
How to handle overhead?Measure overhead by performance counterRemove overhead in result (emulating
hardware schemes)
1111
Allocated color
Dynamic Cache Partitioning through Page Re-Coloring
page links table
……
N - 1
0
1
2
3
Page re-coloring:Allocate page in new colorCopy memory contentsFree old page
Allocated color
Pages of a process are organized into linked lists by their colors.
Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.
1212
Control the Page Migration Overhead
Control the frequency of page migrationFrequent enough to capture application phase
changesNot too often to introduce large page migration
overhead
Lazy migration: avoid unnecessary page migrationObservation: Not all pages are accessed between
their two migrations.Optimization: do not migrate a page until it is
accessed
13
After the optimizationOn average, 2% page migration
overheadUp to 7%.13
Lazy Page Migration
Process page links
……
N - 1
0
1
2
3
Avoid unnecessary page migration for these pages!
Allocated color
Allocated color
1414
Outline
IntroductionDesign and implementation of OS-based
cache partitioning mechanismsEvaluation environment and workload
constructionCache partitioning policies and their resultsConclusion
1515
Experimental Environment
Dell PowerEdge1950Two-way SMP, Intel dual-core Xeon 5160Shared 4MB L2 cache, 16-way8GB Fully Buffered DIMM
Red Hat Enterprise Linux 4.02.6.20.3 kernelPerformance counter tools from HP (Pfmon)Divide L2 cache into 16 colors
1616
Benchmark Classification
Is it sensitive to L2 cache capacity? Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80%
Give red benchmarks more cache: big performance gain Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache)
< 95% Give yellow benchmarks more cache: moderate
performance gain
Else: Does it extensively access L2 cache? Green group: > = 14 accesses / 1K cycle
Give it small cache Black group: < 14 accesses / 1K cycle
Cache insensitive
29 benchmarks from SPEC CPU2006
6 9 6 8
1717
Workload Construction
6 9 6
6
9
6
2-core
RR (3 pairs)
RY (6 pairs)
RG (6 pairs)
YY (3 pairs)
YG (6 pairs) GG (3 pairs)
27 workloads: representative benchmark combinations
1818
Outline
IntroductionOS-based cache partitioning mechanismEvaluation environment and workload
constructionCache partitioning policies and their results
PerformanceFairness
Conclusion
1919
Performance – MetricsDivide metrics into evaluation metrics and policy
metrics [PACT’06]Evaluation metrics:
Optimization objectives, not always available during run-time
Policy metricsUsed to drive dynamic partitioning policies: available
during run-timeSum of IPC, Combined cache miss rate, Combined cache
misses
2020
Static PartitioningTotal #color of cache: 16Give at least two colors to each program
Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning)
Try all possible partitionings for all workloads(2:14), (3:13), (4:12) ……. (8,8), ……, (13:3),
(14:2)Get value of evaluation metricsCompared with performance of all partitionings
with performance of shared cache
2121
Performance – Optimal Static Partitioning
Performance gai n of opti mal stati c parti ti oni ng
1.00
1.05
1.10
1.15
1.20
1.25
RR RY RG YY YG GG
Throughtput Average Weighted Speedup Normalized SMT Speedup Fair Speedup
Confirm that cache partitioning has significant performance impact
Different evaluation metrics have different performance gains
RG-type of workloads have largest performance gains (up to 47%)
Other types of workloads also have performance gains (2% to 10%)
2222
A New Finding
Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green)
Intuitively, giving more cache space to 401.bzip2 (Red)Increases the performance of 401.bzip2 largely
(Red)Decreases the performance of 410.bwaves
slightly (Green)
However, we observe that
23
Memory Bandwidth Utilization
2.702.752.802.852.902.953.003.05
2:14
3:13
4:12
5:11
6:10 7:9
8:8
9:7
10:6
11:5
12:4
13:3
14:2
Partitionings
GB/s
Average Memory Access Latency
140142144146148150152154156
2:14
3:13
4:12
5:11
6:10 7:9
8:8
9:7
10:6
11:5
12:4
13:3
14:2
Partitionings
ns
23
Insight into Our Finding
24
Insight into Our Finding
We have the same observation in RG4, RG5 and YG5
This is not observed by simulation Did not model main memory sub-system in
detail Assumed fixed memory access latency
Shows the advantages of our execution- and measurement-base study
2525
Performance - Dynamic Partition Policy
Init: Partition the cache as (8:8)
Run current partition (P0:P1) for one epoch
finished
Try one epoch for each of the two neighboringpartitions: (P0 – 1: P1+1) and (P0 + 1: P1-1)
Choose next partitioning with best policy metrics measurement
No
YesExit
A simple greedy policy.
Emulate policy of [HPCA’02]
2626
Performance – Static & Dynamic
Use combined miss rates as policy metricsFor RG-type, and some RY-type:
Static partitioning outperforms dynamic partitioningFor RR- and RY-type, and some RY-type
Dynamic partitioning outperforms static partitioning
2727
Fairness – Metrics and Policy [PACT’04]Metrics
Evaluation metrics FM0 difference in slowdown, small is better
Policy metrics
PolicyRepartitioning and rollback
2828
Fairness - Result
Dynamic partitioning can achieve better fairness If we use FM0 as both evaluation metrics and policy
metrics None of policy metrics (FM1 to FM5) is good enough to
drive the partitioning policy to get comparable fairness with static partitioning
Strong correlation was reported in simulation-based study – [PACT’04]
None of policy metrics has consistently strong correlation with FM0 SPEC CPU2006 (ref input) SPEC CPU2000 (test input)
Complete trillions of instructions less than one billion instruction
4MB L2 cache 512KB L2 cache
2929
Conclusion
Confirmed some conclusions made by simulationsProvided new insights and findings
Give cache space from one to another, increase performance of both
Poor correlation between evaluation and policy metrics for fairness
Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning
Advantages of OS-based cache partitioning Working on commodity processors for an execution- and
measurement-based studyDisadvantages of OS-based cache partitioning
Co-partitioning (may underutilize memory), migration overhead
3030
Ongoing Work
Reduce migration overhead on commodity processors
Cache partitioning at the compiler levelPartition cache at object level
Hybrid cache partitioning methodRemove the cost of co-partitioningAvoid page migration overhead
31
Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.
Sadayappan2
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
1 Iowa State University
2 The Ohio State University
Thanks!
3232
Backup Slides
3333
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04])
-1-0.8-0.6-0.4-0.200.20.40.60.81
apsi+equake gzip+apsi swim+gzip tree+mcf AVG18
Corr(M1,M0) Corr(M2,M0) Corr(M3,M0)
Corr(M4,M0) Corr(M5,M0)
Strong correlation was reported in simulation study – [PACT’04]
3434
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result)
None of policy metrics has consistently strong correlation with FM0SPEC CPU2006 (ref input) SPEC CPU2000 (test input) Complete trillions of instructions less than one billion
instruction4MB L2 cache 512KB L2 cache
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
YY1 YY2 YY3 YG1 YG2 YG3 YG4 YG5 YG6 GG1 GG2 GG3
FM1
FM3
FM4
FM5