Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | aldous-kennedy |
View: | 212 times |
Download: | 0 times |
CMP-MSIFeb. 11th 2007
Core to Memory Interconnection Implications for Forthcoming
On-Chip Multiprocessors
Carmelo AcostaCarmelo Acosta 11
Francisco J. CazorlaFrancisco J. Cazorla 22
Alex RamírezAlex Ramírez 1,2 1,2
Mateo ValeroMateo Valero 1,21,2
1 1
UPC-BarcelonaUPC-Barcelona2 2
Barcelona Supercomputing CenterBarcelona Supercomputing Center
22CMP-MSIFeb. 11th 2007
OverviewOverview
Introduction
Simulation Methodology
Results
Conclusions
33CMP-MSIFeb. 11th 2007
IntroductionIntroduction
As Process Technology advances it is more important what to do with transistors.
Current trend to replicate cores. Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad AMD: Opteron Dual-Core, Opteron Quad-Core IBM: POWER4, POWER5 Sun Microsystems: Niagara T1, Niagara T2
44CMP-MSIFeb. 11th 2007
IntroductionIntroduction
Power4 (CMP) Power5 (CMP+SMT)
Memory Subsystem (green) spreads over more than half the chip area.
55CMP-MSIFeb. 11th 2007
IntroductionIntroduction
Each L1 is connected to each L2 bank with a bus-based interconnection network.
66CMP-MSIFeb. 11th 2007
GoalGoal
Is directly applicable prior research in the SMT field in the new CMP+SMT scenario?
NO…we have to revisit well-known SMT ideas.
Instruction Fetch Policy
77CMP-MSIFeb. 11th 2007
ICOUNTICOUNT
FetchROB
88CMP-MSIFeb. 11th 2007
ICOUNTICOUNT
FetchROB
L2 miss
FETCH Stalled
Processor’s resources balanced between running threads. All resources devoted to blue thread unused until L2 miss resolution.
99CMP-MSIFeb. 11th 2007
FLUSHFLUSH
FetchROB
L2 miss
All resources devoted to the pending instructions of the blue thread are freed.
FLUSH Triggered
1010CMP-MSIFeb. 11th 2007
FLUSHFLUSH
FetchROB
L2 miss
Freed resources allow additional forward progress. L2 miss late detection L2 miss prediction.
Thread Stalled
1111CMP-MSIFeb. 11th 2007
Single vs Multi CoreSingle vs Multi Core
I$ D$
Core
L2 b0
I$ D$
Core
I$ D$
Core
L2 b1 L2 b2 L2 b3
I$ D$
Core
I$ D$
Core
L2 b0 L2 b1 L2 b2 L2 b3
More pressure on both:• Interconnection Network• Shared L2 banks
1212CMP-MSIFeb. 11th 2007
Single vs Multi CoreSingle vs Multi Core
I$ D$
Core
L2 b0
I$ D$
Core
I$ D$
Core
L2 b1 L2 b2 L2 b3
I$ D$
Core
I$ D$
Core
L2 b0 L2 b1 L2 b2 L2 b3
More Unpredictable L2 Access Latency - BAD for FLUSH
1313CMP-MSIFeb. 11th 2007
OverviewOverview
Introduction
Simulation Methodology
Results
Conclusions
1414CMP-MSIFeb. 11th 2007
Simulation MethodologySimulation Methodology
Trace driven SMT simulator derived from SMTsim.
C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core)
I$ D$
Core
L2 b0
I$ D$
Core
I$ D$
Core
L2 b1 L2 b2 L2 b3
I$ D$
Core
Core Details(* per thread)
1515CMP-MSIFeb. 11th 2007
Simulation MethodologySimulation Methodology
Instruction Fetch Policies:
ICOUNT
FLUSH
Workload classified per type: ILP All threads have good memory behavior. MEM All threads have bad memory behavior. MIX Mixes both types of threads.
1616CMP-MSIFeb. 11th 2007
OverviewOverview
Introduction
Simulation Methodology
Results
Conclusions
1717CMP-MSIFeb. 11th 2007
Results : Single-Core (2 threads)Results : Single-Core (2 threads)
FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. Mainly on MEM/MIX workloads
1818CMP-MSIFeb. 11th 2007
Results : Multi-Core (2 threads/core)Results : Multi-Core (2 threads/core)
FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.
+Cores -Speedup
1919CMP-MSIFeb. 11th 2007
Results : L2 Hits Latency on Multi-CoreResults : L2 Hits Latency on Multi-Core
+Cores +latency
+dispersion
L2 hit latency (cycles)
2020CMP-MSIFeb. 11th 2007
Results : L2 miss predictionResults : L2 miss prediction
In this four-cored example, the best choice is predicting L2 miss after 90 cycles.
2121CMP-MSIFeb. 11th 2007
Results : L2 miss predictionResults : L2 miss prediction
But, in this other four-cored example the best choice is not to predict L2 miss.
2222CMP-MSIFeb. 11th 2007
OverviewOverview
Introduction
Simulation Methodology
Results
Conclusions
2323CMP-MSIFeb. 11th 2007
ConclusionsConclusions
Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation.
The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance.
For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario.
FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.
CMP-MSIFeb. 11th 2007
Thank you
Questions?Questions?