Post on 20-Dec-2015
transcript
Utilizing Shared Data in Chip Multiprocessors
with the Nahalal Architecture
Zvika Guz, Idit Keidar,Zvika Guz, Idit Keidar,Avinoam Kolodny, Uri C. Weiser Avinoam Kolodny, Uri C. Weiser
The Technion – Israel Institute of Technology The Technion – Israel Institute of Technology
2
CMP’s severely stress on-chip caches Capacity
Bandwidth
Latency
Data sharing complicates our life Contention on shared data
Synchronization
Caches are a principal challenge in CMP
How to organize & handle data in CMP caches? How to organize & handle data in CMP caches?
3
Outline Caches in CMP
Cache-in-the-Middle layout
Application characterization
Nahalal solution Overview
Results
Putting Nahalal into practice Line search
Scalability
Summary
4
Tackling Cache Latency via NUCA Due to the growing wire delay:
Hit time depends on physical location [Agarwal et al., ISCA 2000]
Non uniform access times Closer data => smaller hit time
Aim for vicinity of reference Locate data lines closer to their client
NUCA - Non Uniform Cache Architecture NUCA - Non Uniform Cache Architecture [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]
L2 Cache
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2 L2 Cache
Migrate cache lines towards processors that access them
Dynamic NUCA (DNUCA) Dynamic NUCA (DNUCA) [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]
Source: [Keckler et al., ISSCC 2003]
5
Cache-In-the-Middle Layout (CIM) Shared L2 cache
Higher capacity utilization
Single copy no inter-cache coherence
Banked , DNUCA Interconnected using Network-on-Chip (NoC)
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
[Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]
6
Remoteness of Shared Data Inevitably resides far from (some of) its clients
Long access times
0 7
56 63
P0 P1
P5 P4P
6P
7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4P
6P
7
P3
P2
Distributed L2
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
7
For many parallel applications: Splash-2, SpecOMP, Apache, Specjbb, STM, ..
Observations on Memory Accesses
1. Access to shared lines is substantial
2. Shared lines are shared by many processors
3. A small number of lines make for a large fraction of the total accesses
A small number of lines, shared by many processors, is accessed numerous times
⇒ Shared hot lines effect
8
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
Shared Data Hinders Cache Perf.
What can be done better?
Bring shared data closer to all processors
Preserve vicinity of private data
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
9
Aerial view of Nahalal cooperative village
P0 P1
P2
P3P4P5
P6
P7
This Has Been Addressed Before
Overview of Nahalal cache organization
10
A more realistic layout:
P0 P1P2
P3P4P5
P6
P7
Nahalal Layout A new architectural differentiation of cache lines
According to the way the data is used
Private vs. Shared
Designated area for shared data lines in the center Small & fast structure
Close to all processors
Outer rings used for private data Preserves vicinity of private data
P0 P1
P2P3
P4P5P6
P7
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
Bank0Bank1
Bank2
Ban
k3SharedBank
Ban
k7Bank6
Bank5
Bank4
CPU0
CP
U1
CPU2
CPU6
CP
U5
CPU4
CPU3CPU7
CPU0
CP
U1
CPU2
CPU6
CP
U5
CPU4
CPU3CPU7
11
Nahalal Cache ManagementWhere does the data go? Where does the data go?
First access – go to private yard of requester
Accesses by additional cores – go to the middle
On eviction from over-crowded middle, can go to any sharer’s private yard
In typical workloads: virtually all accesses to shared
data satisfied from the middleCPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
Bank0Bank1
Bank2
Ban
k3SharedBank
Ban
k7Bank6
Bank5
Bank4
12
Full system simulation via SIMICS
8 Processor CMP
Private L1 for each processor (32KByte)
16MByte of shared L2
Simulations
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
Bank0Bank1
Bank2
Ban
k3SharedBankB
ank7
Bank6Bank5
Bank4
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
CIM (Cache In the Middle) Nahalal
2MB near each processor 1.875MB near each processor
1MB in the middle
13
26.8% improvement in average cache hit time 41.1% in apache
Average Cache Hit Time (cycles)
Cache Performance
0
5
10
15
20
25
30
35
40
45
50
equake fma3d barnes water apache zeus specjbb RBTree HashTable
CIM
NAHALAL
# c
lock
cycl
es
3.9% 8.57%
40.53%
41.1%
29.06% 29.35%39.4%
29.1%24.2%
14
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
equake fma3d barnes water apache zeus specjbb RBTree HashTable
Avera
ge R
ela
tive D
ista
nce
Average Relative Distance
Nahalal shortens the distance to shared data
Distance to private data remains roughly the same
Average Distance – Shared vs. Private
15
Putting Nahalal into Practice Line search:
How to find a line within the cache
Line Migration: When and where to move a line between places in the cache
Scalability: How far can we take the Nahalal structure
“The difference between theory and practice is always larger in practice than it is in theory” [Peter H. Salus]
16
Summary State-of-the-art cache’s weakness
Remoteness of shared data
Software behavior: Shared-hot-lines effect
Shared data hinders cache performance
Nahalal cache architecture Places shared lines closer to all processor
Preserve vicinity of private data
A new architectural differentiation of cache lines Not all data should be treated equally
Data-usage-aware design
P0 P1P2
P3P4P5
P6
P7
Questions ?
17
Backup
18
Scalability IssuesThis has (also) been addressed beforeThis has (also) been addressed before
A cluster of Garden-Cities (Ebenezer Howard, 1902)Clustered Nahalal CMP design
Nahalal
Kfar Yehoshua