Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar,...

transcript

Utilizing Shared Data in Chip Multiprocessors

with the Nahalal Architecture

Zvika Guz, Idit Keidar,Zvika Guz, Idit Keidar,Avinoam Kolodny, Uri C. Weiser Avinoam Kolodny, Uri C. Weiser

The Technion – Israel Institute of Technology The Technion – Israel Institute of Technology

CMP’s severely stress on-chip caches Capacity

Bandwidth

Latency

Data sharing complicates our life Contention on shared data

Synchronization

Caches are a principal challenge in CMP

How to organize & handle data in CMP caches? How to organize & handle data in CMP caches?

Outline Caches in CMP

Cache-in-the-Middle layout

Application characterization

Nahalal solution Overview

Results

Putting Nahalal into practice Line search

Scalability

Summary

Tackling Cache Latency via NUCA Due to the growing wire delay:

Hit time depends on physical location [Agarwal et al., ISCA 2000]

Non uniform access times Closer data => smaller hit time

Aim for vicinity of reference Locate data lines closer to their client

NUCA - Non Uniform Cache Architecture NUCA - Non Uniform Cache Architecture [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]

L2 Cache

Distributed L2 L2 Cache

Migrate cache lines towards processors that access them

Dynamic NUCA (DNUCA) Dynamic NUCA (DNUCA) [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]

Source: [Keckler et al., ISSCC 2003]

Cache-In-the-Middle Layout (CIM) Shared L2 cache

Higher capacity utilization

Single copy no inter-cache coherence

Banked , DNUCA Interconnected using Network-on-Chip (NoC)

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Distributed L2

[Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]

Remoteness of Shared Data Inevitably resides far from (some of) its clients

Long access times

P5 P4P

Distributed L2

P5 P4P

Distributed L2

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

For many parallel applications: Splash-2, SpecOMP, Apache, Specjbb, STM, ..

Observations on Memory Accesses

1. Access to shared lines is substantial

2. Shared lines are shared by many processors

3. A small number of lines make for a large fraction of the total accesses

A small number of lines, shared by many processors, is accessed numerous times

⇒ Shared hot lines effect

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Shared Data Hinders Cache Perf.

What can be done better?

Bring shared data closer to all processors

Preserve vicinity of private data

Distributed L2

Aerial view of Nahalal cooperative village

P3P4P5

This Has Been Addressed Before

Overview of Nahalal cache organization

A more realistic layout:

P0 P1P2

P3P4P5

Nahalal Layout A new architectural differentiation of cache lines

According to the way the data is used

Private vs. Shared

Designated area for shared data lines in the center Small & fast structure

Close to all processors

Outer rings used for private data Preserves vicinity of private data

P4P5P6

CPU3CPU7

CPU6 CPU4

CPU3CPU7

CPU6 CPU4

Bank0Bank1

k3SharedBank

k7Bank6

CPU3CPU7

Nahalal Cache ManagementWhere does the data go? Where does the data go?

First access – go to private yard of requester

Accesses by additional cores – go to the middle

On eviction from over-crowded middle, can go to any sharer’s private yard

In typical workloads: virtually all accesses to shared

data satisfied from the middleCPU0

CPU3CPU7

CPU6 CPU4

CPU3CPU7

CPU6 CPU4

Bank0Bank1

k3SharedBank

k7Bank6

Full system simulation via SIMICS

8 Processor CMP

Private L1 for each processor (32KByte)

16MByte of shared L2

Simulations

CPU3CPU7

CPU6 CPU4

CPU3CPU7

CPU6 CPU4

Bank0Bank1

k3SharedBankB

Bank6Bank5

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

CIM (Cache In the Middle) Nahalal

2MB near each processor 1.875MB near each processor

1MB in the middle

26.8% improvement in average cache hit time 41.1% in apache

Average Cache Hit Time (cycles)

Cache Performance

equake fma3d barnes water apache zeus specjbb RBTree HashTable

NAHALAL

3.9% 8.57%

40.53%

29.06% 29.35%39.4%

29.1%24.2%

equake fma3d barnes water apache zeus specjbb RBTree HashTable

tive D

Average Relative Distance

Nahalal shortens the distance to shared data

Distance to private data remains roughly the same

Average Distance – Shared vs. Private

Putting Nahalal into Practice Line search:

How to find a line within the cache

Line Migration: When and where to move a line between places in the cache

Scalability: How far can we take the Nahalal structure

“The difference between theory and practice is always larger in practice than it is in theory” [Peter H. Salus]

Summary State-of-the-art cache’s weakness

Remoteness of shared data

Software behavior: Shared-hot-lines effect

Shared data hinders cache performance

Nahalal cache architecture Places shared lines closer to all processor

Preserve vicinity of private data

A new architectural differentiation of cache lines Not all data should be treated equally

Data-usage-aware design

P0 P1P2

P3P4P5

Questions ?

Backup

Scalability IssuesThis has (also) been addressed beforeThis has (also) been addressed before

A cluster of Garden-Cities (Ebenezer Howard, 1902)Clustered Nahalal CMP design

Nahalal

Kfar Yehoshua

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar,...

Documents