Part 13 Memory management, Many-Cores (CMP), and Crossbars

Hier wird Wissen Wirklichkeit Computer Architecture – Part 11 – page 1 of 44 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Part 13

Memory management, Many-Cores (CMP),and Crossbars

Computer Architecture

Slide Sets

WS 2012/2013

Prof. Dr. Uwe BrinkschulteM.Sc. Benjamin Betting


Chip-Multiprocessors (CMP)/Multi-/Many-Cores

Possible Classification?


Processor Parameters (< 2005)


CMT configurations (<2008)


Sun UltraSPARC T1 (Niagara-1)

General:

• Server Chip-Multiprocessor (CMP)

• Developed by Sun Microsystems (2005)

• Extended to Niagara-2 (2008)

Goal:

Designed for high throughput and excellent performance/Watt on

server workloads

HSA:

• 8x scalar pipelined processing cores on the DIE

(32-bit SPARC, 4-way MT)

• L2-Cache coupling (UMA, DDR2 controllers)


Niagara-1 Block Diagram


Niagara-1 DIE (90nm process)


Niagara-1 SPARC Core Pipeline

• six stages deep (shallow pipeline)

• low speculative (branch target buffer + precompute branch logic)

• single issue (IPC = 1.0)

• 4-way fine-grain multithreading (cycle-by-cycle interleaved + priority

LRU)


• Switching between available threads each cycle with priority given

to the least recently used thread.

• Threads become available of long latency such as e.g., loads,

branches, multiply, and divide.

• Threads become unavailable of pipeline "stalls" e.g, cache misses,

traps, and resource conflicts

• Designed from ground up to 32-thread CMP

Multithreading on Niagara-1


Thread Selection: all threads available

Niagara-1 SPARC Thread Scheduling


There are 5 core components to consider when describing the

memory architecture of Niagara-1 processor:

1. SPARC pipelines (cores)

2. L1-Caches

3. L2-Caches

4. DRAM controller

5. IO Devices (out of scope)

Hint: 1. and 2. also consider the on-Chip interconnection network

between components e.g., buses, crossbars etc.

Memory Resources on Niagara-1


L1-Caches

L1-Cache is contained exclusively for instructions (L1-I) and data

(L1-D) within each SPARC core and shared between the 4 threads

L1-I:

• 16 Kbyte, 4-way set-associative, block size of 32 bytes (line size)

• two instruction fetch each cycle (one speculative)

L1-D:

• 8 Kbyte, 4-way set-associative, block size of 16 bytes

• write-through policy, and 8-entry-store buffer (execution past

stores)small L1-Caches, 3 clocks latency for cache hit, and

miss rate in the range of 10%


L1-Caches

Why to choose small L1-caches with 4-way set-associativity???

....Well, because commercial server applications tend to have large

working sets, the L1-Caches must be much larger to achieve

significantly lower miss rates,..........but the Niagara designers observed

that the incremental performance gained by larger caches did not merit the

area increase..............

....In Niagara, the four threads of each core are very effective at hiding

the latencies from L1 and L2 misses ..........Therefore, the smaller

Niagara level-one cache sizes are good tradeoff between miss rates, area

and the ability of other threads in the processor core to hide latency........

(by James Laudon, Sun Microsystems)


L2-Caches

L2-Cache is contained single on-chip, commonly shared for

instructions and data, banked 4-ways and pipelined.

• 3 Mbytes total, 12-way set-associative, block size 64 bytes

• Banked across 4 L2-banks, interleaved at 64 byte granularity

• Bank selection: physical address bits [7:6]

• 23 clocks latency for L1-D cache miss, and 22 clocks for L1-I

• Cache coherency: full MESI based protocol between L1 and L2

• Line-replacement algorithm: some sort of LRU


L2-Caches

Single Shared L2-Cache:

Advantage: A single shared on-chip cache eliminates cache

coherence misses in L2 and replaces them with

low latency shared communication between L1 and L2

Disadvantage: It also implies longer access time to the L2 because

the cache cannot be located close to all of the

processor cores in the chip. Furthermore, highly

frequented banks could lead to a bottleneck, too


NxM Crossbar Interconnect

Purpose:

Niagara's crossbar interconnect provides and manages fast

communication link between processor cores, L2-Cache banks, and

other shared resources on the chip (e.g., FPU, IO-bridge etc.)

Reminder: What is a crossbar?

•None-blocking, NxM interconnecting network

•N Inputs, M Outputs (individual switches on each cross node)

•memory bandwith, up to several GB/s


NxM Crossbar Example

Hier wird Wissen Wirklichkeit

CCX contains two main blocks (one for each direction):

•Processor-Cache Crossbar (PCX), 8x5, Forward Crossbar

•Cache-Processor Crossbar (CPX), 6x8, Backward Crossbar

Computer Architecture – Part 11 – page 18 of 44 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Niagara-1 CPU Cache Crossbar (CCX)


Niagara-1 Processor-Cache Crossbar (PCX)

• Accepts packets from a source (any of eight SPARC CPU cores) and delivers

the packet to its destination (any one of the four L2-Cache banks, the I/O

bridge, or the FPU)

• A source sends a packet and a destination ID to the PCX

• A packet is sent on a separate 124-bit wide parallel bus ( 40 bits address, 64

bits data, and rest for control)

• Destination ID is sent on a separate 5-bit parallel bus

• Each source connects with its own separate bus to the PCX

• PCX sends a grant to the source after dispatching a packet to its destination

(handshake signal)

• When a destination reaches its limit, it sends a stall signal to the PCX (exc. FPU)

8x buses that connect from the CPUs to the

PCX


Niagara-1 PCX- Block Diagram


Niagara-1 PCX-Issues

Advantage: None-blocking access, overall more than 200 Gbytes/s of

bandwidth

Problem: Bus collisions may occur when multiple sources send a

packet to the same destination

Solution: When multiple sources send a packet to the same

destination, the PCX buffers each packet and arbitrates

its delivery to the destination. The CCX does not modify

or process any packet

Extending PCX with arbitration (one for each

destination)


• 5 identical arbiters with 16 entry deep FIFO-queues

(max. 2 entries per source)

• up to 96 queued transactions Computer Architecture – Part 11 – page 22 of 44 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Niagara-1 PCX - Arbiter Data Flow


Niagara-1 Cache-Processor Crossbar (CPX)

• Opposed data transaction direction of PCX (Backward)

• 6 sources (L2-banks, FPU, and IO bridge) and 8 destinations (SPARC Cores)

• A packet is sent on a separate 145-bit wide parallel bus (128 bits data, and rest

for control)

• Destination ID is sent on a separate 8-bit parallel bus

• CPX sends a grant to the source after dispatching a packet to its destination

• Unlike the PCX, the CPX does not receive a stall from any of its destinations

• contains 8 identical arbiters with 8 queues and a two entry deep FIFO

6 buses that connect from the sources to the CPX


Niagara-1 CPX - Block Diagram


Superscalar vs. CMP


IPC rates


•simple in-order CMPs can achieve same performance on a lower

power level as an equivalent complex out of order CMP on high

power

simple CMPs gain better Watt/PerformanceComputer Architecture – Part 11 – page 27 of 44 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

CMPThroughput vs. Power


Niagara-1Heat Dissipation

Date post:	01-Jan-2016
Category:	Documents
Upload:	heremon-ivers
View:	26 times
Download:	0 times

Part 13 Memory management, Many-Cores (CMP), and Crossbars

Documents