+ All Categories
Home > Documents > RAMP Gold Update Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley August, 2008.

RAMP Gold Update Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley August, 2008.

Date post: 21-Dec-2015
Category:
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
17
RAMP Gold Update Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley August, 2008
Transcript

RAMP Gold Update

Zhangxi Tan, Krste Asanovic, David PattersonUC Berkeley

August, 2008

2

A functional model for RAMP Gold RAMP Gold: A RAMP

emulation model for Parlab manycore Single-socket tiled manycore

target SPARC v8 -> v9

Split functional/timing model, both in hardware

Functional model: Executes ISA Timing model: Capture pipeline

timing detail (can be cycle accurate)

Host multithreading of both functional and timing models

Built on BEE3 system Four Xilinx Virtex 5 LX110T

Parlab Manycore

Functional Model

Pipeline

Arch State

Timing Model

Pipeline

Timing State

Functional model implementation in this talk

3

A RAMP Emulator “RAMP blue” as a proof of concept

1,008 32-bit RISC core on 105 FPGAs of 21 BEE2 boards

A bigger “RAMP blue” with more FPGAs for Parlab? Less interesting ISA High-end FPGAs cost thousands of dollars CPU cores (@90 MHz) are even slower

than memory! Waste memory bandwidth

High CPI, low pipeline utilization Poor emulation performance/FPGA

Need a high density and more efficient design

4

RAMP Gold Implementation Goal :

Performance : maximize aggregate emulated instruction throughput (GIPS/FPGA)

Scalability: scale with reasonable resource consumption

Design for FPGA fabric SRAM nature of FPGA: RAMs are cheap!

Efficient for state storage, but expensive for logic (e.g. multiplexer)

Need reconsider some traditional RISC optimizations By passing network is against “smaller, faster” on FPGAs ~28% LUT reduction, ~18% frequency improvement on SPARC v8

implementation, wo result forwarding DSPs are perfect for ALU implementation Circuit performance limited by routing

Longer pipeline Carefully mapped FPGA primitives

Emulation latencies: e.g. across FPGAs, memory network “High” frequency (targeting 150 MHz)

5

Host multithreading Single hardware pipeline with multiple copies of

CPU state Fine-grained multithreading Not multithreading target

+1

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

ALU

D$

6 6

DE

6

Thread Select

CPU1

CPU2

CPU63

CPU64

Target Model

Functional model on FPGA

6

Pipeline Architecture

Instruction Fetch 1(Issue address Request)

Static Thread Selection

(Round Robin)

Special Registers(pc/npc, wim, psr,

thread control registers)

I-Cache(nine 18kb BRAMs)

Microcode ROM

Instruction Fetch 2(compare tag)

32-bit Instruction

Synthesized Instruction

Tag compare result

Micro inst.

Tag/Data read request

Decode(Resolve Branch,

Decode register file address)

Regfile Read2 cycles (pipelined)

32-bit Multithreaded Register File

(four 36kb BRAMs)

Decode ALU control/Exception

Detectionimm

pc

OP2 OP1

MUL/DIV/SHF(4 DSPs)

Simple ALU (1 DSP)/LDST decoding

Special register handling

(RDPSR/RDWIM)

Mem request under cache miss

Tag

Unaligned address detection / Store

preparation

Issue Load(issue address)

D-Cache(nine 18kb BRAMs)

Trap/IRQ handling Read & Select

Tag/Data read request

Tag / 128-bit data

Generate microcode request

Load align /Write Back

128-bit read & modify data

128-bit memory interface

128-bit memory interface

Thread Selection

Instruction Fetch 1

Decode

Register File Access 1 & 2*

Execution

Memory 1

Write Back/ Exception

LUT RAM (clk x2)

LUT ROM

BRAM (clk x2)

DSP (clk x2)

Instruction Fetch 2

Register File Access 3

Memory 2

Single issue in order pipeline (integer only) 11 pipeline stages (no

forwarding) -> 7 logical stages

Static thread scheduling, zero overhead context switch

Avoid complex operations with “microcode”

E.g. traps, ST Physical implementation

All BRAM/LUTRAM/DSP blocks in double clocked or DDR mode

Extra pipeline stages for routing

ECC/Parity protected BRAMs Deep submicron effect on

FPGAs

7

Implementation Challenges CPU state storage

Where? How large? Does it fit on FPGA?

Minimize FPGA resource consumption E.g. Mapping ALU to DSPs

Host cache & TLB Need cache? Architecture and capacity Bandwidth requirement and R/W access ports

host multithreading amplifies the requirement

8

State storage Complete 32-bit SPARC v8 ISA w. traps/exceptions All CPU states (integer only) are stored in SRAMs

on FPGA Per context register file -- BRAM

3 register windows stored in BRAM chunks of 64 8 (global) + 3*16 (reg window) = 54

6 special registers pc/npc -- LUTRAM PSR (Processor state register) -- LUTRAM WIM (Register Window Mask) -- LUTRAM Y (High 32-bit result for MUL/DIV) -- LUTRAM TBR (Trap based registers) -- BRAM (packed with

regfile) Buffers for host multithreading (LUTRAM) Maximum 64 threads per pipeline on Xilinx Virtex5

Bounded by LUTRAM depth (6-input LUTs)

9

Mapping SPARC ALU to DSP

Xilinx DSP48E advantage 48-bit add/sub/logic/mux + pattern detector

Easy to generate ALU flags: < 10 LUTs for C, O

Pipelined access over 500 MHz

10

DSP advantage Instruction coverage (two DSPs / pipeline)

1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI (value by pass) JMPL, RETT, CALL (address calculation) SAVE/RESTORE (add/sub) WRPSR, RDPSR, RDWIM (XOR op)

Long latency ALU instructions (1 DSP) Shift/MUL (2 cycles)

5%~10% logic save for 32-bit data path

11

Host Cache/TLB Accelerating emulation performance!

Need separate model for target cache

Per thread cache Split I/D direct-map write-allocate write-back cache

Block size: 32 bytes (BEE3 DDR2 controller heart beat) 64-thread configuration: 256B I$, 256B D$

Size doubled in 32-thread configuration Non-blocking cache, 64 outstanding requests (max) Physical tags, indexed by virtual or physical address

$ size < page size 67% BRAM usage

Per thread TLB Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB Dummy currently (VA = PA)

12

Cache-Memory Architecture

Cache controller Non-blocking pipelined access (3-stages) matches CPU pipeline Decoupled access/refill: allow pipelined, OOO mem accesses Tell the pipeline to “replay” inst. on miss 128-bit refill/write back data path

fill one block in 2 cycles

RAMB18SDP RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72)

Tag (Parity)512 x 36

Data (ECC)512x72x4

P r e p a r e L D / S T

a d d r e s s

Memory Stage (1)

L o a d S e l e c t / R o u t i n g

C a c h e F S M

( H i t , e x c e p t i o n , e t c )

Exception/Write Back Stage

Memory Stage (2)

R e a d & M o d i f y

64-bit data

Tag

replay?

Pipeline Register

Write Back

Cache

Integer Pipeline

Pipeline Register

P i p e l i n e S t a t e

C o n t r o l

L o a d A l i g n / S i g n

Memory Command FIFO

64-bit data + Tag

128-bit data

Refill

Memory Controller

128-bit data

Memory request address Victim data

write back

Refill Index

Mem ops

Lookup Index

13

Example: A distributed memory non-cache coherent system

Eight multithreaded SPARC v8 pipelines in two clusters

Each thread emulates one independent node in target system

512 nodes/FPGA Predicted emulation performance:

~1 GIPS/FPGA (10% I$ miss, 30% D$ miss, 30% LD/ST)

x2 compared to naïve manycore implementation

Memory subsystem Total memory capacity 16 GB, 32MB/node

(512 nodes) One DDR2 memory controller per cluster Per FPGA bandwidth: 7.2 GB/s Memory space is partitioned to emulate

distributed memory system 144-bit wide credit-based memory network

Inter-node communication (under development)

Two-level tree network to provide all-to-all communication

14

Project Status Done with RTL implementation

~7,200 lines synthesizable Systemverilog code FPGA resource utilization per pipeline on Xilinx V5

LX110T ~3% logic (LUT), ~10% BRAM Max 10 pipelines, but back off to 8 or less for

timing model

Built RTL verification infrastructure SPARC v8 certification test suite (donated by

SPARC international) + Systemverilog Can be used to run more programs but very slow

(~0.3 KIPS)

15

RTL Verification Flow in SWSPARC V8

Verification Suite(.S or .C)

GNU SPARC v8 Compiler/Linker

(sparc-linux-gcc, sparc-linux-as, sparc-linux-ld)

Customized Linker Script(.lds)

RAMP Gold Systemverilog Source

Files / netlist(.sv, .v)

ELF to BRAM Translator

ELF big-endian Binaries

Modelsim SE/Questasim 6.4

SPARC v8 Disassembler(host binary)

GNU libbfd library(from GNU binutil)

Host disassembler C Implementation

Xilinx Unisim Library Systemverilog DPI interface

Simulation log/Console output

Checker

16

Verification in progress Tested instructions

All SPARC v7 ALU instructions: add/sub, logic, shift All integer branch instructions All special instructions: register window, system

registers Working on: LD/ST and Trap More verification after P&R and on HW

work with the rest RAMP Gold infrastructure Lessons so far

Infrastructure is not trivial, and very few sample design available (have to build our own!)

Multithreaded states complicates the verification process! buffers and shared FU interfaces

17

Thank you


Recommended