ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore System: A Scalable Many-core Simulator by Employing over 100 FPGAs

Shinya Takamaeda-Yamazaki†‡,

Shintaro Sano†, Yoshito Sakaguchi†, Naoki Fujieda†, Kenji Kise†

†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow, Japan

10:00–10:25 March 23, 2012 ARC 2012 @Hong Kong

ScalableCore System 3.3 n  Tile architecture simulator by Multiple FPGAs

l  Achieving SCALABLE simulation speed

Shinya Takamaeda-Y. Tokyo Tech 2

DRAM Controller DRAM Controller

Local Memory

DMAC Core

R

System Functions

Target Core

ARC 2012 @HongKong

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 3

Contents



n  Evaluation


n  Conclusion


Background: Many-core Era

Intel Single Chip Cloud Computer 48 cores (x86)

TILERA TILE-Gx100 100 cores (MIPS)

Shinya Takamaeda-Y. Tokyo Tech 5 ARC 2012 @HongKong

Simulation Target Many-core: M-Core [11] n  Simple tile architecture with 2D mesh network

l  Like Cell/B.E, Node has no caches, but local memories l  Parallel program with DMAs among the Nodes

Shinya Takamaeda-Y. Tokyo Tech 6

Local Memory

DMAC Core

R

DRAM Controller DRAM Controller

Node

ARC 2012 @HongKong [11] Uehara, K. et al. A Study of an Infrastructure for Research and Development of Many-Core Processors, UPDAS-2010

Simulations on SW Simulator takes a lot of time! n  Slow down simulation speed in SW Simulator

with the increasing # target cores l  First, SW Simulator is very slow! (Slows down 1000x ~ )

l  And, to achieve the scalable speed is DIFFICULT!

Simulation Speed on SimMc (M-Core simulator) on Core i7 870, 4GB Memory, gcc 4.5.2 (-O3)


89.1

28.3 14.0 8.8

90.4

28.4 14.1 8.9

0.0

20.0

40.0

60.0

80.0

100.0

16 36 64 100

Freq

. [K

Hz]

# Node

SimMc (MM) SimMc (NQ)

Motivation n  Accelerating Many-core simulations for efficient research

l  Ex) Task allocation on many-core processors

n  SCALABLE simulation speed in case of large core count

n  How to scale the simulation speed? l  In this study, our target architecture is M-Core

•  Tile architecture with 2D mesh network

Map the target processor into multiple FPGAs

Many-core Processor

Partition Map


Contents



n  Evaluation


n  Conclusion


Our Solution: ScalableCore System n  Multiple FPGA units compose whole the target processor


DRAM Controller

Power DC5V

FPGA

SRAM

Power

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

ScalableCore Unit (Processor Core)

Host USB-Serial

USB

FPGA

DRAM

FPGA

DRAM

FPGA

DRAM

FPGA

DRAM

Memory Unit (Off-chip Memory)

Local Memory

DMAC Core

R

System Functions

Target Core

ScalableCore System Target Many-core

Mapping to Multiple FPGAs

ScalableCore System 3.3 for 100-Nodes


Memory Unit (for DRAM Controller): FPGA+DRAM board

46.7cm

60.0cm

Local Memory

DMAC Core

R

System Functions

ScalableCore Unit (for Processor Core): FPGA+SRAM board

Our Original FPGA Boards n  We developed from nothing!

n  ScalableCore Unit FPGA+SRAM board l  Xilinx Spartan-6 XC6SLX16 l  512KB SRAM (8bit, 1-port read/write) l  Configuration ROM

n  Memory Unit FPGA+DRAM board l  Xilinx Spartan-6 XC6SLX16 l  16MB DRAM l  Configuration ROM


4.67cm

6.0cm

4.67cm

6.0cm

(ASIDE) ScalableCore system 1.1 [9] n  Past system up to 64 (8x8) Nodes


ScalableCore Unit (v 1.1)

ScalableCore Board (Connecting among Units)

[9] Takamaeda-Y. S. et al. An FPGA-based Scalable Simulation Accelerator for Tile Architectures, ACM CAN-39 (2011)

It’s Scalable ! n  1 (1x1) ScalableCore Unit


ScalableCore Unit

Memory Unit

Power Supply Unit with

USB-Serial IC to Host PC

It’s Scalable !! n  16 (4x4) ScalableCore Units


It’s Scalable !!! n  64 (8x8) ScalableCore Units


It’s Scalable !!!! n  128 (16x8) ScalableCore Units


Contents



n  Evaluation


n  Conclusion


Logic Hierarchy of ScalableCore Unit

Core DMAC

Local Memory (Interface)

Router

Ser/Des Memory Multiplexer

Device Controller

State Machine Controller

Interface Register

Target Core (Node of M-Core)

System Functions


ScalableCore Unit Architecture


Arbiter

XBAR

Memory Multiplexer

DMA Generator/Receiver

Fetch Unit

Decoder

Execution Unit

Register File

Memory Access Unit

DMA Register Memory Controller

SRAM Controller SRAM

Interface Register

Interface Register

RS232C Controller

Core

DMAC

Local Memory

Router

to/from Adjacent Units


Ser/Des

Ser/Des

Ser/Des

Ser/Des

Clock

Reset

ScalableCore Unit FPGA Spartan-6

Off-chip Devices

IR IR

IR

IR IR IR IR

Configuration ROM

XCF04S JTAG port

RS232C-USB To Host PC (USB)

Memory Unit Architecture n  DRAM instead of SRAM in ScalableCore Unit

l  16MB DRAM on board

n  DRAM Emulator instead of Core/Router


Memory Multiplexer

DMA Generator/Receiver

DMA Register Memory Controller

Off-chip DRAM Controller DRAM

Interface Register

DMAC

DRAM Emulator


Ser/Des

Clock

Reset

ScalableCore Unit FPGA Spartan-6

Off-chip Devices

IR IR

Configuration ROM

XCF04S JTAG port

DRAM Timing Model

Local Barrier Synchronization n  Handshaking with only 4 neighbor FPGAs

l  Constant overhead of the handshaking, NOT increasing with the increasing of # target cores

l  Achieves scalable simulation speed

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0




Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3





Cycle 1 Cycle 2

0

3 4

2

1


Virtual Cycle n  Multiple FPGA clock cycles to 1 target clock cycle

l  Virtual hardware by using simple FPGA equipment

Drive the circuit of target components

Process the memory accesses


DMAC

Core

Sending the synchronized data via Serial I/O (North)

Receiving the synchronized data via Serial I/O (North)

Sending the synchronized data via Serial I/O (East)

Sending the synchronized data via Serial I/O (West)

Sending the synchronized data via Serial I/O (South)

Receiving the synchronized data via Serial I/O (East)

Receiving the synchronized data via Serial I/O (West)

Receiving the synchronized data via Serial I/O (South)

Start sending

Finish synchronization

Data Sender via Serial I/Os

Data Receiver via Serial I/Os

1 Virtual Cycle Time

Virtual Cycle N

Virtual Cycle N+1

…

Router

DMAC Read Core (IF) DMAC Write Core (L/S) Interleaved

Memory Access via Memory Multiplexer

Proceeding Target Circuit State

Target Description in Verilog HDL n  “EN” signal to update all flip-flops in the target

l  Driven by outer State machine controller for every virtual cycle

l  Separating TARGET and SYSTEM well (with Interface regs)


always @(posedge CLK or negedge RST_X) begin if(!RST_X) begin if_id_invalid <= 1; if_id_pc <= 0; end else if(EN) begin if(!if_id_stall) begin if_id_invalid <= if_id_flush; if_id_pc <= icache_addr; end end end

When (EN == 1), update all flip-flops

Contents



n  Evaluation


n  Conclusion


Evaluation

n  Resource Usage (for each ScalableCore Unit) l  Floorplan of FPGA

l  LUT/Reg/BRAM/DSP usage of each FPGA

n  Simulation Speed (vs. Software-based simulator) l  Frequency [KHz]: # simulated cycles per sec

l  # Node of target: 16 ~ 100


Node Micro Architecture of Target n  Core

l  MIPS32 ISA, 5-stage, Single-issue, In-order •  No FPU Support (Future Work)

l  2-Memory-ports (Inst, Load/Store)

n  DMA Controller l  2-Memory-ports (32-bit DMA Read, 32-bit DMA Write)

n  Router l  5-I/O, 4-stage (NRC/VA, SA, ST, LT)

l  2-Virtual Channels, FIFO size=4, Credit-base flow control

n  Local Memory l  Access latency=1, 512KB, 32-bit

l  4-Memory-ports (Inst, Load/Store, DMA Read, DMA Write)


Local Memory

DMAC Core

R

ScalableCore Unit Floorplan (XC6SLX16)


Router

Local Memory (Memory Controller) DMA Controller

Core

System Function

Resource Utilization of ScalableCore Unit n  FPGA: Spartan-6 XC6SLX16

n  NOT serious resource utilization by system function l  System: 20% LUTs and 15% Regs (of LX16)

l  Target: 64% LUTs and 14% Regs (of LX16)


Module LUT Register BRAM LUTRAM DSP System Function 1700 2693 16 0 0

Core 1920 713 3 0 6 DMA Controller 444 378 0 0 0 DMA Register 590 535 0 0 0

Router 2475 959 0 280 0 Target Total 5429 2585 3 280 6

Total 7129 5278 19 280 6 Percent Utilization 84% 29% 31% N/A 6%

Simulation Speed

n  Environment l  ScalableCore system 3.3 (FPGA-based simulator of M-Core)

•  Freq.: 40MHz (SerDes: 80MHz)

l  SimMc (Software simulator of M-Core) •  Intel Corei7 870, Memory 4GB, gcc4.5.2 (-O3), Ubuntu Server 11.04

n  # Node l  16 (4x4), 36 (6x), 64 (8x8), 100 (10x10)


Evaluation: Simulation Speed [KHz] n  ScalableCore System achieves constant simulation

frequency: Good weak-scaling

n  With # target core increases, relative speed increases!! l  In 100-Node, ScalableCore system runs at 129x faster


89.1 28.3 14.0 8.8 90.4 28.4 14.1 8.9

1142 1142 1142 1142

1142 1142 1142 1142

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

16 36 64 100

Freq

. [K

Hz]

# Node

SimMc (MM) SimMc (NQ)

ScalableCore (MM) ScalableCore (NQ)

12.8

40.4

81.4

129.9

12.6 40.2

80.8

128.5

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

16 36 64 100

Rel

ativ

e S

peed

# Node

Relative (MM) Relative (NQ)

Contents



n  Evaluation


n  Conclusion


Case Study: Task Allocation on Many-core n  Task allocation pattern affects to the performance

l  Communication Latency, Packet Contention

n  ScalableCore system for Task Allocation Testing l  RMAP: Pattern-based task allocation on 2D-mesh [7]

l  Simulation time is reduced to 20min from 43h


A A A A

A A A A

A A A A

A A A A

B B B B

B B B B

B B B B

B B B B

C C C C

C C C C

C C C C

C C C C

D D D D

D D D D

D D D D

D D D D

Normal Allocation (4 Apps)

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

RMAP X4 (4 Apps) [7] Sano, S. et al. Pattern-based systematic task mapping for many-core processors, UPDAS-2011

Contents



n  Evaluation


n  Conclusion


Conclusion

n ScalableCore system 3.3 A scalable FPGA-based simulation system for tile architecture evaluations l  Multiple FPGAs corresponding to whole a single processor

l  Development of original boards

l  Two key techniques •  Virtual cycle, Local Barrier Synchronization

l  129x faster simulation than the software simulator (100-Nodes)

n  Future Work l  Virtual combined multiple FPGAs for a large core

l  Time-multiplexed driven for better hardware utilization


Date post:	31-May-2015
Category:	Technology
Upload:	shinya-takamaeda-yamazaki
View:	1,130 times
Download:	0 times