+ All Categories
Home > Technology > ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Date post: 31-May-2015
Category:
Upload: shinya-takamaeda-yamazaki
View: 1,130 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
ScalableCore System: A Scalable Many-core Simulator by Employing over 100 FPGAs Shinya Takamaeda-Yamazaki †‡ , Shintaro Sano , Yoshito Sakaguchi , Naoki Fujieda , Kenji Kise Tokyo Institute of Technology, Japan JSPS Research Fellow, Japan 10:00–10:25 March 23, 2012 ARC 2012 @Hong Kong
Transcript
Page 1: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore System: A Scalable Many-core Simulator by Employing over 100 FPGAs

Shinya Takamaeda-Yamazaki†‡,

Shintaro Sano†, Yoshito Sakaguchi†, Naoki Fujieda†, Kenji Kise†

†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow, Japan

10:00–10:25 March 23, 2012 ARC 2012 @Hong Kong

Page 2: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore System 3.3 n  Tile architecture simulator by Multiple FPGAs

l  Achieving SCALABLE simulation speed

Shinya Takamaeda-Y. Tokyo Tech 2

DRAM Controller DRAM Controller

Local Memory

DMAC Core

R

System Functions

Target Core

ARC 2012 @HongKong

Page 3: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 3

Page 4: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 4

Page 5: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Background: Many-core Era

Intel Single Chip Cloud Computer 48 cores (x86)

TILERA TILE-Gx100 100 cores (MIPS)

Shinya Takamaeda-Y. Tokyo Tech 5 ARC 2012 @HongKong

Page 6: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Simulation Target Many-core: M-Core [11] n  Simple tile architecture with 2D mesh network

l  Like Cell/B.E, Node has no caches, but local memories l  Parallel program with DMAs among the Nodes

Shinya Takamaeda-Y. Tokyo Tech 6

Local Memory

DMAC Core

R

DRAM Controller DRAM Controller

Node

ARC 2012 @HongKong [11] Uehara, K. et al. A Study of an Infrastructure for Research and Development of Many-Core Processors, UPDAS-2010

Page 7: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Simulations on SW Simulator takes a lot of time! n  Slow down simulation speed in SW Simulator

with the increasing # target cores l  First, SW Simulator is very slow! (Slows down 1000x ~ )

l  And, to achieve the scalable speed is DIFFICULT!

Simulation Speed on SimMc (M-Core simulator) on Core i7 870, 4GB Memory, gcc 4.5.2 (-O3)

Shinya Takamaeda-Y. Tokyo Tech 7 ARC 2012 @HongKong

89.1

28.3 14.0 8.8

90.4

28.4 14.1 8.9

0.0

20.0

40.0

60.0

80.0

100.0

16 36 64 100

Freq

. [K

Hz]

# Node

SimMc (MM) SimMc (NQ)

Page 8: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Motivation n  Accelerating Many-core simulations for efficient research

l  Ex) Task allocation on many-core processors

n  SCALABLE simulation speed in case of large core count

n  How to scale the simulation speed? l  In this study, our target architecture is M-Core

•  Tile architecture with 2D mesh network

Map the target processor into multiple FPGAs

Many-core Processor

Partition Map

Shinya Takamaeda-Y. Tokyo Tech 8 ARC 2012 @HongKong

Page 9: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 9

Page 10: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Our Solution: ScalableCore System n  Multiple FPGA units compose whole the target processor

Shinya Takamaeda-Y. Tokyo Tech 10 ARC 2012 @HongKong

DRAM Controller

Power DC5V

FPGA

SRAM

Power

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

FPGA

SRAM

ScalableCore Unit (Processor Core)

Host USB-Serial

USB

FPGA

DRAM

FPGA

DRAM

FPGA

DRAM

FPGA

DRAM

Memory Unit (Off-chip Memory)

Local Memory

DMAC Core

R

System Functions

Target Core

ScalableCore System Target Many-core

Mapping to Multiple FPGAs

Page 11: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore System 3.3 for 100-Nodes

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 11

Memory Unit (for DRAM Controller): FPGA+DRAM board

46.7cm

60.0cm

Local Memory

DMAC Core

R

System Functions

ScalableCore Unit (for Processor Core): FPGA+SRAM board

Page 12: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Our Original FPGA Boards n  We developed from nothing!

n  ScalableCore Unit FPGA+SRAM board l  Xilinx Spartan-6 XC6SLX16 l  512KB SRAM (8bit, 1-port read/write) l  Configuration ROM

n  Memory Unit FPGA+DRAM board l  Xilinx Spartan-6 XC6SLX16 l  16MB DRAM l  Configuration ROM

Shinya Takamaeda-Y. Tokyo Tech 12 ARC 2012 @HongKong

4.67cm

6.0cm

4.67cm

6.0cm

Page 13: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

(ASIDE) ScalableCore system 1.1 [9] n  Past system up to 64 (8x8) Nodes

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 13

ScalableCore Unit (v 1.1)

ScalableCore Board (Connecting among Units)

[9] Takamaeda-Y. S. et al. An FPGA-based Scalable Simulation Accelerator for Tile Architectures, ACM CAN-39 (2011)

Page 14: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

It’s Scalable ! n  1 (1x1) ScalableCore Unit

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 14

ScalableCore Unit

Memory Unit

Power Supply Unit with

USB-Serial IC to Host PC

Page 15: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

It’s Scalable !! n  16 (4x4) ScalableCore Units

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 15

Page 16: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

It’s Scalable !!! n  64 (8x8) ScalableCore Units

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 16

Page 17: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

It’s Scalable !!!! n  128 (16x8) ScalableCore Units

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 17

Page 18: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 18

Page 19: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Logic Hierarchy of ScalableCore Unit

Core DMAC

Local Memory (Interface)

Router

Ser/Des Memory Multiplexer

Device Controller

State Machine Controller

Interface Register

Target Core (Node of M-Core)

System Functions

Shinya Takamaeda-Y. Tokyo Tech 19 ARC 2012 @HongKong

Page 20: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore Unit Architecture

Shinya Takamaeda-Y. Tokyo Tech 20 ARC 2012 @HongKong

Arbiter

XBAR

Memory Multiplexer

DMA Generator/Receiver

Fetch Unit

Decoder

Execution Unit

Register File

Memory Access Unit

DMA Register Memory Controller

SRAM Controller SRAM

Interface Register

Interface Register

RS232C Controller

Core

DMAC

Local Memory

Router

to/from Adjacent Units

State Machine Controller

Ser/Des

Ser/Des

Ser/Des

Ser/Des

Clock

Reset

ScalableCore Unit FPGA Spartan-6

Off-chip Devices

IR IR

IR

IR IR IR IR

Configuration ROM

XCF04S JTAG port

RS232C-USB To Host PC (USB)

Page 21: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Memory Unit Architecture n  DRAM instead of SRAM in ScalableCore Unit

l  16MB DRAM on board

n  DRAM Emulator instead of Core/Router

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 21

Memory Multiplexer

DMA Generator/Receiver

DMA Register Memory Controller

Off-chip DRAM Controller DRAM

Interface Register

DMAC

DRAM Emulator

State Machine Controller

Ser/Des

Clock

Reset

ScalableCore Unit FPGA Spartan-6

Off-chip Devices

IR IR

Configuration ROM

XCF04S JTAG port

DRAM Timing Model

Page 22: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Local Barrier Synchronization n  Handshaking with only 4 neighbor FPGAs

l  Constant overhead of the handshaking, NOT increasing with the increasing of # target cores

l  Achieves scalable simulation speed

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0

Receiving from Unit 1

Receiving from Unit 2

Receiving from Unit 3

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0

Receiving from Unit 1

Receiving from Unit 2

Receiving from Unit 3

Cycle 1 Cycle 2

0

3 4

2

1

Shinya Takamaeda-Y. Tokyo Tech 22 ARC 2012 @HongKong

Page 23: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Virtual Cycle n  Multiple FPGA clock cycles to 1 target clock cycle

l  Virtual hardware by using simple FPGA equipment

Drive the circuit of target components

Process the memory accesses

Shinya Takamaeda-Y. Tokyo Tech 23 ARC 2012 @HongKong

DMAC

Core

Sending the synchronized data via Serial I/O (North)

Receiving the synchronized data via Serial I/O (North)

Sending the synchronized data via Serial I/O (East)

Sending the synchronized data via Serial I/O (West)

Sending the synchronized data via Serial I/O (South)

Receiving the synchronized data via Serial I/O (East)

Receiving the synchronized data via Serial I/O (West)

Receiving the synchronized data via Serial I/O (South)

Start sending

Finish synchronization

Data Sender via Serial I/Os

Data Receiver via Serial I/Os

1 Virtual Cycle Time

Virtual Cycle N

Virtual Cycle N+1

Router

DMAC Read Core (IF) DMAC Write Core (L/S) Interleaved

Memory Access via Memory Multiplexer

Proceeding Target Circuit State

Page 24: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Target Description in Verilog HDL n  “EN” signal to update all flip-flops in the target

l  Driven by outer State machine controller for every virtual cycle

l  Separating TARGET and SYSTEM well (with Interface regs)

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 24

always @(posedge CLK or negedge RST_X) begin if(!RST_X) begin if_id_invalid <= 1; if_id_pc <= 0; end else if(EN) begin if(!if_id_stall) begin if_id_invalid <= if_id_flush; if_id_pc <= icache_addr; end end end

When (EN == 1), update all flip-flops

Page 25: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 25

Page 26: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Evaluation

n  Resource Usage (for each ScalableCore Unit) l  Floorplan of FPGA

l  LUT/Reg/BRAM/DSP usage of each FPGA

n  Simulation Speed (vs. Software-based simulator) l  Frequency [KHz]: # simulated cycles per sec

l  # Node of target: 16 ~ 100

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 26

Page 27: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Node Micro Architecture of Target n  Core

l  MIPS32 ISA, 5-stage, Single-issue, In-order •  No FPU Support (Future Work)

l  2-Memory-ports (Inst, Load/Store)

n  DMA Controller l  2-Memory-ports (32-bit DMA Read, 32-bit DMA Write)

n  Router l  5-I/O, 4-stage (NRC/VA, SA, ST, LT)

l  2-Virtual Channels, FIFO size=4, Credit-base flow control

n  Local Memory l  Access latency=1, 512KB, 32-bit

l  4-Memory-ports (Inst, Load/Store, DMA Read, DMA Write)

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 27

Local Memory

DMAC Core

R

Page 28: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

ScalableCore Unit Floorplan (XC6SLX16)

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 28

Router

Local Memory (Memory Controller) DMA Controller

Core

System Function

Page 29: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Resource Utilization of ScalableCore Unit n  FPGA: Spartan-6 XC6SLX16

n  NOT serious resource utilization by system function l  System: 20% LUTs and 15% Regs (of LX16)

l  Target: 64% LUTs and 14% Regs (of LX16)

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 29

Module LUT Register BRAM LUTRAM DSP System Function 1700 2693 16 0 0

Core 1920 713 3 0 6 DMA Controller 444 378 0 0 0 DMA Register 590 535 0 0 0

Router 2475 959 0 280 0 Target Total 5429 2585 3 280 6

Total 7129 5278 19 280 6 Percent Utilization 84% 29% 31% N/A 6%

Page 30: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Simulation Speed

n  Environment l  ScalableCore system 3.3 (FPGA-based simulator of M-Core)

•  Freq.: 40MHz (SerDes: 80MHz)

l  SimMc (Software simulator of M-Core) •  Intel Corei7 870, Memory 4GB, gcc4.5.2 (-O3), Ubuntu Server 11.04

n  # Node l  16 (4x4), 36 (6x), 64 (8x8), 100 (10x10)

Shinya Takamaeda-Y. Tokyo Tech 30 ARC 2012 @HongKong

Page 31: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Evaluation: Simulation Speed [KHz] n  ScalableCore System achieves constant simulation

frequency: Good weak-scaling

n  With # target core increases, relative speed increases!! l  In 100-Node, ScalableCore system runs at 129x faster

Shinya Takamaeda-Y. Tokyo Tech 31 ARC 2012 @HongKong

89.1 28.3 14.0 8.8 90.4 28.4 14.1 8.9

1142 1142 1142 1142

1142 1142 1142 1142

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

16 36 64 100

Freq

. [K

Hz]

# Node

SimMc (MM) SimMc (NQ)

ScalableCore (MM) ScalableCore (NQ)

12.8

40.4

81.4

129.9

12.6 40.2

80.8

128.5

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

16 36 64 100

Rel

ativ

e S

peed

# Node

Relative (MM) Relative (NQ)

Page 32: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 32

Page 33: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Case Study: Task Allocation on Many-core n  Task allocation pattern affects to the performance

l  Communication Latency, Packet Contention

n  ScalableCore system for Task Allocation Testing l  RMAP: Pattern-based task allocation on 2D-mesh [7]

l  Simulation time is reduced to 20min from 43h

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 33

A A A A

A A A A

A A A A

A A A A

B B B B

B B B B

B B B B

B B B B

C C C C

C C C C

C C C C

C C C C

D D D D

D D D D

D D D D

D D D D

Normal Allocation (4 Apps)

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

A B C D

B C D A

C D A B

D A B C

RMAP X4 (4 Apps) [7] Sano, S. et al. Pattern-based systematic task mapping for many-core processors, UPDAS-2011

Page 34: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Contents

n  Background and Motivation n  Proposal: ScalableCore System

n  Detailed Architecture

n  Evaluation

n  Case Study: Task Allocation on Many-core

n  Conclusion

ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 34

Page 35: ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs

Conclusion

n ScalableCore system 3.3 A scalable FPGA-based simulation system for tile architecture evaluations l  Multiple FPGAs corresponding to whole a single processor

l  Development of original boards

l  Two key techniques •  Virtual cycle, Local Barrier Synchronization

l  129x faster simulation than the software simulator (100-Nodes)

n  Future Work l  Virtual combined multiple FPGAs for a large core

l  Time-multiplexed driven for better hardware utilization

Shinya Takamaeda-Y. Tokyo Tech 35 ARC 2012 @HongKong


Recommended