Date post: | 31-May-2015 |
Category: |
Technology |
Upload: | shinya-takamaeda-yamazaki |
View: | 1,130 times |
Download: | 0 times |
ScalableCore System: A Scalable Many-core Simulator by Employing over 100 FPGAs
Shinya Takamaeda-Yamazaki†‡,
Shintaro Sano†, Yoshito Sakaguchi†, Naoki Fujieda†, Kenji Kise†
†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow, Japan
10:00–10:25 March 23, 2012 ARC 2012 @Hong Kong
ScalableCore System 3.3 n Tile architecture simulator by Multiple FPGAs
l Achieving SCALABLE simulation speed
Shinya Takamaeda-Y. Tokyo Tech 2
DRAM Controller DRAM Controller
Local Memory
DMAC Core
R
System Functions
Target Core
ARC 2012 @HongKong
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 3
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 4
Background: Many-core Era
Intel Single Chip Cloud Computer 48 cores (x86)
TILERA TILE-Gx100 100 cores (MIPS)
Shinya Takamaeda-Y. Tokyo Tech 5 ARC 2012 @HongKong
Simulation Target Many-core: M-Core [11] n Simple tile architecture with 2D mesh network
l Like Cell/B.E, Node has no caches, but local memories l Parallel program with DMAs among the Nodes
Shinya Takamaeda-Y. Tokyo Tech 6
Local Memory
DMAC Core
R
DRAM Controller DRAM Controller
Node
ARC 2012 @HongKong [11] Uehara, K. et al. A Study of an Infrastructure for Research and Development of Many-Core Processors, UPDAS-2010
Simulations on SW Simulator takes a lot of time! n Slow down simulation speed in SW Simulator
with the increasing # target cores l First, SW Simulator is very slow! (Slows down 1000x ~ )
l And, to achieve the scalable speed is DIFFICULT!
Simulation Speed on SimMc (M-Core simulator) on Core i7 870, 4GB Memory, gcc 4.5.2 (-O3)
Shinya Takamaeda-Y. Tokyo Tech 7 ARC 2012 @HongKong
89.1
28.3 14.0 8.8
90.4
28.4 14.1 8.9
0.0
20.0
40.0
60.0
80.0
100.0
16 36 64 100
Freq
. [K
Hz]
# Node
SimMc (MM) SimMc (NQ)
Motivation n Accelerating Many-core simulations for efficient research
l Ex) Task allocation on many-core processors
n SCALABLE simulation speed in case of large core count
n How to scale the simulation speed? l In this study, our target architecture is M-Core
• Tile architecture with 2D mesh network
Map the target processor into multiple FPGAs
Many-core Processor
Partition Map
Shinya Takamaeda-Y. Tokyo Tech 8 ARC 2012 @HongKong
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 9
Our Solution: ScalableCore System n Multiple FPGA units compose whole the target processor
Shinya Takamaeda-Y. Tokyo Tech 10 ARC 2012 @HongKong
DRAM Controller
Power DC5V
FPGA
SRAM
Power
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
FPGA
SRAM
ScalableCore Unit (Processor Core)
Host USB-Serial
USB
FPGA
DRAM
FPGA
DRAM
FPGA
DRAM
FPGA
DRAM
Memory Unit (Off-chip Memory)
Local Memory
DMAC Core
R
System Functions
Target Core
ScalableCore System Target Many-core
Mapping to Multiple FPGAs
ScalableCore System 3.3 for 100-Nodes
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 11
Memory Unit (for DRAM Controller): FPGA+DRAM board
46.7cm
60.0cm
Local Memory
DMAC Core
R
System Functions
ScalableCore Unit (for Processor Core): FPGA+SRAM board
Our Original FPGA Boards n We developed from nothing!
n ScalableCore Unit FPGA+SRAM board l Xilinx Spartan-6 XC6SLX16 l 512KB SRAM (8bit, 1-port read/write) l Configuration ROM
n Memory Unit FPGA+DRAM board l Xilinx Spartan-6 XC6SLX16 l 16MB DRAM l Configuration ROM
Shinya Takamaeda-Y. Tokyo Tech 12 ARC 2012 @HongKong
4.67cm
6.0cm
4.67cm
6.0cm
(ASIDE) ScalableCore system 1.1 [9] n Past system up to 64 (8x8) Nodes
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 13
ScalableCore Unit (v 1.1)
ScalableCore Board (Connecting among Units)
[9] Takamaeda-Y. S. et al. An FPGA-based Scalable Simulation Accelerator for Tile Architectures, ACM CAN-39 (2011)
It’s Scalable ! n 1 (1x1) ScalableCore Unit
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 14
ScalableCore Unit
Memory Unit
Power Supply Unit with
USB-Serial IC to Host PC
It’s Scalable !! n 16 (4x4) ScalableCore Units
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 15
It’s Scalable !!! n 64 (8x8) ScalableCore Units
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 16
It’s Scalable !!!! n 128 (16x8) ScalableCore Units
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 17
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 18
Logic Hierarchy of ScalableCore Unit
Core DMAC
Local Memory (Interface)
Router
Ser/Des Memory Multiplexer
Device Controller
State Machine Controller
Interface Register
Target Core (Node of M-Core)
System Functions
Shinya Takamaeda-Y. Tokyo Tech 19 ARC 2012 @HongKong
ScalableCore Unit Architecture
Shinya Takamaeda-Y. Tokyo Tech 20 ARC 2012 @HongKong
Arbiter
XBAR
Memory Multiplexer
DMA Generator/Receiver
Fetch Unit
Decoder
Execution Unit
Register File
Memory Access Unit
DMA Register Memory Controller
SRAM Controller SRAM
Interface Register
Interface Register
RS232C Controller
Core
DMAC
Local Memory
Router
to/from Adjacent Units
State Machine Controller
Ser/Des
Ser/Des
Ser/Des
Ser/Des
Clock
Reset
ScalableCore Unit FPGA Spartan-6
Off-chip Devices
IR IR
IR
IR IR IR IR
Configuration ROM
XCF04S JTAG port
RS232C-USB To Host PC (USB)
Memory Unit Architecture n DRAM instead of SRAM in ScalableCore Unit
l 16MB DRAM on board
n DRAM Emulator instead of Core/Router
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 21
Memory Multiplexer
DMA Generator/Receiver
DMA Register Memory Controller
Off-chip DRAM Controller DRAM
Interface Register
DMAC
DRAM Emulator
State Machine Controller
Ser/Des
Clock
Reset
ScalableCore Unit FPGA Spartan-6
Off-chip Devices
IR IR
Configuration ROM
XCF04S JTAG port
DRAM Timing Model
Local Barrier Synchronization n Handshaking with only 4 neighbor FPGAs
l Constant overhead of the handshaking, NOT increasing with the increasing of # target cores
l Achieves scalable simulation speed
Sending to Unit 0
Sending to Unit 1
Sending to Unit 2
Sending to Unit 3
Receiving from Unit 0
Receiving from Unit 1
Receiving from Unit 2
Receiving from Unit 3
Sending to Unit 0
Sending to Unit 1
Sending to Unit 2
Sending to Unit 3
Receiving from Unit 0
Receiving from Unit 1
Receiving from Unit 2
Receiving from Unit 3
Cycle 1 Cycle 2
0
3 4
2
1
Shinya Takamaeda-Y. Tokyo Tech 22 ARC 2012 @HongKong
Virtual Cycle n Multiple FPGA clock cycles to 1 target clock cycle
l Virtual hardware by using simple FPGA equipment
Drive the circuit of target components
Process the memory accesses
Shinya Takamaeda-Y. Tokyo Tech 23 ARC 2012 @HongKong
DMAC
Core
Sending the synchronized data via Serial I/O (North)
Receiving the synchronized data via Serial I/O (North)
Sending the synchronized data via Serial I/O (East)
Sending the synchronized data via Serial I/O (West)
Sending the synchronized data via Serial I/O (South)
Receiving the synchronized data via Serial I/O (East)
Receiving the synchronized data via Serial I/O (West)
Receiving the synchronized data via Serial I/O (South)
Start sending
Finish synchronization
Data Sender via Serial I/Os
Data Receiver via Serial I/Os
1 Virtual Cycle Time
Virtual Cycle N
Virtual Cycle N+1
…
Router
DMAC Read Core (IF) DMAC Write Core (L/S) Interleaved
Memory Access via Memory Multiplexer
Proceeding Target Circuit State
Target Description in Verilog HDL n “EN” signal to update all flip-flops in the target
l Driven by outer State machine controller for every virtual cycle
l Separating TARGET and SYSTEM well (with Interface regs)
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 24
always @(posedge CLK or negedge RST_X) begin if(!RST_X) begin if_id_invalid <= 1; if_id_pc <= 0; end else if(EN) begin if(!if_id_stall) begin if_id_invalid <= if_id_flush; if_id_pc <= icache_addr; end end end
When (EN == 1), update all flip-flops
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 25
Evaluation
n Resource Usage (for each ScalableCore Unit) l Floorplan of FPGA
l LUT/Reg/BRAM/DSP usage of each FPGA
n Simulation Speed (vs. Software-based simulator) l Frequency [KHz]: # simulated cycles per sec
l # Node of target: 16 ~ 100
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 26
Node Micro Architecture of Target n Core
l MIPS32 ISA, 5-stage, Single-issue, In-order • No FPU Support (Future Work)
l 2-Memory-ports (Inst, Load/Store)
n DMA Controller l 2-Memory-ports (32-bit DMA Read, 32-bit DMA Write)
n Router l 5-I/O, 4-stage (NRC/VA, SA, ST, LT)
l 2-Virtual Channels, FIFO size=4, Credit-base flow control
n Local Memory l Access latency=1, 512KB, 32-bit
l 4-Memory-ports (Inst, Load/Store, DMA Read, DMA Write)
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 27
Local Memory
DMAC Core
R
ScalableCore Unit Floorplan (XC6SLX16)
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 28
Router
Local Memory (Memory Controller) DMA Controller
Core
System Function
Resource Utilization of ScalableCore Unit n FPGA: Spartan-6 XC6SLX16
n NOT serious resource utilization by system function l System: 20% LUTs and 15% Regs (of LX16)
l Target: 64% LUTs and 14% Regs (of LX16)
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 29
Module LUT Register BRAM LUTRAM DSP System Function 1700 2693 16 0 0
Core 1920 713 3 0 6 DMA Controller 444 378 0 0 0 DMA Register 590 535 0 0 0
Router 2475 959 0 280 0 Target Total 5429 2585 3 280 6
Total 7129 5278 19 280 6 Percent Utilization 84% 29% 31% N/A 6%
Simulation Speed
n Environment l ScalableCore system 3.3 (FPGA-based simulator of M-Core)
• Freq.: 40MHz (SerDes: 80MHz)
l SimMc (Software simulator of M-Core) • Intel Corei7 870, Memory 4GB, gcc4.5.2 (-O3), Ubuntu Server 11.04
n # Node l 16 (4x4), 36 (6x), 64 (8x8), 100 (10x10)
Shinya Takamaeda-Y. Tokyo Tech 30 ARC 2012 @HongKong
Evaluation: Simulation Speed [KHz] n ScalableCore System achieves constant simulation
frequency: Good weak-scaling
n With # target core increases, relative speed increases!! l In 100-Node, ScalableCore system runs at 129x faster
Shinya Takamaeda-Y. Tokyo Tech 31 ARC 2012 @HongKong
89.1 28.3 14.0 8.8 90.4 28.4 14.1 8.9
1142 1142 1142 1142
1142 1142 1142 1142
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
16 36 64 100
Freq
. [K
Hz]
# Node
SimMc (MM) SimMc (NQ)
ScalableCore (MM) ScalableCore (NQ)
12.8
40.4
81.4
129.9
12.6 40.2
80.8
128.5
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
16 36 64 100
Rel
ativ
e S
peed
# Node
Relative (MM) Relative (NQ)
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 32
Case Study: Task Allocation on Many-core n Task allocation pattern affects to the performance
l Communication Latency, Packet Contention
n ScalableCore system for Task Allocation Testing l RMAP: Pattern-based task allocation on 2D-mesh [7]
l Simulation time is reduced to 20min from 43h
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 33
A A A A
A A A A
A A A A
A A A A
B B B B
B B B B
B B B B
B B B B
C C C C
C C C C
C C C C
C C C C
D D D D
D D D D
D D D D
D D D D
Normal Allocation (4 Apps)
A B C D
B C D A
C D A B
D A B C
A B C D
B C D A
C D A B
D A B C
A B C D
B C D A
C D A B
D A B C
A B C D
B C D A
C D A B
D A B C
RMAP X4 (4 Apps) [7] Sano, S. et al. Pattern-based systematic task mapping for many-core processors, UPDAS-2011
Contents
n Background and Motivation n Proposal: ScalableCore System
n Detailed Architecture
n Evaluation
n Case Study: Task Allocation on Many-core
n Conclusion
ARC 2012 @HongKong Shinya Takamaeda-Y. Tokyo Tech 34
Conclusion
n ScalableCore system 3.3 A scalable FPGA-based simulation system for tile architecture evaluations l Multiple FPGAs corresponding to whole a single processor
l Development of original boards
l Two key techniques • Virtual cycle, Local Barrier Synchronization
l 129x faster simulation than the software simulator (100-Nodes)
n Future Work l Virtual combined multiple FPGAs for a large core
l Time-multiplexed driven for better hardware utilization
Shinya Takamaeda-Y. Tokyo Tech 35 ARC 2012 @HongKong