Date post: | 05-May-2018 |
Category: |
Documents |
Upload: | hoangthien |
View: | 226 times |
Download: | 7 times |
Large-Scale HPC systems based on Heterogeneous multicore
processors
Toshikazu Ebisuzaki (RIKEN)
1
contents
• What is heterogeneous many-core processors?
• Introduction of GYOUKOU
• Prospects of Exaflops Computing
2
Homogeneous v.s. heterogeneous
core
core
core
interconnect
Homogeneous many-core processor
GPU PEZY-SC1
Homogeneous many-core processer system
interconnect
core
core
core
interconnect
Homogeneous Many-core processor
memorygeneral purpose
processerco
re
core
core
interconnect
core
core
core
interconnect
Homogeneous Many-core processorHomogeneous Many-core processor 4
Homogeneous v.s. heterogeneous
5
core
core
core
interconnect
Homogeneous many-core processor
GPU PEZY-SC1
core
core
core
Heterogeneous many-core processorSW26010 PEZY-SC2
mem
ory
processor
interconnect
Heterogeneous many-core processor system
interconnect
core
core
core
Heterogeneous many-core processor
mem
ory
processor
interconnect
core
core
core
Heterogeneous many-core processor
mem
ory
processor
interconnect
core
core
core
Heterogeneous many-core processor
mem
ory
processor
interconnect
6
Heterogeneous Manycore Processors
• SW26010 Sunway TaihuLight
→talk of Professor Liu
• PEZY-SC2
– Gyoukou JAMSTEC
– Shoubu Sys.B RIKEN ACCC
– Suiren Blue KEK
– Ajisai RIKEN AICS
– Satsuki RIKEN CAP
7
PEZY-SC PEZY-SC2Process TSMC28HPM TSMC16FFPGL
Freq. Core 733MHz 1GHz
Peripherals 66MHz 66MHz
MemoryCache L1:1MB, L2:4MB, L3:8MB (Chip Total) L1:12MB, L2:12MB, LLC: 40MB (Chip Total)
Scratch Pad 16MB (16KB/PE) 40MB(20KB/PE)
IPsControl CPU
ARM926 x 2 (Management,Debug)Cache L1:32KB x 2, L2:64KB
MIPS64R6(P6600) 6core(General Purpose)
PCIe I/FPCIe Gen3 8Lane 4Port(8GB/s x 4 = 32GB/s)
PCIe Gen4 8Lane 4Port(64GB/s)
DDR I/FDDR4 64bit 2,400MHz 8Port(19.2GB/s x 8 = 153.6GB/s)
Custom TCI Stacked DRAM 4Port 2TB/s(available on phase-2 version 2017 fall)
DDR4 3.200MHz 4Port 100GB/sNum. of PE (MIMD core) 1,024 2,048
Peak Performance 3.0T Flops (Single Precision)1.5T Flops (Double Precision)
8.2T Flops (Single Precision)4.1T Flops (Double Precision)
Power(typical) 70W (Leak:10W, Dynamic:60W) 130W(Estimated)
PEZY-SC/SC2 Specification
8
PEZY-SC2 Block DiagramTC
IDR
AM
8G
B
51
2G
B/s
TCID
RA
M 8
GB
5
12
GB
/s
Prefecture
16City 256PE
25
6b
it x82
56
bit x8
LLCLLC
LLCLLC
LLCLLC
LLCLLC
TCID
RA
M 8
GB
5
12
GB
/sTC
IDR
AM
8G
B
51
2G
B/s
DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s
state
DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s
PCIe x8PCIe x8PCIe x8
prefecture
prefectureprefecture
prefecture prefecture
prefectureprefecture
Xb
arB
us 2
56
bitx3
2 xb
ar
25
6b
it x8
Xb
arB
us 2
56
bitx3
2 xb
ar
25
6b
it x
8
MIPS64R6 x 6PCIe x8
LLCLLC
LLCLLC
LLCLLC
LLCLLC
Uncached Access
9
Hierarchical Architecture
PEZY-SC2 (2,048PE)
Prefecture (256PE) City (16PE) Village (4PE) PE
Program Counter× 8
L1 Instruction Cache64bit × 512w (4KB)
ALU4FP ops/cycle
Register File32bit × 512w (2KB)
Local Storage32bit × 5120w (20KB)
PE
PE
L1 Data Cache2KB
PE
PE
Village(4PE)
L2 Data Cache64KB
L2 Instruction Cache32KB
Village(4PE)
Village(4PE)
Village(4PE)
Special Function UnitCity(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
City(16PE)
Prefecture (256PE) Prefecture (256PE)
LLC2560KB
LLC2560KB
LLC2560KB
LLC2560KB
Prefecture (256PE) Prefecture (256PE)
LLC2560KB
LLC2560KB
LLC2560KB
LLC2560KB
TCI DRAM 512GB/s
TCI DRAM 512GB/s
Prefecture (256PE) Prefecture (256PE)
LLC2560KB
LLC2560KB
LLC2560KB
LLC2560KB
Prefecture (256PE) Prefecture (256PE)
LLC2560KB
LLC2560KB
LLC2560KB
LLC2560KB
TCI DRAM 512GB/s
TCI DRAM 512GB/s
MIPS MIPS MIPS
MIPS MIPS MIPS
DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s
DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s
PCIe Gen4 x8
PCIe Gen4 x8
PCIe Gen4 x8
PCIe Gen4 x8
10
Die Plot
27172.32(um) x 23695.200(um)11
Processing Element2way SuperScaler In-order issue / Out-of-order completion
16 stages pipeline
Fine-grain time-sliced multi-threading (like HEP, Sun Niagara)
8 hardware thread / PE (Active 4thread, Inactive 4thread)
Simplify data-forwarding / pipeline control
Eliminate hardware branch prediction mechanism
L1/L2 Cache coherence does NOT support by hardware
IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB
LA1 LA2
TH0F TH1F
TH2F
TH0B TH1B
TH3F
TH3B TH2B
l.chgthread
l.actthread
clk
clk
clk
clk
IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB
IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB
IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB
IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB
Thread 0Thread 1Thread 2Thread 3
12
Instruction Set ArchitectureOriginal ISA
Focusing on science calculation, image processing, AI and deep learning
Support double precision / single precision / half precision floating point
Register File (/thread)
Integer 64b x 32
floating 64b x 32
Multi-processor Support
change thread (Switch Active – Inactive thread)
cache flush (Each Cache-level)
barrier synchronization (Each hierarchy)
13
Software EnvironmentWe provide OpenCL like PZCL framework
Develop both host-processor code and PEZY-SC2 code
LLVM is used in PZCL compiler.
Special functions for PEZY-SC2 control
sync (barrier synchronization)
flush (writeback from specified cache)
get_pid, get_tid (get PEID/thread-ID)
chgthread (change active / in-active thread)
14
ZettaScaler-2.0 1st systemSystem Overview
26 Tank 832node system (32node / tank)
Model ZettaScaler-2.0
Nodes 832
Vendor ExaScaler Inc.
Processor Xeon D -1571
Speed 1,300
Sockets per Node: 1
Cores per Socket: 16
Accelerator/CP: PEZY-SC2
Accelerators/CP per Node: 16
Cores per Accelerators/CP: 2,048
Operating System: Linux CentOS7.3
Primary Interconnect: InfiniBand EDR
Memory per Node (GB) 1,088 15
GYOUKOU
Immersion Cooling Tanks
26 Tanks @ JAMSTEC16 Bricks / Tank
16
System Network
H
SC2x16
HH H
SC2x16
IB SWITCH IB SWITCH
H
SC2x16
HH H
SC2x16………
H
SC2x16
HH H
SC2x16
H
SC2x16
HH H
SC2x16
648 port director SWITCH
down 32
up 4
H EDR HCA MCX-455AH
tank
Brick
8port InfiniBand EDR / tank
Tank Switch: Mellanox SB7790
Total 8 x 26 =208port connection using 648-port director Switch
Mellanox CS7500
17
BrickBrick
1brick = 2node 32 x PEZY-SC2
Ultimate High Density Implementation
1 Base Carrier Board8 Sub Carrier Board32 PEZY-SC2 Module Card1 Dual-XeonD Module Card4 InfiniBand EDR HCA
18
Node OverviewNode
1 x Xeon D-1571 (16core, 1.3GHz)
16 x PEZY-SC2 (2,048core, 1GHz)
Multi-Layer PCIe Internal Network (Gen3 x16, 128Gbps+128Gbps)
Inter-SC2 Ring Network (PCIe Gen4 x 8 128Gbps + 128Gbps)
2 x InfiniBand Inter-node Network (EDR 100Gbps)
PLX / PEX9797
PCI Express Fabric
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
Gen3 x16
128Gbps + 128Gbps
PLX / PEX9797
PCI Express Fabric
PLX / PEX9797
PCI Express Fabric
PLX / PEX9797
PCI Express Fabric
PLX / PEX9797
PCI Express Fabric
PLX / PEX9797
PCI Express Fabric
IB
EDR
2CH
IB
EDR
2CH
Gen3 x16
128Gbps + 128Gbps
Gen3 x16
128Gbps + 128Gbps
IB EDR 100Gbps
x 2
IB EDR 100Gbps
x 2
Gen3 x16
128Gbps + 128Gbps
Gen4 x8
128Gbps + 128Gbps
CN
CN
CN CN CN
CN CN CN
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
CN
CN
CN CN CN
CN CN CN
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
CN
CN
CN CN CN
CN CN CN
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
PEZY-
SC2
CN
CN
CN CN CN
CN CN CN
XeonD 1571
16Core /
32 Thread
1.3GHz
(TB 2.1GHz )
19
ExaFlops Computing (1018 flops/s)
• Computational Power ≈a human brain
• Deep Learning ≈Matrix Algebra– Single/double precisions are not necessary
– half precision
• Exaflops machine will overcome human brains
→relief of human from boring works
• Full and real-time emulations of a human brain
→studies of human brains
→experimental philosophy, literature, theology
20
Conclusions
• Heterogeneous many-core processors:
– Next Processor architecture for HPC
– Sunway 26010
– PEZY SC2
• GYOKOU based on PEZY SC2
• Prospect of ExaFlops Computing
Computational Power ≈a human brain
21