Connect. Challenge. Inspire.
All Rights Reserved, Copyright© FUJITSU LIMITED 2015
ADAC Japan 2017
Jan 25th, 2017
Fujitsu processor history and future
Takumi Maruyama Senior Director AI Platform Division Advanced System Research & Development Unit
Agenda
K computer
Fujitsu processor development history
HPC
UNIX/Mainframe
Future development plan
Post K
AI processor: DLU
Summary
1 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
RIKEN K computer
WR#1 10.51 PFlops (2011/11)
K computer “Still the best”
3 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
2011 2012 2013 2014
1.TOP500 List
2 4
2. Gordon Bell Prize
3. HPC Challenge Awards (HPC、Random Access、STREAM、FFT)
2015
4. Graph500
4 4
2016
Finalist
7
4 2
4 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
(Source: ISC 2015 Long term failure analysis of 10 petascale supercomputer, RIKEN)
K computer “High reliable”
CPU failure rates of the K computer are about quarter compared to that of Blue Waters.
High Performance Processor
8core
Liquid Cooling
4processor
Torus network
6D.
「京」は、理化学研究所が2010年7月から使用している「次世代スーパーコンピュータ」の愛称です。
Fujitsu technologies to realize K computer
864 racks
82,944 Compute nodes
5,184 IO nodes
High density rack
24board ©Riken
5 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Fujitsu processor development History - HPC - UNIX/GS
6 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Fujitsu Processor development
Perpetual evolution, Always targeting No.1
2000~2003
SPARC64
SPARC64 II
SPARC64 V
SPARC64
GP
GS8900
GS21 600
GS8600
GS8800B
SPARC64 VII
GS21 1600
SPARC64 V+
SPARC64 VI
GS8800
GS21 900
Mainframe
Perfo
rman
ce
Relia
bility
Store Ahead
Branch History
Prefetch
Single-chip CPU
Non-Blocking $
O-O-O Execution
Super-Scalar
L2$ on Die
HPC-ACE
System on Chip
Hardware Barrier
Multi-core Multi-thread
2004~2007 2008~2011
SPARC64
GP
2012~2015 2016~
SPARC64 IXfx
SPARC64 VIIIfx
Virtual Machine Architecture
Software on Chip
High-speed Interconnect
SPARC64 X+
130nm
250nm / 220nm
180nm
:Technology generation
90nm
350nm
28nm
Tr=1B CMOS Cu
40nm
65nm
HPC UNIX
$ ECC
Register/ALU Parity
Instruction Retry
$ Dynamic Degradation
Error Checkers/History
Mainframe/UNIX/HPC + AI incremental development
GS21 2600
45nm
40nm
Next GS
SPARC64 XIfx
SPARC64
X
20nm
DLU
SPARC64 XII
Post-K ARM
AI
7 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64™ VII
• Architecture Features • 4core x 2threads (SMT)
• Embedded 6MB L2$
• 2.5GHz
• Jupiter Bus
• Fujitsu 65nm CMOS • 21.31mm x 20.86mm
• 600M transistors
• 456 signal pins
• 135 W (max) • 44% power reduction per core
from SPARC64TM VI
Core 1Core 1 Core 3Core 3
Core 2Core 2
System I/OSystem I/O
System
Interface
System
Interface
L2 Cache
Control
L2 Cache
Control
L2 Cache
Tag
L2 Cache
Tag
Instruction
Control
Instruction
Control
Execution
Unit
Execution
UnitL1D CacheL1D Cache
L1 Cache
Control
L1 Cache
Control
L2 Cache
Data
L2 Cache
Data
L1I CacheL1I Cache
8 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64™ VII in HPC & UNIX server
Core 1Core 1 Core 3Core 3
Core 2Core 2
System I/OSystem I/O
System
Interface
System
Interface
L2 Cache
Control
L2 Cache
Control
L2 Cache
Tag
L2 Cache
Tag
Instruction
Control
Instruction
Control
Execution
Unit
Execution
UnitL1D CacheL1D Cache
L1 Cache
Control
L1 Cache
Control
L2 Cache
Data
L2 Cache
Data
L1I CacheL1I Cache
JSC
SPARC64TM
VII JSC DDR2
DDR2
32B
32B 20B(in)+12B(out)
SC
SC
MC
MC
MC
MC
CPU
CPU
CPU
CPU
DDR2
DDR2
DDR2
DDR2
SC
SC
Supercomputer FX1
System board diagram System board diagram
SPARC Enterprise
The same processor was used both in HPC and UNIX severs - SC(system controller LSI) was different
9 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64™ VIIIfx Chip [HPC]
• Architecture Features • 8 cores
• HPC-ACE (128-bit SIMD)
• Shared 6 MB L2$
• Embedded Memory Controller
• 2 GHz
• Fujitsu 45nm CMOS • 22.7mm x 22.6mm
• 760M transistors
• 1271 signal pins
• Performance (peak) • 128GFlops
• 64GB/s memory throughput
• Power • 58W (TYP, 30℃)
• Water Cooling – Low leakage power and High reliability
Core5
Core4
Core1
Core0
Core7
Core6
Core3
Core2
DD
R3 inte
rface
DD
R3 inte
rface
L2$ Data
L2$ Data
HSIO
L2$ Control MAC MAC
MAC MAC
10 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64™ X+ Chip [UNIX]
Architecture Features • 16 cores x 2 SMT threads
• Shared 24 MB L2$
• Memory and I/O Controllers
• HPC-ACE (128bit SIMD)
• SWoC (Software on Chip)
28nm CMOS • 24.0mm x 25.0mm
• 2,990M transistors
• 1,500 signal pins
• 3.7GHz
Performance (peak) • 473GFlops
• 102GB/s memory throughput
DDR3 Interface
DDR3 Interface
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
L2 Cache
Data
L2 Cache
Data
L2 Cache
Control
SE
RD
ES
/ Inte
r-CP
U
SE
RD
ES
/ PC
Ie G
en
3
MAC
MAC
11 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64™ XIfx Chip [HPC]
Architecture Features • 32 computing cores
+ 2 assistant cores
• HPC-ACE2 (256bit SIMD)
• 24 MB L2 cache
• HMC, Tofu2 , PCI Gen3
20nm CMOS • 3,750M transistors
• 1,001 signal pins
• 2.2GHz
Performance (peak) • 1.1TFlops
• HMC 240GB/s x 2(in/out)
• Tofu2 125GB/s x 2(in/out)
core core
core core
core core
core core
core core
core core
core core
core core
Assistant
core Assistant
core
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interface
Tofu2 controller
HM
C in
terface
HM
C inte
rfac
e
L2 cache
L2 cache
PCI interface
MA
C
MA
C M
AC
M
AC
PCI controller
12 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
13
HPC-ACE @SPARC64™ VIIIfx (High Performance Computing - Arithmetic Computational Extensions)
Fujitsu’s unique ISA extension to SPARC-V9 for HPC
• Large register sets
• 128bit SIMD
• Software controlled Cache
• FP Trigonometric Functions
• Conditional operation
• FP Reciprocal Approximation of Divide/Square-root
SPARC architecture is tolerant about ISA enhancements.
Ext.
V9
Ext.
V9 V9
Ext.
INT Reg FP Reg
Register Window
32
224
32
32 160
SIMD (basic)
SIMD (extended)
32
Register extension to SPARC-V9
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
14
Software on Chip @SPARC64™ X+
HW for SW ISA extensions to accelerates specific software function with HW
The targets Decimal operation (IEEE754 decimal and NUMBER)
Cypher operation (AES/DES)
Database acceleration
HW implementation The HW engines for SWoC are implemented in FPU
• To fully utilize 128 FP registers & software pipelining
Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW.
Avoid complication due to “CISC” type instructions
• Various “RISC” type instructions are newly defined, instead.
• 18 insts. for Decimal, and 10 insts. for Cypher operation
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
SPARC64TM VII Pipeline [HPC/UNIX]
L1 I$ 64KB
2Way
Branch Target
Address 8Kentry
Decode
& Issue
RSE 8x2Entry
RSA 10Entry
RSF 8x2Entry
RSBR 10Entry
GUB 32Registers
GPR 156Registers
x2
EXA
EXB
EAGA
EAGB
FPR 64Registers
x2
FUB 48Registers
FLA
FLB
Fetch
Port 16Entry
Store
Port 16Entry
Store
Buffer 16Entry
L1 D$ 64KB
2Way
System Bus
Interface
Fetch Issue Dispatch Reg.-Read Execute Memory
CSE 64Entry
Commit
PC
x2
Control
Registers
x2
L2$ 6MB/12MB 12Way
4-core
All Rights Reserved, Copyright 2017 FUJITSU LIMITED 15
16
L1 I$ 32KB
2Way
Branch Target
Address 1Kentry 2Way
Decode
& Issue
RSE 10Entry
RSA 10Entry
RSF 8x2Entry
RSBR 6Entry
GUB 32Registers
GPR 188Registers
EXA
EXB
EAGA
EAGB
FPR 256Registers
FUB 48x2Registers
FLA
FLB
Fetch
Port 20Entry
Store
Port 8Entry
L1 D$ 32KB
2Way
Memory
Controller
Fetch Issue Dispatch Reg.-Read Execute Memory
CSE 48Entry
Commit
PC
Control
Registers
SPARC64TM VIIIfx Pipeline [HPC]
L2$ 6MB
12Way FLC
FLD
DIMM
Write
Buffer 5Entry
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
8-core …
Control
Registers
PC
FPR 128Registers
GPR 156Registers
L1 I$ 64KB
4Way
Branch Target
Address 4Kentry
Decode
& Issue
RSE 24Entry
RSA 24Entry
RSF 20Entry
RSBR 16Entry
GUB 64Registers
GPR 156Registers
EXA
EXB
EAGA
EXC
EAGB
EXD
FPR 128Registers
FUB 64Registers
FLA
Decimal
Cypher
FLB
Fetch
Port 32Entry
Store
Port 24Entry
L1 D$ 64KB
4Way
Memory
Controller
Fetch Issue Dispatch Reg.-Read Execute Memory
CSE 96Entry
Commit
PC
Control
Registers
SPARC64TM X Pipeline [UNIX]
L2$ 24MB 24Way
FLC
Cypher
FLD
DIMM
Write
Buffer 10Entry
Pattern History Table
16Kentry
IO
Controller Router
PCI-GEN3 CPU-CPU I/F
16-core …
All Rights Reserved, Copyright 2017 FUJITSU LIMITED 17
SPARC64TM XIfx Pipeline [HPC]
FLB
L1 I$ 64KB
4ways
Branch Target
Address
Decode
& Issue
RSE
RSA
RSF
RSBR
GUB
GPR 188Registers
EXA
EXB
EAGA
EXC
EAGB
EXD
FPR 128x4 Reg.
FUB
Fetch
Port
Store
Port L1 D $
64KB
4Way
MAC
(HMC Controller)
Fetch Issue Dispatch Reg-Read Execute Cache and Memory
CSE
Commit
PC
Control
Registers
L2$
HMC
Write
Buffer
Pattern History Table
PCI Controller
Tofu 2 controller
PCI-GEN3 CPU-CPU I/F
34 cores …
FLB FLA
Local Pattern Table
FLB FLB FLB
All Rights Reserved, Copyright 2017 FUJITSU LIMITED 18
Fujitsu’s processor design approach
Fully utilize the latest semiconductor technology
Enhance/change ISA (Instruction Set Architecture) to meet requirements: HPC-ACE, SWoC
Shared micro-Architecture across HPC/UNIX/GS
Perpetual evolution over generations
19 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Future Fujitsu processor development - Post K - AI Processor (DLUTM)
20 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Post-K Goals and Approaches
Post-K Goals
High application performance and good power efficiency
Keeping application compatibility while advancing from predecessors
Good usability and better accessibility for users
Our Approaches
Developing high performance and scalable, custom CPU cores
【Performance】 Wider SIMD & high memory BW, mathematical acc. Primitives
【Scalability】 Scalable many core, zero OS jitter (assistant core)
【Power efficiency】 The best device tech, power control functions, optimal resources
Maintaining performance balance and supporting advanced features
• High memory BW, “Tofu” interconnect, and RIKEN advanced system software
Adopting ARM standard architecture
• Co-operation with ARM/Linux community and utilization of open source software
• Getting involved in the ARM HPC ecosystem
All Rights Reserved, Copyright 2017 FUJITSU LIMITED 21
Post-K CPU ISA: ARM SVE
FUJITSU, as a lead partner in ARM SVE development, contributes to specification of ARM SVE (Scalable Vector Extension), for application performance
FUJITSU ARM core incorporates FUJITSU’s proven supercomputer microarchitecture
ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance
Post-K complies ARM’s standard frameworks (SBSA, etc.), for compatibility among platforms
Functions for Perf. Post-K FX100 FX10 K computer
SVE incorporated
SIMD 512bit 256bit 128bit 128bit
FMA4 ✔ ✔ ✔ ✔
Math. acc. prim.* ✔Enhanced ✔ ✔ ✔
Optional functions
Inter-core barrier ✔ ✔ ✔ ✔
Sector cache ✔Enhanced ✔ ✔ ✔
Prefetch modes ✔Enhanced ✔ ✔ ✔
Interconnect Tofu ✔Enhanced ✔ ✔ ✔
*Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential... 22 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Supercomputer K technologies
DLUTM features Architecture designed for Deep Learning High performance HBM2 memory Low power design ➔ Goal: 10x Performance/Watt compared to others
Massively parallel:Apply the supercomputer interconnect technology ➔Ability to handle large scale neural networks
写真はイメージであり、実物とは異なります
DLUTM Goals
DLU
(Deep Learning Unit)
FY2018~
TM
23 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
New ISA for Deep learning High density FMA
HBM2
on-chip network
DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element
Host I/F DPU-0
DPU-1
DPU
DPU
DPU
DPU-n
DPE DPE DPE
DPE DPE DPE DPE DPE DPE
DPE DPE DPE
DPE DPE DPE
Large scale DLU interconnect through off-chip network
DPE DPE DPE
DLUTM
(Deep Learning Unit)
DLUTM architecture
Fujitsu’s interconnect technology
24
ISA: Newly developed for Deep learning
Micro-Architecture
Simple pipeline to remove HW complexity
On chip network to share data between DPUs
Utilize Fujitsu’s HPC experience such as high density FMA and
high speed interconnect
➔ Maximize performance(throughput) / watt
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors.
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
DLUTM Future Roadmap
FUJITSU CONFIDENTIAL
Performance
per watt
×10
Fujitsu
• K computer processor development technology
• Architecture optimized for DL
• Large scale network
Accelerator Needs separate Host CPU The 1st
Generation
Host CPU embedded Inter-DLU direct connection
Large scale neural network
The 2nd
Generation
Neuro computing Combinational optimization
architecture. Future
FY2021 FY2018
25
* Subject to change without notice
New Arch. Rivals Quantum Computers
All Rights Reserved, Copyright 2017 FUJITSU LIMITED
New Architecture Architecture designed for combinatorial optimization problems
Uses a basic optimization circuit, based on digital circuitry and conventional semiconductor technology
Hierarchical structure for optimal data movement and parallel calculation
Acceleration Technology Calculates multiple candidates at once, parallel execution
Detects and escapes from the local minimum states by adding score
New architecture Acceleration using basic optimization circuit http://www.fujitsu.com/global/about/resources/news/press-releases/2016/1020-02.html
26
Summary
27 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
FJ Processors 2015 2012 2010
2008
K Computer 8core
Post “京”
FX1 4core
FX10 16core
FX100 34core DLU
2013 M10
16core
SPARC Enterprise
Next
GS
2014
GS21 model 2600 8core
AI
HPC
UNIX
SPARC64TM
VII
65nm
SPARC64TM
VIIIfx
45nm
SPARC64TM
Ixfx
40nm
SPARC64TM
XIfx
20nm
SPARC64TM
X
28nm SPARC64TM
XII
Mainframe
SPARC64TM
VII+
65nm
2011 SPARC Enterprise
4core
28nm
Utilization of the latest Semiconductor technologies
Evolution to meet different requirements between HPC/UNIX/Mainframe/AI
28 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
29 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
10
100
1,000
10,000
0 1 2 3 4
GF
lop
s
GHz
Single thread performance
[Amdahl’s law]
Throughput
UNIX
processor
HPC
processor
AI
processor
DLU
SPARC64TM
VII
An example of different requirements:
Single thread performance vs Throughput
SPARC64TM
X+
SPARC64TM
VIIIfx
SPARC64TM
IXfx
SPARC64TM
XIfx
Summary
Fujitsu has successfully designed various processors for decades.
Fujitsu’s processor design win has come from Instruction set architecture (ISA) enhancements
Shared micro-architecture with perpetual evolution over generations
Semiconductor technology improvements
Fujitsu will take similar design approach for Post-K and DLU ISA and micro-architecture are getting more important due to the
limitation of Moore’s law.
Fujitsu will continue to develop processors to meet the needs of a new era.
30 All Rights Reserved, Copyright 2017 FUJITSU LIMITED