Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | rakuten-inc |
View: | 35,855 times |
Download: | 2 times |
TSUBAME2.5 to 3.0 and Convergence with Extreme
Big DataSatoshi Matsuoka
ProfessorGlobal Scientific Information and Computing (GSIC) Center
Tokyo Institute of TechnologyFellow, Association for Computing Machinery (ACM)
Rakuten Technology Conference 20132013/10/26Tokyo, Japan
Supercomputers from the Past
Fast, Big, Special, Inefficient, Evil device to conquer the world…
Let us go back to the mid ’70sBirth of “microcomputers” and arrival of
commodity computing (start of my career)• Commodity 8-bit CPUs…
– Intel 4004/8008/8080/8085, Zilog Z-80, Motorola 6800, MOS Tech. 6502, …
• Lead to hobbyist computing…– Evaluation boards: Intel SDK-80,
Motorola MEK6800D2, MOS Tech. KIM-1, (in Japan) NEC TK-80, Fujitsu Lkit-8, …
– System Kits: MITS Altair 8800/680b, IMSAI 8080, Proc. Tech. SOL-20, SWTPC 6800, …
• & Lead to early personal computers– Commodore PET, Tandy TRS-80,
Apple II– (in Japan): Hitachi Basic Master,
NEC CompoBS / PC8001, Fujitsu FM-8, …
Supercomputing vs. Personal Computing in the late 1970s.• Hitachi Basic Master
(1978)– “The first PC in Japan”– Motorola 6802--1Mhz,
16KB ROM, 16KB RAM– Linpack in BASIC: Approx.
70-80 FLOPS (1/1,000,000)• We got “simulation” done
(in assembly language)– Nintendo NES (1982)
• MOS Technology 6502 1Mhz (Same as Apple II)
– “Pinball” by Matsuoka & Iwata (now CEO Nintendo)• Realtime dynamics +
collision + lots of shortcuts• Average ~a few KFLOPS
Cf. Cray-1 (1976)Linpack
80-90MFlops(est.)
Running Linpack 10
Then things got accelerated around the mid 80s to mid 90s
(rapid commoditization towards what we use now)• PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP
x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s– C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, …
• Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash• Network Evolution: RS-232C to Ethernet now to FDR Infinininband• PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe• Software Evolution: CP/M to MS-DOS to Windows, Linux, • WAN Evolution: RS-232+Modem+BBS to Modem+Internet to
ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, …• Internet Evolution: email + ftp to Web, Java, Ruby, …
• Then Clusters, Grid/Clouds, 3-D Gaming, and Top500 all started in the mid 90s(!), and commoditized supercomputing
NEC Confidential
Modern Day Supercomputers
Now supercomputers “look like” IDC servers
High-End COTS dominate
Linux based machine with standard + HPC OSS Software Stack
7
1957
“Reclaimed No.1 Supercomputer Rank in the World”
2011
2010
2012
NEC Confidential
Top Supercomputers vs. Global IDC
DARPA study2020 Exaflop (1018)
100 million~1 Billion Cores
K Computer (#1 2011-12) Riken-AICSFujitsu Sparc VIII-fx Venus CPU 88,000 nodes, 800,000CPU cores~11 Petaflops (1016)1.4 Petabyte memory, 13 MW Power864 racks、3000m2
C.f. Amazon ~= 450,000 Nodes, ~3 million Cores
#1 2012 IBM BlueGene/Q “Sequoia” Lawrence Livermore National LabIBM PowerPC System-On-Chip98,000 nodes, 1.57million Cores~20 Petaflops1.6 Petabytes, 8MW, 96 racks
Tianhe2 (#1 2013) China Gwanjou48,000 KNC Xeon Phi + 36,000 Ivy Bridge Xeon 18,000 nodes, >3 Million CPU cores54 Petaflops (1016)0.8 Petabyte memory, 20 MW Power??? racks、???m2
NEC Confidential
Scalability and Massive Parallelism
More nodes & core => Massive Increase in parallelism
Faster, “Bigger” Simulation
Qualitative Difference
CPU Cores ~= Parallelism
Performance
Ideal Linear Scaling Difficult to Achieve
Limitations in Power,
Cost, Reliability
Limitations in Scaling
GOOD!
BAD!
BAD!
TSUBAME2.0
2006: TSUBAME1.0 as No.1 in Japan
>All University Centers
COMBINED 45 TeraFlops
Total 85 TeraFlops, #7 Top500 June 2006 Earth Simulator
40TeraFlops #1 2002~2004
TSUBAME2.0 Nov. 1, 2010“The Greenest Production Supercomputer in the World”
12
TSUBAME 2.0New Development
32nm 40nm
>400GB/s Mem BW80Gbps NW BW~1KW max
>1.6TB/s Mem BW >12TB/s Mem BW35KW Max
>600TB/s Mem BW220Tbps NW Bisecion BW1.4MW Max
Performance Comparison of CPU vs. GPU
GPU
Peak
Per
form
ance
[GFL
OPS
]
1250
1000
750
500
250
0
1500
CPU
1750 GPU
CPU
160
120
200
40
0
80
Mem
ory
Band
wid
th [G
Byte
/s]
x5-6 socket-to-socket advantage in both compute and memory bandwidth,Same power (200W GPU vs. 200W CPU+memory+NW+…)
NEC Confidential
TSUBAME2.0 Compute Node
Total Perf2.4PFlops
Mem: ~100TBSSD: ~200TB 4-1
HP SL390G7 (Developed for TSUBAME 2.0)GPU: NVIDIA Fermi M2050 x 3
515GFlops, 3GByte memory /GPUCPU: Intel Westmere-EP 2.93GHz x2 (12cores/node)Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDRMemory: 54, 96 GB DDR3-1333SSD:60GBx2, 120GBx2
ThinNode
Infiniband QDR x2 (80Gbps)
Productized as HP ProLiantSL390s
1.6 Tflops400GB/s Mem BW80GBps NW~1KW max
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System Volumes
Home Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
LustreGPFS with HSM
“Thin node SSD” “Fat/Medium node SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜4PB Tape
3.6 PB 30~60GB/s
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
130 TB=> 500TB~1PB
250 TB, 300~500GB/s
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System Volumes
Home Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
LustreGPFS with HSM
“Thin node SSD” “Fat/Medium node SSD”
ScratchHPCI Storage
1.2PB
2.4 PB HDD + 〜4PB Tape
3.6 PB
130 TB=> 500TB~1PB
250 TB, 300GB/s
• Home storage for computing nodes• Cloud-based campus storage services
Concurrent Parallel I/O (e.g. MPI-IO)
Fine-grained R/W I/O(checkpoints, temporary files, Big Data processing)
Data transfer service between SCs/CCs
Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)
Long-TermBackup
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
NEC Confidential
3500 Fiber Cables > 100Kmw/DFB Silicon PhotonicsEnd-to-End 7.5GB/s, > 2usNon-Blocking 200Tbps Bisection
2010: TSUBAME2.0 as No.1 in Japan
> All Other Japanese Centers on the Top500
COMBINED 2.3 PetaFlops
Total 2.4 Petaflops#4 Top500, Nov. 2010
“Greenest Production Supercomputer in the World”the Green 500Nov. 2010, June 2011(#4 Top500 Nov. 2010)
TSUBAME Wins Awards…
3 times more power efficient than a laptop!
ACM Gordon Bell Prize 20112.0 Petaflops Dendrite Simulation
TSUBAME Wins Awards…
Special Achievements in Scalability and Time-to-Solution“Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”
Commendation for Sci &Tech by Ministry of Education 2012
(文部科学大臣表彰)Prize for Sci & Tech, Development CategoryDevelopment of Greenest Production Peta-scale Supercomputer
Satoshi Matsuoka, Toshio Endo, Takayuki Aoki
TSUBAME Wins Awards…
Precise Bloodflow Simulation of Artery on TSUBAME2.0
(Bernaschi et. al., IAC-CNR, Italy)Personal CT Scan + Simulation
=> Accurate Diagnostics of Cardiac Illness
5 Billion Red Blood Cells + 10 Billion Degrees of Freedom
MUPHY: Multiphysics simulation of blood flow(Melchionna, Bernaschi et al.)
Combined Lattice-Boltzmann (LB) simulation for plasma and MolecularDynamics (MD) for Red Blood Cells
Realistic geometry ( from CAT scan)
Two-levels of parallelism: CUDA (on GPU) + MPI
• 1 Billion mesh node for LB component•100 Million RBCs
Red blood cells (RBCs) are represented as ellipsoidal particles
Fluid: Blood plasma
Lattice Boltzmann
Multiphyics simulation with MUPHY software
Body: Red blood cell
Extended MD
Irregular mesh is divided by using PT-SCOTCH tool, considering cutoff distance
coupled
ACMGordon Bell Prize 2011Honorable
Mention
4000 GPUs, 0.6Petaflops
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
Lattice-Boltzmann-LES with Coherent-structure SGS model
[Onodera&Aoki2013]
Lattice-Boltzmann-LES with Coherent-structure SGS model
[Onodera&Aoki2013]
Second invariant of the velocity gradient tensor(Q) andEnergy dissipation(ε)
The model parameter is locally determined by the second invariant of the velocity gradient tensor.
◎ Turbulent flow around a complex object◎ Large-scale parallel computation
Coherent-structure Smagorinsky model
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
Computational Area – Entire Downtown Tokyo
Computational Area – Entire Downtown Tokyo
Major part of TokyoIncluding Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku,
10km×10km
Building Data:Pasco Co. Ltd.TDM 3D
Achieved 0.592 Petaflops using over 4000 GPUs (15% efficiency)
25
Map©2012 Google, ZENRIN
Shinjyuku Tokyo
Shinagawa
Shibuya
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
Area Around Metropolitan Government Building
Area Around Metropolitan Government Building
27
Flow profile at the 25m height on the ground
640 m
960 m
地図データ ©2012 Google, ZENRIN
Wind
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 28
29
ASUCA Typhoon Simulation on TSUBAME2.0500m Resolution 4792×4696×48 , 437 GPUs(x1000 resolution)
ASUCA Typhoon Simulation on TSUBAME2.0500m Resolution 4792×4696×48 , 437 GPUs(x1000 resolution)
Current Weather Forecast5km Resolution(Inaccurate Cloud Simulation)
Current Weather Forecast5km Resolution(Inaccurate Cloud Simulation)
CFD analysis over a car body
* Number of grid points:3,623,878,656
(3,072 × 1,536 × 768)
*Grid resolution:4.2mm(13m x 6.5 m x 3.25m)
*Number of GPUs:288 (96 Nodes)
60 km/h60 km/h
Calculation conditions
32
LBM: DriVer: BMW-AudiLehrstuhl für Aerodynamik und StrömungsmechanikTechnische Universität München
3,000x1,500x1,500Re = 1,000,000
33
34
Industry prog.: TOTO INC.TSUBAME 150 GPUs In-House Cluster
Accelerate In‐silico screeninigand data mining
アステラス製薬とのデング熱等の熱帯病の特効薬の創薬
100‐million‐atom MD Simulation
M. Sekijima (Tokyo Tech), Jim Phillips (UIUC)
Mixed Precision Amber on Tsubame2.0 for Industrial Drug Discovery
75% Energy Efficient
ヌクレオソーム (25095 粒子)
x10 fasterMixed‐Precision
$500mil~$1bil dev. cost per drug
Even 5-10% improvement of the process will more than
pay for TSUBAME
Towards TSUBAME 3.0Interim Upgrade TSUBAME2.0 to 2.5
(Early Fall 2013)• Upgrade the TSUBAME2.0s GPUs
NVIDIA Fermi M2050 to Kepler K20XSFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF
c.f. The K Computer 11.2/11.2Acceleration of Important AppsConsiderable ImprovementSummer 2013TSUBAME2.0 Compute Node
Fermi GPU 3 x 1408 = 4224 GPUs
Significant Capacity Improvement at low cost
& w/oPower Increase
TSUBAME3.0 2H2015
TSUBAME2.0⇒2.5 Thin Node Upgrade
HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5)GPU: NVIDIA Kepler K20X x 3
1310GFlops, 6GByte Mem(per GPU)CPU: Intel Westmere-EP 2.93GHz x2Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDRMemory: 54, 96 GB DDR3-1333SSD:60GBx2, 120GBx2
ThinNode
Infiniband QDR x2 (80Gbps)
Productized as HP ProLiantSL390sModified for TSUABME2.5
Peak Perf.
4.08 Tflops~800GB/s Mem BW80GBps NW~1KW max
NVIDIA Fermi M20501039/515GFlops
NVIDIA KeplerK20X3950/1310GFlops
2013: TSUBAME2.5 No.1 in Japan in Single Precision FP, 17 Petaflops
~=K Computer
11.4 Petaflops SFP/DFP
Total 17.1 Petaflops SFP5.76 Petaflops DFP
All University CentersCOMBINED 9 Petaflops SFP
TSUBAME2.0 TSUBAME2.5Thin Node x 1408 Units
Node Machine HP Proliant SL390s ← No Change
CPU Intel Xeon X5670 (6core 2.93GHz, Westmere) x 2
← No Change
GPU NVIDIA Tesla M2050 x 3 448 CUDA cores (Fermi)
SFP 1.03TFlops DFP 0.515TFlops
3GiB GDDR5 memory 150GB Peak, ~90GB/s
STREAM Memory BW
NVIDIA Tesla K20X x 3 2688 CUDA cores (Kepler)
SFP 3.95TFlops DFP 1.31TFlops
6GiB GDDR5 memory 250GB Peak, ~180GB/s
STREAM Memory BWNode Performance (incl. CPU Turbo boost)
SFP 3.40TFlops DFP 1.70TFlops ~500GB Peak, ~300GB/s
STREAM Memory BW
SFP 12.2TFlops DFP 4.08TFlops ~800GB Peak, ~570GB/s
STREAM Memory BWTOTAL System
Total SystemPerformance
SFP 4.80PFlops DFP 2.40PFlops Peak ~0.70PB/s, STREAM
~0.440PB/s Memory BW
SFP 17.1PFlops (x3.6) DFP 5.76PFlops (x2.4) Peak ~1.16PB/s, STREAM
~0.804PB/s Memory BW (x1.8)
Phase‐field simulation for Dendritic Solidification[Shimokawabe, Aoki et. al.]
• Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials.
• 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution
Weak scaling on TSUBAME (Single precision)Mesh size(1GPU+4 CPU cores):4096 x 162 x 130
TSUBAME 2.02.000 PFlops
(4,000 GPUs+16,000 CPU cores)
4,096 x 6,480 x 13,000
TSUBAME 2.53.444 PFlops
(3,968 GPUs+15,872 CPU cores)4,096 x 5,022 x 16,640
Developing lightweight strengthening material by controlling microstructure
Low‐carbon society
Peta‐scale stencil application :A Large‐scale LES Wind Simulation using Lattice Boltzmann Method[Onodera, Aoki et. al.]
Number of GPUs
Perf
orm
ance
[TFl
ops]
▲ TSUBAME 2.5 (overlap)● TSUBAME 2.0 (overlap)
Weak scalability in single precision(N = 192 x 256 x 256)
TSUBAME 2.51142 TFlops (3968 GPUs)288 GFlops / GPU
TSUBAME 2.0149 TFlops (1000 GPUs)149 GFlops / GPU
x 1.93
10,080 x 10,240 x 512 (4,032 GPUs)
Large-scale Wind Simulation for a10km x 10km Area in Metropolitan Tokyo
The above peta‐scale simulations were executed as the TSUBAME Grand Challenge Program, Category A in 2012 fall.
• The LES wind simulation for the area 10km × 10km with 1‐m resolution has never been done before in the world.
• We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
K20X×4
M2050×8
K20X×2
K20X×8
K20X×1
ns/day
AMBER pmemd benchmarkNucleosome = 25,095 atoms
MPI 4node
0.110.150.31
0.991.852.22
3.443.11
4.046.66
11.39
0 2 4 6 8 10 12
M2050×4
M2050×2
M2050×1
MPI 2node
MPI 1node(12 core)
TSUBAME2.0 M2050TSUBAME2.5 K20X
Dr.Sekijima@Tokyo Tech
Application TSUBAME2.0Performance
TSUBAME2.5Performance
Boost Ratio
Top500/Linpack(PFlops)
1.192 2.843 2.39
Green500/Linpack(GFlops/W)
0.958 > 2.400 > 2.50
Semi‐Definite Programming Nonlinear Optimization (PFlops)
1.019 1.713 1.68
Gordon Bell Dandrite Stencil (PFlops)
2.000 3.444 1.72
LBM LES Whole City Airflow (PFlops)
0.600 1.142 1.90
Amber 12 pmemd 4 nodes 8 GPUs (nsec/day)
3.44 11.39 3.31
GHOSTM Genome Homology Search (Sec)
19361 10785 1.80
MEGADOC Protein Docking (vs. 1CPU core)
37.11 83.49 2.25
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 47
TSUBAME EvolutionTowards Exascale and Extreme Big Data
TSUBAME EvolutionTowards Exascale and Extreme Big Data
Graph 500No. 3 (2011)
Awards
3.0
25‐30PF
2015H2
2.5Phase1Fast I/O250TB300GB/s30PB/Day
5.7PF
Phase2Fast I/O5~10PB10TB/s> 100mil iOPs1ExaB/Day
1TB/s
DoE Exascale Parametersx1000 power efficiency in 10 years
System attributes “2010” “2015” “2020”
System peak 2 PetaFlops 100-200 PetaFlops 1 ExaFlopPower Jaguar
6 MWTSUBAME1.3 MW
15 MW 20 MW
System Memory 0.3PB 0.1PB 5 PB 32-64PBNode Perf 125GF 1.6TF 0.5TF 7TF 1TF 10TFNode Mem BW 25GB/s 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/sNode Concurrency 12 O(1000) O(100) O(1000) O(1000) O(10000
)#Nodes 18,700 1442 50,000 5,000 1 million 100,000Total Node Interconnect BW
1.5GB/s 8GB/s 20GB/s 200GB/s
MTTI O(days) O(1 day) O(1 day)
x
Billion Cores
Challenges of Exascale (FLOPS, Byte, …) (1018)!Various Physical Limitations Surface All‐at‐Once
• # CPU Cores: 1Bil c.f. Total # of Smartphones soldLow Power globally = 400Mil
• # Nodes 100K~xM c.f. The K Computer ~100KGoogle ~ 1 Mil
• Mem: x00PB~ExaB c.f. Total mem all PCs (300Mil) shipped globally in 2011 ~ ExaBBTW 264~=1.8x1019=18ExaB
• Storage: xExaB c.f. Google Storage 2 Exabytes (200Mil x 7GB+)
• All of this at 20MW (50GFlops/W), reliability (MTTI=days), ease of programming (billion cores?), cost… in 2020?!
49
Focused Research Towards Tsubame 3.0 and Beyond towards Exa
• Green Computing: Ultra Power Efficient HPC• High Radix Bisection Networks – HW, Topology, Routing
Algorithms, Placement…• Fault Tolerance – Group-based Hierarchical
Checkpointing, Fault Prediction, Hybrid Algorithms• Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop
Acceleration, Large Graphs• New memory systems – Pushing the envelops of low
Power vs. Capacity vs. BW, exploit the deep hierarchy with new algorithms to decrease Bytes/Flops
• Post Petascale Programming – OpenACC and other many-core programming substrates, Task Parallel
• Scalable Algorithms for Many Core –Apps/System/HW Co-Design
モデルと実測の Bayes 的融合
• Bayes モデルと事前分布
• n 回実測後の事後予測分布
),(-Inv~
)/,(~,|
),(~
200
220
22
2
v
xN
Ny
i
iTiii
iii
nTiiimnn
niTinnn
nninininiii
xynyy
ynxnn
tyyyyn
/)()(
/)(,,
)/,(~),,,(|
20
2200
2000
12
21
モデルによる所要時間の推定
所要時間の実測データ
imi yn
y 1
!ABCLib$ static select region start!ABCLib$ parameter (in CacheS, in NB, in NPrc) !ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)
対象1(アルゴリズム1)
!ABCLib$ select sub region end!ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)
対象2(アルゴリズム2)
!ABCLib$ select sub region end!ABCLib$ static select region end
1
実行起動前自動チューニング指定、アルゴリズム選択処理の指定
コスト定義関数で使われる入力変数
コスト定義関数
対象領域1、2
ABCLibScript: アルゴリズム選択
JST‐CREST “Ultra Low Power (ULP)‐HPC” Project 2007‐2012
MRAMPRAMFlashetc.
Ultra Multi-CoreSlow & Parallel
(& ULP)ULP‐HPC SIMD‐
Vector(GPGPU, etc.)
ULP‐HPCNetworks
Power Optimize using Novel Componentsin HPC
Power‐Aware and Optimizable Applications
Perf. ModelAlgorithms
0
Low Power High Perf Model
0
x10 Power Efficiencty
Optimization Point
Power Perf
x1000Improvement in 10 years
Auto‐Tuning for Perf. & Power
Aggressive Power Saving in HPCMethodologies Enterprise/Business
Clouds HPCServer Consolidation Good NG!
DVFS(Dynamic Voltage/Frequency
Scaling)Good Poor
New Devices Poor(Cost & Continuity)
GoodNew HW &SW Architecture
Poor(Cost & Continuity)
Good
Novel Cooling Limited(Cost & Continuity)
Good(high thermal density)
How do we achive x1000?Process Shrink x100
XMany-Core GPU Usage x5
XDVFS & Other LP SW x1.5
XEfficient Cooling x1.4
x1000 !!!
ULP-HPCProject2007-12
Ultra GreenSupercomputingProject 2011-15
Statistical Power Modeling of GPUs[IEEE IGCC10]
i
n
iicp
1
• Prevents overtraining by ridge regression• Determines optimal parameters by cross fitting
Average power consumption
GPU performance counters • Estimates GPU power consumption GPU statistically
• Linear regression model using performance counters as explanatory variables
Power meter withhigh resolution
High accuracy(Avg Err 4.7%)
Accurate even with DVFSFuture: Model‐based power opt.
Linear model shows sufficient accuracy
Possibility of optimization of Exascale systems with O(10^8) processors
Power Efficiency in Denderite Applicationson TSUBAME1.0 thru JST‐CREST ULPHPC Prototype running Gordon Bell Denderite App
TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed[2011‐2015]
Heat ExchangerOil 35~45℃
⇒ Water 25~35℃
Cooling Tower:Water 25~35℃⇒ Outdoor air
Fluid Submersion Cooling + Outdoor Air Cooling + High Density GPU Supercomputing in a 20‐feet container
GRC Submersion RackProcessors 80~90℃
⇒ Coolant oil 35~45℃
• Intel IvyBridge 2.1GHz 6core×2• NVIDIA Tesla K20X GPU ×4• DDR3 memory 64GB, SSD 120GB• 4x FDR InfiniBand 56GbpsPer node
Total Peak210TFlops (DP)630TFlops (SP)
• World’s top power efficiency (>3GFlops/Watt)• Average PUE 1.05, lower component power• Field test ULP‐HPC results
Target
Compute Nodes
Facility20 feet container(16m2)
• Coolant oil: Spectrasyn8
NEC/SMC 1U server x 40Heat
Dissipation
K20X GPU
TSUBAME-KFCTowards TSUBAME3.0 and Beyond
Shooting for #1 on Nov. 2013 Green 500!
Machine Power (incl. cooling)
LinpackPerf(PF)
LinpackMFLOPs/W
Factor Total MemBW TB/s(STREAM)
Mem BWMByte/S/ W
Earth Simulator 1 10MW 0.036 3.6 13,400 160 16Tsubame1.0(2006Q1)
1.8MW 0.038 21 2,368 13 7.2
ORNL Jaguar(XT5. 2009Q4)
~9MW 1.76 196 256 432 48
Tsubame2.0(2010Q4)
1.8MW 1.2 667 75 440 244
K Computer(2011Q2)
~16MW 10 625 80 3300 206
BlueGene/Q(2012Q1)
~12MW? 17 ~1400 ~35 3000 250
TSUBAME2.5(2013Q3)
1.4MW ~3 ~2100 ~24 802 572
Tsubame3.0(2015Q4~2016Q1)
1.5MW ~20 ~13,000 ~4 6000 4000
EXA (2019~20) 20MW 1000 50,000 1 100K 5000
x31.6
~x20
x34
~x13.7
Extreme Big Data (EBD)Next Generation Big Data
Infrastructure Technologies Towards Yottabyte/Year
Principal InvesigatorSatoshi Matsuoka
Global Scientific Information and Computing Center
Tokyo Institute of Technolgoy
The current “Big Data” are not really that Big…
• Typical “real” definition: “Mining people’s privacy data to make money”
• Corporate data are usually in data warehoused silo -> limited volume, in Gigabytes~Terabytes, seldom Petabytes.
• Processing involve simple O(n) algorithms, or those that can be accelerated with DB-inherited indexingalgorithms
• Executed on re-purposed commodity “web” servers linked with 1Gbps networks running Hadoop/HDFS
• Vicious cycle of stagnation in innovations…• NEW: Breaking down of Silos ⇒ Convergence
with Supercomputing with Extreme Big Data
But “Extreme Big Data” will change everything
• “Breaking down of Silos” (Rajeeb Harza, Intel VP of Technical Computing)
• Already happening in Science & Engineering due to Open Data movement
• More complex analysis algorithms: O(n log n), O(m x n), …
• Will become the NORM for competitiveness reasons.
We will have tons of unknown genes
• Directly sequencing uncultured microbiomes obtained from target environment and analyzing the sequence data– Finding novel genes from unculturable microorganism– Elucidating composition of species/genes of environments
Human body
Sea
Gut microbiome
Examples of microbiome
Soil
Metagenome analysis
62
[Slide Courtesy Yutaka Akiyama @ Tokyo Tech.]
Results from Akiyama group@Tokyo TechUltra high‐sensitive “big data” metagenomesequence analysis of human oral microbiome
63歯列の内側 歯列の外側 歯垢
Metabolic Pathway Map
‐ Required > 1 million node*hour product on K‐computer‐ World’s most sensitive sequence analysis (based on amino acid similarity matrix)
‐ Discovered at least three microbiome clusters with functional differences.(Integrated 422 experiment samples taken from 9 different oral parts)
572.8 M Reads / hour 82,944 node (663,552 Cores)K‐computer (2012)
Extreme Big Data in Genomics
Lincoln Stein, Genome Biology, vol. 11(5), 2010
Sequencing data (bp)/$becomes x4000 per 5 yearsc.f., HPC x33 in 5 years
Impact of new generation sequencers
64
[Slide Courtesy Yutaka Akiyama @ Tokyo Tech.]
Extremely “Big” Graphs• Large scale graphs in various fields
– US Road network : 58 million edges– Twitter follow‐ship : 1.47 billion edges– Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edgesNeuronal network @ Human Brain Project
Cyber‐security
US road network24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPClarge
61.6 million vertices& 1.47 billion edges
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr
Human Brain Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1 billion nodes
1 trillion nodes
1 billion edges
1 trillion edges
Symbolic Network
USA Road Network
Twitter (tweets / day)
# of nodes
# of edges
K computer: 65536nodesGraph500: 5524GTEPS
Android tabletTegra3 1.7GHz : 1GB RAM0.15GTEPS: 64.12MTEPS/W
Towards Continuous Billion-Scale Social Simulation with Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo Tech) Applications
– Target Area: Planet (Open Street Map) – 7 billion people
Input Data – Road Network (Open Street Map) for
Planet: 300 GB (XML) – Trip data for 7 billion people
• 10 KB (1 trip) x 7 billion = 70 TB– Real-Time Streaming Data (e.g. Social
sensor, physical data)
Simulated Output for 1 Iteration
– 700 TB
Graph500 “Big Data” BenchmarkKronecker graph BSP Problem
A: 0.57, B: 0.19C: 0.19, D: 0.05
November 15, 2010Graph 500 Takes Aim at a New Kind of HPCRichard Murphy (Sandia NL => Micron)“ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.”
Reality: Top500 Supercomputers DominateNo Cloud IDCs at all
TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)
A Major Northern Japanese Cloud Datacenter (2013)
Juniper EX8208 Juniper EX8208
2 zone switches (Virtual Chassis)
Juniper EX4200
Zone (700 nodes)
Juniper EX4200
Juniper EX4200
Zone (700 nodes)
Juniper EX4200
Juniper EX4200
Zone (700 nodes)
Juniper EX4200
Juniper MX480 Juniper MX480
10GbE10GbE
10GbE
10GbELACP
the Internet
8 zones, Total 5600 nodes, Injection 1GBps/NodeBisection 160Gigabps
Advanced Silicon Photonics 40G
single CMOS Die1490nm DFB100km Fiber
Supercomputer Tokyo Tech. Tsubame 2.0
#4 Top500 (2010)
~1500 nodes compute & storageFull Bisection Multi-Rail
Optical NetworkInjection 80GBps/NodeBisection 220Terabps
>>x1000!
NEC Confidential
But what does “220Tbps” mean?Global IP Traffic, 2011-2016 (Source Cicso)
2011 2012 2013 2014 2015 2016 CAGR2011-2016
By Type (PB per Month / Average Bitrate in Tbps)Fixed Internet
23,288 32,990 40,587 50,888 64,349 81,347 28%71.9 101.8 125.3 157.1 198.6 251.1
Managed IP
6,849 9,199 11,846 13,925 16,085 18,131 21%21.1 28.4 36.6 43.0 49.6 56.0
Mobile data
597 1,252 2,379 4,215 6,896 10,804 78%1.8 3.9 7.3 13.0 21.3 33.3
Total IP traffic
30,734 43,441 54,812 69,028 87,331 110,282 29%94.9 134.1 169.2 213.0 269.5 340.4
TSUBAME2.0 Network has TWICE the capacity of the Global Internet,
being used by 2.1 Billion users
“convergence” at future extreme scale for computing and data (in Clouds?)
HPC: x1000 in 10 yearsCAGR ~= 100%
Source: Assessing trends over time in performance, costs, and energy use for servers, Intel, 2009.
IDC: x30 in 10 yearsServer unit sales flat (replacement demand)CAGR ~= 30‐40%
What does this all mean?• “Leveraging of mainframe technologies in HPC has
been dead for some time.”• But will leveraging Cloud/Mobile be sufficient?• NO! They are already falling behind, and will be
perpetually behind– CAGR of Clouds 30%, HPC 100%: all data supports it– Stagnation in network, storage, scaling, …
• Rather, HPC will be the technology driver for future Big Data, for Cloud/Mobile to leverage!– Rather than repurposed standard servers
73
Future “Extreme Big Data”- NOT mining Tbytes Silo Data
- Peta~Zetabytes of Data- Ultra High-BW Data Stream- Highly Unstructured, Irregular- Complex correlations between
data from multiple sources- Extreme Capacity, Bandwidth,
Compute All Required
Extreme Big Data not just traditional HPC!!!--- Analysis of required system properties ---74
0
0.2
0.4
0.6
0.8
1Processor Speed
Memory/ops
OPS
Approximate Computations
Local Persistent Storage
Read PerformanceWrite Performance
Comm Latency tolerance
Comm patterns variability
Power Optimization Opportunities
Algorithmic Variety
Extreme-Scale Computing Big Data Analytics BDEC Knowledge Discovery Engine
[Slide courtesy Alok Choudhary, Northeastern U
EBE Research Scheme
SupercomputersCompute&Batch-Oriented
Cloud IDCVery low BW & Efficiencty
Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW
PCB
TSV Interposer
High Powered Main CPU
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW
30PB/s I/O BW Possible1 Yottabyte / Year
EBD System Softwareincl. EBD Object System
Large Scale Metagenomics
Massive Sensors and Data Assimilation in Weather Prediction
Ultra Large Scale Graphs and Social Infrastructures
Exascale Big Data HPC
Co-Design
Future Non-Silo Extreme Big Data Apps
Graph StoreGraph Store
EBD BagEBD BagCo-Design
KVSKVS
KVSKVS
KVSKVS
EBD KVS
Cartesian PlaneCartesian PlaneCo-Design
Phase4: 2019‐20 DRAM+NVM+CPU with 3D/2.5D Die Stacking
‐The Ultimate Convergence of BD and EC‐
PCB
TSV Interposer
High Powered Main CPULow Power CPU
DRAM
DRAM
DRAM
NVM/Flash
NVM/Flash
NVM/Flash
Low Power CPU
DRAM
DRAM
DRAM
NVM/Flash
NVM/Flash
NVM/Flash
2Tbps HBM4~6HBM Channels1.5TB/s DRAM &
NVM BW
30PB/s I/O BW Possible1 Yottabyte / Year
Preliminary I/O Performance Evaluation on GPU and NVRAM
Mother board
RAID card
mSATA mSATA mSATA mSATA
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 5 10 15 20
Band
width [M
B/s]
# mSATAs
Raw mSATA 4KBRAID0 1MBRAID0 64KB
0
0.5
1
1.5
2
2.5
3
3.5
0.2740.547 1.09 2.19 4.38 8.75 17.5 35 70 140
Throughu
put [GB/s]
Matrix Size [GB]
Raw 8 mSATA8 mSATA RAID0 (1MB)8 mSATA RAID0 (64KB)
I/O performance of multiple mSATA SSD I/O performance from GPU to multiple mSATA SSDs
〜 7.39 GB/s from 16 mSATA SSDs (Enabled RAID0)
〜 3.06 GB/s from 8 mSATA SSDs to GPU
How to design local storage for next‐gen supercomputers ?‐ Designed a local I/O prototype using 16 mSATA SSDs
・Capacity: 4TB・Read bandwidth: 8 GB/s
Large Scale BFS Using NVRAM• Large scale graph processing in
various domainsDRAM resources has increased
• Spread of Flash DevicesProf : Price per bit,Energy consumptionCons: Latency,Throughput
② BFS with reading data from NVRAM
3.Proporsal
Using NVRAMs for large scale graph processing has possibilities of minimum performance degradation
1. Introduction
① offload small accesses data 2.Hybrid‐BFS
Top‐down Bottom‐up
0.01.02.03.04.05.06.0
1.E+04 1.E+05 1.E+06 1.E+07
GTEPS
Switching Parameter α
DRAM Only(β=10α)DRAM+SSD(β=0.1α)
2.8GTEPS(47.1% down)
2.8GTEPS(47.1% down)
5.2GTEPS5.2GTEPS
4.Evaluation
●We could reduce half the size of DRAM with 47.1% performance degradation(130M vertices,2.1G edges)
# of frontiers:nfrontier, # of all vertices:nall, parameter : α, β
Switch two approaches
●We are working on multiplexed I/O→ multiplexed I/O improve NVRAM’s I/O
performance
● Pearce, et al. : 13 times larger datasetswith 52 MTEPS(DRAM 1TB, 12TB NVRAM)
Algorithm Kernels on EBD
High Performance SortingFast algorithms:Distribution vs Comparison-based
MSD radix sort variable length / long keyshigh efficiency on small alphabets
Efficient implementation
GPUs are good at counting numbers
Computational Genomics (A,C,G,T)
Scalability
N log(N) classical sorts(quick, merge etc)
LSD radix sort (THRUST)
short length /fixed-length keys
integer sorts
appleapricotbananakiwi
don't have to examine all characters
Bitonic sort Comparison of keysMap-Reduce
Hadoop easy to use but not that efficient
Hybrid approaches/Best to be foundGood for GPU nodes Balancing IO / computation
–
Twitter network (Application of Graph500 Benchmark)
41 million vertices and 2.47 billion edges
Lv Frontier size Freq. (%) Cum. Freq. (%)0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00
Total 41,652,230 100.00 ‐
Frontier size in BFSwith source as User 21,804,357
Follow‐ship network 2009
User i
User j
(i, j)‐edge
Our NUMA‐optimized BFSon 4‐way Xeon system
69 ms / BFS ⇒ 21.28 GTEPS
Six‐degrees of separation
KVS
KVSKVS
Graph StoreGraph Store
EBD Application Co‐Design and Validation
Ultra High BW & Low Latency NVM Ultra High BW & Low Latency NW
Processor‐in‐memory 3D stacking
EBD Performance Modeling & Evaluation
Large Scale Genomic Correlation
Data Assimilationin Large Scale Sensors
and Exascale Atmospherics
Large Scale Graphs and Social
Infrastructure Apps
100,000 Times Fold EBD “Convergent” System Overview
TSUBAME 3.0
EBD Programming System
TSUBAME 2.0/2.5
Task 3
Tasks 5‐1~5‐3 Task6
EBD “converged” Real‐Time Resource Scheduling
Task 4
EBD Distrbuted Object Store on100,000 NVM Extreme Compute
and Data Nodes
Task 2
EBD BagEBD Bag
EBD KVSEBD KVS
Cartesian PlaneCartesian Plane
Ultra Parallel & Low Powe I/O EBD “Convergent” Supercomputer~10TB/s⇒~100TB/s⇒~10PB/s
Task 1
Summary• TSUBAME1.0->2.0->2.5->3.0->…
– Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP– Template for future supercomputers and IDC machines
• TSUBAME3.0 Early 2016– New supercomputing leadership– Tremendous power efficiency, extreme big data,
extremely high reliability• Lots of background R&D for TSUBAME3.0 and
towards Exascale– Green Computing: ULP-HPC & TSUBAME-KFC– Extreme Big Data – Convergence of HPC and IDC!– Exascale Resilience– Programming with Millions of Cores– …
• Please stay tuned! 乞うご期待。応援をお願いします。