The Design, Deployment, and Evaluation of the CORAL Pre-Exascale
Systems
Sudharshan S. Vazhkudai, Bronis R. de Supinski,Arthur S. Bland, Al Geist, James Sexton, Jim Kahle,Christopher J. Zimmer, Scott Atchley, Sarp Oral,Don E. Maxwell, Veronica G. Vergara Larrea, AdamBertsch, Robin Goldstone, Wayne Joubert, ChrisChambreau, David Appelhans, Robert Blackmore,Ben Casses, George Chochia, Gene Davison,Matthew A. Ezell, Tom Gooding, ElsaGonsiorowski, Leopold Grinberg, Bill Hanson, BillHartner, Ian Karlin, Matthew L. Leininger, DustinLeverman, Chris Marroquin, Adam Moody, MartinOhmacht, Ramesh Pankajakshan, FernandoPizzano, James H. Rogers, Bryan Rosenburg, DrewSchmidt, Mallikarjun Shankar, Feiyi Wang, PyWatson, Bob Walkup, Lance D. Weems, Junqi Yin
Oak Ridge National Lab, Lawrence Livermore National Lab, IBM
2
CORAL Collaboration ORNL, ANL, LLNL 2012
Approach:• Competitive process - one RFP (LLNL issued), led to 2 R&D contracts and 3 computer procurement contracts• For risk reduction and to meet a broad set of requirements, 2 architectural paths were selected and Oak Ridge and
Argonne were required to choose different architectures• Once selected, multi-year Lab-Awardee relationship to co-design computers• Both R&D contracts jointly managed by the 3 Labs• Each lab managed and negotiated its own computer procurement contract, withoptions to meet their specific needs• Understood that long procurement lead-time can impact architectural characteristics and designs
Leadership Computers: RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia
Objective: Procure 3 leadership computers to be sited at Oak Ridge, Lawrence Livermore and Argonne in 2017.
Sequoia (LLNL)2012 - Present
Mira (ANL)2012 - Present
Titan (ORNL)2012 - Present
Then Current DOE Leadership Computers
3
CORAL Workloads
• Lab missions differ– Open science• Understanding the natural world
– Stockpile certification• But workload needs are similar– Scalable Science• RFP benchmarks: LSMS, QBOX, HACC, Nekbone
– Throughput• RFP benchmarks: CAM-SE, UMT, AMG, MCB
4
Additional Target Requirements
• Memory capacity– 4PB for direct application use– At least 2GB/MPI process
• Power < 20 MW– $20M / year of power – OpEx should not exceed CapEx
• High performance for data-centric workloads– Graph500, Hash, Integer Sort
• Burst buffers– Need to bridge performance gap
5
Compute & MemoryIBM POWER9+NVIDIA Volta
2.4 PB DDR + 442 TB HBM2 = 2.8 PB1.1 PB DDR + 277 TB HBM2 = 1.4 PB
Storage250 PB GPFS, 2.5 TB/s150 PB, 1.5 TB/s
Management InfraPower Servers and Network
InterconnectMellanox IB EDR
Bisection BW: 115 TB/s, 1:1 Fat Tree54 TB/s, 2:1 Fat Tree
IB Network
Ethernet Network
Racks256 and 240 Racks @
18 Nodes/rack
24-39 Racks@ 2 ESS
4-5 RacksManagement servers and Networking
9-18 Racks@ 1 Mellanox Director
Nodes & FLOPS4,608 2C6G, 200 PF 4,320 2C4G, 125 PF
Overall Architecture: Accelerator based Supercomputers
Burst Buffer7.4 PB, 9.7 TB/s6.9 PB, 9.1 TB/s
6
IBM Power9 and Nvidia Volta V100
• Power 9– Up to 24 cores fabricated; 22 available to
applications to improve manufacturing yield – PCI-Express 4.0• Twice as fast as PCIe 3.0
– NVLink 2.0• High-bandwidth links to GPUs• Coherent, single address space
• Volta– 5,120 CUDA cores (64 on each of 80 SMs)– 640 NEW Tensor cores (8 on each of 80 SMs)– 300 GB/s NVLink– 7.5 FP64 TFLOPS | 15 FP32 TFLOPS | 120
Tensor OPS
7
P9 P9
DRAM256 GBHB
M16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
HBM
16 G
B
GPU 7 TF
DRAM128 GB HB
M16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
NIC
HBM/DRAM Bus (aggregate B/W)NVLINK
X-Bus (SMP)PCIe Gen4
EDR IB
NVM6.0 GB/s Read2.1 GB/s Write
Summit Half Node Sierra Half Node
12.5
GB/
s
12.5
GB/
s
16 GB/s 16 G
B/s
64 GB/s
170
GB/
s
170
GB/
s
50 G
B/s
50 GB/s
50 GB/s
75 G
B/s
50 G
B/s
50 G
B/s
75 G
B/s
900
GB/
s90
0 G
B/s
900
GB/
s
900
GB/
s90
0 G
B/s
Summit / Sierra Node Configuration
8
System Balance Ratio Analysis
Summit Sierra TitanMemory subsystem to Intra-node connectivity ratios
HBM BW: DDR BW 15.8 10.6 4.9HBM BW: CPU-GPU BW 18 12 39Per HBM BW: GPU-GPU BW 18 12 -DDR BW: CPU-GPU BW 1.13 1.13 8DDR:HBM (capacity) 5.3 4 5.3
HBM capacity: GPU-GPU BW (GB/GB/s) 0.32 0.43 -Memory subsystem to FLOPS ratios
Memory capacity:FLOPS 0.01 0.01 0.03Memory BW:FLOPS 0.13 0.13 0.17
Interconnect subsystem to FLOPS ratiosInjection BW:FLOPS 0.0006 0.0009 0.004Bisection BW:FLOPS 0.0006 0.0004 0.004
Other System RatiosFS:Memory capacity 89 107 42FLOPS:Power 15.4 10.4 3
9
I/O Design
Mellanox Fat-tree Network
NSD ANSD B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
Compute Node
Node-local
NVMe
77 IBM ESS GL4s32,494 10TB HDDs
250 PB Usable Capacity
ORNL56 IBM ESS GL4s
23,632 10TB HDDs150 PB Usable Capacity
LLNL
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
NSD ANSD
B
Enclosure 2
Enclosure 1
Enclosure 3
Enclosure 4
2 2U Power9 NSD servers4 4U 106 Disk Enclosures
422 10 TB HDDs
ORNL - Summit4,608 CNs
4,608 1.6 TB NVMes7.4 PB Raw Capacity
9.3 TiB/s Write I/O BW23.65 TiB/s Read I/O BW
783 Million Write IOPs4.6 Billion Read IOPS
LLNL - Sierra4,320 CNs
4,320 1.6 TB NVMes6.9 PB Raw Capacity
8.7 TiB/s Write I/O BW22.17 TiB/s Read I/O BW
734 Million Write IOPS4.3 Billion Read IOPS
2.5 TB/s to/fromSummit
1.5 TB/s to/fromSierra
Compute Node1.6 TB NVMe
2.1 GB/s Write I/O BW6 GB/s Read I/O BW
170 K Write IOPS1 Million Read IOPS
IBM Elastic Storage Server (ESS) - GL4
LLNL
Gateway Nodes Other LLNL Systems
ORNL
ORNL center-wide systems2.5 TB/s
10
Interconnect
• Both systems built upon Mellanox EDR InfiniBand• Switch IB-2 and ConnectX-5 HCAs– Adaptive routing– Scalable Hierarchical Aggregation Protocol (SHARP)– Dynamic connected queue pairs– Hardware tag-matching– GPU Direct
11
Software Strategy Builds on CORAL Collaboration, Open Source, and OpenPOWER
Provide a complete HPC Software Solution for Power Systems
§ Fully Integrated, Tested and Supported§ Exploit Power hardware features § Focus on maintainability and reliabilityHardware
discovery &
provisioning
• Open Source• Supports multiple HW and Installation options
New System
Management SW
Developed for
CORAL
CSM • Integrates system management functions• Integrates job management • Jitter aware design• Data collection aligned to system clock• Big Data for RAS analysis• Open Source in 2018
Spectrum LSF Job and Workflow
Management
Spectrum MPI
& PPT
• Based on Open MPI• New Features driven by CORAL Innovation including
enablement of unique hardware features• High Performance Collective Library• Enhanced multithreaded performance• IBM Power Parallel Performance Toolkit for
analyzing application performance including single timeline tracking for CPU, GPU & MPI
Optimized MPI
SolutionCompilers
ESSL/PESSL
Engineering &
Scientific
Subroutine
Libraries
• Tuned for each Power Architecture• Serial, SMP (OpenMP), GPU (CUDA),
SIMD (VSX) and SPMD (MPI) subroutines
• Callable from Fortran, C, and • C++Parallel ESSL SMP Libraries
Integrated File
System
Spectrum Scale
Burst Buffer
New System
Productivity
Feature for
CORAL
• Asynchronous Checkpoint transfer from on node NVMe to file system
• Stage in data to expedite job start• Stage out data releases CPU to start
new job• Flash wear, health and usage
monitoring• Open Source in 2018
• Single Namespace up to 250 Petabytes
• 2.5 TB/s large block sequential IO performance
• 2.6M file creates/sec for 32KB files in unique directories
• 50K file creates/sec to single shared directory
• Superior job throughput at scale• Enhanced job feedback for users,
administrators• RESTful API’s to simplify integration into
business processes• Absolute priority scheduling• Enhancements for advance reservations• Pending job limits
• Rich collection of compiler supporting multiple programing models• Optimized for Power• Open Source and fully supported proprietary compiler options
xCAT
CUDA
OpenMP 4.x
OpenACC
12
Memory Interconnect Evaluation
• CPU Stream– ci = core isolation
• Slight differences– Memory configuration
Peak/ci Peak/ci Peak Peak Sierra Sierra
System Cores 40 42 40 44 40 44
Copy 272.9 273.5 273.1 274.6 277.3 278.3
Scale 269.6 270.6 269.5 271.4 274.4 275.7
Add 268.8 269.8 268.7 270.6 273.5 274.9
Triad 273 273.9 273.5 275.3 277.7 279
CPU Stream Rates
13
NVLink• NVLink results show significant bandwidth
– Host to accelerator– Accelerator to accelerator
MPI Process Count 1 2 3 4 5 6Peak HTOD 45.93 91.85 137.69 183.54 229.18 274.82Peak DTOH 45.95 91.9 137.85 183.8 225.64 268.05Peak BIDIR 85.7 172.59 223.54 276.34 277.39 278.07Butte HTOD 68.66 137.39 206.05 275.47 — —Butte DTOH 68.91 137.48 203.8 271.12 — —Butte BIDIR 126.06 255.47 270.72 283.08 — —
Power 9 to GPUs
No PTP PTP
Socket 0 Socket 1 Cross-socket Socket 0 Socket 1 Cross-socket Peak
Peak Unidirectional 33.18 25.84 30.32 46.33 46.55 25.89 50Peak Bidirectional 54.48 27.91 49.02 93.02 93.11 21.63 100
Butte Unidirectional 41.27 24.72 31.04 69.49 69.49 31.05 75
Butte Bidirectional 58.63 25.55 49.17 139.15 124.3 49.15 150
GPU to GPU
14
Adaptive Routing
• Should enable applications to capture more network bandwidth
• MPIGraph –– Average single port bandwidth increases from 5.7 GB to 9.5 GB/s
A/R On A/R Off
15
SHARP
• Allreduce– Compare• Spectrum libcol• Mellanox FCA• Mellanox SHARP
• SHARP – Barrier ~7us full
system
0
20
40
60
80
100
120
140
160
8 16 32 64 128 256 512 1024 2048
Tim
e (u
s)
Nodes
IBM Spectrum MPI OpenMPI FCA
SHARP™
OSU Allreduce 2KiB up to 2048 nodes
16
I/O Subsystem (Summit – Alpine – Disk only)
• Single GL4 performance (average)– Sequential POSIX Write/Read: 33 GB/s / 39 GB/s– Random POSIX Write/ Read: 30 GB/s / 40 GB/s– 32 KB POSIX transactions (create/open+write(32KB)+close): +42K/s
• Aggregate performance (77 GL4s, average)– Sequential POSIX Write/Read: 2.47 TB/s / 2.72 TB/s– Random POSIX Write/Read: 2.38 TB/s / 3.07 TB/s– MPI-IO (file per process) Write/Read: 2 TB/s / 2.1 TB/s– MPI-IO (single shared file) Write/Read: 2.1 TB/s / 1.9 TB/s
• Metadata performance (77 GL4s, average)– +936K file creates/s with unique directories
17
Burst Buffer Measurements
• FIO– Compare TDS and burst
buffers• Node-local linear scaling– Per node bandwidth
remains consistent• TDS/PFS performance– Overall performance
degrades with too many nodes
– 35 GB/s - 31 GB/s• Oct 18 measured 23 TB read
9 TB write
00.5
11.5
22.5
33.5
44.5
5
8 16 32 128Pe
r Nod
e Ba
ndw
idth
(GB/
s)Nodes
Burst Buffer Test PFS
18
AMG 2013• AMG throughput app– 2 Phases setup and solve– Setup shows CPU benefits– Solve shows GPU impact
9.86E+061.78E+07
2.82E+10 2.46E+10
9.96E+06
2.01E+07 2.26E+07
2.87E+10 2.54E+10 2.50E+10
1.20E+072.15E+07 2.42E+07
2.30E+10 2.21E+10 2.30E+10
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1 thread 6 threads 9 threads 1 thread 6 threads 9 threads
Setup Phase Solve Phase
Figu
re o
f Mer
it (F
OM)
Summit 6 GPUs/node Summit 4 GPUs/node Sierra 4 GPUs/node
19
UMT 2013• NVLink offers significant benefit
• Ratio of bandwidth to GPUs positively impacts Sierra
5.82E+08 5.70E+08
7.56E+08
0.00E+00
1.00E+08
2.00E+08
3.00E+08
4.00E+08
5.00E+08
6.00E+08
7.00E+08
8.00E+08
FOM
/GPU
Summit 6 GPUs Summit 4 GPUs Sierra 4 GPUs
20
CORAL2 Data Science Benchmark
0
10
20
CANDLE RNN CNN-googlenet CNN-vgg CNN-alexnet CNN-overfeat
Speedup Over Titan Baseline for CORAL-2 Deep Learning BenchmarksSummitDev Summit
x5.9x3.5x4
x4x4
x4x4
x6
x6x6
x6x6
16
11162126313641
PCA K-Means SVM PCA K-Means SVM PCA K-Means SVM
Speedup Over Titan Baseline for CORAL-2 Big Data Benchmarks (based on pbdR)
SummitDev Summit
1 node 2 nodes 4 nodes
21
Conclusion / Lessons Learned
• Future Collaborative Procurements– Group procurements offer a mix of expertise from multiple labs– NRE provides significant benefits but collaborations require
compromises on the topics and the shape of the solutions• Systems Designers– HBM is very important however bandwidth between CPUs and
accelerators can not be neglected– Multi-tiered Storage: Future designs must carefully reconcile
performance and transparency• Users– Fitting data in HBM offers best benefit on Summit system– Datasets too large benefit from more balanced Sierra system– Single address space eases porting
22
Five Gordon Bell Finalists Credit Summit Supercomputer
The finalists—representing Oak Ridge, Lawrence Berkeley,
and Lawrence Livermore National Laboratories and the
University of Tokyo—leveraged Summit’s unprecedented
computational capabilities to tackle a broad range of science
challenges and produced innovations in machine learning, data
science, and traditional modeling and simulation to maximize
application performance. The Gordon Bell Prize winner will be
announced at SC18 in Dallas in November. Finalists include:
• An ORNL team led by data scientist Robert
Patton that scaled a deep learning technique
on Summit to produce intelligent software that
can automatically identify materials’ atomic-
level information from electron microscopy
data.
• A LBNL and Lawrence Livermore National
Laboratory team led by physicists André
Walker-Loud and Pavlos Vranas that
developed improved algorithms to help
scientists predict the lifetime of neutrons and
answer fundamental questions about the
universe.
• An ORNL team led by computational systems biologist
Dan Jacobson and OLCF computational scientist Wayne
Joubert that developed a genomics algorithm capable of
using mixed-precision arithmetic to attain exascale
speeds.
• A team from the University of Tokyo led by associate
professor Tsuyoshi Ichimura that applied AI and mixed-
precision arithmetic to accelerate the simulation of
earthquake physics in urban environments.
• A Lawrence Berkeley National Laboratory-led
collaboration that trained a deep neural network to
identify extreme weather patterns from high-resolution
climate simulations.
23
Coral Board (1 node) showing the Water Cooling
24
Summit / Sierra Node Configuration
Summit Sierra TitanCPU 2 Power 9 2 Power 9 1 AMD Opteron InterlagosCores 44 (22 per P9) 44 (22 per P9) 16Memory 512 GB 256 GB 32 GBMemory Bandwidth 340 GB/s 340 GB/s 51.2 GB/sSMP Bus X-Bus 64 GB/s X-Bus 64 GB/s NACPUs:GPUs 2:6 2:4 1:1GPU 6 Volta V100 4 Volta V100 1 Tesla K20xSM 480 320 14GPU DP Flops 42 TF 28 TF 1.31 TFGPU Memory 96 GB HBM2 64 GB HBM2 6 GB GDDRGPU Memory Bandwidth 5.4 TB/s 3.6 TB/s 250 GB/sNVLink BW 50 GB/s/GPU 75 GB/s/GPU NASSD Capacity 1.6 TB 1.6 TB NASSD Write BW 2.1 GB/s 2.1 GB/s NAInterconnect Injection BW 2x 12.5 GB/s EDR 2x 12.5 GB/s EDR 1x 5.5 GB/s Gemini