Download - The Design, Deployment, and Evaluation of the CORAL Pre …sc18.supercomputing.org/proceedings/tech_paper/tech... · 2020-02-05 · performance 4x-8x Titan or Sequoia Objective: Procure

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale

Systems

Sudharshan S. Vazhkudai, Bronis R. de Supinski,Arthur S. Bland, Al Geist, James Sexton, Jim Kahle,Christopher J. Zimmer, Scott Atchley, Sarp Oral,Don E. Maxwell, Veronica G. Vergara Larrea, AdamBertsch, Robin Goldstone, Wayne Joubert, ChrisChambreau, David Appelhans, Robert Blackmore,Ben Casses, George Chochia, Gene Davison,Matthew A. Ezell, Tom Gooding, ElsaGonsiorowski, Leopold Grinberg, Bill Hanson, BillHartner, Ian Karlin, Matthew L. Leininger, DustinLeverman, Chris Marroquin, Adam Moody, MartinOhmacht, Ramesh Pankajakshan, FernandoPizzano, James H. Rogers, Bryan Rosenburg, DrewSchmidt, Mallikarjun Shankar, Feiyi Wang, PyWatson, Bob Walkup, Lance D. Weems, Junqi Yin

Oak Ridge National Lab, Lawrence Livermore National Lab, IBM

2

CORAL Collaboration ORNL, ANL, LLNL 2012

Approach:• Competitive process - one RFP (LLNL issued), led to 2 R&D contracts and 3 computer procurement contracts• For risk reduction and to meet a broad set of requirements, 2 architectural paths were selected and Oak Ridge and

Argonne were required to choose different architectures• Once selected, multi-year Lab-Awardee relationship to co-design computers• Both R&D contracts jointly managed by the 3 Labs• Each lab managed and negotiated its own computer procurement contract, withoptions to meet their specific needs• Understood that long procurement lead-time can impact architectural characteristics and designs

Leadership Computers: RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia

Objective: Procure 3 leadership computers to be sited at Oak Ridge, Lawrence Livermore and Argonne in 2017.

Sequoia (LLNL)2012 - Present

Mira (ANL)2012 - Present

Titan (ORNL)2012 - Present

Then Current DOE Leadership Computers

3

CORAL Workloads

• Lab missions differ– Open science• Understanding the natural world

– Stockpile certification• But workload needs are similar– Scalable Science• RFP benchmarks: LSMS, QBOX, HACC, Nekbone

– Throughput• RFP benchmarks: CAM-SE, UMT, AMG, MCB

4

Additional Target Requirements

• Memory capacity– 4PB for direct application use– At least 2GB/MPI process

• Power < 20 MW– $20M / year of power – OpEx should not exceed CapEx

• High performance for data-centric workloads– Graph500, Hash, Integer Sort

• Burst buffers– Need to bridge performance gap

5

Compute & MemoryIBM POWER9+NVIDIA Volta

2.4 PB DDR + 442 TB HBM2 = 2.8 PB1.1 PB DDR + 277 TB HBM2 = 1.4 PB

Storage250 PB GPFS, 2.5 TB/s150 PB, 1.5 TB/s

Management InfraPower Servers and Network

InterconnectMellanox IB EDR

Bisection BW: 115 TB/s, 1:1 Fat Tree54 TB/s, 2:1 Fat Tree

IB Network

Ethernet Network

Racks256 and 240 Racks @

18 Nodes/rack

24-39 Racks@ 2 ESS

4-5 RacksManagement servers and Networking

9-18 Racks@ 1 Mellanox Director

Nodes & FLOPS4,608 2C6G, 200 PF 4,320 2C4G, 125 PF

Overall Architecture: Accelerator based Supercomputers

Burst Buffer7.4 PB, 9.7 TB/s6.9 PB, 9.1 TB/s

6

IBM Power9 and Nvidia Volta V100

• Power 9– Up to 24 cores fabricated; 22 available to

applications to improve manufacturing yield – PCI-Express 4.0• Twice as fast as PCIe 3.0

– NVLink 2.0• High-bandwidth links to GPUs• Coherent, single address space

• Volta– 5,120 CUDA cores (64 on each of 80 SMs)– 640 NEW Tensor cores (8 on each of 80 SMs)– 300 GB/s NVLink– 7.5 FP64 TFLOPS | 15 FP32 TFLOPS | 120

Tensor OPS

7

P9 P9

DRAM256 GBHB

M16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

HBM

16 G

B

GPU 7 TF

DRAM128 GB HB

M16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

NIC

HBM/DRAM Bus (aggregate B/W)NVLINK

X-Bus (SMP)PCIe Gen4

EDR IB

NVM6.0 GB/s Read2.1 GB/s Write

Summit Half Node Sierra Half Node

12.5

GB/

s

12.5

GB/

s

16 GB/s 16 G

B/s

64 GB/s

170

GB/

s

170

GB/

s

50 G

B/s

50 GB/s

50 GB/s

75 G

B/s

50 G

B/s

50 G

B/s

75 G

B/s

900

GB/

s90

0 G

B/s

900

GB/

s

900

GB/

s90

0 G

B/s

Summit / Sierra Node Configuration

8

System Balance Ratio Analysis

Summit Sierra TitanMemory subsystem to Intra-node connectivity ratios

HBM BW: DDR BW 15.8 10.6 4.9HBM BW: CPU-GPU BW 18 12 39Per HBM BW: GPU-GPU BW 18 12 -DDR BW: CPU-GPU BW 1.13 1.13 8DDR:HBM (capacity) 5.3 4 5.3

HBM capacity: GPU-GPU BW (GB/GB/s) 0.32 0.43 -Memory subsystem to FLOPS ratios

Memory capacity:FLOPS 0.01 0.01 0.03Memory BW:FLOPS 0.13 0.13 0.17

Interconnect subsystem to FLOPS ratiosInjection BW:FLOPS 0.0006 0.0009 0.004Bisection BW:FLOPS 0.0006 0.0004 0.004

Other System RatiosFS:Memory capacity 89 107 42FLOPS:Power 15.4 10.4 3

9

I/O Design

Mellanox Fat-tree Network

NSD ANSD B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

Compute Node

Node-local

NVMe

77 IBM ESS GL4s32,494 10TB HDDs

250 PB Usable Capacity

ORNL56 IBM ESS GL4s

23,632 10TB HDDs150 PB Usable Capacity

LLNL

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

NSD ANSD

B

Enclosure 2

Enclosure 1

Enclosure 3

Enclosure 4

2 2U Power9 NSD servers4 4U 106 Disk Enclosures

422 10 TB HDDs

ORNL - Summit4,608 CNs

4,608 1.6 TB NVMes7.4 PB Raw Capacity

9.3 TiB/s Write I/O BW23.65 TiB/s Read I/O BW

783 Million Write IOPs4.6 Billion Read IOPS

LLNL - Sierra4,320 CNs

4,320 1.6 TB NVMes6.9 PB Raw Capacity

8.7 TiB/s Write I/O BW22.17 TiB/s Read I/O BW

734 Million Write IOPS4.3 Billion Read IOPS

2.5 TB/s to/fromSummit

1.5 TB/s to/fromSierra

Compute Node1.6 TB NVMe

2.1 GB/s Write I/O BW6 GB/s Read I/O BW

170 K Write IOPS1 Million Read IOPS

IBM Elastic Storage Server (ESS) - GL4

LLNL

Gateway Nodes Other LLNL Systems

ORNL

ORNL center-wide systems2.5 TB/s

10

Interconnect

• Both systems built upon Mellanox EDR InfiniBand• Switch IB-2 and ConnectX-5 HCAs– Adaptive routing– Scalable Hierarchical Aggregation Protocol (SHARP)– Dynamic connected queue pairs– Hardware tag-matching– GPU Direct

11

Software Strategy Builds on CORAL Collaboration, Open Source, and OpenPOWER

Provide a complete HPC Software Solution for Power Systems

§ Fully Integrated, Tested and Supported§ Exploit Power hardware features § Focus on maintainability and reliabilityHardware

discovery &

provisioning

• Open Source• Supports multiple HW and Installation options

New System

Management SW

Developed for

CORAL

CSM • Integrates system management functions• Integrates job management • Jitter aware design• Data collection aligned to system clock• Big Data for RAS analysis• Open Source in 2018

Spectrum LSF Job and Workflow

Management

Spectrum MPI

& PPT

• Based on Open MPI• New Features driven by CORAL Innovation including

enablement of unique hardware features• High Performance Collective Library• Enhanced multithreaded performance• IBM Power Parallel Performance Toolkit for

analyzing application performance including single timeline tracking for CPU, GPU & MPI

Optimized MPI

SolutionCompilers

ESSL/PESSL

Engineering &

Scientific

Subroutine

Libraries

• Tuned for each Power Architecture• Serial, SMP (OpenMP), GPU (CUDA),

SIMD (VSX) and SPMD (MPI) subroutines

• Callable from Fortran, C, and • C++Parallel ESSL SMP Libraries

Integrated File

System

Spectrum Scale

Burst Buffer

New System

Productivity

Feature for

CORAL

• Asynchronous Checkpoint transfer from on node NVMe to file system

• Stage in data to expedite job start• Stage out data releases CPU to start

new job• Flash wear, health and usage

monitoring• Open Source in 2018

• Single Namespace up to 250 Petabytes

• 2.5 TB/s large block sequential IO performance

• 2.6M file creates/sec for 32KB files in unique directories

• 50K file creates/sec to single shared directory

• Superior job throughput at scale• Enhanced job feedback for users,

administrators• RESTful API’s to simplify integration into

business processes• Absolute priority scheduling• Enhancements for advance reservations• Pending job limits

• Rich collection of compiler supporting multiple programing models• Optimized for Power• Open Source and fully supported proprietary compiler options

xCAT

CUDA

OpenMP 4.x

OpenACC

12

Memory Interconnect Evaluation

• CPU Stream– ci = core isolation

• Slight differences– Memory configuration

Peak/ci Peak/ci Peak Peak Sierra Sierra

System Cores 40 42 40 44 40 44

Copy 272.9 273.5 273.1 274.6 277.3 278.3

Scale 269.6 270.6 269.5 271.4 274.4 275.7

Add 268.8 269.8 268.7 270.6 273.5 274.9

Triad 273 273.9 273.5 275.3 277.7 279

CPU Stream Rates

13

NVLink• NVLink results show significant bandwidth

– Host to accelerator– Accelerator to accelerator

MPI Process Count 1 2 3 4 5 6Peak HTOD 45.93 91.85 137.69 183.54 229.18 274.82Peak DTOH 45.95 91.9 137.85 183.8 225.64 268.05Peak BIDIR 85.7 172.59 223.54 276.34 277.39 278.07Butte HTOD 68.66 137.39 206.05 275.47 — —Butte DTOH 68.91 137.48 203.8 271.12 — —Butte BIDIR 126.06 255.47 270.72 283.08 — —

Power 9 to GPUs

No PTP PTP

Socket 0 Socket 1 Cross-socket Socket 0 Socket 1 Cross-socket Peak

Peak Unidirectional 33.18 25.84 30.32 46.33 46.55 25.89 50Peak Bidirectional 54.48 27.91 49.02 93.02 93.11 21.63 100

Butte Unidirectional 41.27 24.72 31.04 69.49 69.49 31.05 75

Butte Bidirectional 58.63 25.55 49.17 139.15 124.3 49.15 150

GPU to GPU

14

Adaptive Routing

• Should enable applications to capture more network bandwidth

• MPIGraph –– Average single port bandwidth increases from 5.7 GB to 9.5 GB/s

A/R On A/R Off

15

SHARP

• Allreduce– Compare• Spectrum libcol• Mellanox FCA• Mellanox SHARP

• SHARP – Barrier ~7us full

system

0

20

40

60

80

100

120

140

160

8 16 32 64 128 256 512 1024 2048

Tim

e (u

s)

Nodes

IBM Spectrum MPI OpenMPI FCA

SHARP™

OSU Allreduce 2KiB up to 2048 nodes

16

I/O Subsystem (Summit – Alpine – Disk only)

• Single GL4 performance (average)– Sequential POSIX Write/Read: 33 GB/s / 39 GB/s– Random POSIX Write/ Read: 30 GB/s / 40 GB/s– 32 KB POSIX transactions (create/open+write(32KB)+close): +42K/s

• Aggregate performance (77 GL4s, average)– Sequential POSIX Write/Read: 2.47 TB/s / 2.72 TB/s– Random POSIX Write/Read: 2.38 TB/s / 3.07 TB/s– MPI-IO (file per process) Write/Read: 2 TB/s / 2.1 TB/s– MPI-IO (single shared file) Write/Read: 2.1 TB/s / 1.9 TB/s

• Metadata performance (77 GL4s, average)– +936K file creates/s with unique directories

17

Burst Buffer Measurements

• FIO– Compare TDS and burst

buffers• Node-local linear scaling– Per node bandwidth

remains consistent• TDS/PFS performance– Overall performance

degrades with too many nodes

– 35 GB/s - 31 GB/s• Oct 18 measured 23 TB read

9 TB write

00.5

11.5

22.5

33.5

44.5

5

8 16 32 128Pe

r Nod

e Ba

ndw

idth

(GB/

s)Nodes

Burst Buffer Test PFS

18

AMG 2013• AMG throughput app– 2 Phases setup and solve– Setup shows CPU benefits– Solve shows GPU impact

9.86E+061.78E+07

2.82E+10 2.46E+10

9.96E+06

2.01E+07 2.26E+07

2.87E+10 2.54E+10 2.50E+10

1.20E+072.15E+07 2.42E+07

2.30E+10 2.21E+10 2.30E+10

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1.00E+10

1.00E+11

1 thread 6 threads 9 threads 1 thread 6 threads 9 threads

Setup Phase Solve Phase

Figu

re o

f Mer

it (F

OM)

Summit 6 GPUs/node Summit 4 GPUs/node Sierra 4 GPUs/node

19

UMT 2013• NVLink offers significant benefit

• Ratio of bandwidth to GPUs positively impacts Sierra

5.82E+08 5.70E+08

7.56E+08

0.00E+00

1.00E+08

2.00E+08

3.00E+08

4.00E+08

5.00E+08

6.00E+08

7.00E+08

8.00E+08

FOM

/GPU

Summit 6 GPUs Summit 4 GPUs Sierra 4 GPUs

20

CORAL2 Data Science Benchmark

0

10

20

CANDLE RNN CNN-googlenet CNN-vgg CNN-alexnet CNN-overfeat

Speedup Over Titan Baseline for CORAL-2 Deep Learning BenchmarksSummitDev Summit

x5.9x3.5x4

x4x4

x4x4

x6

x6x6

x6x6

16

11162126313641

PCA K-Means SVM PCA K-Means SVM PCA K-Means SVM

Speedup Over Titan Baseline for CORAL-2 Big Data Benchmarks (based on pbdR)

SummitDev Summit

1 node 2 nodes 4 nodes

21

Conclusion / Lessons Learned

• Future Collaborative Procurements– Group procurements offer a mix of expertise from multiple labs– NRE provides significant benefits but collaborations require

compromises on the topics and the shape of the solutions• Systems Designers– HBM is very important however bandwidth between CPUs and

accelerators can not be neglected– Multi-tiered Storage: Future designs must carefully reconcile

performance and transparency• Users– Fitting data in HBM offers best benefit on Summit system– Datasets too large benefit from more balanced Sierra system– Single address space eases porting

22

Five Gordon Bell Finalists Credit Summit Supercomputer

The finalists—representing Oak Ridge, Lawrence Berkeley,

and Lawrence Livermore National Laboratories and the

University of Tokyo—leveraged Summit’s unprecedented

computational capabilities to tackle a broad range of science

challenges and produced innovations in machine learning, data

science, and traditional modeling and simulation to maximize

application performance. The Gordon Bell Prize winner will be

announced at SC18 in Dallas in November. Finalists include:

• An ORNL team led by data scientist Robert

Patton that scaled a deep learning technique

on Summit to produce intelligent software that

can automatically identify materials’ atomic-

level information from electron microscopy

data.

• A LBNL and Lawrence Livermore National

Laboratory team led by physicists André

Walker-Loud and Pavlos Vranas that

developed improved algorithms to help

scientists predict the lifetime of neutrons and

answer fundamental questions about the

universe.

• An ORNL team led by computational systems biologist

Dan Jacobson and OLCF computational scientist Wayne

Joubert that developed a genomics algorithm capable of

using mixed-precision arithmetic to attain exascale

speeds.

• A team from the University of Tokyo led by associate

professor Tsuyoshi Ichimura that applied AI and mixed-

precision arithmetic to accelerate the simulation of

earthquake physics in urban environments.

• A Lawrence Berkeley National Laboratory-led

collaboration that trained a deep neural network to

identify extreme weather patterns from high-resolution

climate simulations.

23

Coral Board (1 node) showing the Water Cooling

24

Summit / Sierra Node Configuration

Summit Sierra TitanCPU 2 Power 9 2 Power 9 1 AMD Opteron InterlagosCores 44 (22 per P9) 44 (22 per P9) 16Memory 512 GB 256 GB 32 GBMemory Bandwidth 340 GB/s 340 GB/s 51.2 GB/sSMP Bus X-Bus 64 GB/s X-Bus 64 GB/s NACPUs:GPUs 2:6 2:4 1:1GPU 6 Volta V100 4 Volta V100 1 Tesla K20xSM 480 320 14GPU DP Flops 42 TF 28 TF 1.31 TFGPU Memory 96 GB HBM2 64 GB HBM2 6 GB GDDRGPU Memory Bandwidth 5.4 TB/s 3.6 TB/s 250 GB/sNVLink BW 50 GB/s/GPU 75 GB/s/GPU NASSD Capacity 1.6 TB 1.6 TB NASSD Write BW 2.1 GB/s 2.1 GB/s NAInterconnect Injection BW 2x 12.5 GB/s EDR 2x 12.5 GB/s EDR 1x 5.5 GB/s Gemini