Designing Scalable HPC, Deep Learning, Big Data and Cloud Middleware for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Keynote Talk at HPC-UK 2019 Conference (September ‘19)
by
Follow us on
https://twitter.com/mvapich
2Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
High-End Computing (HEC): PetaFlop to ExaFlop
Expected to have an ExaFlop system in 2020-2021!
100 PFlops in
2017
1 EFlops in
2020-2021?
149
PFlops
in 2018
3Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
4Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Physical Compute
5Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
6Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
7Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
8Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• MVAPICH Project
– MPI and PGAS Library with CUDA-Awareness
• HiBD Project
– High-Performance Big Data Analytics Library
• HiDL Project
– High-Performance Deep Learning
• Public Cloud Deployment
– Microsoft-Azure and Amazon-AWS
• Conclusions
Presentation Overview
9Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
OpenSHMEM, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
10Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,025 organizations in 89 countries
– More than 589,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 15th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the TACC Frontera System
11Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
100000
200000
300000
400000
500000
600000
Sep
-04
Feb
-05
Jul-
05
Dec
-05
May
-06
Oct
-06
Mar
-07
Au
g-0
7
Jan
-08
Jun
-08
No
v-0
8
Ap
r-0
9
Sep
-09
Feb
-10
Jul-
10
Dec
-10
May
-11
Oct
-11
Mar
-12
Au
g-1
2
Jan
-13
Jun
-13
No
v-1
3
Ap
r-1
4
Sep
-14
Feb
-15
Jul-
15
Dec
-15
May
-16
Oct
-16
Mar
-17
Au
g-1
7
Jan
-18
Jun
-18
No
v-1
8
Ap
r-1
9
Nu
mb
er o
f D
ow
nlo
ads
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
21
.0
MV
1.0
MV
21
.0.3
MV
1.1
MV
21
.4
MV
21
.5
MV
21
.6
MV
21
.7
MV
21
.8
MV
21
.9
MV
2-G
DR
2.0
b
MV
2-M
IC 2
.0
MV
2-G
DR
2.3
.2M
V2
-X2
.3rc
2
MV
2V
irt
2.2
MV
2 2
.3.2
OSU
INA
M 0
.9.3
MV
2-A
zure
2.3
.2M
V2
-AW
S2
.3
MVAPICH2 Release Timeline and Downloads
12Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
ToleranceVirtualization
Active
MessagesJob Startup
Introspection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-
IOV
Multi
Rail
Transport Mechanisms
Shared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
13Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
MVAPICH2 Software Family
Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
14Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
15Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
One-way Latency: MPI over IB with MVAPICH2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8 Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.111.19
1.011.15
1.041.1
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
0
20
40
60
80
100
120TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Omni-Path
ConnectX-6 HDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
16Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Bandwidth: MPI over IB with MVAPICH2
0
5000
10000
15000
20000
25000
30000 Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
12,590
3,3736,356
12,08312,366
24,532
0
10000
20000
30000
40000
50000
60000TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Omni-Path
ConnectX-6 HDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
21,227
12,161
21,983
6,228
48,027
24,136
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
17Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Startup Performance on TACC Frontera
• MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes• All numbers reported with 56 processes per node
4.5s
3.9s
New designs available in MVAPICH2-2.3.2
0
1000
2000
3000
4000
5000
56 112 224 448 896 1792 3584 7168 143362867257344
Tim
e T
ake
n
(Mill
ise
con
ds)
Number of Processes
MPI_Init on Frontera
Intel MPI 2019
MVAPICH2 2.3.2
18Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
MPI_Allreduce on KNL + Omni-Path (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(u
s)
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8K 16K 32K 64K 128K 256K
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
OSU Micro Benchmark 64 PPN
2.4X
• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X
M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based
Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b
19Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Shared Address Space (XPMEM)-based Collectives Design
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(u
s)
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1
OSU_Allreduce (Broadwell 256 procs)
• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
• Offloaded computation/communication to peers ranks in reduction collective operation
• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-2.3rc1
OSU_Reduce (Broadwell 256 procs)
4X
36.1
37.9
16.8
J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction
Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.Available since MVAPICH2-X 2.3rc1
20Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
5000
10000
15000
20000
25000
30000
224 448 896
Per
form
ance
in G
FLO
PS
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Default
0
1
2
3
4
5
6
7
8
9
112 224 448
Tim
e p
er
loo
p in
sec
on
ds
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Async
Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand
Up to 33% performance improvement in P3DFFT application with 448 processesUp to 29% performance improvement in HPL application with 896 processes
Memory Consumption = 69%
P3DFFT High Performance Linpack (HPL)
26%
27% Lower is better Higher is better
A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018. Enhanced version accepted for PARCO Journal.
Available since MVAPICH2-X 2.3rc1
PPN=28
33%
29%
12%
PPN=28
8%
21Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(u
s)
MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
Intra-node Point-to-Point Performance on OpenPower
Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA
Intra-Socket Small Message Latency Intra-Socket Large Message Latency
Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth
0.22us0
20
40
60
80
100
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
0
10000
20000
30000
40000
1 8 64 512 4K 32K 256K 2M
Ban
dw
idth
(M
B/s
) MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
0
20000
40000
1 8 64 512 4K 32K 256K 2M
Bi-
Ban
dw
idth
(M
B/s
) MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
22Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
0.5
1
1.5
2
0 2 8 32
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Small Messages
MVAPICH2-X
OpenMPI+UCX
0
5
10
15
20
128 512 2048 8192
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Medium Messages
MVAPICH2-X
OpenMPI+UCX
0
400
800
1200
1600
2000
32K 128K 512K 2M
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Large Messages
MVAPICH2-X
OpenMPI+UCX
0
100
200
300
400
500
1 4 16 64
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth - Small Messages
MVAPICH2-X
OpenMPI+UCX
0
2000
4000
6000
8000
10000
256 1K 4K 16K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth – Medium Messages
MVAPICH2-X
OpenMPI+UCX
0
3000
6000
9000
12000
64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth - Large Messages
MVAPICH2-X
OpenMPI+UCX
Point-to-point: Latency & Bandwidth (Intra-socket) on Mayer
8.2x better
23Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
0.5
1
1.5
2
2.5
3
0 2 8 32
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Small Messages
MVAPICH2-X
OpenMPI+UCX
0
5
10
15
20
25
30
35
128 512 2048 8192
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Medium Messages
MVAPICH2-X
OpenMPI+UCX
0
500
1000
1500
2000
2500
32K 128K 512K 2M
Late
ncy
(u
s)
Message Size (Bytes)
Latency - Large Messages
MVAPICH2-X
OpenMPI+UCX
0
100
200
300
400
1 4 16 64
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth - Small Messages
MVAPICH2-X
OpenMPI+UCX
0
2000
4000
6000
8000
10000
256 1K 4K 16K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth – Medium Messages
MVAPICH2-X
OpenMPI+UCX
0
2000
4000
6000
8000
10000
12000
64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth - Large Messages
MVAPICH2-X
OpenMPI+UCX
Point-to-point: Latency & Bandwidth (Inter-socket) on Mayer
3.5x better 8.3x better
5x better
24Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
25Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
2000
4000
6000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us11X
26Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32Ave
rage
Tim
e St
eps
pe
r se
con
d (
TPS)
Number of Processes
64K Particles 256K Particles
2X2X
27Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32
No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content
/tasks/operational/meteoSwiss/
28Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
20
40
60
80
100
120
140
160
MILC Leslie3D POP2 LAMMPS WRF2 LU
Exe
cuti
on
Tim
e in
(s)
Intel MPI 18.1.163
MVAPICH2-X
31%
SPEC MPI 2007 Benchmarks: Broadwell + InfiniBand
MVAPICH2-X outperforms Intel MPI by up to 31%
Configuration: 448 processes on 16 Intel E5-2680v4 (Broadwell) nodes having 28 PPN and interconnected
with 100Gbps Mellanox MT4115 EDR ConnectX-4 HCA
29% 5%
-12%1%
11%
29Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Application Scalability on Skylake and KNL (Stamepede2)MiniFE (1300x1300x1300 ~ 910 GB)
Runtime parameters: MV2_SMPI_LENGTH_QUEUE=524288 PSM2_MQ_RNDV_SHM_THRESH=128K PSM2_MQ_RNDV_HFI_THRESH=128K
0
50
100
150
2048 4096 8192
Exec
uti
on
Tim
e (s
)
No. of Processes (KNL: 64ppn)
MVAPICH2
0
20
40
60
2048 4096 8192
Exec
uti
on
Tim
e (s
)
No. of Processes (Skylake: 48ppn)
MVAPICH2
0
500
1000
1500
48 96 192 384 768No. of Processes (Skylake: 48ppn)
MVAPICH2
NEURON (YuEtAl2012)
Courtesy: Mahidhar Tatineni @SDSC, Dong Ju (DJ) Choi@SDSC, and Samuel Khuvis@OSC ---- Testbed: TACC Stampede2 using MVAPICH2-2.3b
0
1000
2000
3000
4000
64 128 256 512 1024 2048 4096
No. of Processes (KNL: 64ppn)
MVAPICH2
0
500
1000
1500
68 136 272 544 1088 2176 4352
No. of Processes (KNL: 68ppn)
MVAPICH2
0
500
1000
1500
2000
48 96 192 384 768 1536 3072No. of Processes (Skylake: 48ppn)
MVAPICH2
Cloverleaf (bm64) MPI+OpenMP, NUM_OMP_THREADS = 2
30Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• MVAPICH Project
– MPI and PGAS Library with CUDA-Awareness
• HiBD Project
– High-Performance Big Data Analytics Library
• HiDL Project
– High-Performance Deep Learning
• Public Cloud Deployment
– Microsoft-Azure and Amazon-AWS
• Conclusions
Presentation Overview
31Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Datacenters
32Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
33Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 315 organizations from 35 countries
• More than 31,100 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
34Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
Nu
mb
er
of
Do
wn
load
s
Timeline
RD
MA
-Had
oo
p 2
.x 0
.9.1
RD
MA
-Had
oo
p 2
.x 0
.9.6
RD
MA
-Me
mca
ched
0.9
.4
RD
MA
-Had
oo
p 2
.x 0
.9.7
RD
MA
-Sp
ark
0.9
.1
RD
MA
-HB
ase
0.9
.1
RD
MA
-Had
oo
p 2
.x 1
.0.0
RD
MA
-Sp
ark
0.9
.4
RD
MA
-Had
oo
p 2
.x 1
.3.0
RD
MA
-Me
mca
ched
0.9
.6&
OH
B-0
.9.3
RD
MA
-Sp
ark
0.9
.5
RD
MA
-Had
oo
p 1
.x 0
.9.0
RD
MA
-Had
oo
p 1
.x 0
.9.8
RD
MA
-Me
mca
ched
0.9
.1 &
OH
B-0
.7.1
RD
MA
-Had
oo
p 1
.x 0
.9.9
35Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
36Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• OSU-RI2 Cluster
• The RDMA-IB design improves the job execution time of TestDFSIO by 1.4x – 3.4x compared to
IPoIB (100Gbps)
• For TestDFSIO throughput experiment, the RDMA-IB design has an improvement of 1.2x – 3.1x
compared to IPoIB (100Gbps)
• Support for Erasure Coding being worked out (presented at HPDC ’19)
Performance Numbers of RDMA for Apache Hadoop 3.x (NEW)
0
100
200
300
120GB 240GB 360GB
Ex
ecu
tio
nT
ime
(sec
)
TestDFSIO Latency
IPoIB (100Gbps)
RDMA-IB (100Gbps)
0
60
120
180
240
300
120GB 240GB 360GB
Th
rou
gh
put
(MB
ps)
TestDFSIO Throughput
IPoIB (100Gbps)
RDMA-IB (100Gbps)
37Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
38Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
39Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
43%37%
40Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
41Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• MVAPICH Project
– MPI and PGAS Library with CUDA-Awareness
• HiBD Project
– High-Performance Big Data Analytics Library
• HiDL Project
– High-Performance Deep Learning
• Public Cloud Deployment
– Microsoft-Azure and Amazon-AWS
• Conclusions
Presentation Overview
42Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance
• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit
communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-u
p P
erf
orm
ance
Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
DesiredNCCL2
43Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
44Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• CPU-based Deep Learning
• GPU-based Deep Learning
• RDMA-Enabled TensorFlow
High-Performance Deep Learning
45Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning
0
200
400
600
800
28 56 112 224
Exec
uti
on
Tim
e (s
)
No. of Processes
Intel MPIMVAPICH2MVAPICH2-XPMEM
CNTK AlexNet Training
(B.S=default, iteration=50, ppn=28)
20%
9%
• CPU-based training of AlexNet neural
network using ImageNet ILSVRC2012
dataset
• Advanced XPMEM-based designs show
up to 20% benefits over Intel MPI (IMPI)
for CNTK DNN training using All_Reduce
• The proposed designs show good
scalability with increasing system size
Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018
Available since MVAPICH2-X 2.3rc1 release
46Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Deep Learning on Frontera
• TensorFlow, PyTorch, and MXNet are widely used Deep Learning Frameworks
• Optimized by Intel using Math Kernel Library for DNN (MKL-DNN) for Intel
processors
• Single Node performance can be improved by running Multiple MPI processes
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256
Ima
ges
per
sec
Batch Size
TensorFlow PyTorch MXNet
0
20
40
60
80
100120
140160
1 2 4 8 14 16
Ima
ges
pe
r se
c
PPN
TensorFlow PyTorch
Impact of Batch Size on Performance for ResNet-50 Performance Improvement using Multiple MPI processes
A. Jain et al., Scaling Deep Learning Frameworks on Frontera using MVAPICH2 MPI, under review
47Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Deep Learning on Frontera
• Observed 260K images per sec for ResNet-50 on 2,048 Nodes
• Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using
TensorFlow
• ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)
1
8
64
512
4096
32768
262144
1 2 4 8 16
32
64
128
256
512
102
4
204
8
Imag
es p
er s
ec
Nodes
Intel MPI 19.0.5 MVAPICH2-X 2.3.2 ideal
0
5000
10000
15000
20000
2500030000
3500040000
1 2 4 8 16 32 64 128 256
Ima
ges
per
sec
Nodes
Keras TF_CNN MXNet PyTorch
A. Jain et al., Scaling Deep Learning Frameworks on Frontera using MVAPICH2 MPI, under review
48Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale
• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Ban
dw
idth
(G
B/s
)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.7X better
0
50
100
150
200
250
300
350
400
450
4 16 64 256 1K 4K 16K
Late
ncy
(u
s)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Ban
dw
idth
(G
B/s
)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
1.7X better
49Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Distributed Training with TensorFlow and MVAPICH2-GDR
• ResNet-50 Training using
TensorFlow benchmark on
SUMMIT -- 1536 Volta
GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs)
= 3.6 x 90 = 332 seconds =
5.5 minutes!
0
50
100
150
200
250
300
350
400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e p
er s
eco
nd
(Th
ou
san
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million
images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
50Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
5000
10000
15000
20000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
Docker Native
02000400060008000
1000012000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
Docker Native
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
Docker Native
MVAPICH2-GDR-2.3.2Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
MVAPICH2-GDR on Container with Negligible Overhead
51Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• High-Performance Design of TensorFlow over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow
– RDMA-based data communication
– Adaptive communication protocols
– Dynamic message chunking and accumulation
– Support for RDMA device selection
– Easily configurable for different protocols (native InfiniBand and IPoIB)
• Current release: 0.9.1
– Based on Google TensorFlow 1.3.0
– Tested with
• Mellanox InfiniBand adapters (e.g., EDR)
• NVIDIA GPGPU K80
• Tested with CUDA 8.0 and CUDNN 5.0
– http://hidl.cse.ohio-state.edu
RDMA-TensorFlow Distribution
OpenPOWER support
will be coming soon
52Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
0
50
100
150
200
16 32 64
Imag
es /
Sec
on
d
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for RDMA-TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster
– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
0
200
400
600
16 32 64
Imag
es /
Sec
on
d
Batch Size
gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)
0
100
200
300
400
16 32 64Im
ages
/ S
eco
nd
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
53Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Using HiDL Packages for Deep Learning on Existing HPC Infrastructure
Hadoop Job
54Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• MVAPICH Project
– MPI and PGAS Library with CUDA-Awareness
• HiBD Project
– High-Performance Big Data Analytics Library
• HiDL Project
– High-Performance Deep Learning
• Public Cloud Deployment
– Microsoft-Azure and Amazon-AWS
• Conclusions
Presentation Overview
55Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Released on 08/16/2019
• Major Features and Enhancements
– Based on MVAPICH2-2.3.2
– Enhanced tuning for point-to-point and collective operations
– Targeted for Azure HB & HC virtual machine instances
– Flexibility for 'one-click' deployment
– Tested with Azure HB & HC VM instances
MVAPICH2-Azure 2.3.2
56Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Performance of Radix
0
5
10
15
20
25
30
Exec
uti
on
Tim
e (S
eco
nd
s)
Number of Processes (Nodes X PPN)
Total Execution Time on HC (Lower is better)
MVAPICH2-X
HPCx
3x faster
0
5
10
15
20
25
60(1X60) 120(2X60) 240(4X60)
Exec
uti
on
Tim
e (S
eco
nd
s)
Number of Processes (Nodes X PPN)
Total Execution Time on HB (Lower is better)
MVAPICH2-X
HPCx
38% faster
57Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Released on 08/12/2019
• Major Features and Enhancements
– Based on MVAPICH2-X 2.3
– New design based on Amazon EFA adapter's Scalable Reliable Datagram (SRD) transport protocol
– Support for XPMEM based intra-node communication for point-to-point and collectives
– Enhanced tuning for point-to-point and collective operations
– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support
– Tested with c5n.18xlarge instance
MVAPICH2-X-AWS 2.3
58Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Application Performance
01020304050607080
72(2x36) 144(4x36) 288(8x36)Exec
uti
on
Tim
e (S
eco
nd
s)
Processes (Nodes X PPN)
miniGhost
MV2X OpenMPI
10% better
0
5
10
15
20
25
30
72(2x36) 144(4x36) 288(8x36)Exec
uti
on
Tim
e (S
eco
nd
s)
Processes (Nodes X PPN)
CloverLeaf
MV2X-UD
MV2X-SRD
OpenMPI27.5% better
• Up to 10% performance improvement for MiniGhost on 8 nodes
• Up to 27% better performance with CloverLeaf on 8 nodes
S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot
Interconnect, 2019
59Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Upcoming Exascale systems need to be designed with a holistic view of HPC,
Big Data, Deep Learning, and Cloud
• Presented an overview of designing convergent software stacks
• Presented solutions enable HPC, Big Data, and Deep Learning communities to
take advantage of current and next-generation systems
Concluding Remarks
60Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)
• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements
• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
61Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Has joined the OpenPOWER Consortium as a silver ISV member
• Provides flexibility:
– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the
OpenPOWER software stack
– A part of the OpenPOWER ecosystem
– Can participate with different vendors for bidding, installation and deployment
process
• Introduced two new integrated products with support for OpenPOWER systems
(Presented at the OpenPOWER North America Summit)
– X-ScaleHPC
– X-ScaleAI
– Send an e-mail to [email protected] for free trial!!
Silver ISV Member for the OpenPOWER Consortium + Products
62Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
7th Annual MVAPICH User Group (MUG) Meeting
• August 19-21, 2019; Columbus, Ohio, USA
• Keynote Speakers
– Dan Stanzione, Texas Advanced Computing Center (TACC)
– Robert Harrison, Director of the Institute of Advanced
Computational Science (IACS) and Brookhaven Computational
Science Center (CSC)
• Tutorials
• ARM
• IBM
• Mellanox
• OSU/MVAPICH2
Slides and Videos of the talks are available from
http://mug.mvapich.cse.ohio-state.edu
• Invited Speakers
– Gregory Blum Becker, Lawrence Livermore National Laboratory
– Nicholas Brown, EPCC, The University of Edinburg (United Kingdom)
– Gene Cooperman, Northeastern University
– Hyon-Wook Jin, Konkuk University (South Korea)
– Jithin Jose, Microsoft Azure
– Minsik Kim, KISTI Supercomputing Center (South Korea)
– Pramod Kumbhar, Blue Brain Project, EPFL (Switzerland)
– Naoya Maruyama, Lawrence Livermore National Laboratory
– Heechang Na, Ohio Supercomputer Center
– Vikram Saletore, Intel
– Jeffrey Salmond, University of Cambridge (United Kingdom)
– Gilad Shainer, Mellanox
– Sameer Shende, Paratools and University of Oregon
– Sayantan Sur, Intel
– Shinichiro Takizawa, RWBC-OIL, AIST (Japan)
– Mahidhar Tatineni, San Diego Supercomputing Center (SDSC)
– Karen Tomko, Ohio Supercomputer Center
63Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Funding Acknowledgments
Funding Support by
Equipment Support by
64Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)
– K. S. Kandadi (M.S.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)
– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– M. S. Ghazimeersaeed
– A. Ruhela
– K. ManianCurrent Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist
– M. Arnold
Current Research Scientist
– H. Subramoni– Q. Zhou (Ph.D.)
65Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
• Looking for Bright and Enthusiastic Personnel to join as
– PhD Students
– Post-Doctoral Researchers
– MPI Programmer/Software Engineer
– Hadoop/Big Data Programmer/Software Engineer
– Deep Learning and Cloud Programmer/Software Engineer
• If interested, please send an e-mail to [email protected]
Multiple Positions Available in My Group
66Network Based Computing Laboratory HPC-AI UK (Sept. ’19)
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich