Comet Virtual Clusters –What’s underneath? · 2017. 7. 27. · Comet: System Characteristics...

Comet Virtual Clusters – What’s underneath?

Philip PapadopoulosSan Diego Supercomputer Center

[email protected]

Overview

NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of SciencePI: Michael NormanCo-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-DiehrSDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

Comet: System Characteristics • Total peak flops ~2.1 PF• Dell primary integrator

• Intel Haswell processors w/ AVX2• Mellanox FDR InfiniBand

• 1,944 standard compute nodes (46,656 cores)

• Dual CPUs, each 12-core, 2.5 GHz• 128 GB DDR4 2133 MHz DRAM• 2*160GB GB SSDs (local disk)

• 36 108 GPU nodes• Same as standard nodes plus• Two NVIDIA K80 cards, each with dual

Kepler3 GPUs (36)• Two NVIDIA P100 GPUs (72)

• 4 large-memory nodes• 1.5 TB DDR4 1866 MHz DRAM• Four Haswell processors/node• 64 cores/node

• Hybrid fat-tree topology• FDR (56 Gbps) InfiniBand

• Rack-level (72 nodes, 1,728 cores) full bisection bandwidth

• 4:1 oversubscription cross-rack

• Performance Storage (Aeon)• 7.6 PB, 200 GB/s; Lustre

• Scratch & Persistent Storage segments

• Durable Storage (Aeon)• 6 PB, 100 GB/s; Lustre

• Automatic backups of critical data

• Home directory storage• Gateway hosting nodes• Virtual image repository• 100 Gbps external

connectivity to Internet2 &

Comet Network Architecture InfiniBand compute, Ethernet Storage

Juniper100 Gbps

Arista40GbE

(2x)

Data Mover Nodes

Research and Education Network Access

Data Movers

Internet 2

7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks.

72 HSWL320 GB

Core InfiniBand(2 x 108-

port)

36 GPU

4 Large-Memory

IB-Ethernet Bridges (4 x

18-port each)

Performance Storage7.7 PB, 200 GB/s

32 storage servers

Durable Storage6 PB, 100 GB/s

64 storage servers

Arista40GbE

(2x)

27 racks

FDR 36p

FDR 36p

64 128

18

72 HSWL320 GB

72 HSWL

2*36

4*18

Mid-tierInfiniBand

Additional Support Components (not shown for clarity)Ethernet Mgt Network (10 GbE)

Node-Local Storage 18

72FDR

FDR

40GbE

40GbE

10GbE

18 switches

4

4

FDR

72

Home File SystemsVM Image RepositoryLogin

Data MoverManagement Gateway Hosts

Fun with IB ÅÆ Ethernet Bridging

• Comet has four (4) Ethernet ÅÆ IB bridge switches• 18 FDR links, 18 40GbE links (72 total of each)• 4 X 16 port + 4 x 2 Port LAGS on the Ethernet Side

• Issue #1• Significant BW limitation cluster Æ storage• Why? (IB Routing)

1. Each LAG group has a single IB local ID (LID)2. IB switches are destination routed – Default is that all sources for the

same destination LID take the same route (port)• Solution: change LID mask count (LMC) from 0 to 2. Æ Every LID

becomes 2^LMC addresses. At each switch level, there are now 2^LMC routes to a destination LID (better route dispersion)

• Drawbacks: IB can have about 48K endpoints . When you increase LMC for better route balancing, you reduce the size of your network. At LMC=2 Î 12K at LMC=3 Î 6K nodes.

IB Switch

LID

of L

AG

IB Nodes

More IB to Ethernet IssuesPROBLEM: Losing Ethernet Paths from Nodes to storage• Mellanox bridges use PROXY ARP

• When a IPOIB interface on a compute ARPs for IP address XX.YY bridges “answers” with it’s MAC address. When it receives a packet destined for IP XX.YY it forwards (Layer 2) to the appropriate mac

• Vendor Advertised that it could handle 3K Proxy Arp entries per bridge. Our network config worked for 18+ months.

• Then, a change in opensm (subnet manager). Whenever a subnet change occurred, an ARP flood ensued (2K nodes each asking for O(64) Ether mac addresses).

• Bridge CPUs were woefully underpowered taking minutes to respond to all ARP requests. Lustre wasn’t happy

• Î redesigned network from layer 2 to layer 3 (using routers inside our Arista Fabric).

IB/Ether Bridge (mac: bb)

Ethernet IP XX.YY (mac: aa)

IPoIB node: Who has XX.YY?Bridge answers: “I do, at bb”

(PROXY ARP)

Lustre Storage

Arista Switch/Router

Virtualized Clusters on Comet

Goal:Provide a near bare metal HPC performance and

management experience

Target UseProjects that could manage their own cluster, and:

• can’t fit OUR software environment, and• don’t want to buy hardware or

• have bursty or intermittent need

Nucleus

Persistent virtual front end

API• Request nodes• Console & power

• Scheduling• Storage management• Coordinating network changes• VM launch & shutdown

Idle disk images

Active virtual compute nodes

Disk images

User PerspectiveUser is a system administrator –we give them their own HPC cluster

Attached and synchronized

User-Customized HPC1:1 physical-to-virtual compute node

Frontend

Virtual FrontendHosting Disk Image Vault

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

public network

private

VirtualCompute

VirtualCompute

VirtualCompute

private

VirtualCompute

VirtualCompute

VirtualCompute

private

physicalvirtual virtual

VirtualFrontend

VirtualFrontend

High Performance Virtual Cluster Characteristics

VirtualFrontend

VirtualCompute

VirtualCompute

VirtualCompute

privateEthernet

Infiniband

All nodes have• Private Ethernet• Infiniband• Local Disk Storage

Virtual Compute Nodes can Network boot (PXE) from its virtual frontend

All Disks retain state• keep user configuration between boots

Infiniband Virtualization • 8% latency overhead. • Nominal bandwidth overhead

Comet: ProvidingVirtualized HPC for XSEDE

Bare Metal “Experience”

• Can install virtual frontend from a bootable ISO image• Subordinate nodes can PXE boot• Compute nodes retain disk state (turning off a compute node is equivalent

to turning off power on a physical node).

• Î Don’t want cluster owners to learn an entirely “new way” of doing things.• Side comment: you don’t always have to run the way “Google does it” to do good

science.• Î If you have tools to manage physical nodes today, you can use those

same tools to manage your virtual cluster.

Benchmark Results

Single Root I/O Virtualization in HPC• Problem: Virtualization generally has resulted in

significant I/O performance degradation (e.g., excessive DMA interrupts)

• Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters • One physical function Æmultiple virtual

functions, each light weight but with its own DMA streams, memory space, interrupts

• Allows DMA to bypass hypervisor to VMs• SRIOV enables virtual HPC cluster w/ near-native

InfiniBand latency/bandwidth and minimal overhead

MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones

WRF Weather Modeling

• 96-core (4-node) calculation• Nearest-neighbor

communication• Test Case: 3hr Forecast, 2.5km

resolution of Continental US (CONUS).

• Scalable algorithms• 2% slower w/ SR-IOV vs native IB.

WR F 3.4.1 – 3hr forecast

MrBayes: Software for Bayesian inference of phylogeny.

• Widely used, including by CIPRESgateway.

• 32-core (2 node) calculation• Hybrid MPI/OpenMP Code.• 8 MPI tasks, 4 OpenMP threads per

task.• Compilers: gcc + mvapich2 v2.2,

AVX options.• Test Case: 218 taxa, 10,000

generations.• 3% slower with SR-IOV vs native IB.

Quantum ESPRESSO

• 48-core (3 node) calculation• CG matrix inversion - irregular

communication• 3D FFT matrix transposes (all-to-

all communication)• Test Case: DEISA AUSURF 112

benchmark.• 8% slower w/ SR-IOV vs native IB.

RAxML: Code for Maximum Likelihood-based inference of large phylogenetic trees.

• Widely used, including by CIPRESgateway.

• 48-core (2 node) calculation• Hybrid MPI/Pthreads Code.• 12 MPI tasks, 4 threads per task.• Compilers: gcc + mvapich2 v2.2,

AVX options.• Test Case: Comprehensive analysis,

218 taxa, 2,294 characters, 1,846patterns, 100 bootstraps specified.

• 19% slower w/ SR-IOV vs native IB.

NAMD: Molecular Dynamics, ApoA1 Benchmark

• 48-core (2 node) calculation• Test Case: ApoA1 benchmark.• 92,224 atoms, periodic, PME.• Binary used: NAMD 2.11, ibverbs,

SMP.• Directly used prebuilt binary which

uses ibverbs for multi-node runs.• 23% slower w/ SR-IOV vs native IB.

Accessing Virtual Cluster Capabilities – much smaller APIthan Openstack/EC2/GCE

• REST API• Command line interface• Command shell for scripting• Console Access• (Portal)

User does NOT see: Rocks, Slurm, etc.

Cloudmesh – Command line interfaceDeveloped by IU collaborators • Cloudmesh client enables access to multiple cloud

environments from a command shell and command line.• We leverage this easy to use CLI, allowing the use of Comet

as infrastructure for virtual cluster management.• Cloudmesh has more functionality with ability to access

hybrid clouds OpenStack, (EC2, AWS, Azure); possible toextend to other systems like Jetstream, Bridges etc.

• Plans for customizable launchers available throughcommand line or browser – can target specific applicationuser communities.

Reference: https://github.com/cloudmesh/client

Comet Cloudmesh Client (selected commands)

• cm comet cluster ID • Show the cluster details

• cm comet power on ID vm-ID -[0-3] --walltime=6h• Power 3 nodes on for 6 hours

• cm comet image attach image.iso ID vm-ID-0• Attach an image

• cm comet boot ID vm-ID-0• Boot node 0

• cm comet console vc4• Console

Getting Started• http://cloudmesh.github.io/client/tutorials/comet_cloudmesh.html• List of ISO images that a user can use to install a frontend

$ cm comet iso list1: CentOS-7-x86_64-NetInstall-1511.iso2: ubuntu-16.04.2-server-amd64.iso3: ipxe.iso...<snip>...19: Fedora-Server-netinst-x86_64-25-1.3.iso20: ubuntu-14.04.4-server-amd64.iso

• Attach ISO (Ubuntu) , Boot Frontend, Connect to Console

$ cm comet iso attach 2 vctNN1$ cm comet power on vctNN$ cm comet console vctNN

cm comet iso attach 2 vctNN

http://cloudmesh.github.io/client/tutorials/comet_cloudmesh.html

Cluster owner has access to console at BIOS boot (any node in the cluster)

SDSC Policy

• Virtual frontends (VFE) can be up 7 x 24 x 365• Typical config is 8GB memory, 36GB disk, 4 cores• Multiple VFEs on a single physical host

• Compute nodes are treated as (parallel) jobs in our batch system• Users request nodes to be turned on/off.• Cloudmesh client hides that a request to turn on a node is actually a

batch job submission to SLURM.• A compute node retains its disk state, MAC address of Ethernet and

GUID of virtualized IB. Æ power off a virtual compute node is just like power off of physical hardware.

“Fun” with KVM and SRIOV

• Issue: virtual compute nodes are allocated 120/128GB memory. Sometimes it would take a very long time (20 minutes) for a KVM virtual container to start.• Root cause: KVM wants to allocate a contiguous block of physical memory.

When a node has been running for a while, this isn’t likely.• Hammer: reboot physical node• More subtle: (works mostly), release all caches/buffers.

• When a cluster node is allocated, we assign its virtual IB adapter a fixed GUID. • Some handstands with virtual function assignment within the physical node

VM Disk Management

● Each VM gets a 36 GB disk (Small SCSI) – This is adjustable● Disk images are persistent through reboots● Two central NASes (ZFS-based) store all disk images● VM can be allocated on any physical compute node in Comet● Two solutions:

o iSCSI (Network mounted disk)o Disk replication on nodes

Virtual compute-x

Non-performant approach: VM Disk Management via iSCSI only

NAS

Compute nodes

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x

This is what OpenStack SupportsBig Issue: Bandwidth Bottleneck at

NAS

A hybrid solution via replication

● Initial boot of any cluster node uses an iSCSI disk (Call this a node disk) on the NAS

● During normal operation, Comet moves a node disk to the physical host that is running the node VM. And then disconnects from the NASo All Node disk operation is local to the physical hosto Fundamentally enables scale out w/o a $1M NAS

● At Shutdown, any changes made to the node disk (now on the physical host) are migrated back to the NAS, ready for next boot

VM Disk Management Replication

Replication states:1. Unused unmapped2. Init disk NAS -> VM

a. Move disk imageb. Merge temporary modification

3. Steady state mapped4. Release disk VM -> NAS5. Unused unmapped

1.a Init Disk

NAS

Compute nodes

Virtual compute-xTargets

iqn.2001-04.com.nas-0-0-vm-compute-xReplicate Disk

iSCSI mount on NAS enables virtual compute node to boot immediately.● Read operations from NAS● Write operations to local disk

1.b Init Disk

NAS

Compute nodes


During boot, the disk image on the NAS is migrated to the physical host.● Read-only and read/write

are then merged into one local disk

● iSCSI mount is disconnected

2. Steady State

NAS

Compute nodes


During normal operation● Node disk is snapshot● Incremental snapshots

sent to NAS (replicate back to NAS)

● Timing/load/experiment will tell us how often we can do this

3. Release Disk

NAS

Compute nodes

Virtual compute-xTargets Power off

At shutdown, any unsynched changes are send back to NAS● When the last snapshot

is sent, the Virtual compute node can be rebooted on another system

Current implementation

https://github.com/rocksclusters/img-storage-roll

Some Technical Details● NAS and Physical Nodes use ZFS as the native file system

o A Node Disk is defined inside of ZFS as a ZVOL (a raw disk volume)● ZVOLs

o Can be snapshot using native ZFS utilitieso Full and incremental snapshots can be sent over the network using ZFS

send/recv + ssh (or other protocol)o VMs simply see a raw disk

● The ZVOL is the Disk Image

Virtual Cluster projects• Open Science Grid: University of California, San

Diego, Frank Wuerthwein (in production)• Virtual cluster for PRAGMA/GLEON lake expedition

- University of Florida, Renato Figueiredo• Deploying the Lifemapper species modeling

Platform with Virtual Clusters on Comet: University of Kansas, James Beach

• Adolescent Brain Cognitive Development Study: NIH funded, 19 institutions.

• Comet Goal was O(20) virtual clusters (not 1000s)

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Comet Virtual Clusters –What’s underneath? · 2017. 7. 27. · Comet: System Characteristics...

Documents