Comet Virtual Clusters – What’s underneath?
Philip PapadopoulosSan Diego Supercomputer Center
Overview
NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of SciencePI: Michael NormanCo-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-DiehrSDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)
Comet: System Characteristics • Total peak flops ~2.1 PF• Dell primary integrator
• Intel Haswell processors w/ AVX2• Mellanox FDR InfiniBand
• 1,944 standard compute nodes (46,656 cores)
• Dual CPUs, each 12-core, 2.5 GHz• 128 GB DDR4 2133 MHz DRAM• 2*160GB GB SSDs (local disk)
• 36 108 GPU nodes• Same as standard nodes plus• Two NVIDIA K80 cards, each with dual
Kepler3 GPUs (36)• Two NVIDIA P100 GPUs (72)
• 4 large-memory nodes• 1.5 TB DDR4 1866 MHz DRAM• Four Haswell processors/node• 64 cores/node
• Hybrid fat-tree topology• FDR (56 Gbps) InfiniBand
• Rack-level (72 nodes, 1,728 cores) full bisection bandwidth
• 4:1 oversubscription cross-rack
• Performance Storage (Aeon)• 7.6 PB, 200 GB/s; Lustre
• Scratch & Persistent Storage segments
• Durable Storage (Aeon)• 6 PB, 100 GB/s; Lustre
• Automatic backups of critical data
• Home directory storage• Gateway hosting nodes• Virtual image repository• 100 Gbps external
connectivity to Internet2 &
Comet Network Architecture InfiniBand compute, Ethernet Storage
Juniper100 Gbps
Arista40GbE
(2x)
Data Mover Nodes
Research and Education Network Access
Data Movers
Internet 2
7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks.
72 HSWL320 GB
Core InfiniBand(2 x 108-
port)
36 GPU
4 Large-Memory
IB-Ethernet Bridges (4 x
18-port each)
Performance Storage7.7 PB, 200 GB/s
32 storage servers
Durable Storage6 PB, 100 GB/s
64 storage servers
Arista40GbE
(2x)
27 racks
FDR 36p
FDR 36p
64 128
18
72 HSWL320 GB
72 HSWL
2*36
4*18
Mid-tierInfiniBand
Additional Support Components (not shown for clarity)Ethernet Mgt Network (10 GbE)
Node-Local Storage 18
72FDR
FDR
40GbE
40GbE
10GbE
18 switches
4
4
FDR
72
Home File SystemsVM Image RepositoryLogin
Data MoverManagement Gateway Hosts
Fun with IB ÅÆ Ethernet Bridging
• Comet has four (4) Ethernet ÅÆ IB bridge switches• 18 FDR links, 18 40GbE links (72 total of each)• 4 X 16 port + 4 x 2 Port LAGS on the Ethernet Side
• Issue #1• Significant BW limitation cluster Æ storage• Why? (IB Routing)
1. Each LAG group has a single IB local ID (LID)2. IB switches are destination routed – Default is that all sources for the
same destination LID take the same route (port)• Solution: change LID mask count (LMC) from 0 to 2. Æ Every LID
becomes 2^LMC addresses. At each switch level, there are now 2^LMC routes to a destination LID (better route dispersion)
• Drawbacks: IB can have about 48K endpoints . When you increase LMC for better route balancing, you reduce the size of your network. At LMC=2 Î 12K at LMC=3 Î 6K nodes.
IB Switch
LID
of L
AG
IB Nodes
More IB to Ethernet IssuesPROBLEM: Losing Ethernet Paths from Nodes to storage• Mellanox bridges use PROXY ARP
• When a IPOIB interface on a compute ARPs for IP address XX.YY bridges “answers” with it’s MAC address. When it receives a packet destined for IP XX.YY it forwards (Layer 2) to the appropriate mac
• Vendor Advertised that it could handle 3K Proxy Arp entries per bridge. Our network config worked for 18+ months.
• Then, a change in opensm (subnet manager). Whenever a subnet change occurred, an ARP flood ensued (2K nodes each asking for O(64) Ether mac addresses).
• Bridge CPUs were woefully underpowered taking minutes to respond to all ARP requests. Lustre wasn’t happy
• Î redesigned network from layer 2 to layer 3 (using routers inside our Arista Fabric).
IB/Ether Bridge (mac: bb)
Ethernet IP XX.YY (mac: aa)
IPoIB node: Who has XX.YY?Bridge answers: “I do, at bb”
(PROXY ARP)
Lustre Storage
Arista Switch/Router
Virtualized Clusters on Comet
Goal:Provide a near bare metal HPC performance and
management experience
Target UseProjects that could manage their own cluster, and:
• can’t fit OUR software environment, and• don’t want to buy hardware or
• have bursty or intermittent need
Nucleus
Persistent virtual front end
API• Request nodes• Console & power
• Scheduling• Storage management• Coordinating network changes• VM launch & shutdown
Idle disk images
Active virtual compute nodes
Disk images
User PerspectiveUser is a system administrator –we give them their own HPC cluster
Attached and synchronized
User-Customized HPC1:1 physical-to-virtual compute node
Frontend
Virtual FrontendHosting Disk Image Vault
Compute
Compute
Compute
Compute
Compute
Compute
Compute
Compute
Compute
public network
private
VirtualCompute
VirtualCompute
VirtualCompute
private
VirtualCompute
VirtualCompute
VirtualCompute
private
physicalvirtual virtual
VirtualFrontend
VirtualFrontend
High Performance Virtual Cluster Characteristics
VirtualFrontend
VirtualCompute
VirtualCompute
VirtualCompute
privateEthernet
Infiniband
All nodes have• Private Ethernet• Infiniband• Local Disk Storage
Virtual Compute Nodes can Network boot (PXE) from its virtual frontend
All Disks retain state• keep user configuration between boots
Infiniband Virtualization • 8% latency overhead. • Nominal bandwidth overhead
Comet: ProvidingVirtualized HPC for XSEDE
Bare Metal “Experience”
• Can install virtual frontend from a bootable ISO image• Subordinate nodes can PXE boot• Compute nodes retain disk state (turning off a compute node is equivalent
to turning off power on a physical node).
• Î Don’t want cluster owners to learn an entirely “new way” of doing things.• Side comment: you don’t always have to run the way “Google does it” to do good
science.• Î If you have tools to manage physical nodes today, you can use those
same tools to manage your virtual cluster.
Benchmark Results
Single Root I/O Virtualization in HPC• Problem: Virtualization generally has resulted in
significant I/O performance degradation (e.g., excessive DMA interrupts)
• Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters • One physical function Æmultiple virtual
functions, each light weight but with its own DMA streams, memory space, interrupts
• Allows DMA to bypass hypervisor to VMs• SRIOV enables virtual HPC cluster w/ near-native
InfiniBand latency/bandwidth and minimal overhead
MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones
MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones
WRF Weather Modeling
• 96-core (4-node) calculation• Nearest-neighbor
communication• Test Case: 3hr Forecast, 2.5km
resolution of Continental US (CONUS).
• Scalable algorithms• 2% slower w/ SR-IOV vs native IB.
WR F 3.4.1 – 3hr forecast
MrBayes: Software for Bayesian inference of phylogeny.
• Widely used, including by CIPRESgateway.
• 32-core (2 node) calculation• Hybrid MPI/OpenMP Code.• 8 MPI tasks, 4 OpenMP threads per
task.• Compilers: gcc + mvapich2 v2.2,
AVX options.• Test Case: 218 taxa, 10,000
generations.• 3% slower with SR-IOV vs native IB.
Quantum ESPRESSO
• 48-core (3 node) calculation• CG matrix inversion - irregular
communication• 3D FFT matrix transposes (all-to-
all communication)• Test Case: DEISA AUSURF 112
benchmark.• 8% slower w/ SR-IOV vs native IB.
RAxML: Code for Maximum Likelihood-based inference of large phylogenetic trees.
• Widely used, including by CIPRESgateway.
• 48-core (2 node) calculation• Hybrid MPI/Pthreads Code.• 12 MPI tasks, 4 threads per task.• Compilers: gcc + mvapich2 v2.2,
AVX options.• Test Case: Comprehensive analysis,
218 taxa, 2,294 characters, 1,846patterns, 100 bootstraps specified.
• 19% slower w/ SR-IOV vs native IB.
NAMD: Molecular Dynamics, ApoA1 Benchmark
• 48-core (2 node) calculation• Test Case: ApoA1 benchmark.• 92,224 atoms, periodic, PME.• Binary used: NAMD 2.11, ibverbs,
SMP.• Directly used prebuilt binary which
uses ibverbs for multi-node runs.• 23% slower w/ SR-IOV vs native IB.
Accessing Virtual Cluster Capabilities – much smaller APIthan Openstack/EC2/GCE
• REST API• Command line interface• Command shell for scripting• Console Access• (Portal)
User does NOT see: Rocks, Slurm, etc.
Cloudmesh – Command line interfaceDeveloped by IU collaborators • Cloudmesh client enables access to multiple cloud
environments from a command shell and command line.• We leverage this easy to use CLI, allowing the use of Comet
as infrastructure for virtual cluster management.• Cloudmesh has more functionality with ability to access
hybrid clouds OpenStack, (EC2, AWS, Azure); possible toextend to other systems like Jetstream, Bridges etc.
• Plans for customizable launchers available throughcommand line or browser – can target specific applicationuser communities.
Reference: https://github.com/cloudmesh/client
Comet Cloudmesh Client (selected commands)
• cm comet cluster ID • Show the cluster details
• cm comet power on ID vm-ID -[0-3] --walltime=6h• Power 3 nodes on for 6 hours
• cm comet image attach image.iso ID vm-ID-0• Attach an image
• cm comet boot ID vm-ID-0• Boot node 0
• cm comet console vc4• Console
Getting Started• http://cloudmesh.github.io/client/tutorials/comet_cloudmesh.html• List of ISO images that a user can use to install a frontend
$ cm comet iso list1: CentOS-7-x86_64-NetInstall-1511.iso2: ubuntu-16.04.2-server-amd64.iso3: ipxe.iso...<snip>...19: Fedora-Server-netinst-x86_64-25-1.3.iso20: ubuntu-14.04.4-server-amd64.iso
• Attach ISO (Ubuntu) , Boot Frontend, Connect to Console
$ cm comet iso attach 2 vctNN1$ cm comet power on vctNN$ cm comet console vctNN
cm comet iso attach 2 vctNN
Cluster owner has access to console at BIOS boot (any node in the cluster)
SDSC Policy
• Virtual frontends (VFE) can be up 7 x 24 x 365• Typical config is 8GB memory, 36GB disk, 4 cores• Multiple VFEs on a single physical host
• Compute nodes are treated as (parallel) jobs in our batch system• Users request nodes to be turned on/off.• Cloudmesh client hides that a request to turn on a node is actually a
batch job submission to SLURM.• A compute node retains its disk state, MAC address of Ethernet and
GUID of virtualized IB. Æ power off a virtual compute node is just like power off of physical hardware.
“Fun” with KVM and SRIOV
• Issue: virtual compute nodes are allocated 120/128GB memory. Sometimes it would take a very long time (20 minutes) for a KVM virtual container to start.• Root cause: KVM wants to allocate a contiguous block of physical memory.
When a node has been running for a while, this isn’t likely.• Hammer: reboot physical node• More subtle: (works mostly), release all caches/buffers.
• When a cluster node is allocated, we assign its virtual IB adapter a fixed GUID. • Some handstands with virtual function assignment within the physical node
VM Disk Management
● Each VM gets a 36 GB disk (Small SCSI) – This is adjustable● Disk images are persistent through reboots● Two central NASes (ZFS-based) store all disk images● VM can be allocated on any physical compute node in Comet● Two solutions:
o iSCSI (Network mounted disk)o Disk replication on nodes
Virtual compute-x
Non-performant approach: VM Disk Management via iSCSI only
NAS
Compute nodes
Targets
iqn.2001-04.com.nas-0-0-vm-compute-x
This is what OpenStack SupportsBig Issue: Bandwidth Bottleneck at
NAS
A hybrid solution via replication
● Initial boot of any cluster node uses an iSCSI disk (Call this a node disk) on the NAS
● During normal operation, Comet moves a node disk to the physical host that is running the node VM. And then disconnects from the NASo All Node disk operation is local to the physical hosto Fundamentally enables scale out w/o a $1M NAS
● At Shutdown, any changes made to the node disk (now on the physical host) are migrated back to the NAS, ready for next boot
VM Disk Management Replication
Replication states:1. Unused unmapped2. Init disk NAS -> VM
a. Move disk imageb. Merge temporary modification
3. Steady state mapped4. Release disk VM -> NAS5. Unused unmapped
1.a Init Disk
NAS
Compute nodes
Virtual compute-xTargets
iqn.2001-04.com.nas-0-0-vm-compute-xReplicate Disk
iSCSI mount on NAS enables virtual compute node to boot immediately.● Read operations from NAS● Write operations to local disk
1.b Init Disk
NAS
Compute nodes
Virtual compute-xTargets
During boot, the disk image on the NAS is migrated to the physical host.● Read-only and read/write
are then merged into one local disk
● iSCSI mount is disconnected
2. Steady State
NAS
Compute nodes
Virtual compute-xTargets
During normal operation● Node disk is snapshot● Incremental snapshots
sent to NAS (replicate back to NAS)
● Timing/load/experiment will tell us how often we can do this
3. Release Disk
NAS
Compute nodes
Virtual compute-xTargets Power off
At shutdown, any unsynched changes are send back to NAS● When the last snapshot
is sent, the Virtual compute node can be rebooted on another system
Current implementation
https://github.com/rocksclusters/img-storage-roll
Some Technical Details● NAS and Physical Nodes use ZFS as the native file system
o A Node Disk is defined inside of ZFS as a ZVOL (a raw disk volume)● ZVOLs
o Can be snapshot using native ZFS utilitieso Full and incremental snapshots can be sent over the network using ZFS
send/recv + ssh (or other protocol)o VMs simply see a raw disk
● The ZVOL is the Disk Image
Virtual Cluster projects• Open Science Grid: University of California, San
Diego, Frank Wuerthwein (in production)• Virtual cluster for PRAGMA/GLEON lake expedition
- University of Florida, Renato Figueiredo• Deploying the Lifemapper species modeling
Platform with Virtual Clusters on Comet: University of Kansas, James Beach
• Adolescent Brain Cognitive Development Study: NIH funded, 19 institutions.
• Comet Goal was O(20) virtual clusters (not 1000s)