Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Introduction to the NPACI Rocks Clustering Toolkit:

Building Manageable COTS Clusters

Philip M. Papadopoulos,Mason J. Katz,Greg Bruno

Who We Are

Philip Papadopoulos Parallel message passing expert (PVM and Fast

Messages)Mason Katz

Network protocol expert (x-kernel, Scout and Fast Messages)

Greg Bruno 10 years experience with NCR’s Teradata Systems

Builders of clusters which drive very large commercial databases

All three of us have worked together for the past 2 years building NT and Linux clusters

Who is “NPACI Rocks” ?Key people from the UCB Millennium Group

Prof. David Culler Eric Fraser Brent Chun Matt Massie Albert Goto

People from SDSC Bruno, Katz, Papadopoulos (Distributed Computing

Group) Kenneth Yoshimoto (Scheduling) [Keith Thompson, Bill Link (Grid)] [ Storage Resource Broker (SRB) Group] [ You ! ]

Why We Do Clusters – Frankly, we love it

Building high-performance systems which have killer price/performance is a gas

NPACI is about building pervasive infrastructure. Supported, transferable cluster infrastructure was missing from our “portfolio”.

Enabling others to build their own clusters and “do scientific simulation” is a blast.

We wanted a management system that would allow us to rapidly experiment with new low-level system software (and recover when things didn’t go quite right)

“Protect ourselves from ourselves”

What We’ll Cover

Rocks philosophies

Hardware components

Software packages Theory and practice

Lab

What we thought we “Learned”

Clusters are phenomenal price/performance computational engines, but are hard to manage

Cluster management is amanagement is a full-time job which gets linearly harder as one scales out.

“Heterogeneous” Nodes are a bummer (network, memory, disk, MHz, current kernel version).

You Must Unlearn What You Have Learned

Installation/ManagementNeed to have a strategy for managing cluster nodesPitfalls

Installing each node “by hand” Difficult to keep software on nodes up to date Management increases as node count increases

Disk Imaging techniques (e.g.. VA Disk Imager) Difficult to handle heterogeneous nodes Treats OS as a single monolithic system

Specialized installation programs (e.g. IBM’s LUI, or RWCPs Multicast installer) –

let Linux packaging vendors do their jobPenultimate

RedHat Kickstart Define packages needed for OS on nodes, kickstart

gives a reasonable measure of control. Need to fully automate to scale out (Rocks)

Scaling out

Evolve to management of “two” systems The front end(s)

Log in host User’s home areas, passwords, groups Cluster configuration information

The compute nodes Disposable OS image Let software manage node heterogeneity Parallel (re)installation

Cluster-wide configuration files derived through reports from a MySQL database (DHCP, hosts, PBS nodes, …)

NPACI Rocks Toolkit – rocks.npaci.edu

Techniques and software for easy installation, management, monitoring and update of clustersInstallation

Bootable CD + floppy which contains all the packages and site configuration info to bring up an entire cluster

Management and update philosophies Trivial to completely reinstall any (all) nodes. Nodes are 100% automatically configured

Use of DHCP, NIS for configuration Use RedHat’s Kickstart to define the set of software that

defines a node. All software is delivered in a RedHat Package (RPM)

Encapsulate configuration for a package (e.g.. Myrinet) Manage dependencies

Never try to figure out if node software is consistent If you ever ask yourself this question, reinstall the node

More RocksismsLeverage widely-used (standard) software wherever possible

Everything is in RedHat Packages (RPM) RedHat’s “kickstart” installation tool SSH, Telnet, Existing open source tools

Write only the software that we need to writeFocus on simplicity

Commodity components For example: x86 compute servers, Ethernet, Myrinet

Minimal For example: no additional diagnostic or proprietary networks

Rocks is a collection point of software for people building clusters It will evolve to include cluster software and packaging from

more than just SDSC and UCB <[your-software.i386.rpm] [your-software.src.rpm] here>

Hardware

Many variations on a basic layout

Front-end Node(s) Public Ethernet

Fast-Ethernet Switching Complex

Gigabit Network Switching Complex

Node Node Node Node Node

Node Node Node Node Node

Power Distribution (Net addressable units as option)

Frontend and Compute NodesChoices

Uni or Dual, Intel Processors Linux is, in reality, an Intel OS

Rackmount vs. Desktop chassis Rackmount “essential” for large installations

SCSI vs. IDE Performance is a non-issue Price and serviceability are the real considerations

Note: rackmount servers usually are SCSI User integration versus system integrator

Our Nodes Dual PIIIs (733, 800 and 933 MHz [Compaq, IBM])

1.0+ GHz as we expand ½ GB node (1 GB would be better) Hot swap SCSI on these nodes We integrate our hardware

NetworksHigh-performance networks

Myrinet, Giganet, Servernet, Gigabit Ethernet, etc. Ethernet only Beowulf-class

Management Networks (Light Side) Ethernet – 100 Mbit

Management network used to manage compute nodes and launch jobs

Nodes are in Private IP (192.168.x.x) space, front-end does NAT

Ethernet – 802.11b Easy access to the cluster via laptops Plus, wireless will change your life

Evil Management Networks (Dark Side) A serial “console” network is not necessary A KVM (keyboard/video/monitor) switching system

adds too much complexity, cables, and cost

Power Distribution

Highly desirable to have network addressable power distribution units Can remotely power cycle compute nodes Instrumented which help determine power

needs

Ethernet port

Power sockets

Other Helpful Hardware When All Else Fails

When a node appears to be sick Issue a “reinstall” command over the network If still dead, instruct the network addressable power

distribution unit to power cycle the node (this reinstalls the OS)

If still dead, roll up the “crash cart” Monitor and keyboard

Leatherman: A Must-Have For Any Self-Respecting Clusters Person

Current Configuration of the Meteor Cluster

Rocks v2.02 Frontends100 nodes50 GB RAMEthernet

For managementMyrinetServernet

Working through some bugs

Software

RedHat Supplied Software

7.0 Base + Updates

RPM RedHat Package Manager

Kickstart Method for unattended server installation

Community SoftwareMyricom’s General Messaging (GM)MPICH GM device Ethernet device

Portable Batch SystemMauiPVMIntel’s Math Kernel Library Math functions tuned for Intel processors

NPACI Rocks SoftwareCluster-dist

A tool used to assemble the latest RedHat, community and Rocks packages into a distribution which is used by compute nodes during reinstallation

Shoot-node and eKV (Ethernet Keyboard and Video) Initiate a compute node reinstallation Monitor compute node reinstallations over Ethernet

with telnetCluster-admin and cluster-ssl

Tools to create user accounts and user SSL certificatesRexec (UC Berkeley)

Launch and control parallel jobs (SSL-based authentication)

Ganglia (UC Berkeley) Cluster monitoring

Software Details

Cluster-dist

Integrate RedHat Packages from Redhat (mirror) – base distribution + updates Contrib directory Locally produced packages Packages from rocks.npaci.edu

Produces a single updated distribution that resides on front-end

Is a RedHat Distribution with patches and updates applied

Different Kickstart files and different distribution can co-exist on a front-end to add flexibility in configuring nodes.

Remote re-installationShoot-node and eKV

Rocks provides a simple method to remotely reinstall a node (once it has been installed the first time)

By default, hard power cycling will cause a node to reinstall itself.

With no serial (or KVM) console, we are able to watch a node as installs

192.168.254.254

Remotely starting reinstallation on two nodes

192.168.254.253

Remote re-installationShoot-node and eKV

Starting Jobs

SSH-based MPI-Launch Provides full integration with Myrinet reservation

capability of Usher/PatronSSL-Based Rexec

Better control of jobs on remote nodes Sane signal propagation

Batch System: PBS + Maui PBS provides queue definition and node monitoring Maui has rich scheduling policies

Standing and Future Reservations Query number of “available” now nodes

PBS – Portable Batch System

Three standard components to PBS MOM – Node health reporting daemon, Job

Launch daemon on every node Server – On front-end: queue definition,

aggregation of node information Scheduler – Policies for what job to run out

of which queue at what timeWe added a fourth Configuration – Get cluster node

configuration from our SQL database.

PBS RPM Packaging

Repackaged PBS (Sane packaging + enhancements) Added “chkconfig-compatible” start-up scripts 4 packages

pbs (server and scheduler) (should be divided again) pbs-mom pbs-config-sql (Python script to generate database

report) pbs-common (files needed by all three packages)

A Rocks 2.0 base installation (automatically) defines a default queue with all nodes being available in the queuehttp://pbs.mrj.com is a good starting point for PBS

http://pbs.mrj.com/

PBS Server defaults (and changing them)

Startup script: “/etc/rc.d/init.d/pbs-server start”/usr/apps/pbs/pbs.default

“Sourced” every time pbs is started# $Id: pbs.default.in,v 1.5 2001/02/16 19:59:38 bruno Exp $## A basic pbs setup that creates a queue called default and starts scheduling## Create queues and set their attributes.### Create and define queue default## 1 node default, 1hr walltimecreate queue defaultset queue default queue_type = Executionset queue default resources_default.nodes = 1set queue default resources_default.walltime = 1:00:00set queue default enabled = Trueset queue default started = True

PBS.defaults (cont’d)## Set server attributes.## Assume maui scheduler will be installedset server managers = maui@frontend-0set server operators = maui@frontend-0set server default_queue = defaultset server log_events = 511set server mail_from = admset server scheduler_iteration = 600set server scheduling=false

PBS will ignore queue creation if a queue already exists.

Modifying the default setup (simple queue creation)

Use qmgr to create a new queue# /usr/apps/pbs/bin/qmgrMax open servers: 4Qmgr: create queue singleQmgr: set queue single queue_type=executionQmgr: set queue single enabled=trueQmgr: set queue single acl_hosts=compute-1-0Qmgr: set queue single started=true

Use qmgr command to save configuration /usr/apps/pbs/bin/qmgr -c "print server“ > /usr/apps/pbs/pbs.default

Maui Scheduler

We use Maui as our scheduler for PBS mauischeduler.sourceforge.net http://havi.supercluster.org/documentation/

mauiAdd the “single” queue definition so that Maui understands. This is in /usr/spool/maui/maui.cfg

SRNAME[0] singleSRHOSTLIST[0] compute-1-0

Restart Maui % /etc/rc.d/init.d/maui restart

Submit a job to PBS % /usr/apps/pbs/bin/qsub –q single mytest.sh

http://havi.supercluster.org/documentation/maui

http://havi.supercluster.org/documentation/maui

Monitoring your cluster

PBS has a GUI called xpsmon. Gives a nice graphical view of up/down state of nodesSNMP status Use the extensive SNMP MIB defined by the

Linux community to find out many things about a node Installed software Uptime Load

Ganglia (UCB) – IP Multicast-based monitoring system

Ganglia - http://www.millennium.berkeley.edu/ganglia/

Dendrite on each node Multicasts state of the machine on significant

changes Load averages, disk consumption, memory, etc.

Beacons every minute, if no significant deltasAxons

Collection daemons (at least one/cluster)Ganglia client – Sort the measured variables to find a set of hosts that match a desired criteria

E.g. X MB free memory, load below Y Can act as a “vexec” resource for Rexec.

http://www.millennium.berkeley.edu/ganglia/



Ganglia – text output[phil@slic01 ~]$ /usr/sbin/ganglia load_one compute-1-5 0.07 compute-0-9 0.08 compute-1-3 0.14 compute-2-0 0.15 compute-2-8 0.18 compute-2-5 0.27 frontend-0 0.36 compute-3-11 0.82 compute-23 1.06 compute-22 1.19 compute-3-4 1.96 compute-3-9 1.99 compute-3-10 1.99 compute-3-2 2.00 compute-3-3 2.09 compute-3-7 2.12 compute-3-5 2.99 compute-3-6 3.0

“Hidden” Software

Some Tools that assist in automation.

Users generally will not see these tools

Profile scripts run at user’s first loginUsher-patron (Myrinet port reservation)Insert-ethers (Add nodes to a cluster)Cluster-sql package

Reports to build service-specific config filesCluster-admin

Node reinstallation Creating accounts (NIS, auto.home map creation)

Cluster-ssl Generate keys for SSL authentication (rexec)

Usher/Patron

Tool to simplify using installed Myricom Hardware Eliminates a central “database” to decide which

Myrinet ports are currently in use (Myricom driver installed with a separate source RPM)

Usher daemon runs on each compute node. Takes reservation requests for access to the limited set of Myrinet ports (RPC-based)

Reservations time out, if not claimed.Patron – works with usher to request and claim portsIntegrated with MPI-Launch

Automatically creates node file need for MPICH-GM

First Login Profile Scripts

On first login, all users, including root, are prompted to build an SSH public/private key pair Makes sense because ssh is the only way to gain

login access to the nodes NIS is updated (passwd, auto.home, etc.)

Additionally, if it’s the first time root has logged in, a SSL certificate authority is generated which is used to sign user’s SSL certificates The SSL certificate and root’s public SSH key are

then propagated the to compute node kickstart file

insert-ethersUsed to populate the “nodes” MySQL tableParses a file (e.g., /var/log/messages) for DHCPDISCOVER messages

Extracts MAC addr and, if not in table, adds MAC addr and hostname to table

For every new entry: Rebuilds /etc/hosts and

/etc/dhcpd.conf Reconfigures NIS Restarts DHCP and PBS

Hostname is <basename>-<cabinet>-

<chassis>Configurable to change hostname

E.g., when adding new cabinets

dhcp_options – One More Important MySQL Table

Created by the Frontend kickstart file (based on user input from Rocks configuration web page)Used by makedhcp to construct the header in /etc/dhcpd.conf

Configuration Derived from Database

mySQL DB

makehosts

/etc/hosts

makedhcp

/etc/dhcpd.conf

pbs-config-sql

pbs node list

insert-ethers Node 0

Node 1

Node N

Automated nodediscovery

FuturesAttack the storage problem

Keep the global view of storage that NFS gives us, but address the scalability problem

Source high bandwidth from the cluster into the WANApply our cluster bring-up automation to easily attach clusters to the gridContinue to improve cluster monitoring

Configure a monitoring GUI (e.g., NetSaint) to extract data from Ganglia

Get node health (Fan Speed, Temp., Disk Error rate) into Ganglia

Technologies Processors: IA-64 and Alpha Networks: Infiniband 2.4 kernel (Will rev our distribution at RedHat 7.1)

Lab

Front-end NodeNode seen by external worldPerforms Network Address Translation (NAT)NFS Server(s) for user home areas

Beware of scalability issues!Compilers, librariesConfiguration for Nodes

DHCP Server, NIS Domain Controller, NTP Server, Web Server, MySQL Server

Installation Server for defining system on nodesMethod(s) to start jobs on compute nodes

Batch System (PBS + Maui) Interactive launching of jobs

Installing a Front-end Machine

Build ks.cfg from https://rocks.npaci.edu/site.htm Define your root password NIS Domain Public IP Address

Boot CD Full ISO image for download. Burn your own!

Enter: “frontend” at the boot prompt.Sit back. Time varies depending on speed of the CPU and CDROM of frontend

Entire distribution is being copied to /home/install/cluster-dist

https://rocks.npaci.edu/site.htm

https://rocks.npaci.edu/site.htm

Building a Distribution with cluster-dist

Directory structure

Build mirror From mirror host

Emulates mirroring from rocks

Build distro cluster-dist dist

Installing Compute Nodes

Login as root to frontend Execute: tail –f /var/log/messages | insert-

ethers

Back on the compute node Boot CD

From laptop Examine MySQL database through browser

Reinstalling Compute Nodes

With shoot-node Frontend-driven reinstallation

With power on/off “Hail Mary” to recover from bad software

state

Structure of a Rocks Kickstart FileSite.hLook at ks.cfg.in preamble packages %pre %post

“%post” and “%post –nochroot” #include

For more info:http://redhat.com/support/manuals/RHL-7-Manual/ref-guide/ch-kickstart2.html

Updating and Augmenting Your Distribution with cluster-dist

Add new package to compute node in ks.cfg.in Kickstart the compute node

Add new package to distro, then have compute node pick it up

Put a package in “contrib” Build distro -> kickstart

Add a new local package (usher) Bump version number Build rpm Show where it the RPM gets put Build distro Show new home for usher RPM Build distro -> kickstart

Adding Users with create-account

As root: create-account yoda passwd yoda ssl-genuser yoda

Ypcat Shows how data made it into NIS Ypcat passwd Ypcat auto.home

Look at cluster nodes with Netscape

Database info

Launching and Controlling Jobs

REXEC Rexec … sleeper

PBS

Add a new queue to Maui

Cluster Monitoring with ganglia

Look at multiple different values from ganglia

ganglia Lists commands

ganglia load_oneganglia cpu_system cpu_idle cpu_nice

Go to millennium.berkeley.edu to see live demo

Auto configuration of node

Pop in Myrinet Card

After reinstallation, startup script uses “lspci” to determine if Myrinet card is on the PCI bus

If yes Automatically compile Myrinet device driver from

source rpm Install Myrinet module Module is then guaranteed compatible with running

kernel Eliminates managing binary device drivers for

different kernel configurations

Date post:	05-Feb-2016
Category:	Documents
Upload:	romney
View:	31 times
Download:	0 times

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Documents