Strategies in Cluster-Design
Gerolf Ziegenhain, TU Kaiserslautern, Germany
Outline of This Talk● Look at the technologies once again● Provide more detail for making decisions● What to consider?● What should be avoided by all means?● Provide keywords / directions for further reading
● Less organized talk● Contains personal experience
Making Decisions● Strategic decisions:
– Do once, changes difficult, expensive● Setup relatively easy● Therefore know some numbers
(per person, group, university)– #jobs– Runtime of jobs– CPUs per job– Memory per job– Coupling of system: latency / bandwidth– Hdd storage (also consider final storage)
Buy or Build?● Buying
– Less work– More costs– You will have more than you want– Vendor may help in consulting
● Building yourself– More work– High learning effect– Less costs– You will have what you buy
Technological Overview
DHCP
NIS
Firewall
Queue
Syslog
Login1
Mirror
Boot
Admin Login2
NAS1
NAS3
NAS2
User1 User3
User2
Nodes
Components of a Cluster
Networking
A Word on Entropy● Managing 10 Workstations differs a lot from
managing a cluster● Entropy of cables
– Sort them immediately– Use colors– Use hook-and-loop-tape– Use printed labels
Choice of Hardware● Nodes● Networking● Overhead servers
Choosing Nodes
Example: Google
Example: Google● Stock hardware● Custom build low-tech cases● Modular approach● Components
– Mainboard, CPU, Memory– 2x HDD (Stripe)– UPS Battery
● Advantage:– Cheap– High learning effect
Example: BlueGene/P● PowerPC● Custom build
– Boards– Chips – Networking
● Advantage:– Scales very good
Buy a Rack● Common beowolf cluster● Buy ready-built 19”
pizza-boxes● Mounting in 19” rack
– Usually 42HE● Advantage
– Less work– High packing density
Use Ready-Built Desktops
Processors and Architectures?● Know your problem● What to know about your algorithms?
– How much memory?– Can the problem easily be decomposed?– What precision?
● Libraries– Do they exist for your problem (i.e. QM calculations)– Do they run on all architectures?
● Choices:– Architecture (usually AMD / Intel is a good choice)– #CPUs
Storage Management● Know your problem● Parameters to know
– How much HDD space?– What is the common bandwidth?
● Evaluating 100GB files in real-time?● Writing out 1TB files?
● Choices: – NAS (multiple?)– SAN– Distributed filesystem
Backup● RAID ≠ backup
– You still can kill your stuff byrm -rf /my_stuff
● Incremental backup – Critical user configuration– Configuration files– Complete overhead server installation
Networking
Types● Know your problem● Choices
– Bandwidth● Gbit < Infiniband● Gbit: channel bonding possible
– Latency time ● Gbit > SCI
– Scalability● Stacked network switches● Fat tree architecture
Switches● Important parameters
– Backbone speed ● throughput when all ports are under load?
– Can it be configured? ● Auto-Sensing ● IP ● ARP● ...
– Stackable?– (Uplink ports?)
Which #Cores/Node is Optimum?● Currently cheapest cost per core: 8 cores per node● Small systems (48 nodes)
– Doesn't matter because one switch is enough● Average systems
– Do you need all-to-all connections?– Different rings or change network topology– If you want to stick to single-switched networks:
current optimum is 16 cpu per node for this● Big system
– Go for fat tree network :)
Infrastructure Requirements● Cooling
– Each W burned in CPU heat⇒● Stable power supply
– Black out?– Fluctuations in voltage level
● Cheap power supplies will break on fluctuations
Notes about Power Consumption● Less power consumption
less heat ⇒ less defects(?)⇒
● Running costs per year can easily reach initial investment costs!– Do the math blade center could also pay off!⇒
● Do not switch on / off all nodes at once– Voltage peaks!
Decomposition of the Servers
Why Separate Login Nodes?● User interaction● May hang due to jobs ● Security
– Ssh ports open– May be hacked
● Configuration of user packages– System more on bleeding edge
Splitting Servers● Easily >10 overhead tasks● Why not in one big server?
– Security (one hole all broken)⇒– Stability– Maintenance
● Updates (what was done 3 years ago?)● Dependencies (how do software packages interfere?)● No plugin structure (no testing of different variants)
● Solution– Split the tasks >10 overhead servers⇒– Problem:
● Cost ● Hardware failures?
Combining Servers● Use XEN ● Host servers: 1...3 servers
– Hardware failure tolerant● Further advantages
– Extremely reduced costs– Complete rollback possible– Try different configurations
● Experiments are possible with limited budget– Clear separation of tasks
Administration
Administration Policies● Interaction with human beings
– Difficult social aspects– Good administrator is never realized (system works)
● Who has the root password?● Who will document what has been done?● Split the work, but communicate:
– Design decisions– Buying, writing grant proposals– Installation, bug fixing– Educating end-users
Administration Policies● User interaction
– Keep the users informed (mailing list)– Monitor system to cancel out problems before they
occur
Managing Different Groups● Impossible!● Each group has to provide at least one person
– Managing user education– Monitoring performance– Know the needs ( cluster design decisions)⇒
⇒ Sharing administrator not possible!● Sharing resources: possible & meaningful
What is the Critical Data?● What data has to be stored?
– User programs– Final data– May be put on RAID-Mirror
● What data can be exposed to potential loss?– Temporary files– May be put on RAID-Stripe
Compilation● Custom user programs / libraries● Where to install
– /usr/local/ (system-wide)– $HOME (per-user)
● Autotools provide possibility to install whole distribution in home-directory!
⇒ Depends on how often the code changes● Choosing a compiler
– GNU compilers are good & free– Special CPU instructions: buy a compiler
● Intel compiler● Portland compiler
Security● University networks are
– Insecure– Treasured victims
● Risks– Ssh password login– Open ports– Updating
● Keep up to date with serious bugs!– Users
● Therefore (attacks will happen on daily basis!)– Use firewall– Monitor system for odd behavior
Operating Systems
Which Operating System?● Different OS / distributions exist
– But widely compatible configuration– Way of doing stuff differs slightly in detail
● I.e. Directories / files– Watch out for licenses: BSD, GPL, ...
● OS: provide basic stable & secure functionality– Linux
● Debian● RedHat● SuSE (slow, costly, small community)
– FreeBSD (more secure, but ~older versions)– OpenBSD (most secure)
Updating or not?● Motivations
– Stability– Security– Features
● Possible solution: – Keep login servers and firewall up-to-date– Keep computation nodes stable (out-of-date)– Works only if nodes are in inner network
Rolling your own distribution● Possible solution for installation issues● Possibilities
– From scratch distribution– Modify existing distribution– Compiling only custom packages (/usr/local/bin)– Keep system hdd-images and clone them
Lesson Learned● Reproduceable?
– Making a distribution is exhausting● Documentation (wiki)
– Someday you have to handover– Or reinstall
● Keep a complete mirror – Packages may vanish
The Gentoo-Approach● Use source-packages● Autotools binary files⇒● Create special configuration files for dependencies
– In gentoo: portage (→ corvix: egatrop)– In bsd: ports
● Alternative– Linux from scratch
● Missing the configuration files ● Rely on autotools
– Arch linux● Websites are good sources for step-by-step
howtos
The Debian-Approach● Compile once, distribute binary packages● Create custom-packages with only one command● Advantage
– Extremely fast– Easier to maintain for big number of servers– Embedded devices use similar packages architecture
Our solution● Stable basis system:
– Debian overlays● Additional package source with custom packages
– Xen images of the installed debian-system ⇒ Even faster reinstallations
● Custom software– I.e. user demanded libraries– Compilation in ~
Other cluster distributions● Debian-Based / RedHat-Based exist
– I.e. RocksCluster, CentOS, PelicanHPC, Corvix, ...● Good source for howtos● Good as cheat-sheet● But
– HPC is inherently customized– Flexibility highest with customized installation– None of the distros solved a problem that we had
Thank you!
● Acknowledgements– Prof. Dr. rer. Nat. Herbert M. Urbassek,
TU Kaiserslautern, Germany